# Open Questions
## Question 1

* Did your models show signs of overfitting or underfitting? What architectural or training changes could address this?
* How did using class weights affect training stability and final performance?


### MLP Curves
![mlp training curves](mlp_training.png)

### LSTN Curves
![lstn training curves](lstn_training.png)

LSTM exhibits significant overfitting: the training loss decreased consistently (albeit being quite jittery) and the training macro F1 score continued rising throughout the 30 epochs, but the validation loss and macro F1 stopped improving meaningfully and plateaued. The widening gap between training and validation performance indicates that the model was overfitting on the data and learned its idiosncrasies instead of the generalizable pattern, which is like caused by the model's higher capacity (three LSTM layers with 128 hidden units). To mitigate this, we could reduce the number of layers or hidden dimensions, introduce dropout, or apply early stopping.

The MLP similarly displayed signs of overfitting but in a much milder manner, occurring later on in the epochs and also being much more stable/steady. The gap between training and validation macro F1 scores was smaller compared to the LSTM. Although the MLP did not achieve as high a training F1 as the LSTM, the validation performance was much stabler and did nnot deteriorate over time, suggesting that the MLP's lower capacity limited its ability to memorize the training data and made it generalize much better. To improve, we can increase hidden layer depth or width with hopes of increasing capacity as long as we do not start to overfit.

# Question 2

* Which of your two models generalized better to the test set? Provide evidence from your metrics.

As seen in the plots from q1, LSTN generalized better with a test F1 Macro of 0.74 vs a test F1 Macro of 0.71. This makes sense, given the LSTN's greater theoretical capacity to capture sentence meaning in its advantage of sequential processing.

* Which sentiment class was most frequently misclassified? Propose reasons for this pattern.

### MLP

![mlp cm](mlp_confusion_matrix.png)

### LSTN

![lstn cm](lstn_confusion_matrix.png)

Although neutral examples generated the largest number of raw misclassifications in both models, this pattern is largely explained by class imbalance. When examining each classes recall, both models saw positive classes have the worst recall. One possible explanationi is that positive financial sentiment is expressed subtly or moderately optimistic language rather than strongly polarized due to consumers' bias toward being risk averse - as a result, positive statements may have large overlap semantically with neutral statements. 


# Question 3

* How did mean-pooled FastText embeddings limit the MLP compared to sequence-based models?

The use of mean-poooled FastText embeddings constrains the representational capacity of the MLP. By averaging the word vectors across the sentence, we effectively discard all the information about word order and syntactic structure, reducing the sentence to a single "bag-of-words" representation that eliminates contextual nuances such as contrast and negation (e.g., "not X" vs "X" become very similar because, after pooling, the negative modifier contributes only a small vector adjust to the average). As a result, MLP was limited to learning static semantic averages rather than modeling how words interact across a sequence.

* What advantage did the LSTMâ€™s sequential processing provide over the MLP?

Its ability to process text sequentially and preserve word order information allows it to accumulate contextual information as it reads through a sentence, while MLPs are unable to view how words interact across a sentnence. This enables the model to capture the compositional patterns mentioned earlier that are critical in financial sentiment analysis. Because the final hidden state reflects information from the entire sequence in order, the LSTM can distinguish sentences that contain the same words but in different arrangements, effectively having a higher theoretical ceiling than that of an MLP.

* Did fine-tuned LLMs (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.

F1 Macro:

- GPT: 0.7969
- Bert: 0.8132
- GRU: 0.7491
- RNN: 0.6834

Yes, they did. The performance gap reflects the better representational advantages by large-scale pretraining and better architecture. Classical models rely on static word embeddings and must learn task-specific patterns solely from the training data, limiting their ability to generalize. In contrast, BERT and GPT generate contextual embeddings in which a word's representation depends on its surroundings, allowing the model to understand subtle chifts in meaning. Additionally, transformer-based architectures use self-attention mechanisms that allow modeling of long-range dependencies without the bottlenecking you would find in an RNN.

* Rank all six models by test performance. What architectural or representational factors explain the ranking?


Rankings based on Test F1 Macro:

1) BERT
2) GPT
3) LSTM
4) GRU
5) MLP
6) RNN

This ranking reflects a clear hierarchy in representational power and architectural sophistication, with the transformers ranking at the top followed by two of the RNNs, the MLP, and then the vanilla RNN. BERT achieved the strongest performance, probably due to its bidirectional transformer architecture and large pretraining. GPT followed, probably benefitting from its transformer-based contextual embeddings and pretraining. Among the non-transformer models, LSTM outperformed the rest, reflecting the advantage of gated recurrent architectures in modeling sequential dependencies. This is probably due to its memory cell and gating mechanisms that allow it to preserve long-range contextual information more effectively than a standard RNN. Interestingly, the MLP ranked above the vanilla RNN, probably due to some better hyperparameter tuning, though LSTM's should theoretically perform better due to its ability to process sequential information. 

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">Write your disclosure here.</font>

- ChatGPT 5.2
- Concept explanation, understanding how to structure the script, what to implement and how to preprocess the data, debugging
- Made sure the code did what it was meant to do, checked output, read documentation
