# Training Dynamics

## MLP Training

Based on the curves of loss, accuracy, and F1, my model did a good job stopping its training before hitting the point of overfitting. I planned around this as I made sure to implement an early stop if the validation score stopped improving after 10 steps, and it worked very well. The usage of class weights is very important in helping with trianing stability and my final model performance because as most of the dataset is netural, thus it is important that we utilize class weights to be storng and predictive rather than just picking neutral everytime, helping to distinguish the classes.

### MLP Training Results
![MLP Training Curves](outputs/mlp_training_plots.png)

### MLP Confusion Matrix
![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png)

## LSTM Training
Based on the curves of loss, accuracy, and F1, my model showed signs of overfitting. The training loss went down much faster than the validation loss, showing that the model was losing generalizability and overfitting. Again, I could try to implement a stop after the validation loss doesn't see much improvement/goes up. However it would likely have to happen before 30 epochs. Class weights again helped to affect training stability as the majority of the data lies in neutral, so having class weights helps to push the model to distinguish positive and negative and not just safely pick neutral.

### LSTM Training Results
![LSTM Training Curves](outputs/lstm_curves.png)

### LSTM Confusion Matrix
![LSTM Confusion Matrix](outputs/lstm_cm.png)

# Model Performance and Error Analysis

I think my LSTM model did a far better job at generalizing to the test set. We can see this as in the LSTM graphs, the F1 train and the validation curves are all very close and tight together over multiple epochs (albeit only 30 total) to achieved a high F1 score together while in my MLP model training, the curves deviate from each other after just 10 epochs. The difference between the training set is much higher than the validation set meaning that the model overfit on the training data weakning its genearalizability as more epochs came. 

The sentiment class that was most frequently misclassified between both models was positive being predicted as neutral. The potential reason for this misclassification is because it is very possible that the sentiment is hard to find/distinguish with the training data/subject. Financial lingo may be hard to find good positive sin comparison to a reporting neutral, while bad things in finance are very clearly delineated as such thus potentially making it easier to classify.

# Cross model comparison

### RNN

#### RNN F1 Learning Curves
![RNN Confusion Matrix](outputs/rnn_f1_learning_curves.png)

#### RNN Confusion Matrix
![RNN Confusion Matrix](outputs/rnn_confusion_matrix.png)

Final Test Accuracy: 0.7111
Test F1 Macro: 0.6919
Test F1 Weighted: 0.7162

Per-class F1 Scores:
Negative (0): 0.6968
Neutral (1): 0.7719
Positive (2): 0.6069

### GRU

#### GRU F1 Learning Curves
![GRU Confusion Matrix](outputs/gru_f1_learning_curves.png)

#### GRU Confusion Matrix
![GRU Confusion Matrix](outputs/gru_confusion_matrix.png)


Final Test Accuracy: 0.7510
Test F1 Macro: 0.7254
Test F1 Weighted: 0.7536

Per-class F1 Scores:
Negative (0): 0.7037
Neutral (1): 0.8043
Positive (2): 0.6683


### BERT

#### BERT F1 Learning Curves
![BERT Confusion Matrix](outputs/bert_f1_learning_curves.png)

#### BERT Confusion Matrix
![BERT Confusion Matrix](outputs/bert_confusion_matrix.png)


Bert final results

Final Test Accuracy: 0.8294
Test F1 Macro: 0.8204
Test F1 Weighted: 0.8313

Per-class F1 Scores:
Negative (0): 0.8360
Neutral (1): 0.8630
Positive (2): 0.7621

### GPT

#### GPT F1 Learning Curves
![GPT Confusion Matrix](outputs/gpt_f1_learning_curves.png)

#### GPT Confusion Matrix
![GPT Confusion Matrix](outputs/gpt_confusion_matrix.png)

Final Test Accuracy: 0.8116
Test F1 Macro: 0.7717
Test F1 Weighted: 0.8054

Per-class F1 Scores:
Negative (0): 0.7514
Neutral (1): 0.8690
Positive (2): 0.6947



The fact that MLP relied on Fasttext embeddings compared to sequence based models caused it to mostly perform worse. While most fo the models saw overfitting, the end result of F1 on validation data is better for the sequence based model. 
This advantage for sequence processing models like comes from the fact the context of financial data is much different from regular context that the fasttext likely takes from, thus causing it to struggle to understand what the data means. On the other hand, learning from sequences allows a better distinguishing of sentiment in context, thus allowing said models to perform much better.
The fine tuned LLMs did outperform all of the classical baseline models. In terms of pretraining, these LLMS have been trained on a much larger database and have seen lots of context before they start learning the task while the classical models have to build up from not understanding anything. As for contextual representations, the usage of self attention in the LLMs allows the LLMS to understand context throughout the entire sentence easily while the classical models struggle to retain longer distance contexts.
Thus the rankings are as follows:
1. BERT (Test F1 Macro: 0.8204, Test F1 Weighted: 0.8313)
2. GPT (Test F1 Macro: 0.7717, Test F1 Weighted: 0.8054)
3. GRU (Test F1 Macro: 0.7254, Test F1 Weighted: 0.7536)
4. LSTM (Macro F1 Score: 0.7242)
5. MLP (FINAL TEST MACRO F1 SCORE: 0.6998)
6. RNN (Test F1 Macro: 0.6919, Test F1 Weighted: 0.7162)

The first two models work the best thanks to their ability to use self attention, allowing parallelism and long range interaction of words in O(1) along with having strong contextual embeddings in comaprison to the mostly static embeddings of RNN, LSTM, and MLP. As for why GRU is better than LSTM, it is thanks to the gating within GRU allowing it to protect information long term. Finally, RNN as it can struggle with the vanishing gradient problem struggling to learn across a sentence, and MLP is just barely above RNN as it fails to understand word order and thus struggles.


## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">Write your disclosure here.</font>


Gemini Flash 1.5 and Claude Sonnet 4.5

I used them to help me draft the code for the lstm and mlp models and their graphs, while also making sure that my explanations and reasoning for the performance of each model in my write up was correct.

I went through the architecture of each model and made sure they were consistent with the techniques introduced in the slides and example documentation found online.