In [52]:
from pathlib import Path
import pandas as pd

In [53]:
def read_metrics_txt(path):
    path = Path(path)
    if not path.exists():
        return None
    d = {}
    for line in path.read_text(encoding="utf-8").splitlines():
        if not line.strip() or "\t" not in line:
            continue
        k, v = line.split("\t", 1)
        try:
            d[k.strip()] = float(v.strip())
        except ValueError:
            d[k.strip()] = v.strip()
    return d

models = {
    "MLP": "mlp",
    "LSTM": "lstm",
    "RNN": "rnn",
    "GRU": "gru",
    "BERT": "bert",
    "GPT": "gpt",
}

rows = []
for name, prefix in models.items():
    m = read_metrics_txt(OUT / f"{prefix}_metrics.txt")
    rows.append({
        "Model": name,
        "test_macro_f1": None if m is None else m.get("test_macro_f1"),
        "test_accuracy": None if m is None else m.get("test_accuracy"),
    })

df = pd.DataFrame(rows)
df_sorted = df.sort_values(by="test_macro_f1", ascending=False, na_position="last")
df_sorted

Unnamed: 0,Model,test_macro_f1,test_accuracy
4,BERT,0.800104,0.814305
5,GPT,0.798883,0.823934
1,LSTM,0.743383,0.77304
3,GRU,0.740908,0.771664
0,MLP,0.692936,0.726272
2,RNN,0.686558,0.719395


## Plots generated by the training scripts

### MLP

![MLP Learning Curves](outputs/mlp_learning_curves.png)
![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png)

### LSTM

![LSTM Learning Curves](outputs/lstm_learning_curves.png)
![LSTM Confusion Matrix](outputs/lstm_confusion_matrix.png)

### **1. Training Dynamics**
Focus on your MLP and LSTM implementations

#### **Did your models show signs of overfitting or underfitting? What architectural or training changes could address this?**

- MLP:
    - The training loss keeps declining, but the validation loss stops improving early and then keeps relatively stable. 
    - In both macro f1 and accuracy curves, there are small performance gaps between the training set and the validation set.
    Thus, we can say that the MLP model shows a sign of slight overfitting, but the situation seems to be acceptable.

- LSTM:
    - The training loss keeps declining, but the validation loss stops improving early and then starts to drift upward.
    - In both macro f1 and accuracy curves, there are large performance gaps between the training set and the validation set.
    This implies that the LSTM model shows a sign of severe overfitting. This likely occurs because mildly positive financial statements often resemble neutral ones in tone, making the boundary between Positive and Neutral difficult to distinguish. As a result, many positive examples are predicted as neutral, increasing the error rate for the Positive class.

#### **How did using class weights affect training stability and final performance?**

The dataset is imbalanced with neutral as the majority class. Without class weights, optimization tends to favor the majority class, which can inflate accuracy but hurt macro f1 score since minority classes get low recall. Using class weights changes the gradient contribution so minority class mistakes are penalized more.

### **2. Model Performance and Error Analysis**
Focus on your MLP and LSTM implementations

#### **Which of your two models generalized better to the test set? Provide evidence from your metrics.**

In [54]:
df_sorted

Unnamed: 0,Model,test_macro_f1,test_accuracy
4,BERT,0.800104,0.814305
5,GPT,0.798883,0.823934
1,LSTM,0.743383,0.77304
3,GRU,0.740908,0.771664
0,MLP,0.692936,0.726272
2,RNN,0.686558,0.719395


Observe that the LSTM model has higher macro F1 score and test accuracy. Thus, the LSTM model generalized better to the test set.

#### **Which sentiment class was most frequently misclassified? Propose reasons for this pattern.**

We focus on the confusion matrix:
![MLP Confusion Matrix](outputs/mlp_confusion_matrix.png)
![LSTM Confusion Matrix](outputs/lstm_confusion_matrix.png)


- MLP (number of misclassifications)
    - Negative: 17 (18.7%)
    - Neutral: 108 (25.0%)
    - Postive: 74 (36.3%)

- LSTM (number of misclassifications)
    - Negative: 16 (17.6%)
    - Neutral: 76 (17.6%)
    - Postive: 73 (35.8%)

In both models, the neutral class had the largest number of misclassifications, which is probably because of its large total number.

In both models, the positive class had the highest misclassification rate. This likely occurs because mildly positive financial statements often resemble neutral ones in tone, making the boundary between Positive and Neutral difficult to distinguish. As a result, many positive examples are predicted as neutral, increasing the error rate for the Positive class.

### **3. Cross-Model Comparison**
Compare all six models: MLP, RNN, LSTM, GRU, BERT, GPT

#### **How did mean-pooled FastText embeddings limit the MLP compared to sequence-based models?**

Mean-pooled FastText embeddings limit the MLP because averaging word vectors removes word order and structural information from the sentence. This makes the model unable to properly capture negation, phrase-level interactions, or compositional meaning. In contrast, sequence-based models like LSTM process tokens in order and can model contextual dependencies, resulting in richer sentence representations.

#### **What advantage did the LSTM’s sequential processing provide over the MLP?**

The LSTM’s sequential processing allows it to model word order and contextual dependencies within a sentence, while the MLP relies on mean-pooled embeddings and ignores order. This enables the LSTM to better capture negation, modifiers, and multi-word expressions that influence sentiment. As a result, the LSTM can learn richer sentence representations, although its higher capacity also increases the risk of overfitting.

#### **Did fine-tuned LLMs (BERT/GPT) outperform classical baselines? Explain the performance gap in terms of pretraining and contextual representations.**

Yes，according to the metrics, BERT and GPT have test macro f1 scores and test accuracies that are significantly higher than the other four models, implying their outstanding performance.

Fine‑tuned BERT/GPT outperform classical baselines because they start from pretrained contextual representations:
- Pretraining on massive corpora teaches general syntax.
- Unlike static FastText, contextual embeddings represent the meaning of a token conditioned on surrounding tokens.
- Self‑attention captures long range dependencies without the limitations of vanilla recurrence.

Even with a small labeled dataset, fine‑tuning can leverage this prior knowledge to improve macro f1 and reduce systematic class confusions.

#### **Rank all six models by test performance. What architectural or representational factors explain the ranking?**

In [55]:
df_sorted

Unnamed: 0,Model,test_macro_f1,test_accuracy
4,BERT,0.800104,0.814305
5,GPT,0.798883,0.823934
1,LSTM,0.743383,0.77304
3,GRU,0.740908,0.771664
0,MLP,0.692936,0.726272
2,RNN,0.686558,0.719395


We use the test macro f1 score to rank the models, since the dataset is imbalanced and Macro-F1 better reflects balanced performance across all sentiment classes.
The final ranking is that: BERT > GPT > LSTM > GRU > MLP > RNN

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">Write your disclosure here.</font>

- **Tool(s) used:** Google Gemini
- **How you used them:** concept explanation, debugging, drafting code, checking the validity of the result
- **What you verified yourself:** checked outputs/plots, checked shapes
- **What you did *not* use AI for (if applicable):** drafted the notebook, adjusted the format of the notebook, did analysis