## Open-Ended Reflection Questions
## 1. Train Dynamics
The LSTM model showed strong signs of overfitting, as the training loss went to zero and F1 went to one. The MLP model showed some signs of overfitting, but had a smaller train-validation gap and more stable validation curves. Changes we could make are early stopping, increased dropout, or weight decay. <br/>
Using class weights improved macro-F1 and minority class recall for both models, but in the LSTM it probably amplified gradient magnitudes and increased training instability.

## 2. Model Performance and Error Analysis
The LSTM generalized better to the test set, as it achieved higher validation accuracy (around 76–78%) and higher macro-F1 (around 0.73–0.75) compared to the MLP (around 70% accuracy and 0.67 macro-F1). Even though it was overfitting more during training, the LSTM still captured richer sequential information and achieved better overall generalization performance. <br/>

The Neutral class was most frequently misclassified in both models. The confusion is seen both between Neutral and Positive and between Neutral and Negative examples. This is likely because of the semantic ambiguity, where mildly positive or negative sentences could be similar to neutral statements in the embedding space. Also, Neutral was the largest class, so that likely contributes to the absolute number of misclassifications.

## 3. Cross-Model Comparison
The mean-pooled FastText embeddings get rid of word order and contextual dependencies. This prevents the MLP from capturing sentence-level structure that sequence-based models can learn.
The LSTM’s way of sequential processing allows it to model word order and compositional structure, which is why it has an advantage over MLP. 
Fine-tuned BERT outperformed classical neural baselines because of large-scale pretraining and contextual self-attention. The difference for BERT is that it dynamically adjusts word representations based on sentence context, allowing it to better distinguish sentiment differences and- reduce class confusion.
Looking at the performance, we can rank the six models in descending order: GPT, BERT, GRU, LSTM, RNN, MLP, with GRU and LSTM having a rather narrow gap. To explain this, we can argue that MLP is at the bottom because of its mean-pooled embeddings, which does not take into account word order or context. Then, the sequence-based models(RNN, LSTM, GRU) have token order and long-range dependencies, which improves performance. Amongst them, the gated models(LSTM, GRU) have better gradient flow. Then, BERT's pretraining is one more step over LSTM. Finally, GPT, like BERT, is a pretrained transformer, but is larger, leading to best performance.

## AI Use Disclosure (Required)

If you used any AI-enabled tools (e.g., ChatGPT, GitHub Copilot, Claude, or other LLM assistants) while working on this assignment, you must disclose that use here. The goal is transparency-not punishment.

In your disclosure, briefly include:
- **Tool(s) used:** (name + version if known)
- **How you used them:** (e.g., concept explanation, debugging, drafting code, rewriting text)
- **What you verified yourself:** (e.g., reran the notebook, checked outputs/plots, checked shapes, read documentation)
- **What you did *not* use AI for (if applicable):** (optional)

You are responsible for the correctness of your submission, even if AI suggested code or explanations.

#### <font color="red">Write your disclosure here.</font>


Used ChatGPT to debug and provide a starting point for pytorch code. I then read, understood, and edited the code it gave me using documentations. 