## HW 5: LSTM Language Models
### COSC 426: Fall 2025, Colgate University

Use this notebook to run your experiments and answer questions. Add as many code or markdown chunks as you would like for each (sub)section. Please use markdown chunks for written responses. 

**If you use any external resources (e.g., code snippets, reference articles), please cite them in comments or text!**

## Part 1: Understanding and completing the four classes

In [2]:
import LM
import torch

import pandas as pd

### Part 1.1

In [3]:
dataset = LM.LM_Dataset(
    data_fname="data/yoda_train.tsv",
    vocab_fname="data/HW5_vocab.txt",
    max_length=50,
    lower=True,
)
print("dataset.vocabSize:", dataset.vocab)
print("dataset.sentences:", dataset.sentences)
print("dataset.sentids:", dataset.sentids)

dataset.vocabSize: {'darkness', 'forest', 'rain', 'seek', 'force', 'follow', 'in', 'light', 'path', '[EOS]', 'teach', 'night', 'they', 'fight', '[BOS]', 'the', 'i', 'we', 'you', 'through', 'sleep', 'jedi'}
dataset.sentences: ['the jedi the force seek', 'fight they do', 'the force the rain seek', 'the path the darkness teach', 'sleep the force do', 'the darkness the night teach', 'in the rain in the force fight the force do', 'fight we do', 'we the path teach', 'the forest the force follow', 'fight the force do', 'the force the rain seek', 'the forest the jedi seek', 'they the forest follow', 'the night the forest follow', 'fight the force do', 'in we fight the rain do', 'the forest in the darkness i teach', 'fight we do', 'fight the forest do', 'sleep the jedi do', 'the jedi i follow', 'fight the darkness do', 'fight the force do', 'the darkness in the rain the rain follow', 'we the jedi seek', 'in the light in the forest in the force sleep the force do', 'sleep you do', 'fight we do',

The `make_pairs` function creates training data for next word prediction by taking each encoded sentence, using all tokens except the last as contexts, and the same sequence shifted one position to the left as targets. Both are then left-padded to the maximum sequence length. This means the model learns to predict each next word in a sentence based on all previous words. In contrast, CBOW model uses small context windows around a target word, predicting the middle word from its nearby context.

### Part 1.2

1. Why does the decoder take as input a tensor of nHidden and return tensor of vocabSize? Why does it not have the sequence length (i.e., the number of words in the sequence) as one of the dimensions?

    The decoder converts each LSTM hidden state of size nHidden into a probability vector of size vocabSize. It does not include the sequence length because it operates on each time step independently. The decoder simply transforms hidden representations into word predictions.

2. You will notice that the LSTM layer returns more outputs than the Linear layer. What are these additional outputs that this layer generates? Why are they important?

    The LSTM outputs the sequence of hidden states, the final hidden state, and the final cell state. The hidden states generate word predictions, while the final states carry long-term memory across time. These outputs allow the model to remember and use context.

3. What is the purpose of the init_hidden function?

    The `init_hidden` function initializes the hidden and cell states to zeros before each new batch. This prevents information from previous batches from affecting current computations.

4. Is there any difference in the loss function between the CBOW and LSTM_LM models? Why or why not?

    Both CBOW and LSTM use crossentropy loss, but they apply it differently. CBOW predicts a single target word from its context, while LSTM predicts every next word in a sequence. (The loss function is the same but used across multiple time steps in the LSTM.)

### Part 1.3

1. Compared to the CBOW trainer, there are a couple of additional steps in this trainer. What are these steps? Why are they important?

    The LSTM trainer adds the initialization of hidden and cell states before processing each batch and the reshaping of predictions and targets for sequence-level loss calculation. The initialization is necessary because the LSTM relies on these states to preserve information through time, and reshaping ensures the loss is computed correctly across all time steps.

2. Compared to the CBOW trainer, there is one additional step after the loss is computed. What is this step? Why is this important?

    After the loss is computed, the LSTM trainer performs gradient clipping(clip_grad_norm). This prevents gradients from becoming too large(exploding gradients).

### Part 1.4

1. In Lab5.ipynb answer the following question: what parts of the compute_loss function in this evaluator are different from the CBOW evaluator?
    
    The evaluator initializes hidden & cell states for each batch, feeds sequences through the model to get outputs at every time step, reshapes logits to two dimensions and flattens targets so cross entropy applies across all positions, and uses batch size of one. The CBOW evaluator skips state initialization, produces one logit vector per example, and computes loss directly without reshaping.

2. In Lab5.ipynb answer the following question: For the CBOW model, to generate the prediction for some input we took the label with the maximum logit value. In the case of langauge modeling why is this not appropriate? What do we actually want to get?

    Our goal is not to select a single most likely word but to estimate the probability over the entire vocabulary for the next word in a sequence/context. Taking the label with the maximum logit would only give one predicted word, losing information about how likely each possible word is. Instead, we need the full set of predicted probabilities so we can measure how well the model assigns high probability to the actual next word.

## Part 2: Generating predictions from models trained on SAE and Yoda sentences

nEmbed: 32  
nHidden: 64  
nLayers: 1

With only twenty words, each word does not need a large vector to capture meaning, so 32 dimensions are enough to capture them. Sentences are relatively short, so I believe one LSTM layer can easily model the context. A hidden size of 64 would give the network enough "memory" to learn small patterns like “the Jedi sleep” or “the Force fight in you” while staying small enough to train quickly.

In [4]:
sae_train = "data/sae_train.tsv"
sae_val = "data/sae_val.tsv"

yoda_train = "data/yoda_train.tsv"
yoda_val = "data/yoda_val.tsv"

eval_path = "data/eval.tsv"

vocab_path = "data/HW5_vocab.txt"

nEmbed = 32
nHidden = 64
nLayers = 1
max_length = 8
batch_size = 32
num_epochs = 20
lr = 0.2
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

sae_train_data = LM.LM_Dataset(
    data_fname=sae_train,
    vocab_fname=vocab_path,
    max_length=max_length,
    lower=True,
)
sae_val_data = LM.LM_Dataset(
    data_fname=sae_val,
    vocab_fname=vocab_path,
    max_length=max_length,
    lower=True,
)

yoda_train_data = LM.LM_Dataset(
    data_fname=yoda_train,
    vocab_fname=vocab_path,
    max_length=max_length,
    lower=True,
)
yoda_val_data = LM.LM_Dataset(
    data_fname=yoda_val,
    vocab_fname=vocab_path,
    max_length=max_length,
    lower=True,
)

In [5]:
cleaned_eval_data_fname = "data/cleaned_eval.tsv"
eval_df = pd.read_csv(eval_path, sep="\t", index_col=0)
cleaned_eval_df = eval_df[["sentence"]]
cleaned_eval_df.to_csv(cleaned_eval_data_fname, sep="\t")

eval_data = LM.LM_Dataset(
    data_fname=cleaned_eval_data_fname,
    vocab_fname=vocab_path,
    max_length=max_length,
    lower=True,
)

In [6]:
sae_model = LM.LSTM_LM(
    vocabSize=sae_train_data.vocabSize,
    nEmbed=nEmbed,
    nHidden=nHidden,
    nLayers=nLayers,
).to(device)

yoda_model = LM.LSTM_LM(
    vocabSize=yoda_train_data.vocabSize,
    nEmbed=nEmbed,
    nHidden=nHidden,
    nLayers=nLayers,
).to(device)

In [7]:
trainer = LM.LM_Trainer(
    num_epochs=num_epochs,
    lr=lr,
    batch_size=batch_size,
    device=device,
)

In [8]:
trainer.train(sae_model, sae_train_data, sae_val_data)

Epoch 0:	 Avg Train Loss: 2.10055	 Avg Val Loss: 1.3925
Epoch 10:	 Avg Train Loss: 0.72673	 Avg Val Loss: 0.70445


In [9]:
trainer.train(yoda_model, yoda_train_data, yoda_val_data)

Epoch 0:	 Avg Train Loss: 2.1776	 Avg Val Loss: 1.43395
Epoch 10:	 Avg Train Loss: 0.73493	 Avg Val Loss: 0.6624


In [10]:
evaluator = LM.LM_Evaluator(eval_data, device)
words_all, probs_all = evaluator.get_preds(yoda_model)

In [11]:
evaluator.save_preds({"yoda": yoda_model, 'sae': sae_model}, 'predictions/prediction.tsv')

## Part 3: Analyzing and interpreting the predictions with the analyze mode from NLPScholar

In [16]:
result_df = pd.read_csv('./results/result.tsv', sep='\t', index_col=0)
display(result_df)

Unnamed: 0,model,acc,diff,expected,unexpected,macrodiff
0,sae,0.777218,-1.731873,4.483778,6.215651,-1.731873
1,yoda,0.717742,-0.828962,4.882544,5.711506,-0.828962


1. Which of the outputs of the analyze mode (by_word, by_pair, by_cond) did you decide to look at? Why?
   
    I focused on the by_cond output because it provides a straightforward summary(acc, diff) of how well each model distinguishes between expected and unexpected sentences.

2. Do you think SAE and Yoda models learned different things? Why or why not? Present concrete parts of your results in the ipynb cell output, and reference this in your answer.
   
   Yes, the SAE and Yoda models learned different things. According to the by_cond results, the SAE model achieved a higher accuracy (~ 0.78) and a larger difference (~ -1.73) compared to Yoda’s accuracy (~ 0.72) and smaller difference (-0.83). This means SAE more strongly distinguished between expected and unexpected sentences, finding expected sentences noticeably more predictable. Yoda also showed some learning but to a lesser extent, suggesting it captured the pattern more weakly.


## Part 4 (optional): Implement a bidirectional version of the LSTM_LM 

We need to change the LSTM layer’s setting to make it bidirectional. A bidirectional LSTM runs both forward and backward passes, which means it produces twice as many hidden state features at each time step compared to a unidirectional one. Because of this, we must also double the input feature size of the decoder layer to match the new output dimensions and avoid a size mismatch error.

Also, the initialization function for the hidden and cell states must be changed to reflect the correct shape, since the number of hidden states is now multiplied by 2. No other parts of the pipeline need to be changed.

In [17]:
import MaskedLM as LMB

bi_sae_model = LMB.LSTM_LM(
    vocabSize=sae_train_data.vocabSize,
    nEmbed=nEmbed,
    nHidden=nHidden,
    nLayers=nLayers,
).to(device)

bi_yoda_model = LMB.LSTM_LM(
    vocabSize=yoda_train_data.vocabSize,
    nEmbed=nEmbed,
    nHidden=nHidden,
    nLayers=nLayers,
).to(device)

In [19]:
trainer.train(bi_sae_model, sae_train_data, sae_val_data)
trainer.train(bi_yoda_model, yoda_train_data, yoda_val_data)

Epoch 0:	 Avg Train Loss: 1.96818	 Avg Val Loss: 1.18771
Epoch 10:	 Avg Train Loss: 0.04275	 Avg Val Loss: 0.03273
Epoch 0:	 Avg Train Loss: 2.08096	 Avg Val Loss: 1.24085
Epoch 10:	 Avg Train Loss: 0.05345	 Avg Val Loss: 0.04088


In [20]:
evaluator.save_preds(
    {"yoda": bi_yoda_model, "sae": bi_sae_model}, "predictions/bi_prediction.tsv"
)

In [22]:
bi_result_df = pd.read_csv('results/bi_result.tsv', sep='\t', index_col=0)
display(bi_result_df)

Unnamed: 0,model,acc,diff,expected,unexpected,macrodiff
0,sae,0.84627,-2.073843,2.885408,4.959252,-2.073843
1,yoda,0.573085,0.014839,3.100911,3.086072,0.014839


The bidirectional SAE model performed better, showing higher accuracy and a stronger distinction between expected and unexpected sentences. This improvement makes sense because standard English sentences flow naturally from left to right, and seeing both the beginning and the end helps the model form a more complete understanding of how words depend on each other, almost as if it had more training data. 

On the other hand, Yoda’s sentence structure (“sleep the Jedi do”) relies on a word order that goes against normal grammar, with important information appearing at the end("do"). When the bidirectional Yoda model can already see that final word from the start, it is no longer surprised. As a result, it becomes less sensitive to the structure that defines Yoda-speak, leading to lower accuracy.