## HW 5: LSTM Language Models
### COSC 426: Fall 2025, Colgate University

Use this notebook to run your experiments and answer questions. Add as many code or markdown chunks as you would like for each (sub)section. Please use markdown chunks for written responses. 

**If you use any external resources (e.g., code snippets, reference articles), please cite them in comments or text!**

## Part 1: Understanding and completing the four classes

In [1]:

import LM
import torch
import pandas as pd
import MaskedLM

### Part 1.1

In [2]:
data_file = "data/sae_train.tsv"
valid_file = "data/sae_val.tsv"
eval_file = "data/eval.tsv"
vocab_file = "data/HW5_vocab.txt"
max_len = 5 #this is the maximum number of words i want to consider at each point
lower = True
my_dataset = LM.LM_Dataset(data_file, vocab_file, max_len, lower)

In [3]:
print(my_dataset.vocab)
print(my_dataset.tokenized)

{'through', 'in', 'rain', 'jedi', 'the', 'we', 'i', 'they', 'seek', 'forest', 'sleep', 'follow', 'fight', 'force', 'teach', '[BOS]', 'you', 'darkness', '[EOS]', 'night', 'light', 'path'}
[['[BOS]', 'the', 'jedi', 'sleep', '[EOS]'], ['[BOS]', 'the', 'light', 'fight', '[EOS]'], ['[BOS]', 'the', 'force', 'fight', 'in', 'you', '[EOS]'], ['[BOS]', 'the', 'force', 'sleep', '[EOS]'], ['[BOS]', 'the', 'darkness', 'seek', 'the', 'darkness', '[EOS]'], ['[BOS]', 'the', 'force', 'seek', 'the', 'jedi', 'in', 'the', 'light', '[EOS]'], ['[BOS]', 'the', 'rain', 'fight', 'through', 'the', 'force', 'in', 'the', 'jedi', '[EOS]'], ['[BOS]', 'the', 'light', 'fight', 'in', 'you', '[EOS]'], ['[BOS]', 'the', 'jedi', 'fight', '[EOS]'], ['[BOS]', 'the', 'rain', 'seek', 'we', 'in', 'the', 'force', '[EOS]'], ['[BOS]', 'the', 'force', 'fight', '[EOS]'], ['[BOS]', 'the', 'jedi', 'fight', '[EOS]'], ['[BOS]', 'the', 'force', 'fight', '[EOS]'], ['[BOS]', 'the', 'night', 'sleep', '[EOS]'], ['[BOS]', 'they', 'fight', '[

How contexts and targets are specified in the make pairs
3. Given a sequence of words of length max_length, context is a list of words in the sequence from the first word to the last but one word. It does not include the last word in the sequence. 
Target is a list of words in a sequence, that starts with second word of the sequence and ends with the last word word of the sequence.
These target and context lists are padded by adding pad tokens and then converted to tensors.

This approach selects the entire sequence of words except the target word as our context. In CBOW, however, our context included words around (before and after) our target word. Our target in this approach is another sequence of words that ends with our target word. This approach is also different from the CBOW lab which selected a single word as our expected target.

### Part 1.2

1. In making our LSTM model, we specify that the model should learn nHidden attributes of our text, and output some sort of score for each of these attributes. As we get to the decode step, we want the model to accept the scores for each of these attributes (thus, take in nHidden input) and compute a softtmax score or probability distribution for each word in the vocabulary (thus the output size of vocab size). It does not specify the length of the sequence because the LSTM model process words one after the other and does not consider the size of the input at any point of the iterations. 

2. The additional output of LSTM represents some information about the sequence; this information affects the probabilities the model assign to the words being predicted. 

3. The functions creates and initializes the starting hidden state for the model. 

4. The loss function for both models is the same. This is because we want to calculate loss in the same manner we did; compare our target to predicted, compute cross-entropy loss and adjust weights accordingly. 

### Part 1.3

1. In this LSTM trainer, we initialized the hidden state. This step does not happen in the CBOW trainer. This step is important because it allows for the recurrent processing to be done on the sequence. The hidden layer is useful in normalizing the intermediate results in our recurrent neural net.
After calling forward on the model, we reshape the result obtained into a format that is comparable to the target. This step is not performed in the cbow method. This useful because our forward method produces a different format of output from the format of our target, and the inclusion of the hidden layer affects that. 

2. This step is normalizing the gradient. In the CBOW model, we did not normalize the gradient but rather took the optimation step. Normalizing the gradient in this model is useful for handling optimation steps that are too large, ensuring that the model takes steps that are reasonably sized on obtaining the optimal weights.  

### Part 1.4

1. Initializing the hidden layer of the model, and reshaping the output of running forward on the model. 

2. In this language model, we want to select the most probable word given the context or preceeding words. This implies that we get the probability of the target word, and not necessarily the word the highest logit value. 

## Part 2: Generating predictions from models trained on SAE and Yoda sentences

1. 
    nEmbed = 10
    Our vocabulary size is small (20) so this embedding size should be enough to capture various attributes of words we want to learn
    nHidden = 5 
    Our data is not so much, so we want to keep differing representaions of words to a minimal level but not too small.
    nLayers = 2 
    Given that our dataset is small, we will use only 2 layers to avoid excessive computation time


In [4]:
nEmbed = 10
nHidden = 5
nLayers = 2

#sae model
sae_model = LM.LSTM_LM(my_dataset.vocabSize, nEmbed, nHidden, nLayers) #create model
valid_data = LM.LM_Dataset(valid_file, vocab_file, max_len, lower) #create validation dataset

#train parameters
epochs = 25
batch_size = 50
lr = 0.02
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#train model
model_trainer = LM.LM_Trainer(epochs, lr, batch_size, device)

#train sae model
model_trainer.train(sae_model, my_dataset, valid_data)

  from .autonotebook import tqdm as notebook_tqdm


Epoch 0:	 Avg Train Loss: 3.22488	 Avg Val Loss: 3.10107
Epoch 10:	 Avg Train Loss: 2.52987	 Avg Val Loss: 2.52523
Epoch 20:	 Avg Train Loss: 2.24584	 Avg Val Loss: 2.21038


In [5]:
#yoda model
yoda_file = "data/yoda_train.tsv"
yoda_valid_file = "data/yoda_val.tsv"

#train yoda model
yoda_data = LM.LM_Dataset(yoda_file, vocab_file, max_len, lower) #create yoda training dataset
yoda_valid_data = LM.LM_Dataset(yoda_valid_file, vocab_file, max_len, lower) #create yoda validation dataset
yoda_model = LM.LSTM_LM(yoda_data.vocabSize, nEmbed, nHidden, nLayers) #create yoda model
model_trainer.train(yoda_model, yoda_data, yoda_valid_data) #train yoda model

Epoch 0:	 Avg Train Loss: 3.1056	 Avg Val Loss: 3.01284
Epoch 10:	 Avg Train Loss: 2.47913	 Avg Val Loss: 2.43458
Epoch 20:	 Avg Train Loss: 1.92179	 Avg Val Loss: 1.87183


In [6]:
#evaluating models
eval_data = LM.LM_Dataset(eval_file, vocab_file, max_len, lower) #creates test dataset
model_evaluator = LM.LM_Evaluator(eval_data, device)

#evaluate yoda model
words, yoda_probs = model_evaluator.get_preds(yoda_model)

#evaluate sae model
words, sae_probs = model_evaluator.get_preds(sae_model)
print(words)
print(sae_probs)
print(yoda_probs)

[[0, 5, 3, 9, 14], [0, 9, 5, 3, 14], [0, 14, 5, 3, 9], [0, 9, 5, 3, 14], [0, 14, 9, 5, 3], [0, 9, 14, 5, 3], [0, 5, 3, 14, 9], [0, 9, 14, 5, 3], [0, 5, 3, 2, 5], [0, 9, 2, 5, 20], [0, 14, 5, 3, 2], [0, 9, 2, 5, 20], [0, 5, 3, 9, 14], [0, 2, 5, 20, 9], [0, 14, 2, 5, 20], [0, 2, 5, 20, 9], [0, 5, 3, 2, 5], [0, 2, 5, 20, 9], [0, 14, 2, 5, 21], [0, 2, 5, 20, 9], [0, 14, 2, 5, 20], [0, 9, 2, 5, 20], [0, 5, 3, 14, 2], [0, 9, 2, 5, 20], [0, 14, 9, 5, 3], [0, 2, 5, 20, 9], [0, 5, 3, 2, 5], [0, 2, 5, 20, 9], [0, 14, 2, 5, 20], [0, 2, 5, 20, 9], [0, 5, 3, 2, 5], [0, 2, 5, 20, 9], [0, 5, 3, 2, 5], [0, 9, 2, 5, 21], [0, 14, 5, 3, 2], [0, 9, 2, 5, 21], [0, 5, 3, 9, 14], [0, 2, 5, 21, 9], [0, 14, 2, 5, 21], [0, 2, 5, 21, 9], [0, 5, 3, 2, 5], [0, 2, 5, 21, 9], [0, 14, 2, 5, 20], [0, 2, 5, 21, 9], [0, 14, 2, 5, 21], [0, 9, 2, 5, 21], [0, 5, 3, 14, 2], [0, 9, 2, 5, 21], [0, 14, 9, 5, 3], [0, 2, 5, 21, 9], [0, 5, 3, 2, 5], [0, 2, 5, 21, 9], [0, 14, 2, 5, 21], [0, 2, 5, 21, 9], [0, 5, 3, 2, 5], [0, 2, 5,

In [None]:
#veiw some datapoints from eval dataset
trial_data = model_evaluator.test_loader
for i,datapoint in enumerate(trial_data):
    print(f"Datapoint {i}: {datapoint}")
    if i == 2:
        break
    

Datapoint 0: [tensor([[ 0.,  6., 12.,  5.,  8.]]), tensor([[ 0, 12,  5,  8, 18]])]
Datapoint 1: [tensor([[ 0.,  6.,  8., 12.,  5.]]), tensor([[ 0,  8, 12,  5, 18]])]
Datapoint 2: [tensor([[ 0.,  6., 18., 12.,  5.]]), tensor([[ 0, 18, 12,  5,  8]])]


In [None]:
#save model predictions
save_fpath1 = "output/predictions1.tsv" 

models_preds = {
    "sae model": sae_model,
    "yoda model": yoda_model
}

#save predictions to file
model_evaluator.save_preds(models_preds, save_fpath1)


## Part 3: Analyzing and interpreting the predictions with the analyze mode from NLPScholar

1. Our goal to determine if yoda_model learned from the yoda text and the sae model sae text. The necessary output to look at is the by_cond output, and we set this condition as 'lang'. In the config file, we want to compute the accuracies of the models on the two languages. Using the 'by_cond' output allows us to get the results of these analysis. 

In [7]:
result_fpath = "results/result.tsv"
my_result = pd.read_csv(result_fpath, sep="\t")
my_result.head()

Unnamed: 0.1,Unnamed: 0,model,lang,acc,diff,expected,unexpected,macrodiff
0,0,sae model,sae,0.66129,-0.239394,3.227138,3.466532,-0.239394
1,1,sae model,yoda,0.602823,-0.069829,3.396703,3.466532,-0.069829
2,2,yoda model,sae,0.671371,-0.232496,3.388431,3.620927,-0.232496
3,3,yoda model,yoda,0.701613,-0.256076,3.364851,3.620927,-0.256076


2. Based on the result table above, it can be concluded that the models learned. The sae model obtained a higher accuracy on sae sentences than on yoda sentences. This indicates that the model learned from its sae sentences. Conversely, yoda model also also got high accuracies for yoda sentences than for sae sentences. This goes to prove that the yoda model learned yoda sentences.  

## Part 4 (optional): Implement a bidirectional version of the LSTM_LM 

For the bidirectional model, we want to consider text context from both left and right right directions. In our lstm model, we need to specify that we need a bidirectional model. The output of this layer will have both right and left contexts results, and we want to our hidden to consider both contexts. Thus, we multiply nHidden by two and decode the result after concatenating both left and right contexts results.
Another modification was increasing the size of the dimensions of the hidden layers to accept both left and right context in the initialize hidden function. 

In [7]:
#create the bidirectional language model
sae_bidirect_model = MaskedLM.Masked_LM(my_dataset.vocabSize, nEmbed, nHidden, nLayers)
yoda_bidirect_model = MaskedLM.Masked_LM(yoda_data.vocabSize, nEmbed, nHidden, nLayers)

#create trainer for bidirectional model
bidirect_trainer = MaskedLM.MaskedLM_Trainer(epochs, lr, batch_size, device)

#train bidirectional model
bidirect_trainer.train(sae_bidirect_model, my_dataset, valid_data) #train sae bidirectional model
bidirect_trainer.train(yoda_bidirect_model, yoda_data, yoda_valid_data) #train yoda bidirectional model

  from .autonotebook import tqdm as notebook_tqdm


Epoch 0:	 Avg Train Loss: 3.0744	 Avg Val Loss: 2.95832
Epoch 10:	 Avg Train Loss: 2.34887	 Avg Val Loss: 2.32229
Epoch 20:	 Avg Train Loss: 1.96948	 Avg Val Loss: 1.95021
Epoch 0:	 Avg Train Loss: 3.19692	 Avg Val Loss: 3.08713
Epoch 10:	 Avg Train Loss: 2.31924	 Avg Val Loss: 2.25816
Epoch 20:	 Avg Train Loss: 1.9581	 Avg Val Loss: 1.90666


In [8]:
#evaluate bidirectional models
#evaluate sae bidirectional model
words, sae_bidirect_probs = model_evaluator.get_preds(sae_bidirect_model)

#evaluate yoda bidirectional model
words, yoda_bidirect_probs = model_evaluator.get_preds(yoda_bidirect_model)

print(sae_bidirect_probs)
print(yoda_bidirect_probs)

[[0.9185004830360413, 0.5681387186050415, 0.018598051741719246, 0.06379152089357376, 0.04429509490728378], [0.9268451929092407, 0.020502077415585518, 0.5248281955718994, 0.023121239617466927, 0.04283291846513748], [0.923264741897583, 0.031211834400892258, 0.4522002339363098, 0.023807382211089134, 0.06001531332731247], [0.9268451929092407, 0.020502077415585518, 0.5248281955718994, 0.023121239617466927, 0.04283291846513748], [0.9333169460296631, 0.025065647438168526, 0.039158642292022705, 0.42883583903312683, 0.02520556002855301], [0.9319940209388733, 0.017987607046961784, 0.04295837879180908, 0.35472315549850464, 0.025398017838597298], [0.9145044088363647, 0.5597018599510193, 0.02080269157886505, 0.047033149749040604, 0.05808540806174278], [0.9319940209388733, 0.017987607046961784, 0.04295837879180908, 0.35472315549850464, 0.025398017838597298], [0.9077674150466919, 0.5563839673995972, 0.021715104579925537, 0.03942102566361427, 0.19430285692214966], [0.9286812543869019, 0.01989657245576

In [9]:
#save bidirectional model predictions
save_fpath2 = "output/bidirect_predictions1.tsv" 

bi_models_preds = {
    "bi_sae model": sae_bidirect_model,
    "bi_yoda model": yoda_bidirect_model
}

#save predictions to file
model_evaluator.save_preds(bi_models_preds, save_fpath2)

In [10]:
result_fpath2 = "results/bidirect_result.tsv"
my_result = pd.read_csv(result_fpath2, sep="\t")
my_result.head()

Unnamed: 0.1,Unnamed: 0,model,lang,acc,diff,expected,unexpected,macrodiff
0,0,bi_sae model,sae,0.782258,-0.378419,3.199289,3.577708,-0.378419
1,1,bi_sae model,yoda,0.631048,-0.132776,3.444932,3.577708,-0.132776
2,2,bi_yoda model,sae,0.582661,-0.180726,3.200094,3.380821,-0.180726
3,3,bi_yoda model,yoda,0.72379,-0.306833,3.073988,3.380821,-0.306833


As expected, the bidirectional sae model performs well on sae sentences than on yoda sentences. The converse is true for the bidirectional yoda model, which performs better on yoda sentences than on sae sentences. 
Compared with the unidirectional models, these bidirectioal models perform better; they obtain higher accuracies on the type of sentences they were trained on. For instance, the causal sae and yoda model obtained accuracies of 0.66 and 0.70 respectively on their respective sentences. However, the masked models obtained accuries of 0.78 for the sae and 0.72 for the yoda. 