# Lab 7: Contextual Bag of Words with Pytorch
### COSC 426: Fall 2025, Colgate University

Use this notebook to answer the questions in `Lab7.md`. Make sure to include in this notebook all the tests and experiments you run. Make sure to also cite any external resources you use. 

## Part 1: Familiarize yourself with the different components of the CBOW model

Add as many code chunks and markdown chunks as required to answer the questions in this part.

In [56]:
import os

import torch
import pickle

import CBOW

# For colorcoding
RED = "\033[31m"
GREEN = "\033[92m"
BLUE = "\033[94m"
END = "\033[0m"

### Part 1.1


In [57]:
data4 = CBOW.CBOW_Dataset("./data/sample-alice.txt", "./data/sample_vocab.txt", 4)

In [58]:
print(f"Num training examples: {len(data4.X)}")

Num training examples: 336


In [59]:
data2 = CBOW.CBOW_Dataset("./data/sample-alice.txt", "./data/sample_vocab.txt", 2)

In [60]:
print(f"Num training examples: {len(data2.X)}")

Num training examples: 336


1. Create a `CBOW_Dataset` object with `sample_alice.txt`, `sample_vocab.txt`, and a window size of 4. How many training examples does the dataset have? (*Hint: Remember you can access the variables in the object's init*) 

    There are 336 training examples.

2. Create a `CBOW_Dataset` object with `sample_alice.txt`, `sample_vocab.txt`, and a window size of 2. How many training examples does the dataset have?

    There are 336 training examples.

3. Does window size affect the number of training examples, why or why not? 

    Since we are using padding, both of the training sets have the same number of examples, which is the number of words in the text.

4. What does one training example look like? What is the sequence of steps to go from text in the form of a string to the final format and data types of the training example? 

    A training example looks like an embedding, of widow size*2. We want to use a *2 since we want to account for every word, hence we need to add padding so that the window accounts for every word. We frist divide the text into sentences, and further divide the sentences into tokens. Tokens are then converted into integer values using the vocab:id mapping. We chunk these values into pairs in the format of n words:n+1th word, where n is the window size. 

5. Is this final dataset case-sensitive (i.e., does it treat lower and upper case differently)? If it is, how can you change it to not be case-sensitive and vice versa? 

    The final dataset is not case sensitive. To make it case sensitive, we can remove `.lower()` funciton in the preprocessing, and treat capitalized words as diffrent tokens. 

### Part 1.2


1. In the `__init__` part of the `CBOW_Model`, you are initializing a `torch.nn.Embedding` layer. Conceptually, what does this layer do? What are the input and output dimensions of this layer? Why does this make sense?

    Conceptually, the torch.nn.Embedding layer is weights that convert input into output, and the number of weights we have is the size of the input * the size of the output. The dimensions of the input is the size of the vocab and the output is nEmbed. This makes sense becuase we want the embedding to be denser then the vocabasize.

2. You are also initializing a `torch.nn.Linear` layer. Conceptually, what does this layer do? What are the input and output dimensions of this layer? Why does this make sense? 

    Conceptually, the linear layer is the weights that convert the output vecotr of the Embedding layer to the vector representation of a word. The input dimension is the same as `nEmbed` and the output dimension is the vocabsize. This makes sense because in the end we want the result to be in the size of the vocab. 

3. Describe in your own words what you think is happening in the `forward` function. 

    The `foward` function multiplies the embedding weights to the input and calcualtes the average.

4. Describe in your own words what you think is happening in the `loss` function.

    The `loss` fucntion calculates the crossenthropy loss. 

### Part 1.3


1. When you create a `CBOW_Trainer` object, you also have to pass in the following parameters. Explain how these paramters are used during training? 

   - num_epochs: how many times we want to repeat the training process.
   - lr: the amount change we want to make in each epoch
   - batch_size: the amount of data we want to pass at each forward. 
   - train_data: the data we use to train the model
   - val_data: the data we use to minimize the loss
   - device: the device we want to use(cpu, cuda etc.)

2. We are loading training data using `torch.utils.data.DataLoader`. What are the parameters for this function? What is the format in which this function returns the data? 

    - train_data: dataset from which to load the data.
    - batch_size: how many samples per batch to load
    - shuffle: set to True to have the data reshuffled at every epoch (default: False).

    It returns an iterable list of batches.

3. The following line calls the `forward` function of the model and saves the output. 
    ```
    y_pred = model(X)
    ```
    Describe what you think each of the following lines are doing. 
    
   * `X,y_target = X.to(self.device), y_target.to(self.device)`: Moves training data from the memory to GPU. 
   * `loss = model.loss(y_pred, y_target)`: computes the CrossEntropyLoss loss.
   * `optimizer.zero_grad()`: Clearing the gradient for the next update. 
   * `loss.backward()`: Adjusts the weight to optimize the model(to minimize the loss of the model). Calculating the gradient. 
   * `optimizer.step()`: Updates the weights

### Part 1.4


1. Describe what `compute_loss` and `get_preds` do. Which of these two functions is closest to the evaluate mode in `NLPScholar`?

    First the `compute_loss` function loads the training data to the selected device(e.g., CPU, GPU), then it predicts y given X by using the model. After the prediction, the function calculates the loss of the predicted result by comparing it to the expected result. This process gets reapted for all the examples in the data to calculate the total loss which then gets divied by the number of examples to get the average loss. 

    Instead of calculating the loss of the predicted values, the `get_preds` simply finds and returns the words that the model assigned the highest value to for each of the trainning examples. 

    I would say the latter function, `get_preds` is closer to the evaluate mode in `NLPScholar` becuase NLPScholar also focuses on the accuracy of each prediction rather then the loss. 

2. Why do these functions have `@torch.nograd`?

    Gradients are used in a process of training a model. However, since the functions are only used in the context of evaluating the models, gradients are not used and thus there is no need to calculate them.

3. Say you have a `CBOW_Model` object `model`, and a `CBOW_Dataset` object called `test_data`. How will you use this class to calculate the loss of `model` on `test_data`? 

    ```python
    evaluator = CBOW_Evaluator(test_data, batch_size, device)
    evaluator.compute_loss(model)
    ```

## Part 2: Train and evaluate a CBOW model on toy data, and explore the word embeddings

Add as many code chunks and markdown chunks as required to answer the questions in this part.


In [61]:
args = {"cuda": True, "mps": True}  # cuda ==> mps ==> cpu


def get_device(args: dict):
    # using get. removed not
    use_cuda = args.get("cuda", False) and torch.cuda.is_available()
    use_mps = args.get("mps", False) and torch.backends.mps.is_available()

    if use_cuda:
        device = torch.device("cuda")
    elif use_mps:
        device = torch.device("mps")
    else:
        device = torch.device("cpu")

    return device


device = get_device(args)

### Initialize model and datasets

In [62]:
train_data = CBOW.CBOW_Dataset(
    fname="./data/sample-alice.txt",
    vocab_fname="./data/sample_vocab.txt",
    window_size=4,
)

val_data = CBOW.CBOW_Dataset(
    fname="./data/sample-lookingglass.txt",
    vocab_fname="./data/sample_vocab.txt",
    window_size=4,
)

In [63]:
model = CBOW.CBOW_Model(50, train_data.vocabSize)

### Evaluate randomly initialized model

* Report loss and accuracy on the training data
* Report cosine similarity between the 4 word pairs

In [64]:
ev = CBOW.CBOW_Evaluator(
    test_data=val_data,
    batch_size=8,
    device=device,
)

In [65]:
model = model.to(device)  # sending the model to CUDA

# loss
loss_rand = ev.compute_loss(model)

# acc
gold, pred_rand = ev.get_preds(model)
acc_rand = (torch.stack(gold) == torch.stack(pred_rand)).float().mean()

In [66]:
print("=" * 40)
print(f"{BLUE}Randomly Initialzied Model{END}")
print("-" * 40)
print(f"Loss: {loss_rand:.2f}")
print(f"Accuracy: {acc_rand*100:.2f}%")
print("=" * 40)

[94mRandomly Initialzied Model[0m
----------------------------------------
Loss: 5.05
Accuracy: 0.54%


In [None]:
embedding_rand = model.embed.weight.data

seqs = [
    ["think", "thought"],
    ["think", "tired"],
    ["sleepy", "thought"],
    ["sleepy", "tired"],
]

print("=" * 40)
print(f"{BLUE}Cosine Similarities{END}")
print("-" * 40)
for idx, seq in enumerate(seqs):
    w1, w2 = train_data.encode(seq)

    v1 = embedding_rand[w1]
    v2 = embedding_rand[w2]

    similarity = torch.nn.functional.cosine_similarity(v1, v2, dim=0)

    print(f"{idx+1}. {seq[0]}-{seq[1]}: {RED}{similarity:.2f}{END}")
print("=" * 40)

[94mCosine Similarities[0m
----------------------------------------
1. think-thought: [31m0.12[0m
2. think-tired: [31m-0.05[0m
3. sleepy-thought: [31m0.04[0m
4. sleepy-tired: [31m0.03[0m


### Train the model

In [68]:
trainer = CBOW.CBOW_Trainer(
    num_epochs=1000,
    lr=0.5,
    batch_size=8,
    train_data=train_data,
    val_data=val_data,
    device=device,
)

In [69]:
trainer.train(model)

# rand_model_path = "./models/rand.pkl"

# if not os.path.exists(rand_model_path):
#     trainer.train(model)
#     with open(rand_model_path, 'wb') as f:
#         pickle.dump(model, f)

# with open(rand_model_path, "rb") as f:
#     model = pickle.load(f)

Epoch 0:	 Avg Train Loss: 4.78259	 Avg Val Loss: 5.44737
Epoch 20:	 Avg Train Loss: 0.05395	 Avg Val Loss: 10.30407
Epoch 40:	 Avg Train Loss: 0.02395	 Avg Val Loss: 10.69626
Epoch 60:	 Avg Train Loss: 0.02867	 Avg Val Loss: 10.76187
Epoch 80:	 Avg Train Loss: 0.01161	 Avg Val Loss: 11.02757
Epoch 100:	 Avg Train Loss: 0.01075	 Avg Val Loss: 11.25019
Epoch 120:	 Avg Train Loss: 0.01036	 Avg Val Loss: 11.51962
Epoch 140:	 Avg Train Loss: 0.01007	 Avg Val Loss: 11.64922
Epoch 160:	 Avg Train Loss: 0.01118	 Avg Val Loss: 11.75806
Epoch 180:	 Avg Train Loss: 0.00971	 Avg Val Loss: 11.82172
Epoch 200:	 Avg Train Loss: 0.00898	 Avg Val Loss: 11.90032
Epoch 220:	 Avg Train Loss: 0.01032	 Avg Val Loss: 12.04747
Epoch 240:	 Avg Train Loss: 0.01166	 Avg Val Loss: 12.10977
Epoch 260:	 Avg Train Loss: 0.00688	 Avg Val Loss: 12.18542
Epoch 280:	 Avg Train Loss: 0.00725	 Avg Val Loss: 12.24286
Epoch 300:	 Avg Train Loss: 0.00855	 Avg Val Loss: 12.32814
Epoch 320:	 Avg Train Loss: 0.00818	 Avg Val Lo

### Evaluate trained model

* Report loss and accuracy on the training data
* Report cosine similarity between the 4 word pairs

In [70]:
model = model.to(device)  # sending the model to CUDA

# loss
loss_1000 = ev.compute_loss(model)

# acc
gold, pred_1000 = ev.get_preds(model)
acc_1000 = (torch.stack(gold) == torch.stack(pred_1000)).float().mean()

In [71]:
print("=" * 40)
print(f"{BLUE}Randomly Initialzied Model{END}")
print("-" * 40)
print(f"Loss: {loss_1000:.2f}")
print(f"Accuracy: {acc_1000*100:.2f}%")
print("=" * 40)

[94mRandomly Initialzied Model[0m
----------------------------------------
Loss: 13.69
Accuracy: 9.24%


In [72]:
embedding_1000 = model.embed.weight.data

print("=" * 40)
print(f"{BLUE}Cosine Similarities: 1000 epochs{END}")
print("-" * 40)
for idx, seq in enumerate(seqs):
    w1, w2 = train_data.encode(seq)

    v1 = embedding_1000[w1]
    v2 = embedding_1000[w2]

    similarity = torch.nn.functional.cosine_similarity(v1, v2, dim=0)

    print(f"{idx+1}. {seq[0]}-{seq[1]}: {RED}{similarity:.2f}{END}")
print("=" * 40)

[94mCosine Similarities: 1000 epochs[0m
----------------------------------------
1. think-thought: [31m0.44[0m
2. think-tired: [31m-0.16[0m
3. sleepy-thought: [31m0.13[0m
4. sleepy-tired: [31m0.09[0m


### Discussion and reflection

7. Do you think that the model has learned the task? Do you think the model has learned useful embeddings?

I think the model has learned the task, but only to a very limited degree. While the accuracy increased significantly from almost 0.54% to around 9%, 9% as an absolute measure is still quite low. It is interesting that the validation loss has doubled while the accuracy has increased. The model still seems to struggle with cosine similarity measures, as the "sleepy-tired" pair still has a very low cosine similarity. However, relative to the original embedding, this is a notable improvement since now some of the pairs("think-thought", "sleepy-thought") have more sensible values. 

In [None]:
train_data.vocabSize  # vocab size.

151

8. Does it make sense to use embedding size of 300 for this toy data? Why or why not? 

In this case, the embedding size of 300 does not make sense since the vocab size(151) is much smaller than 300. If we use 300 as the embedding size, we are making the representation sparser, not denser. 

## Part 3: Explore the role of training data on the word embeddings that are learned

### Train models

#### Alice in Wonderland

In [None]:
import pandas as pd

# coca_vocab_10k.txt was missing in the data folder
vocab_fname = "./data/coca_vocab_10k.txt"

# Source: https://www.eapfoundation.com/vocab/general/bnccoca/#listfreq
df = pd.read_excel("./data/BNC_COCA_lists.xlsx", sheet_name="Sheet1")
df = df.sort_values(by="Total frequency", ascending=False)[:10_000]

df["Headword "].to_csv(vocab_fname, index=False, header=False)

In [None]:
aiw_train_data = CBOW.CBOW_Dataset(
    fname="./data/alice_in_wonderland.txt",
    vocab_fname=vocab_fname,
    window_size=4,
)

aiw_model = CBOW.CBOW_Model(50, aiw_train_data.vocabSize)
aiw_model = aiw_model.to(device)

aiw_trainer = CBOW.CBOW_Trainer(
    num_epochs=50,
    lr=0.1,
    batch_size=100,
    train_data=aiw_train_data,
    val_data=aiw_train_data,
    device=device,
)

In [77]:
aiw_trainer.train(aiw_model)

Epoch 0:	 Avg Train Loss: 3.83196	 Avg Val Loss: 3.35886
Epoch 20:	 Avg Train Loss: 2.72914	 Avg Val Loss: 2.69802
Epoch 40:	 Avg Train Loss: 2.54592	 Avg Val Loss: 2.50947
Training done!
Avg Train Loss: 2.48387


In [78]:
aiw_eval = CBOW.CBOW_Evaluator(
    test_data=aiw_train_data,
    batch_size=100,
    device=device,
)

In [80]:
# loss
loss_aiw = aiw_eval.compute_loss(aiw_model)

# acc
gold_aiw, pred_aiw = aiw_eval.get_preds(aiw_model)
acc_aiw = (torch.stack(gold_aiw) == torch.stack(pred_aiw)).float().mean()

print("=" * 40)
print(f"{BLUE}Alice in Wonderland Model{END}")
print("-" * 40)
print(f"Loss: {loss_aiw:.2f}")
print(f"Accuracy: {acc_aiw*100:.2f}%")
print("=" * 40)

[94mAlice in Wonderland Model[0m
----------------------------------------
Loss: 2.46
Accuracy: 49.82%


---

#### Sherlock Homes

In [84]:
sh_train_data = CBOW.CBOW_Dataset(
    fname="./data/sherlock_holmes_short.txt",
    vocab_fname=vocab_fname,
    window_size=4,
)

sh_model = CBOW.CBOW_Model(50, sh_train_data.vocabSize)
sh_model = sh_model.to(device)

sh_trainer = CBOW.CBOW_Trainer(
    num_epochs=50,
    lr=0.1,
    batch_size=100,
    train_data=sh_train_data,
    val_data=sh_train_data,
    device=device,
)

sh_eval = CBOW.CBOW_Evaluator(
    test_data=sh_train_data,
    batch_size=100,
    device=device,
)

In [85]:
sh_trainer.train(sh_model)

Epoch 0:	 Avg Train Loss: 4.18429	 Avg Val Loss: 3.62921
Epoch 20:	 Avg Train Loss: 3.00917	 Avg Val Loss: 2.97163
Epoch 40:	 Avg Train Loss: 2.80457	 Avg Val Loss: 2.77392
Training done!
Avg Train Loss: 2.73094


In [88]:
# loss
loss_sh = sh_eval.compute_loss(sh_model)

# acc
gold_sh, pred_sh = sh_eval.get_preds(sh_model)
acc_sh = (torch.stack(gold_sh) == torch.stack(pred_sh)).float().mean()

print("=" * 40)
print(f"{BLUE}Sherlock Holmes Model{END}")
print("-" * 40)
print(f"Loss: {loss_sh:.2f}")
print(f"Accuracy: {acc_sh*100:.2f}%")
print("=" * 40)

[94mSherlock Holmes Model[0m
----------------------------------------
Loss: 2.71
Accuracy: 45.78%


### Come up with list of words
Answer questions 2 and 3 here

In [None]:
sims = [
    "the",
    "and",
    "of",
    "a",
    "to",
    "in",
    "is",
    "was",
    "with",
    "it",
    "that",
    "from",
]

vars = [  # aiw / sh
    "mad",  # hatter / angry
    "queen",  # white|hearts / royalty
    "white",  # queen / color
    "house",  # rabbit / home
    "case",  # box / murder
    "time",  # rabbit / clock
    "king",  # herats / royalty
    "curious",  # Alice / case
    "mystery",  # wonder / crime
    "off",  # heads / something
]

Since the words in the list `sims` are largely funtion as just grammatical words, they should have the same embeddigns across different texts. 

On the other hand, the words in vars list are context heavy words. For example, the word `mad` should have a similar embedding as the word `hatter` in the aiw model, but not in sh model. 

### Test your hypotheses

Answer question 4 here


In [None]:
def cst(vs1: list[torch.Tensor], vs2: list[torch.Tensor]) -> list[torch.Tensor]:
    """
    Given two lists of vectors, returns a list of their cosine similarities.

    Args:
        vs1 (list[torch.Tensor]): list of vectors
        vs2 (list[torch.Tensor]): list of vectors

    Returns:
        list[torch.Tensor]: list of similarities
    """
    if len(vs1) != len(vs2):
        raise ValueError("Input lists must have the same length")

    result = []

    for i in range(len(vs1)):
        similarity = torch.nn.functional.cosine_similarity(vs1[i], vs2[i], dim=0)
        result.append(similarity)

    return result

In [123]:
aiw_encoded_sims = aiw_train_data.encode(sims)
aiw_encoded_vars = aiw_train_data.encode(vars)
aiw_sims = []
aiw_vars = []

for sim in aiw_encoded_sims:
    aiw_sims.append(aiw_model.embed.weight.data[sim])
for var in aiw_encoded_vars:
    aiw_vars.append(aiw_model.embed.weight.data[var])

sh_encoded_sims = sh_train_data.encode(sims)
sh_encoded_vars = sh_train_data.encode(vars)
sh_sims = []
sh_vars = []

for sim in sh_encoded_sims:
    sh_sims.append(sh_model.embed.weight.data[sim])
for var in sh_encoded_vars:
    sh_vars.append(sh_model.embed.weight.data[var])

In [125]:
sims_cst = cst(aiw_sims, sh_sims)
vars_cst = cst(aiw_vars, sh_vars)

In [126]:
for i in range(len(sims_cst)):
    print(f"{i+1}. {sims[i]}: {sims_cst[i]:.2f}")
print("-" * 40)
print(f"Ave. Cosine Similarity: {sum(sims_cst)/len(sims_cst):.2f}")

1. the: 0.00
2. and: -0.09
3. of: -0.36
4. a: 0.01
5. to: -0.01
6. in: -0.19
7. is: 0.03
8. was: 0.03
9. with: 0.02
10. it: -0.01
11. that: 0.02
12. from: -0.03
----------------------------------------
Ave. Cosine Similarity: -0.05


In [127]:
for i in range(len(vars_cst)):
    print(f"{i+1}. {vars[i]}: {vars_cst[i]:.2f}")
print("-" * 40)
print(f"Ave. Cosine Similarity: {sum(vars_cst)/len(vars_cst):.2f}")

1. mad: 0.10
2. queen: 0.09
3. white: -0.02
4. house: 0.37
5. case: -0.05
6. time: -0.21
7. king: -0.13
8. curious: 0.07
9. mystery: -0.07
10. off: 0.06
----------------------------------------
Ave. Cosine Similarity: 0.02


The fact that the "similar" words list had a lower average cosine similarity (-0.05) than the "varied" words list (0.02) seems to indicate that my hypothesis was wrong. However, I suspect that this result stems from the models being undertrained. Both average similarity scores are, for all practical purposes, effectively zero, which suggests the word embeddings have not learned meaningful representations and are still close to their random initialization. The training corpora (Alice in Wonderland and Sherlock Holmes) are far too small—containing only thousands of words—to adequately train vectors for a 10,000-word vocabulary. In a high-dimensional space, the similarity between any two random vectors is expected to be near 0. Therefore, the minor difference between -0.05 and 0.02 is not a meaningful signal of learned semantic relationships, but rather a product of statistical noise and data sparsity.

## Part 4 (Optional): Explore the role of other factors on the word embeddings that are learned