# Lab 7: Contextual Bag of Words with Pytorch
### COSC 426: Fall 2025, Colgate University

Use this notebook to answer the questions in `Lab7.md`. Make sure to include in this notebook all the tests and experiments you run. Make sure to also cite any external resources you use. 

## Part 1: Familiarize yourself with the different components of the CBOW model

Add as many code chunks and markdown chunks as required to answer the questions in this part.

In [68]:
import CBOW
import torch

# For colorcoding
RED = "\033[31m"
GREEN = "\033[92m"
BLUE = "\033[94m"
END = "\033[0m"

### Part 1.1


In [58]:
data4 = CBOW.CBOW_Dataset("./data/sample-alice.txt", "./data/sample_vocab.txt", 4)

In [59]:
print(f"Num training examples: {len(data4.X)}")

Num training examples: 336


In [60]:
data2 = CBOW.CBOW_Dataset("./data/sample-alice.txt", "./data/sample_vocab.txt", 2)

In [61]:
print(f"Num training examples: {len(data2.X)}")

Num training examples: 336


1. Create a `CBOW_Dataset` object with `sample_alice.txt`, `sample_vocab.txt`, and a window size of 4. How many training examples does the dataset have? (*Hint: Remember you can access the variables in the object's init*) 

    There are 336 training examples.

2. Create a `CBOW_Dataset` object with `sample_alice.txt`, `sample_vocab.txt`, and a window size of 2. How many training examples does the dataset have?

    There are 336 training examples.

3. Does window size affect the number of training examples, why or why not? 

    Since we are using padding, both of the training sets have the same number of examples, which is the number of words in the text.

4. What does one training example look like? What is the sequence of steps to go from text in the form of a string to the final format and data types of the training example? 

    A training example looks like an embedding, of widow size*2. We want to use a *2 since we want to account for every word, hence we need to add padding so that the window accounts for every word. We frist divide the text into sentences, and further divide the sentences into tokens. Tokens are then converted into integer values using the vocab:id mapping. We chunk these values into pairs in the format of n words:n+1th word, where n is the window size. 

5. Is this final dataset case-sensitive (i.e., does it treat lower and upper case differently)? If it is, how can you change it to not be case-sensitive and vice versa? 

    The final dataset is not case sensitive. To make it case sensitive, we can remove `.lower()` funciton in the preprocessing, and treat capitalized words as diffrent tokens. 

### Part 1.2


1. In the `__init__` part of the `CBOW_Model`, you are initializing a `torch.nn.Embedding` layer. Conceptually, what does this layer do? What are the input and output dimensions of this layer? Why does this make sense?

    Conceptually, the torch.nn.Embedding layer is weights that convert input into output, and the number of weights we have is the size of the input * the size of the output. The dimensions of the input is the size of the vocab and the output is nEmbed. This makes sense becuase we want the embedding to be denser then the vocabasize.

2. You are also initializing a `torch.nn.Linear` layer. Conceptually, what does this layer do? What are the input and output dimensions of this layer? Why does this make sense? 

    Conceptually, the linear layer is the weights that convert the output vecotr of the Embedding layer to the vector representation of a word. The input dimension is the same as `nEmbed` and the output dimension is the vocabsize. This makes sense because in the end we want the result to be in the size of the vocab. 

3. Describe in your own words what you think is happening in the `forward` function. 

    The `foward` function multiplies the embedding weights to the input and calcualtes the average.

4. Describe in your own words what you think is happening in the `loss` function.

    The `loss` fucntion calculates the crossenthropy loss. 

### Part 1.3


1. When you create a `CBOW_Trainer` object, you also have to pass in the following parameters. Explain how these paramters are used during training? 

   - num_epochs: how many times we want to repeat the training process.
   - lr: the amount change we want to make in each epoch
   - batch_size: the amount of data we want to pass at each forward. 
   - train_data: the data we use to train the model
   - val_data: the data we use to minimize the loss
   - device: the device we want to use(cpu, cuda etc.)

2. We are loading training data using `torch.utils.data.DataLoader`. What are the parameters for this function? What is the format in which this function returns the data? 

    - train_data: dataset from which to load the data.
    - batch_size: how many samples per batch to load
    - shuffle: set to True to have the data reshuffled at every epoch (default: False).

    It returns an iterable list of batches.

3. The following line calls the `forward` function of the model and saves the output. 
    ```
    y_pred = model(X)
    ```
    Describe what you think each of the following lines are doing. 
    
   * `X,y_target = X.to(self.device), y_target.to(self.device)`: Moves training data from the memory to GPU. 
   * `loss = model.loss(y_pred, y_target)`: computes the CrossEntropyLoss loss.
   * `optimizer.zero_grad()`: Clearing the gradient for the next update. 
   * `loss.backward()`: Adjusts the weight to optimize the model(to minimize the loss of the model). Calculating the gradient. 
   * `optimizer.step()`: Updates the weights

### Part 1.4


1. Describe what `compute_loss` and `get_preds` do. Which of these two functions is closest to the evaluate mode in `NLPScholar`?

    First the `compute_loss` function loads the training data to the selected device(e.g., CPU, GPU), then it predicts y given X by using the model. After the prediction, the function calculates the loss of the predicted result by comparing it to the expected result. This process gets reapted for all the examples in the data to calculate the total loss which then gets divied by the number of examples to get the average loss. 

    Instead of calculating the loss of the predicted values, the `get_preds` simply finds and returns the words that the model assigned the highest value to for each of the trainning examples. 

    I would say the latter function, `get_preds` is closer to the evaluate mode in `NLPScholar` becuase NLPScholar also focuses on the accuracy of each prediction rather then the loss. 

2. Why do these functions have `@torch.nograd`?

    Gradients are used in a process of training a model. However, since the functions are only used in the context of evaluating the models, gradients are not used and thus there is no need to calculate them.

3. Say you have a `CBOW_Model` object `model`, and a `CBOW_Dataset` object called `test_data`. How will you use this class to calculate the loss of `model` on `test_data`? 

    ```python
    evaluator = CBOW_Evaluator(test_data, batch_size, device)
    evaluator.compute_loss(model)
    ```

## Part 2: Train and evaluate a CBOW model on toy data, and explore the word embeddings

Add as many code chunks and markdown chunks as required to answer the questions in this part.


In [62]:
args = {"cuda": True, "mps": True}  # cuda ==> mps ==> cpu


def get_device(args: dict):
    # using get. removed not
    use_cuda = args.get("cuda", False) and torch.cuda.is_available()
    use_mps = args.get("mps", False) and torch.backends.mps.is_available()

    if use_cuda:
        device = torch.device("cuda")
    elif use_mps:
        device = torch.device("mps")
    else:
        device = torch.device("cpu")

    return device


device = get_device(args)

### Initialize model and datasets

In [63]:
train_data = CBOW.CBOW_Dataset(
    fname="./data/sample-alice.txt",
    vocab_fname="./data/sample_vocab.txt",
    window_size=4,
)

test_data = CBOW.CBOW_Dataset(
    fname="./data/sample-lookingglass.txt",
    vocab_fname="./data/sample_vocab.txt",
    window_size=4,
)

In [64]:
model = CBOW.CBOW_Model(50, train_data.vocabSize)

### Evaluate randomly initialized model

* Report loss and accuracy on the training data
* Report cosine similarity between the 4 word pairs

In [65]:
ev = CBOW.CBOW_Evaluator(
    test_data=test_data,
    batch_size=8,
    device=device,
)

In [None]:
model = model.to(device)  # sending the model to CUDA

# loss
loss_rand = ev.compute_loss(model)

# acc
gold, pred_rand = ev.get_preds(model)
acc_rand = (torch.stack(gold) == torch.stack(pred_rand)).float().mean()

In [71]:
print("=" * 40)
print(f"{BLUE}Randomly Initialzied Model{END}")
print("-" * 40)
print(f"Loss: {loss_rand:.2f}")
print(f"Accuracy: {acc_rand*100:.2f}%")
print("=" * 40)

[94mRandomly Initialzied Model[0m
----------------------------------------
Loss: 5.08
Accuracy: 0.54%


In [None]:
em_mat = model.embed.weight.data

### Train the model

### Evaluate trained model

* Report loss and accuracy on the training data
* Report cosine similarity between the 4 word pairs

### Discussion and reflection

* Do you think that the model has learned the task? Do you think the model has learned useful embeddings?

* Does it make sense to use embedding size of 300 for this toy data? Why or why not? 

## Part 3: Explore the role of training data on the word embeddings that are learned

### Train models

Answer question 1 here

### Come up with list of words
Answer questions 2 and 3 here

### Test your hypotheses

Answer question 4 here


## Part 4 (Optional): Explore the role of other factors on the word embeddings that are learned