# 🦜 NN-Based Language Model
In this excercise we will run a basic RNN based language model and answer some questions about the code. It is advised to use GPU to run the code. First run the code then answer the questions below that require modifying it.

In [1]:
#@title 🧮 Imports & Hyperparameter Setup
#@markdown Feel free to experiment with the following hyperparameters at your
#@markdown leasure. For the purpose of this assignment, leave the default values
#@markdown and run the code with these suggested values.
# Some part of the code was referenced from below.
# https://github.com/pytorch/examples/tree/master/word_language_model 
# https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/02-intermediate/language_model

! git clone https://github.com/yunjey/pytorch-tutorial/
%cd pytorch-tutorial/tutorials/02-intermediate/language_model/

import torch
import torch.nn as nn
import numpy as np
from torch.nn.utils import clip_grad_norm_

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Hyper-parameters
embed_size = 128 #@param {type:"number"}
hidden_size = 1024 #@param {type:"number"}
num_layers = 1 #@param {type:"number"}
num_epochs = 5 #@param {type:"slider", min:1, max:10, step:1}
batch_size = 20 #@param {type:"number"}
seq_length = 30 #@param {type:"number"}
learning_rate = 0.002 #@param {type:"number"}
#@markdown Number of words to be sampled ⬇️
num_samples = 50 #@param {type:"number"}  

print(f"--> Device selected: {device}")


Cloning into 'pytorch-tutorial'...
remote: Enumerating objects: 917, done.[K
remote: Total 917 (delta 0), reused 0 (delta 0), pack-reused 917[K
Receiving objects: 100% (917/917), 12.80 MiB | 953.00 KiB/s, done.
Resolving deltas: 100% (490/490), done.
/home/wa_ziqia/Documents/assignments/COMP691/assignment2/pytorch-tutorial/tutorials/02-intermediate/language_model
--> Device selected: cuda


In [2]:
from data_utils import Dictionary, Corpus

# Load "Penn Treebank" dataset
corpus = Corpus()
ids = corpus.get_data('data/train.txt', batch_size)
vocab_size = len(corpus.dictionary)
num_batches = ids.size(1) // seq_length

print(f"Vcoabulary size: {vocab_size}")
print(f"Number of batches: {num_batches}")

Vcoabulary size: 10000
Number of batches: 1549


## 🤖 Model Definition
As you can see below, this model stacks `num_layers` many [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) units vertically to construct our basic RNN-based language model. The diagram below shows a pictorial representation of the model in its simplest form (i.e `num_layers`=1).

In [24]:
# RNN based language model
class RNNLM(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers):
        super(RNNLM, self).__init__()
        self.embed = nn.Embedding(vocab_size, embed_size)
        self.lstm = nn.LSTM(embed_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, h):
        # Embed word ids to vectors
        x = self.embed(x)
        
        # Forward propagate LSTM
        out, (h, c) = self.lstm(x, h)
        
        # Reshape output to (batch_size*sequence_length, hidden_size)
        out = out.reshape(out.size(0)*out.size(1), out.size(2))
        
        # Decode hidden states of all time steps
        out = self.linear(out)
        return out, (h, c)

## 🏓 Training
In this section we will train our model, this should take a couple of minutes! Be patient 😊

In [27]:
model = RNNLM(vocab_size, embed_size, hidden_size, num_layers).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Truncated backpropagation
def detach(states):
    return [state.detach() for state in states] 


# Train the model
for epoch in range(num_epochs):
    # Set initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size).to(device),
              torch.zeros(num_layers, batch_size, hidden_size).to(device))
    
    for i in range(0, ids.size(1) - seq_length, seq_length):
        # Get mini-batch inputs and targets
        inputs = ids[:, i:i+seq_length].to(device)
        targets = ids[:, (i+1):(i+1)+seq_length].to(device)
        
        # Forward pass
        states = detach(states)
        outputs, states = model(inputs, states)
        loss = criterion(outputs, targets.reshape(-1))
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        step = (i+1) // seq_length
        if step % 100 == 0:
            print ('Epoch [{}/{}], Step[{}/{}], Loss: {:.4f}, Perplexity: {:5.2f}'
                   .format(epoch+1, num_epochs, step, num_batches, loss.item(), np.exp(loss.item())))

Epoch [1/5], Step[0/1549], Loss: 9.2125, Perplexity: 10021.20
Epoch [1/5], Step[100/1549], Loss: 5.9930, Perplexity: 400.62
Epoch [1/5], Step[200/1549], Loss: 5.9335, Perplexity: 377.48
Epoch [1/5], Step[300/1549], Loss: 5.7679, Perplexity: 319.86
Epoch [1/5], Step[400/1549], Loss: 5.6859, Perplexity: 294.68
Epoch [1/5], Step[500/1549], Loss: 5.1088, Perplexity: 165.47
Epoch [1/5], Step[600/1549], Loss: 5.1871, Perplexity: 178.95
Epoch [1/5], Step[700/1549], Loss: 5.3405, Perplexity: 208.62
Epoch [1/5], Step[800/1549], Loss: 5.1862, Perplexity: 178.80
Epoch [1/5], Step[900/1549], Loss: 5.0756, Perplexity: 160.07
Epoch [1/5], Step[1000/1549], Loss: 5.1047, Perplexity: 164.79
Epoch [1/5], Step[1100/1549], Loss: 5.3735, Perplexity: 215.62
Epoch [1/5], Step[1200/1549], Loss: 5.1934, Perplexity: 180.07
Epoch [1/5], Step[1300/1549], Loss: 5.0448, Perplexity: 155.22
Epoch [1/5], Step[1400/1549], Loss: 4.8532, Perplexity: 128.15
Epoch [1/5], Step[1500/1549], Loss: 5.1126, Perplexity: 166.11
Ep

# 🤔 Questions

## 1️⃣ Q2.1 Detaching or not? (10 points)
The above code implements a version of truncated backpropagation through time. The implementation only requires the `detach()` function (lines 7-9 of the cell) defined above the loop and used once inside the training loop.
* Explain the implementation (compared to not using truncated backprop through time).
* What does the `detach()` call here achieve? Draw a computational graph. You may choose to answer this question outside the notebook.
* When using using line 7-9 we will typically observe less GPU memory being used during training, explain why in your answer.


## 🔮 Model Prediction
Below we will use our model to generate text sequence!

In [11]:
# Sample from the model
with torch.no_grad():
    with open('sample.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size).to(device),
                 torch.zeros(num_layers, 1, hidden_size).to(device))

        # Select one word id randomly
        prob = torch.ones(vocab_size)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(num_samples):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            prob = output.exp()
            word_id = torch.multinomial(prob, num_samples=1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, num_samples, 'sample.txt'))
! cat sample.txt

outright <unk> <unk> <unk> so a western spokesman in new york 
rogers got N aliens and instructions have suffered their office 
and it has big early terms with momentum 
we know that he will his next two games needs and credible while our political perspective 
the 

## 2️⃣ Q2.2 Sampling strategy (7 points)
Consider the sampling procedure above. The current code samples a word:
```python
word_id = torch.multinomial(prob, num_samples=1).item()
```
in order to feed the model at each output step and feeding those to the next timestep. Copy below the above cell and modify this sampling startegy to use a greedy sampling which selects the highest probability word at each time step to feed as the next input.

In [12]:
# Sample greedily from the model
# Sample from the model
with torch.no_grad():
    with open('sample.txt', 'w') as f:
        # Set intial hidden ane cell states
        state = (torch.zeros(num_layers, 1, hidden_size).to(device),
                 torch.zeros(num_layers, 1, hidden_size).to(device))

        # Select one word id randomly
        prob = torch.ones(vocab_size)
        input = torch.multinomial(prob, num_samples=1).unsqueeze(1).to(device)

        for i in range(num_samples):
            # Forward propagate RNN 
            output, state = model(input, state)

            # Sample a word id
            prob = output.exp()
            word_id = torch.argmax(prob, dim=1).item()

            # Fill input with sampled word id for the next time step
            input.fill_(word_id)

            # File write
            word = corpus.dictionary.idx2word[word_id]
            word = '\n' if word == '<eos>' else word + ' '
            f.write(word)

            if (i+1) % 100 == 0:
                print('Sampled [{}/{}] words and save to {}'.format(i+1, num_samples, 'sample.txt'))
! cat sample.txt

counter that the leadership change reflects the high of all this 
the <unk> of the <unk> mr. jones has turned <unk> to the <unk> 
mr. roman also raised pinkerton 's equity investment in the u.s. 
he also told reporters that he would n't elaborate 
he would 

## 3️⃣ Q2.3 Embedding Distance (8 points)
Our model has learned a specific set of word embeddings.
* Write a function that takes in 2 words and prints the cosine distance between their embeddings using the word embeddings from the above models.
* Use it to print the cosine distance of the word "army" and the word "taxpayer".

*Refer to the sampling code for how to output the words corresponding to each index. To get the index you can use the function `corpus.dictionary.word2idx.`*


In [26]:
# Embedding distance
def cosine_distance(model, word1, word2):
    idx1 = corpus.dictionary.word2idx[word1]
    idx2 = corpus.dictionary.word2idx[word2]

    # Retrieve the embeddings
    embed1 = model.embed(torch.tensor([idx1]).to(device)).squeeze()
    embed2 = model.embed(torch.tensor([idx2]).to(device)).squeeze()

    # Normalize the embeddings
    norm1 = embed1 / torch.norm(embed1)
    norm2 = embed2 / torch.norm(embed2)

    # Compute the cosine distance (dot product of normalized embeddings)
    distance = torch.dot(norm1, norm2).item()

    return 1 - distance

word1 = "army"
word2 = "taxpayer"
distance = cosine_distance(model, word1, word2)
print(f"The cosine distance between '{word1}' and '{word2}' is {distance:.4f}")

The cosine distance between 'army' and 'taxpayer' is 0.9354


## 4️⃣ Q2.4 Teacher Forcing (Extra Credit 2 points)
What is teacher forcing?
> Teacher forcing works by using the actual or expected output from the training dataset at the current time step $y(t)$ as input in the next time step $X(t+1)$, rather than the output generated by the network.

In the `🏓 Training` code this is achieved, implicitly, when we pass the entire input sequence (`inputs = ids[:, i:i+seq_length].to(device)`) to the model at once.

Copy below the `🏓 Training` code and modify it to disable teacher forcing training. Compare the performance of this model, to original model, what can you conclude? (compare perplexity and convergence rate)

In [32]:
# Training code without Teacher Forcing
for epoch in range(num_epochs):
    # Set initial hidden and cell states
    states = (torch.zeros(num_layers, batch_size, hidden_size).to(device),
              torch.zeros(num_layers, batch_size, hidden_size).to(device))
    
    for i in range(0, ids.size(1) - seq_length, seq_length):
        # Get mini-batch inputs and targets
        inputs = ids[:, i:i+1].to(device)  # only the first word
        targets = ids[:, (i+1):(i+1)+seq_length].to(device)
        
        # Initialize loss
        loss = 0
        
        for j in range(seq_length):
            # Forward pass
            states = detach(states)
            outputs, states = model(inputs, states)
            
            # Calculate loss
            loss += criterion(outputs, targets[:, j])

            # Update input for the next step
            _, predicted = outputs.max(1)
            inputs = predicted.unsqueeze(1).detach()
        
        # Normalize loss by sequence length
        loss /= seq_length
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        step = (i+1) // seq_length
        if step % 100 == 0:
            print ('Epoch [{}/{}], Step[{}/{}], Loss: {:.4f}, Perplexity: {:5.2f}'
                   .format(epoch+1, num_epochs, step, num_batches, loss.item(), np.exp(loss.item())))


Epoch [1/5], Step[0/1549], Loss: 6.6925, Perplexity: 806.36
Epoch [1/5], Step[100/1549], Loss: 6.5009, Perplexity: 665.77
Epoch [1/5], Step[200/1549], Loss: 6.5244, Perplexity: 681.54
Epoch [1/5], Step[300/1549], Loss: 6.6948, Perplexity: 808.16
Epoch [1/5], Step[400/1549], Loss: 6.5107, Perplexity: 672.32
Epoch [1/5], Step[500/1549], Loss: 6.3665, Perplexity: 582.00
Epoch [1/5], Step[600/1549], Loss: 6.3464, Perplexity: 570.43
Epoch [1/5], Step[700/1549], Loss: 6.5332, Perplexity: 687.59
Epoch [1/5], Step[800/1549], Loss: 6.3586, Perplexity: 577.42
Epoch [1/5], Step[900/1549], Loss: 6.4413, Perplexity: 627.23
Epoch [1/5], Step[1000/1549], Loss: 6.4862, Perplexity: 656.02
Epoch [1/5], Step[1100/1549], Loss: 6.5744, Perplexity: 716.54
Epoch [1/5], Step[1200/1549], Loss: 6.4850, Perplexity: 655.22
Epoch [1/5], Step[1300/1549], Loss: 6.6614, Perplexity: 781.61
Epoch [1/5], Step[1400/1549], Loss: 6.4868, Perplexity: 656.39
Epoch [1/5], Step[1500/1549], Loss: 6.4544, Perplexity: 635.51
Epoc

## Conclusion for Q2.4
Compared with the original model, the model trained without teacher forcing presents a slower convergence rate and higher perplexity values. This is because, without teacher forcing, the model relies on its own generated outputs to learn, which might not be accurate, especially in the early stages of training. As a result, the model might take more time to learn the correct patterns and dependencies in the data. In some cases, it could lead to a more robust model that can better handle unexpected inputs during inference, but it generally takes longer to train and may not achieve the same performance as the model trained with teacher forcing.

## 5️⃣ Q2.5 Distance Comparison (+1 point)
Repeat the work you did for `3️⃣ Q2.3 Embedding Distance` for the model in `4️⃣ Q2.4 Teacher Forcing` and compare the distances produced by these two models (i.e. with and without the teacher forcing), what can you conclude?

In [37]:
word1 = "army"
word2 = "taxpayer"
distance_without_tf = cosine_distance(model, word1, word2)
print(f"The cosine distance between '{word1}' and '{word2}' without teacher forcing is {distance_without_tf:.4f}")

The cosine distance between 'army' and 'taxpayer' without teacher forcing is 0.9480


## Conclusion for Q2.5
The cosine distances produced by the models with and without teacher forcing are different due to the differences in training strategies. They have different learned representations. However, different final embedding layers may not be sufficient to draw conclusions about the overall quality or performance of the models. We still need to consider evaluating both models on a validation dataset and comparing their perplexities.