Name | Matr.Nr. | Due Date
:--- | ---: | ---:
Ayse Sude Baki | 12211229 | 25.05.2023, 08:00

<h1 style="color:rgb(0,120,170)">Hands-on AI II</h1>
<h2 style="color:rgb(0,120,170)">Unit 5 – Language Modeling with LSTM (Assignment)</h2>

<b>Authors:</b> N. Rekabsaz, B. Schäfl, S. Lehner, J. Brandstetter, E. Kobler, M. Abbass, A. Schörgenhumer<br>
<b>Date:</b> 16-05-2023

This file is part of the "Hands-on AI II" lecture material. The following copyright statement applies to all code within this file.

<b>Copyright statement:</b><br>
This material, no matter whether in printed or electronic form, may be used for personal and non-commercial educational use only. Any reproduction of this material, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

<h3 style="color:rgb(0,120,170)">How to use this notebook</h3>
<p><p>This notebook is designed to run from start to finish. There are different tasks (displayed in <span style="color:rgb(248,138,36)">orange boxes</span>) which might require small code modifications. Most/All of the used functions are imported from the file <code>u5_utils.py</code> which can be seen and treated as a black box. However, for further understanding, you can look at the implementations of the helper functions. In order to run this notebook, the packages which are imported at the beginning of <code>u5_utils.py</code> need to be installed.</p></p>

In [2]:
import u5_utils as u5

import numpy as np
import torch
import os
import time
import math
import ipdb
import matplotlib.pyplot as plt
import seaborn as sns

# Set default plotting style.
sns.set()

# Setup Jupyter notebook (warning: this may affect all Jupyter notebooks running on the same Jupyter server).
u5.setup_jupyter()

# Check minimum versions.
u5.check_module_versions()

Installed Python version: 3.9 (✓)
Installed numpy version: 1.21.5 (✓)
Installed pandas version: 1.4.4 (✓)
Installed PyTorch version: 1.12.1 (✓)


<h2>Language Model Training and Evaluation</h2>

<h3 style="color:rgb(0,120,170)">Data & Dictionary Preperation</h3>

<div class="alert alert-warning">
    <b>Exercise 1. [20 Points]</b>
        <ul>
            <li>Setup the data set using the same parameter settings as in the main exercise notebook but with the changes mentioned below.</li>
            <li>Change the batch size in the initial parameters to $64$ and observe its effect on the created batches. Explain how the corpora are transformed into batches.</li>
            <li>Use a seed of $23$.</li>
            <li>For a specific sequence in <code>val_data_splits</code> (e.g., index $15$), print the corresponding words of its first 25 wordIDs.</li>
        </ul>
</div>

In [3]:
# Input & output parameters
data_path = os.path.join("resources", "penn")
save_path = "model.pt" # path to save the final model

# Training & evaluation parameters
train_batch_size = 64 # batch size for training
eval_batch_size = 64 # batch size for validation/test
max_seq_len = 40 # sequence length

# Random seed to facilitate reproducibility
torch.manual_seed(23)

use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
print("Device:", device)

Device: cpu


In [4]:
train_corpus = u5.Corpus(os.path.join(data_path, "train.txt"))
valid_corpus = u5.Corpus(os.path.join(data_path, "valid.txt"))
test_corpus = u5.Corpus(os.path.join(data_path, "test.txt"))

dictionary = u5.Dictionary()
train_corpus.fill_dictionary(dictionary)
ntokens = len(dictionary)
print(f"Number of tokens in dictionary {ntokens}")

Number of tokens in dictionary 10001


In [5]:
train_data = train_corpus.words_to_ids(dictionary)
print(f"Train data: number of tokens {len(train_data)}")

valid_data = valid_corpus.words_to_ids(dictionary)
print(f"Validation data: number of tokens {len(valid_data)}")

test_data = test_corpus.words_to_ids(dictionary)
print(f"Test data: number of tokens {len(test_data)}")

print()
train_data_splits = u5.batchify(train_data, train_batch_size, device)
print(f"Train data split shape: {train_data_splits.shape}")

val_data_splits = u5.batchify(valid_data, eval_batch_size, device)
print(f"Validation data split shape: {val_data_splits.shape}")

test_data_splits = u5.batchify(test_data, eval_batch_size, device)
print(f"Test data batchified shape: {test_data_splits.shape}")

Train data: number of tokens 929589
Validation data: number of tokens 73760
Test data: number of tokens 82430

Train data split shape: torch.Size([14524, 64])
Validation data split shape: torch.Size([1152, 64])
Test data batchified shape: torch.Size([1287, 64])


The words get transformed into word-IDs using a dictionary and sequense of the said IDs get reshaped into batches of the sizes (sequence_length, batch_size). If the sequence_length isn't the multiple of batch_size, the remaining samples are dropped.    

In [6]:
for idx in val_data_splits[:25,15]:
    print(dictionary.idx2word[idx])

weekly
reports
on
school
and
college
construction
plans
<eos>
market
data
<unk>
is
a
<unk>
of
educational
information
and
provides
related
services
<eos>
closely
held


<div class="alert alert-warning">
    <b>Exercise 2. [20 Points]</b>
        <ul>
            <li>Copy the implementation of <code>LM_LSTMModel</code> from the main exercise notebook but make the following changes:</li>
            <ul>
                <li>Add an integer parameter to <code>LM_LSTMModel</code>'s initialization, called <code>num_layers</code> which indicates the number of (vertically) stacked LSTM blocks. Hint: PyTorch's LSTM implementation directly supports this, so you simply have to set it when creating the LSTM instance (see parameter <code>num_layers</code> in the <a href="https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html">documentation</a>).</li>
                <li>Add a new bool parameter to <code>LM_LSTMModel</code>'s initialization, called <code>tie_weights</code>. Extend the implementation of <code>LM_LSTMModel</code> such that if <code>tie_weights</code> is set to <code>True</code>, the model ties/shares the parameters of <code>encoder</code> with the ones of <code>decoder</code>. Consider that <code>encoder</code> and <code>decoder</code> still remain separate components but their parameters are now the same (shared). This process is called <i>weight tying</i>. Feel free to search the internet for relevant resources and implementation hints.</li>
            </ul>
            <li>Create four models:</li>
            <ul>
                <li>1 layer and without weight tying</li>
                <li>1 layer and with weight tying</li>
                <li>2 layers and without weight tying</li>
                <li>2 layers and with weight tying</li>
            </ul>
            <li>Compare the number of parameters of the models and report your observations.</li>
        </ul>
</div>

In [7]:
class LM_LSTMModel(torch.nn.Module):
    
    def __init__(self, ntoken, ninp, nhid, num_layers, tie_weights = False):
        super().__init__()
        self.ntoken = ntoken
        self.encoder = torch.nn.Embedding(ntoken, ninp)  # matrix E in the figure
        self.rnn = torch.nn.LSTM(ninp, nhid, num_layers=num_layers)  # Set num_layers in LSTM
        self.decoder = torch.nn.Linear(nhid, ntoken)  # matrix U in the figure
        self.init_weights()
        self.ninp = ninp
        self.nhid = nhid
        self.num_layers = num_layers
        self.tie_weights = tie_weights
        
        if tie_weights:
            self.decoder.weight = self.encoder.weight
            
    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)
        
    def init_hidden(self, bsz):
        weight = next(self.parameters())
        return (weight.new_zeros(1, bsz, self.nhid),
                weight.new_zeros(1, bsz, self.nhid))
    
    def forward(self, input, hidden=None, return_logs=True):
        #ipdb.set_trace()
        emb = self.encoder(input)
        hiddens, last_hidden = self.rnn(emb, hidden)
        
        decoded = self.decoder(hiddens)
        if return_logs:
            y_hat = torch.nn.LogSoftmax(dim=-1)(decoded)
        else:
            y_hat = torch.nn.Softmax(dim=-1)(decoded)
        
        return y_hat, last_hidden

In [8]:
# Model parameters
emsize = 200  # size of word embeddings
nhid = 200  # number of hidden units per layer

model_1 = LM_LSTMModel(ntokens, emsize, nhid, num_layers=1, tie_weights = False)
model_1.to(device)

print(f"Model 1 without weight tying: {model_1}")
print(f"Model 1 without weight tying total parameters: {sum(p.numel() for p in model_1.parameters())}")
print(f"Model 1 without weight tying total trainable parameters: {sum(p.numel() for p in model_1.parameters() if p.requires_grad)}")

model_1_wt = LM_LSTMModel(ntokens, emsize, nhid, num_layers=1, tie_weights = True)
model_1_wt.to(device)

print(f"Model 1 with weight tying total parameters: {sum(p.numel() for p in model_1_wt.parameters())}")
print(f"Model 1 with weight tying total trainable parameters: {sum(p.numel() for p in model_1_wt.parameters() if p.requires_grad)}")


model_2 = LM_LSTMModel(ntokens, emsize, nhid, num_layers=2, tie_weights = False)
model_2.to(device)

print(f"Model 2 without weight tying: {model_2}")
print(f"Model 2 without weight tying total parameters: {sum(p.numel() for p in model_2.parameters())}")
print(f"Model 2 without weight tying total trainable parameters: {sum(p.numel() for p in model_2.parameters() if p.requires_grad)}")

model_2_wt = LM_LSTMModel(ntokens, emsize, nhid, num_layers=2, tie_weights = True)
model_2_wt.to(device)

print(f"Model 2 with weight tying total parameters: {sum(p.numel() for p in model_2_wt.parameters())}")
print(f"Model 2 with weight tying total trainable parameters: {sum(p.numel() for p in model_2_wt.parameters() if p.requires_grad)}")

models=[model_1, model_1_wt, model_2, model_2_wt]

Model 1 without weight tying: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model 1 without weight tying total parameters: 4332001
Model 1 without weight tying total trainable parameters: 4332001
Model 1 with weight tying total parameters: 2331801
Model 1 with weight tying total trainable parameters: 2331801
Model 2 without weight tying: LM_LSTMModel(
  (encoder): Embedding(10001, 200)
  (rnn): LSTM(200, 200, num_layers=2)
  (decoder): Linear(in_features=200, out_features=10001, bias=True)
)
Model 2 without weight tying total parameters: 4653601
Model 2 without weight tying total trainable parameters: 4653601
Model 2 with weight tying total parameters: 2653401
Model 2 with weight tying total trainable parameters: 2653401


Weight tying reduces the total number of parameters in the model by sharing the encoder and decoder weights. In both models with weight tying, the total number of parameters is significantly lower compared to the models without weight tying, while the number of trainable parameters remains the same. This reduction in parameters can improve memory and computational efficiency.

<h3 style="color:rgb(0,120,170)">Training and Evaluation</h3>


<div class="alert alert-warning">
    <b>Exercise 3. [30 Points]</b>
    <ul>
        <li>Using the same setup as in the main lecture/exercise notebook, train all four models for $5$ epochs.</li>
        <li>Using <code>ipdb</code>, look inside the <code>forward</code> function of <code>LM_LSTMModel</code> during training. Check the forward process from input to output particularly by looking at the shapes of tensors. Report the shape of all tensors used in <code>forward</code>. Try to translate the numbers into batches $B$ and sequence length $L$. For instance, if we know that the batch size is $B=32$, a tensor of shape $(32, 128, 3)$ can be interpreted as a batch of $32$ sequences with $3$ channels of size $L=128$. Thus, this tensor can be translated into $(32, 128, 3) \rightarrow (B, L, 3)$. Look at the <a href="https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html">official documentation</a> to understand the order of the dimensions.</li>
        <li>Evaluate the models. Compare the performances of all four models on the train, validation and test set (for the test set, use the best model according to the respective validation set performance), and report your observations. To do so, create a plot showing the following curves:</li>
        <ul>
            <li>Loss on each current training batch before every model update step as function of epochs</li>
            <li>Loss on the validation set at every epoch</li>
        </ul>
        <li>Comment on the results!</li>
    </ul>
</div>

In [None]:
CUT_AFTER_BATCHES = 100  # JUST FOR DEBUGGING: cut the loop after these number of batches. Set to -1 to ignore


def plot_losses(model_losses, model_names):
    plt.figure(figsize=(10, 5))

    for model_loss, model_name in zip(model_losses, model_names):
        plt.plot(model_loss[0], label=f'{model_name} Training Loss')
        plt.plot(model_loss[1], label=f'{model_name} Validation Loss')

    plt.title('Loss Curves')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    plt.show()
    
def get_batch(data, i, seq_len):
    """
    Get a batch of input data and targets from the given data starting at index i.

    Args:
        data: Input data tensor.
        i: Starting index.
        seq_len: Sequence length.

    Returns:
        batch_data: Batch of input data.
        batch_targets: Batch of target data.
    """
    batch_data = data[i:i+seq_len]
    batch_targets = data[i+1:i+seq_len+1].view(-1)
    return batch_data, batch_targets

def repackage_hidden(hidden):
    """
    Detach the hidden state from the computational graph.
    """
    if isinstance(hidden, torch.Tensor):
        return hidden.detach()
    else:
        return tuple(repackage_hidden(h) for h in hidden)

def evaluate(model, dictionary, max_seq_len, eval_batch_size, eval_data_splits):
    """
    Evaluate the model. Evaluation mode turned on to disable dropout.
    """
    model.eval()
    total_loss = 0.0
    ntokens = len(dictionary)
    start_hidden = None
    n_batches = (eval_data_splits.size(0) - 1) // max_seq_len

    with torch.no_grad():
        for i in range(0, eval_data_splits.size(0) - 1, max_seq_len):
            batch_data, batch_targets = get_batch(eval_data_splits, i, max_seq_len)

            if start_hidden is not None:
                start_hidden = repackage_hidden(start_hidden)

            # Forward pass
            y_hat_logprobs, last_hidden = model(batch_data, start_hidden, return_logs=True)

            y_hat_logprobs = y_hat_logprobs.view(-1, ntokens)
            loss = criterion(y_hat_logprobs, batch_targets.view(-1))

            start_hidden = last_hidden
            total_loss += loss.item()

    return total_loss / n_batches

def evaluate(model, criterion, dictionary, max_seq_len, eval_batch_size, eval_data_splits):
    """
    Evaluate the model. Evaluation mode turned on to disable dropout.
    """
    model.eval()
    total_loss = 0.0
    ntokens = len(dictionary)
    start_hidden = None
    n_batches = (eval_data_splits.size(0) - 1) // max_seq_len

    with torch.no_grad():
        for i in range(0, eval_data_splits.size(0) - 1, max_seq_len):
            batch_data, batch_targets = get_batch(eval_data_splits, i, max_seq_len)

            if start_hidden is not None:
                start_hidden = repackage_hidden(start_hidden)

            # Forward pass
            y_hat_logprobs, last_hidden = model(batch_data, start_hidden, return_logs=True)

            y_hat_logprobs = y_hat_logprobs.view(-1, ntokens)
            loss = criterion(y_hat_logprobs, batch_targets.view(-1))

            start_hidden = last_hidden
            total_loss += loss.item()

    return total_loss / n_batches

def train(model, optimizer, dictionary, max_seq_len, train_batch_size, train_data_splits,
          clipping, learning_rate, print_interval, epoch, criterion=torch.nn.NLLLoss()):
    
    """
    Train the model. Training mode turned on to enable dropout.
    """
    model.train()
    total_loss = 0.0
    start_time = time.time()
    ntokens = len(dictionary)
    start_hidden = None
    n_batches = (train_data_splits.size(0) - 1) // max_seq_len

    for batch_i, i in enumerate(range(0, train_data_splits.size(0) - 1, max_seq_len)):
        batch_data, batch_targets = get_batch(train_data_splits, i, max_seq_len)

        # Don't forget it! Otherwise, the gradients are summed together!
        optimizer.zero_grad()

        # Repackaging batches only keeps the value of start_hidden and disconnects its computational graph.
        # If repackaging is not done the, gradients are calculated from the current point to the beginning
        # of the sequence which becomes computationally too expensive.
        if start_hidden is not None:
            start_hidden = repackage_hidden(start_hidden)

        # Forward pass
        y_hat_logprobs, last_hidden = model(batch_data, start_hidden, return_logs=True)

        # Loss computation & backward pass
        y_hat_logprobs = y_hat_logprobs.view(-1, ntokens)
        loss = criterion(y_hat_logprobs, batch_targets.view(-1))
        loss.backward()

        # The last hidden states of the current step is set as the start hidden state of the next step.
        # This passes the information of the current batch to the next batch.
        start_hidden = last_hidden

        # Clipping gradients
        torch.nn.utils.clip_grad_norm_(model.parameters(), clipping)

        # Updating parameters using SGD
        optimizer.step()

        total_loss += loss.item()

        if batch_i % print_interval == 0 and batch_i > 0:
            cur_loss = total_loss / print_interval
            elapsed = time.time() - start_time
            throughput = elapsed * 1000 / print_interval
            print(f"| epoch {epoch:3d} | {batch_i:5d}/{n_batches:5d} batches | lr {learning_rate:02.2f} | ms/batch {throughput:5.2f} "
                  f"| loss {cur_loss:5.2f} | perplexity {math.exp(cur_loss):8.2f}")
            total_loss = 0
            start_time = time.time()

        # Cuts the loop (only for debugging)
        if (CUT_AFTER_BATCHES != -1) and (batch_i >= CUT_AFTER_BATCHES):
            print(f"WARNING: Training is interrupted after {batch_i} batches")
            break


model_losses = []
model_names = ["model_1", "model_1_wt", "model_2", "model_2_wt"]

for model in models:
    epochs = 5  # total number of training epochs
    print_interval = 25  # print report statistics every x batches
    lr = 20  # initial learning rate
    clipping = 0.25  # gradient clipping
    optimizer = torch.optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    train_losses = []
    val_losses = []
    best_val_loss = None

    # Loop over epochs.
    for epoch in range(epochs):
        epoch_start_time = time.time()
        train_loss = train(model, optimizer, dictionary, max_seq_len, train_batch_size, train_data_splits, clipping, lr, print_interval, epoch)
        val_loss = evaluate(model, dictionary, max_seq_len, eval_batch_size, val_data_splits)

        train_losses.append(train_loss)
        val_losses.append(val_loss)
        print("-" * 100)
        print(f"| end of epoch {epoch:3d} | time: {time.time() - epoch_start_time:5.2f}s"
              f"| valid loss {val_loss:5.2f} | valid perplexity {math.exp(val_loss):8.2f}")
        print("-" * 100)

        # Save the model if the validation loss is the best we've seen so far.
        if not best_val_loss or val_loss < best_val_loss:
            torch.save(model, save_path)
            best_val_loss = val_loss
        else:
            # Anneal the learning rate if no improvement has been seen in the validation dataset.
            lr /= 4.0
            for g in optimizer.param_groups:
                g["lr"] = lr

    model_losses.append((train_losses, val_losses))

plot_losses(model_losses, model_names)

your answer goes here

<h2>Language Generation</h2>

<div class="alert alert-warning">
    <b>Exercise 4. [30 Points]</b>
    <p>
    Copy the language generation code from the main exercise notebook and perform the following tasks:
    </p>
        <ul>
            <li>Compare all four previous models by generating $12$ words that append the starting word <tt>"despite"</tt>.</li>
            <li>For each model, retrieve the top $10$ wordIDs with the highest probabilities from the generated probability distribution (<code>prob_dist</code>) following the starting word <tt>"despite"</tt>. Fetch the corresponding words of these wordIDs. Do you observe any specific linguistic characteristic common between these words?</li>
            <li>The implementation in the main exercise notebook is based on sampling. Implement a second deterministic variant based on the <i>top-1</i> approach. In this particular variant, the generated word is the word with the highest probability in the predicted probability distribution. Repeat the same procedure as before (i.e., generate $12$ words that append the starting word <tt>"despite"</tt>).</li>
        </ul>
</div>

In [17]:
GENERATION_LENGTH = 12
START_WORD = "despite"

for model_idx, model in enumerate(models):
    print(f"Model {model_idx + 1}")
    
    start_hidden = None
    wordid_input = dictionary.word2idx[START_WORD.lower()]
    generated_text = START_WORD
    
    with torch.no_grad():
        for _ in range(GENERATION_LENGTH):
            data = torch.tensor([wordid_input]).unsqueeze(1).to(device)
            
            y_hat_probs, last_hidden = model(data, start_hidden)
            prob_dist = torch.distributions.Categorical(y_hat_probs.squeeze())
            wordid_input = prob_dist.sample().item()
            
            generated_word = dictionary.idx2word[wordid_input]
            generated_text += " " + generated_word
            
            start_hidden = last_hidden

    print(f"Generated text (probabilistic): {generated_text}")

    top10_wordids = y_hat_probs.squeeze().topk(10).indices.tolist()
    top10_words = [dictionary.idx2word[wordid] for wordid in top10_wordids]
    print(f"Top 10 words with the highest probabilities: {top10_words}")
    
    start_hidden = None
    wordid_input = dictionary.word2idx[START_WORD.lower()]
    generated_text = START_WORD

    with torch.no_grad():
        for _ in range(GENERATION_LENGTH):
            data = torch.tensor([wordid_input]).unsqueeze(1).to(device)
            
            y_hat_probs, last_hidden = model(data, start_hidden)
            wordid_input = y_hat_probs.argmax().item()
            
            generated_word = dictionary.idx2word[wordid_input]
            generated_text += " " + generated_word
            
            start_hidden = last_hidden

    print(f"Generated text (deterministic): {generated_text}\n")


Model 1
Generated text (probabilistic): despite policyholders bankrupt solution oas countries seeing grasp angered heights cleveland district theoretical
Top 10 words with the highest probabilities: ['<eos>', '<unk>', 'and', 'of', 'the', 'a', 'in', 'to', 'that', 'is']
Generated text (deterministic): despite the <unk> of the <unk> of the <unk> of the <unk> of

Model 2
Generated text (probabilistic): despite unhappy good bottling indicted outcry seabrook claimants agreements been patch scotland bit
Top 10 words with the highest probabilities: ['insure', 'dial', 'yankee', 'century', 'dunes', 'undersecretary', 'accomplished', 'hewlett-packard', 'purchase', 'closings']
Generated text (deterministic): despite matthews dial dial dial dial dial dial dial dial dial dial dial

Model 3
Generated text (probabilistic): despite returning judgment book transatlantic round farmer die '80s redford instrumental legislature respective
Top 10 words with the highest probabilities: ['sociologist', 'fan', 'a

Model 1: Basic function words and determiners such as 'to', 'in', 'and', 'of', 'a', 'the', 'that', "'s".
Model 2: Diverse range of nouns, verbs, and modifiers representing various concepts and actions.
Model 3: Mix of nouns, adjectives, and verbs related to intelligence, support, and specific roles or actions.
Model 4: Range of concepts including organizations, food, legal terms, time-related words, and specific locations.