# Assignment 1: Language modelling

In this assignment you will implement and train two or three neural language models: the fixed-window model, the recurrent neural network model from Unit&nbsp;1-2, and optionally a model based on the Transformer architecture from Unit&nbsp;1-3. You will evaluate these models by computing their perplexity on a benchmark dataset.

In [1]:
import torch

For this lab, you should use the GPU if you have one:

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')    # NVIDIA
# device = torch.device('mps')    # Apple Silicon

## Data

The data for this assignment is [WikiText](https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/), a collection of more than 100 million tokens extracted from the “Good” and “Featured” articles on Wikipedia. We will use the small version of the dataset, which contains slightly more than 2.5 million tokens.

The next cell contains code for an object that will act as a container for the “training” and the “validation” section of the data. We fill this container by reading the corresponding text files. The only processing we do is to whitespace-tokenise and to replace each newline with an end-of-sentence token.

In [3]:
class WikiText(object):

    def __init__(self):
        self.word2idx = {}
        self.idx2word = []
        self.train = self.read_data('wiki.train.tokens')
        self.valid = self.read_data('wiki.valid.tokens')

    def read_data(self, path):
        ids = []
        with open(path, encoding='utf-8') as source:
            for line in source:
                for word in line.split() + ['<eos>']:
                    if word not in self.word2idx:
                        self.word2idx[word] = len(self.word2idx)
                        self.idx2word.append(word)
                    ids.append(self.word2idx[word])
        return ids

The cell below loads the data and prints the total number of tokens and the size of the vocabulary.

In [4]:
wikitext = WikiText()

print('Tokens in train:', len(wikitext.train))
print('Tokens in valid:', len(wikitext.valid))
print('Vocabulary size:', len(wikitext.word2idx))

Tokens in train: 2088628
Tokens in valid: 217646
Vocabulary size: 33278


## Problem 1: Fixed-window model

In this section, you will implement and train the fixed-window neural language model proposed by [Bengio et al. (2003)](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) and presented in the lectures. Recall that an input to the network takes the form of a vector of $n-1$ integers representing the preceding words. Each integer is mapped to a vector via an embedding layer. (All positions share the same embedding.) The embedding vectors are then concatenated and sent through a two-layer feed-forward network with a non-linearity in the form of a rectified linear unit (ReLU) and a final softmax layer.

### Problem 1.1: Vectorise the data

Your first task is to write code for transforming the data in the WikiText container into a vectorised form that can be fed to the fixed-window model. Concretely, you will implement a [collate function](https://pytorch.org/docs/stable/data.html#dataloader-collate-fn) in the form of a callable vectoriser object. Complete the skeleton code in the cell below:

In [5]:
class FixedWindowVectorizer(object):
    def __init__(self, n):
        self.n = n

    def __call__(self, data):
        return torch.tensor([data[i:i+self.n-1] for i in range(len(data)-self.n+1)], device=device), torch.tensor([data[i+self.n-1] for i in range(len(data)-self.n+1)], device=device)

Your code should implement the following specification:

**__init__** (*self*, *n*)

> Creates a new vectoriser with n-gram order $n$. Your code should be able to handle arbitrary n-gram orders $n \geq 1$.

**__call__** (*self*, *data*)

> Transforms WikiText *data* (a list of word ids) into a pair of tensors $\mathbf{X}$, $\mathbf{y}$ that can be used to train the fixed-window model. Let $N$ be the total number of $n$-grams from the token list; then $\mathbf{X}$ is a matrix with shape $(N, n-1)$ and $\mathbf{y}$ is a vector with length $N$.

#### 🤞 Test your code

Test your implementation by running the code in the next cell. Does the output match your expectation?

In [6]:
valid_x, valid_y = FixedWindowVectorizer(3)(wikitext.valid)

print(valid_x.size(), valid_y.size())

torch.Size([217644, 2]) torch.Size([217644])


### Problem 1.2: Implement the model

Your next task is to implement the fixed-window model based on the graphical specification given in the lecture.

In [7]:
import torch.nn as nn

class FixedWindowModel(nn.Module):

    def __init__(self, n, n_words, embedding_dim=64, hidden_dim=64):
        super().__init__()
        self.n = n
        self.n_words = n_words
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(n_words, embedding_dim)
        self.fc1 = nn.Linear((n-1) * embedding_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, n_words)

    def forward(self, x):
        return self.fc2(torch.tanh(self.fc1(self.embedding(x).view(x.size(0), -1))))

    def generate(self, context, n_tokens=10, temperature=1.0):
        for _ in range(n_tokens):
            # Sample with temperature: divide logits by temperature before softmax, then sample multinomial distribution
            # We can only send the last n-1 tokens to the model
            logits = torch.softmax(self(context[:, -self.n+1:])/temperature, dim=0)
            idxs = logits.multinomial(num_samples=1, replacement=True)
            context = torch.cat((context, idxs), dim=1)
        return context

Here is the specification of the two methods:

**__init__** (*self*, *n*, *n_words*, *embedding_dim*=64, *hidden_dim*=64)

> Creates a new fixed-window neural language model. The argument *n* specifies the model&rsquo;s $n$-gram order. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the hidden layer of the feedforward network, respectively; their default value is 64.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, n-1)$, where $B$ is the batch size. The output of the forward pass is a tensor of shape $(B, V)$ where $V$ is the number of words in the vocabulary.

#### 🤞 Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

In [8]:
# Instantiate objects
test_model_fw = FixedWindowModel(3, len(wikitext.word2idx)).to(device)
vectorizer = FixedWindowVectorizer(3)

# Print model
print(test_model_fw)
print('Number of parameters:', sum(p.numel() for p in test_model_fw.parameters() if p.requires_grad))

# Test the forward pass
x, y = vectorizer(wikitext.valid)
print(test_model_fw(x[:16]).size())

# Test generate function
test_model_fw.generate(torch.tensor([[1, 2, 3], [1, 2, 4]], device=device))

FixedWindowModel(
  (embedding): Embedding(33278, 64)
  (fc1): Linear(in_features=128, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=33278, bias=True)
)
Number of parameters: 4301118
torch.Size([16, 33278])


tensor([[    1,     2,     3, 16738, 22230,  3839,  3877, 27720,  6772, 14499,
         24509,  7938,  5720],
        [    1,     2,     4, 22129, 33270, 20146, 17175,  4726, 22297,  9838,
         10773,  8804, 19095]], device='cuda:0')

### Problem 1.3: Train the model

Next, write code to train the fixed-window model using minibatch gradient descent and the cross-entropy loss function. This should be a straightforward generalisation of the training loops that you have seen so far. Complete the skeleton code in the cell below:

In [10]:
def perplexity_fixed_window(model, vectorizer, data, batch_size=3072):
    # Test the model by computing the perplexity on the validation set
    # Save memory by not storing gradients
    with torch.no_grad():
        model.train(mode=False)
        loss_fn = nn.CrossEntropyLoss()
        nlls = []
        for i in range(0, len(data)-model.n+1, batch_size):
            x, y = vectorizer(data[i:i+batch_size])
            y_pred = model(x)
            nlls.append(loss_fn(y_pred, y))
        return torch.exp(torch.stack(nlls).mean()).item()

def train_fixed_window(n, n_epochs=2, batch_size=3072, lr=1e-2):
    # Initialization 
    model = FixedWindowModel(n, len(wikitext.word2idx)).to(device)
    vectorizer = FixedWindowVectorizer(n)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-6)
    loss_fn = nn.CrossEntropyLoss()

    # Training loop
    for epoch in range(n_epochs):
        model.train()
        for i in range(0, len(wikitext.train)-n+1, batch_size):
            x, y = vectorizer(wikitext.train[i:i+batch_size])
            y_pred = model(x)
            loss = loss_fn(y_pred, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        print('Epoch:', epoch)
        print('Validation Perplexity:', perplexity_fixed_window(model, vectorizer, wikitext.valid))    
    return model

Here is the specification of the training function:

**train_fixed_window** (*n*, *n_epochs* = 2, *batch_size* = 3072, *lr* = 0.01)

> Trains a fixed-window neural language model of order *n* using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

The code in the cell below trains a trigram model.

In [13]:
model_fixed_window = train_fixed_window(3)

Epoch: 0
Validation Perplexity: 330.84820556640625
Epoch: 1
Validation Perplexity: 318.74090576171875


#### Performance goal

Your submitted notebook must contain output demonstrating a validation perplexity of **at most 360** after training for two epochs with the default parameters.

⚠️ Computing the validation perplexity in one go (for the full validation set) will most probably exhaust your computer’s memory and/or take a lot of time. Instead, do the computation at the minibatch level and aggregate the results.

#### 🤞 Test your code

To see whether your network is learning something, print or plot the loss and/or the perplexity on the training data. If the two values do not decrease during training, try to find the problem before wasting time (and electricity) on useless computation.

Training and even evaluation will take some time – on a CPU, you should expect several minutes per epoch, depending on hardware. Our reference implementation uses a GPU and runs in 45&nbsp;seconds on a MacBook Pro (2023).

## Problem 2: Recurrent neural network model

In this section, you will implement the recurrent neural network language model. Recall that an input to this model is a vector of word ids. Each integer is mapped to an embedding vector. The sequence of embedded vectors is then fed into an unrolled LSTM. At each position $i$ in the sequence, the hidden state of the LSTM at that position is sent through a linear transformation into a final softmax layer representing the probability distribution over the words at position $i+1$. In theory, the input vector could represent the complete training data; for practical reasons, however, we will truncate the input to some fixed value *bptt_len*. This length is called the **backpropagation-through-time horizon**.

### Problem 2.1: Vectorise the data

As in the previous problem, your first task is to transform the data in the WikiText container into a vectorised form that can be fed to the model.

In [14]:
class RNNVectorizer(object):
    def __init__(self, bptt_len):
        self.bptt_len = bptt_len

    def __call__(self, data):
        return torch.tensor([data[i:i+self.bptt_len] for i in range(0, len(data)-self.bptt_len-1, self.bptt_len)], device=device), torch.tensor([data[i+1:i+self.bptt_len+1] for i in range(0, len(data)-self.bptt_len-1, self.bptt_len)], device=device)

Your vectoriser should meet the following specification:

**__init__** (*self*, *bptt_len*)

> Creates a new vectoriser. The parameter *bptt_len* specifies the backpropagation-through-time horizon.

**__call__** (*self*, *data*)

> Transforms a list of token indexes *data* into a pair of tensors $\mathbf{X}$, $\mathbf{Y}$ that can be used to train the recurrent neural language model. The rows of both tensors represent contiguous subsequences of token indexes of length *bptt_len*. Compared to the sequences in $\mathbf{X}$, the corresponding sequences in $\mathbf{Y}$ are shifted one position to the right. More precisely, if the $i$ th row of $\mathbf{X}$ is the sequence that starts at token position $j$, then the same row of $\mathbf{Y}$ is the sequence that starts at position $j+1$.

#### 🤞 Test your code

Test your implementation by running the following code:

In [15]:
valid_x, valid_y = RNNVectorizer(32)(wikitext.valid)

print(valid_x.size(), valid_y.size())

print(valid_x[0])
print(valid_y[0])

assert(valid_x[0][-1] == valid_y[0][-2])

torch.Size([6801, 32]) torch.Size([6801, 32])
tensor([    0,     1, 32966, 32967,     1,     0,     0, 32966, 32967,    13,
          406,    23,    17,  6253, 19902,   310,  1444, 19902,    13,    26,
           27,  2576,    16,     9, 19902,   115,    17,  4929,  4121,  9611,
           13,  4854], device='cuda:0')
tensor([    1, 32966, 32967,     1,     0,     0, 32966, 32967,    13,   406,
           23,    17,  6253, 19902,   310,  1444, 19902,    13,    26,    27,
         2576,    16,     9, 19902,   115,    17,  4929,  4121,  9611,    13,
         4854,  2429], device='cuda:0')


### Problem 2.2: Implement the model

Your next task is to implement the recurrent neural network model based on the graphical specification.

In [16]:
import torch.nn as nn

class RNNModel(nn.Module):
    
    def __init__(self, n_words, embedding_dim=64, hidden_dim=64):
        super().__init__()
        self.n_words = n_words
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(n_words, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, n_words)

    def forward(self, x):
        x = self.embedding(x)
        x, _ = self.lstm(x)
        return self.fc(x)

    def generate(self, context, n_tokens=10, temperature=1.0):
        for _ in range(n_tokens):
            # Sample with temperature: divide logits by temperature before softmax, then sample multinomial distribution
            # We want the logits for the last token (the predicted one)
            logits = torch.softmax(self(context)[:, -1, :] / temperature, dim=0)
            idxs = logits.multinomial(num_samples=1, replacement=True)
            context = torch.cat((context, idxs), dim=1)
        return context

Your implementation should follow this specification:

**__init__** (*self*, *n_words*, *embedding_dim* = 64, *hidden_dim* = 64)

> Creates a new recurrent neural network language model based on an LSTM. The argument *n_words* is the number of words in the vocabulary. The arguments *embedding_dim* and *hidden_dim* specify the dimensionalities of the embedding layer and the LSTM hidden layer, respectively; their default value is 64.

**forward** (*self*, *x*)

> Computes the network output on an input batch *x*. The shape of *x* is $(B, H)$, where $B$ is the batch size and $H$ is the length of each input sequence. The shape of the output tensor is $(B, H, V)$, where $V$ is the size of the vocabulary.

In [17]:
# Instantiate objects
test_model_rnn = RNNModel(len(wikitext.word2idx)).to(device)
vectorizer = RNNVectorizer(32)

# Print model
print(test_model_rnn)
print('Number of parameters:', sum(p.numel() for p in test_model_rnn.parameters() if p.requires_grad))

# Test the forward pass
x, y = vectorizer(wikitext.valid)
print(test_model_rnn(x[:16]).shape)

# Test the generate function
print(test_model_rnn.generate(torch.tensor([[1, 2, 3], [1, 2, 4]], device=device)))

RNNModel(
  (embedding): Embedding(33278, 64)
  (lstm): LSTM(64, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=33278, bias=True)
)
Number of parameters: 4326142
torch.Size([16, 32, 33278])
tensor([[    1,     2,     3,   400, 18809, 12429, 14255, 21129, 27567,  4154,
           999, 24100, 24929],
        [    1,     2,     4, 24760,     7, 10194,  9776,  6553, 24030, 10958,
         18523, 24004,  6398]], device='cuda:0')


#### 🤞 Test your code

Test your code by instantiating the model and feeding it a batch of examples from the training data.

### Problem 2.3: Train the model

The training loop for the recurrent neural network model is essentially identical to the loop that you wrote for the feed-forward model. The only thing to note is that the cross-entropy loss function expects its input to be a two-dimensional tensor; you will therefore have to re-shape the output tensor from the LSTM as well as the gold-standard output tensor in a suitable way. The most efficient way to do so is to use the [`view()`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view) method.

In [32]:
def perplexity_rnn(model, vectorizer, data, batch_size=64):
    # Test the model by computing the perplexity on the validation set
    # Save memory by not storing gradients
    with torch.no_grad():
        model.train(mode=False)
        loss_fn = nn.CrossEntropyLoss()
        nlls = []
        for i in range(0, len(data)-model.lstm.input_size-1, batch_size):
            x, y = vectorizer(data[i:i+batch_size])
            y_pred = model(x)
            nlls.append(loss_fn(y_pred.view(-1, len(wikitext.word2idx)), y.view(-1)))
        return torch.exp(torch.stack(nlls).mean()).item()

def train_rnn_model(model, n_epochs=2, batch_size=3072, bptt_len=32, lr=1e-2):
    # Initialization
    vectorizer = RNNVectorizer(bptt_len)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=1e-6)
    loss_fn = nn.CrossEntropyLoss()

    # Training loop
    for epoch in range(n_epochs):
        model.train()
        x, y = vectorizer(wikitext.train)
        for i in range(0, len(x), batch_size):
            x_batch, y_batch = x[i:i+batch_size], y[i:i+batch_size]
            y_pred = model(x_batch)
            loss = loss_fn(y_pred.view(-1, len(wikitext.word2idx)), y_batch.view(-1))
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
        print('Epoch:', epoch)
        print('Validation Perplexity:', perplexity_rnn(model, vectorizer, wikitext.valid))    
    return model

def train_rnn(n_epochs=2, batch_size=3072, bptt_len=32, lr=1e-2):
    return train_rnn_model(RNNModel(len(wikitext.word2idx)).to(device), n_epochs, batch_size, bptt_len, lr)

Here is the specification of the training function:

**train_rnn** (*n_epochs* = 2, *batch_size* = 3072, *bptt_len* = 32, *lr* = 0.01)

> Trains a recurrent neural network language model on the WikiText data using minibatch gradient descent and returns it. The parameters *n_epochs* and *batch_size* specify the number of training epochs and the minibatch size, respectively. The parameter *bptt_len* specifies the length of the backpropagation-through-time horizon, that is, the length of the input and output sequences. Training uses the cross-entropy loss function and the [Adam optimizer](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) with learning rate *lr*. After each epoch, prints the perplexity of the model on the validation data.

Evaluate your model by running the following code cell:

In [33]:
# Had to change batch_size and lr due to OOM issues
model_rnn = train_rnn(2, 64, 32, 0.005)

Epoch: 0
Validation Perplexity: 328.1418762207031
Epoch: 1
Validation Perplexity: 267.7008361816406


#### Performance goal

Your submitted notebook must contain output demonstrating a validation perplexity of **at most 280** after training for two epochs with the default hyperparameters.

## Problem 3: Transformer model (optional)

If you are up for a challenge, try implementing a Transformer-based language model. The required vectoriser is identical to the vectoriser for the RNN model. For the model itself, you can use the Pytorch modules [`nn.TransformerEncoder`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html) and [`nn.TransformerEncoderLayer`](https://pytorch.org/docs/stable/generated/torch.nn.TransformerEncoderLayer.html).  To represent positional information, follow the approach from the original Transformer paper and use sine and cosine functions of different frequencies ([details](https://nlp.seas.harvard.edu/2018/04/03/attention.html#positional-encoding)), or learn position-specific embeddings. Can you get a lower perplexity than for the RNN model?

## Problem 4: Generation

In this section, you will implement a simple generation mechanism for the language models you have implemented.

Recall that one way to generate text with a language model is to repeatedly sample from the model’s output distribution, conditioning on some context. More specifically, this involves treating the softmax-normalised logits of the model as a multinomial distribution. The “creativeness” of the generation can be controlled with the temperature parameter of the softmax distribution.

To implement this recipe, we first ask you to extend each model with a `generate` method according to the following specification:

**generate** (*self*, *context*, *n_tokens* = 10, *temperature* = 1.0)

> Takes a batch of context tokens *context* and extends it by sampling *n_tokens* new tokens from the model’s output distribution, scaled with the temperature *temperature*. Returns the extended context.

In a second stage, you should implement a convenience function `generate` that allows you to easily generate text with different models, like this:

```
generate(model_fixed_window, 'i like', max_tokens=10, temperature=1.5)
```

In [34]:
def generate(model, context, max_tokens=3, temperature=1.0):
    # Tokenize
    input_ids = torch.tensor([[wikitext.word2idx[word] for word in context.split()]], device=device)
    # We do not need to store the gradients when generating
    with torch.no_grad():
        input_ids = model.generate(input_ids, n_tokens=max_tokens, temperature=temperature)
    return " ".join(wikitext.idx2word[idx] for idx in input_ids[0]).split("<eos>")[0]

print(generate(model_fixed_window, 'i like', max_tokens=10, temperature=1.5))
print(generate(model_rnn, 'i like', max_tokens=10, temperature=1.5))

i like tonnage plantations inactivity nudes bodily rap 1932 Fries suspense complimenting
i like friend pack York Советская oral Tarpan neon Schumann Gallia Spears


Here is the specification of the convenience function:

**generate** (*model*, *context*, *max_tokens* = 10, *temperature* = 1.0)

> Takes a context sentence *context*, tokenises and vectorises it, and passes it to the specified *model* to generate new text. The new text consists of at most *max_tokens*, but is cut off at the first `<eos>` token. Returns the generated text (including the context).

## Problem 5: Parameter initialisation

The error surfaces explored when training neural networks can be very complex. Because of this, it is important to choose “good” initial values for the parameters. In PyTorch, the weights of the embedding layer are initialised by sampling from the standard normal distribution $\mathcal{N}(0, 1)$. Test how changing the initialisation affects the perplexity of your language models. Find research articles that propose different initialisation strategies.

Write a short (150&nbsp;words) report about your experiments and literature search. Use the following prompts:

* What different initialisation did you try? What results did you get?
* How do your results compare to what was suggested by the research articles?
* What did you learn? How, exactly, did you learn it? Why does this learning matter?

### Report

#### Methods

We use four different initialization strategies for the embedding weights:
1. Uniform distribution
2. Normal distribution
3. Xavier uniform distribution [1] (we use uniform as proposed in the original paper)
4. Kaiming normal distribution [2] (we use normal as proposed in the original paper)

We train ten RNN models with each initialization, to account for randomness.\
For each of the strategies, the default pytorch parameters are employed.\
The code and execution results can be found after this report.

#### Results
```
Uniform initialization
Mean:  279.8584503173828
Std:  4.1893097628015115

Normal initialization
Mean:  270.80449829101565
Std:  2.6815572422077953

Xavier initialization
Mean:  273.4174377441406
Std:  3.3659920284614633

Kaiming initialization
Mean:  272.05133361816405
Std:  1.8209715806031563
```

The normal distribution achieves the best mean results, following by Kaiming, Xavier, and Uniform.

#### Discussion

The Xavier init [1] tries to improve the uniform initialization by taking into account the input size to maintain the variance of the output regardless of the input size. Maintaining the variance helps tackle the exploding/vanishing gradient problem in deep neural networks.
Our results show that Xavier indeed performs better than the uniform initialization, as also observed in [1].

The Kaiming init takes into account the non-linearity of activation functions. It is worth noting that the Kaiming paper [2] works heavily on top of rectified activation (i.e., ReLU) and that the LSTM layer uses tanh, a symmetric function, for activation.
This difference might be behind the higher perplexity w.r.t. the normal distribution.
Nonetheless, it still performs better than the Xavier init as also observed in [2].

Overall, the default initialization scheme (i.e., N(0, 1)) performed slightly better than Xavier and Kaiming.

#### References

[1] Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference Proceedings.\
[2] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision (pp. 1026-1034).

In [49]:
def init_uniform(model):
    torch.nn.init.uniform_(model.embedding.weight)

def init_normal(model):
    nn.init.normal_(model.embedding.weight)

def init_xavier(model):
    nn.init.xavier_uniform_(model.embedding.weight)

def init_kaiming(model):
    nn.init.kaiming_normal_(model.embedding.weight)

In [47]:
def train_eval_rnn_uniform():
    model = RNNModel(len(wikitext.word2idx)).to(device)
    init_uniform(model)
    model = train_rnn_model(model, 2, 64, 32, 0.005)
    return perplexity_rnn(model, RNNVectorizer(32), wikitext.valid)

def train_eval_rnn_normal():
    model = RNNModel(len(wikitext.word2idx)).to(device)
    init_normal(model)
    model = train_rnn_model(model, 2, 64, 32, 0.005)
    return perplexity_rnn(model, RNNVectorizer(32), wikitext.valid)

def train_eval_rnn_xavier():
    model = RNNModel(len(wikitext.word2idx)).to(device)
    init_xavier(model)
    model = train_rnn_model(model, 2, 64, 32, 0.005)
    return perplexity_rnn(model, RNNVectorizer(32), wikitext.valid)

def train_eval_rnn_kaiming():
    model = RNNModel(len(wikitext.word2idx)).to(device)
    init_kaiming(model)
    model = train_rnn_model(model, 2, 64, 32, 0.005)
    return perplexity_rnn(model, RNNVectorizer(32), wikitext.valid)

# We train 10 models for each initialization technique, to account for randomness
perplexity_uniform = []
perplexity_normal = []
perplexity_xavier = []
perplexity_kaiming = []
for _ in range(10):
    perplexity_uniform.append(train_eval_rnn_uniform())
    perplexity_normal.append(train_eval_rnn_normal())
    perplexity_xavier.append(train_eval_rnn_xavier())
    perplexity_kaiming.append(train_eval_rnn_kaiming())

Epoch: 0
Validation Perplexity: 353.9242858886719
Epoch: 1
Validation Perplexity: 274.8731384277344
Epoch: 0
Validation Perplexity: 339.4307556152344
Epoch: 1
Validation Perplexity: 275.37506103515625
Epoch: 0
Validation Perplexity: 353.7783203125
Epoch: 1
Validation Perplexity: 276.1128234863281
Epoch: 0
Validation Perplexity: 344.0949401855469
Epoch: 1
Validation Perplexity: 273.32769775390625
Epoch: 0
Validation Perplexity: 366.4917297363281
Epoch: 1
Validation Perplexity: 284.0852355957031
Epoch: 0
Validation Perplexity: 328.4405517578125
Epoch: 1
Validation Perplexity: 270.49859619140625
Epoch: 0
Validation Perplexity: 346.5953063964844
Epoch: 1
Validation Perplexity: 273.08111572265625
Epoch: 0
Validation Perplexity: 338.7983703613281
Epoch: 1
Validation Perplexity: 269.53759765625
Epoch: 0
Validation Perplexity: 360.36260986328125
Epoch: 1
Validation Perplexity: 277.8545227050781
Epoch: 0
Validation Perplexity: 329.918701171875
Epoch: 1
Validation Perplexity: 269.3349609375
Epoc

In [48]:
import numpy as np

print("Uniform initialization")
print("Mean: ", np.mean(perplexity_uniform))
print("Std: ", np.std(perplexity_uniform))
print()

print("Normal initialization")
print("Mean: ", np.mean(perplexity_normal))
print("Std: ", np.std(perplexity_normal))
print()

print("Xavier initialization")
print("Mean: ", np.mean(perplexity_xavier))
print("Std: ", np.std(perplexity_xavier))
print()

print("Kaiming initialization")
print("Mean: ", np.mean(perplexity_kaiming))
print("Std: ", np.std(perplexity_kaiming))
print()

Uniform initialization
Mean:  279.8584503173828
Std:  4.1893097628015115

Normal initialization
Mean:  270.80449829101565
Std:  2.6815572422077953

Xavier initialization
Mean:  273.4174377441406
Std:  3.3659920284614633

Kaiming initialization
Mean:  272.05133361816405
Std:  1.8209715806031563

