# Lab: PyTorch & TorchText

_Author: Konstantin Todorov_

Welcome. This laboratory is meant to teach you about the basics of PyTorch - one of the most widely used python libraries for machine learning out there. In addition to it, you will learn about TorchText - one of the integrated libraries that is meant to help when training Natural Language Processing (NLP) models. In addition to that, TorchText also provides readily available datasets such as the IMDB one that we will make use of in this setup.

At the end of this lab, you will know how to train your own basic model, as well as evaluate and make use of techniques to improve your performance.

Notes on implementation:

* You should write your code and answers in this IPython Notebook: http://ipython.org/notebook.html. If you have problems, please contact your teaching assistant.
* Please write your answers right below the questions.
* Among the first lines of your notebook should be "%pylab inline". This imports all required modules, and your plots will appear inline.

In [None]:
%pylab inline

plt.rcParams["figure.figsize"] = [20,10]

In [None]:
# This cell makes sure that you have all the necessary libraries installed

import sys
import platform
from importlib.util import find_spec, module_from_spec

def check_newer_version(version_inst, version_nec):
    version_inst_split = version_inst.split('.')
    version_nec_split = version_nec.split('.')
    for i in range(min(len(version_inst_split), len(version_nec_split))):
        if int(version_nec_split[i]) > int(version_inst_split[i]):
            return False
        elif int(version_nec_split[i]) < int(version_inst_split[i]):
            return True
    return True


module_list = [('torch', '1.8.0'),
               ('torchtext', '0.9.0'), 
               ('matplotlib', '3.0.0'), 
               ('numpy', '1.13.1'), 
               ('python', '3.6.2')]

packages_correct = True
packages_errors = []

for module_name, version in module_list:
#     if module_name == 'scikit-learn':
#         module_name = 'sklearn'
    if 'python' in module_name:
        python_version = platform.python_version()
        if not check_newer_version(python_version, version):
            packages_correct = False
            error = f'Update {module_name} to version {version}. Current version is {python_version}.'
            packages_errors.append(error) 
            print(error)
    else:
        spec = find_spec(module_name)
        if spec is None:
            packages_correct = False
            error = f'Install {module_name} with version {version} or newer, it is required for this assignment!'
            packages_errors.append(error) 
            print(error)
        else:
            x = __import__(module_name)
            if hasattr(x, '__version__') and not check_newer_version(x.__version__, version):
                packages_correct = False
                error = f'Update {module_name} to version {version}. Current version is {x.__version__}.'
                packages_errors.append(error) 
                print(error)

try:
    from google.colab import drive
    packages_correct = False
    error = """Please, don't use google colab!
It will make it much more complicated for us to check your homework as it merges all the cells into one."""
    packages_errors.append(error) 
    print(error)
except:
    pass

packages_errors = '\n'.join(packages_errors)

In [None]:
# We must import all required libraries. These should be enough. 
# In case you need extra libraries, feel free to import them in cells below.

import torch
import torch.nn as nn

import torchtext

In [None]:
# We make use of a GPU if one is available on the current system. 
# Using GPU can make the training process and matrix operations magnitudes faster

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [None]:
# We set a seed manually for reproducing purposes. 
# This will ensure that every time you run the notebook on the same machine you will receive the same results

SEED = 42

torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

if device == 'cuda':
    torch.backends.cudnn.benchmark = False
    torch.cuda.manual_seed_all(SEED)

## Part I: Preparing the data

We will use TorchText and more specifically the IMDB sentiment analysis dataset. We first load it from the `torchtext.datasets` namespace. For the purpose of testingwe can load only the `train` split. Note that different sets have different splits. You can see all available datasets and their properties [here](https://pytorch.org/text/stable/datasets.html#imdb). 

In [None]:
train_imdb_iterator = torchtext.datasets.IMDB(split='train')

Next, let's see what kind of data we are working with. The `train_imdb_iterator` is a python iterator object and as such, we can iterate through it by calling the `next` function for example.

In [None]:
label, text = next(train_imdb_iterator)

print(f'Text sequence:\n{text}')
print(f'Sentiment: {label}')

In [None]:
label, text = next(train_imdb_iterator)

print(f'Text sequence:\n{text}')
print(f'Sentiment: {label}')

As you can see, each element from the dataset contains an IMDB review and a sentiment ground truth label. The label can be either positive (pos) or negative (neg). Our goal is to train a machine learning model which can predict the sentiment of a given text.

In [None]:
import torchtext.datasets
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab
from collections import Counter

def get_tokenized_data(tokenizer):
    train_iterator = torchtext.datasets.IMDB(split='train')
    
    counter = Counter()
    for (label, line) in train_iterator:
        counter.update(tokenizer(line))
        
    return counter

def build_vocabulary(data_counter, vectors=None):
    vocab = Vocab(data_counter, min_freq=1, vectors=vectors)
    return vocab

In [None]:
tokenizer = get_tokenizer('basic_english')
data_counter = get_tokenized_data(tokenizer)
vocab = build_vocabulary(data_counter)

In [None]:
def tokenize_text(text, vocab, tokenizer):
    return [vocab[token] for token in tokenizer(text)]

def tokenize_label(label):
    if label == 'neg':
        return 0
    else:
        return 1

In [None]:
tokenize_text('here is the an example', vocab, tokenizer)

In [None]:
tokenize_label('pos')

In [None]:
from torch.utils.data import DataLoader

def collate_batch(batch, vocab, tokenizer):
    batch_size = len(batch)
    label_list, text_list = [], []
    
    # Tokenize labels and texts
    for (label, text) in batch:
        label_list.append(tokenize_label(label))
        tokenized_text = tokenize_text(text, vocab, tokenizer)
        text_list.append(tokenized_text)

    # Put all texts into a single numpy array of uniform length
    # For all sequences that are shorther than the maximum length, pad to the right with 0
    lengths = [len(text) for text in text_list]
    max_length = max(lengths)
    padded_sequences = np.zeros((batch_size, max_length), dtype=np.int64)
    for i, (length, text) in enumerate(zip(lengths, text_list)):
        padded_sequences[i][0:length] = text_list[i][0:length]
        
    # Finally transform the arrays into tensors and return to the dataloader
    label_tensor = torch.tensor(label_list, dtype=torch.float32).to(device)
    sequences_tensor = torch.from_numpy(padded_sequences).to(device)
    return sequences_tensor, label_tensor

In [None]:
# Let's see how many unique tokens we have in our vocabularies

print(f"Unique tokens in vocabulary: {len(vocab):,}")

In [None]:
# We can also check the most common tokens in our vocabulary

print(vocab.freqs.most_common(20))

Vocabularies contain so called **i**ndex **to s**tring vector which is usually represented as a list of string where the index of an element represents the index of this token in the vocabulary. We can access this list in a TorchText data field using the `.itos` property

In [None]:
print(vocab.itos[:10])

Similarly, vocabularies also contain the opposite mapping, namely **s**tring **to i**ndex which is a dictionary with the keys being the tokens and the values - the corresponding vocabulary indices. This can be examined using the `.stoi` property of a TorchText data field.

In [None]:
print(list(vocab.stoi.items())[:10])

As a final data preparation, we must take care of the way we _access_ the data. To this end, in Python we use iterators. Specifically, in TorchText, we can make use of the `BucketIterator` object which presents us with numerous advantages over traditional iterators. It works directly with TorchText datasets, it allows us to pass a `device` argument which automatically moves the data that is iterated over to the corresponding device. We can also sort within a batch (which sometimes is required for training) and most importantly, it can be configured so that the data is iterated over _batches_. This is an arrangement of the data organized in sets or groups. This allows us to work with multiple elements in one iteration and speeds up the training and evaluation processes tremendously.

Please populate the `get_iterators` function. Use the `BucketIterator` and make sure to also set `batch_size`, `device` and `sort_within_batch` arguments

In [None]:
from torch.utils.data.dataset import random_split

def get_dataloaders(batch_size, vocab, tokenizer):
    train_iter, test_iter = torchtext.datasets.IMDB()
    train_dataset, test_dataset = list(train_iter), list(test_iter)
    num_train = int(len(train_dataset) * 0.95)
    split_train_, split_valid_ = random_split(
        train_dataset,
        [num_train, len(train_dataset) - num_train])
    
    train_dataloader = DataLoader(
        split_train_,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=lambda x: collate_batch(x, vocab, tokenizer))
    
    valid_dataloader = DataLoader(
        split_valid_,
        batch_size=batch_size,
        collate_fn=lambda x: collate_batch(x, vocab, tokenizer))
    
    test_dataloader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        collate_fn=lambda x: collate_batch(x, vocab, tokenizer))

    return train_dataloader, valid_dataloader, test_dataloader

In [None]:
BATCH_SIZE = 128

train_dataloader, valid_dataloader, test_dataloader = get_dataloaders(BATCH_SIZE, vocab, tokenizer)

## Part II: Model

### nn.Module vs nn.functional

The `torch.nn.Module` or simply `nn.Module` is arguably one of the most important parts and what many refer to as the "cornerstone" of PyTorch. In order to build a machine learning model that is able to backpropagate automatically, one must define an `nn.Module` object and then invoke its `forward` method to run it. This is the Object Oriented way of doing things. On the other hand, people also make use of `nn.functional` which provides some layers/activations in the form of functions that can be directly called on the input rather than defining the an object. For example, in order to rescale an image tensor, you call `nn.functional.interpolate` on an image tensor.

### Understanding Stateful-ness

Normally, any layer can be seen as a function. For example, a convolutional operation is just a bunch of multiplication and addition operations. So, it makes sense for us to just implement it as a function right? But wait, the layer holds weights which need to be stored and updated while we are training. Therefore, from a programmatic angle, a layer is more than function. It also needs to hold data, which changes as we train our network.

It must be understood that the data held by the convolutional layer **changes**. This means that the layer has a state which changes as we train. For us to implement a function that does any operation, we would also need to define a data structure to hold the weights of the layer separately from the function itself. And then, make this external data structure an input to our function. Or just to beat the hassle, we could just define a class to hold the data structure, and make convolutional operation as an member function. This would really ease up our job, as we don't have to worry about stateful variables existing outside of the function. 

This all seems rather complicated, but thankfully, PyTorch comes to the rescue by exposing the `nn.Module` objects where we have weights or other pre-defined states which define the behaviour of the layers. In the cases where no state or weights are required, one could use the nn.functional - examples being resizing (`nn.functional.interpolate`) or average pooling (`nn.functional.AvgPool2d`).

### nn.Parameter

Before we step towards writing our own model, we must look into one other important part of PyTorch, namely the `nn.Parameter` class. 

Each `nn.Module` has a `parameters()` function which returns, well, it's trainable parameters. We have to implicitly define what these parameters are. However, when we use internal `nn.Module` objects, all of the module training weights are implemented as `nn.Parameter`. If you try to assign a tensor to the `nn.Module` object, it won't show up in the `parameters()` unless you define it as `nn.Parameter` object. This has been done to facilitate scenarios where you might need to cache a non-differentiable tensor.

Let's consider the following example

In [None]:
class net1(nn.Module):
  def __init__(self):
    super().__init__()
    self.conv = nn.Linear(10,5)
    self.tens = torch.ones(3,4)
    
  def forward(self, x):
    return self.linear(x)

##########################################################

class net2(nn.Module):
  def __init__(self):
    super().__init__()
    self.conv = nn.Linear(10,5) 
    self.tens = nn.Parameter(torch.ones(3,4))
    
  def forward(self, x):
    return self.linear(x)

##########################################################

class net3(nn.Module):
  def __init__(self):
    super().__init__()
    self.conv = nn.Linear(10,5)
    self.net  = net2()
    
  def forward(self, x):
    return self.linear(x)

Having defined those simple networks, let's invoke the `named_parameters` function of each. This calls the `parameters` function and also returns the name of each parameter. Notice how the parameters differ across the three networks.

In [None]:
def print_named_parameters(model):
    for name, parameter in model.named_parameters():
        print(f' ** {name}: {parameter}')
    
print('# Net 1:')
print_named_parameters(net1())

print('\n# Net 2:')
print_named_parameters(net2())

print('\n# Net 3:')
print_named_parameters(net3())

### Defining your own model

Now it's time to define our own PyTorch model which we will train on sentiment analysis detection over the IMDB dataset. 

We will use the model defined in this example
![Sentiment analysis model](https://pytorch.org/tutorials/_images/text_sentiment_ngrams_model.png "Sentiment analysis model") 
\[[reference](https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html#define-the-model)\]

For simplicity reasons, we will not use an embedding bag but rather a simple embedding layer. This is the first inner module we define It uses `input_dimension` and `embedding_dimension` arguments that can differ based on our use case or experiment. After, a fully connected linear layer which must map down the `embedding_dimension` to an `output_dimension` which in our case would be 1 as we have two labels which can be either 0 or 1.

We also add uniform weights initialization. This can often improve and/or speed up the training process. We also add optional embeddings weights initialization from external source (more on this later).

Finally, the forward iteration must be implemented. First, a pass through the embedding layer must be performed before taking the average value of the embeddings. Finally, the output must be mapped down using the fully connected layer.

In [None]:
class SentimentAnalysisModel(nn.Module):
    def __init__(
        self,
        input_dimension,
        embedding_dimension,
        output_dimension,
        pretr_embeddings=None):
        super().__init__()

        self._embedding_layer = nn.Embedding(input_dimension, embedding_dimension)    
        self._fully_connected_layer = nn.Linear(embedding_dimension, output_dimension)
        self._init_weights(pretr_embeddings)
    
    # Use uniform initialization of the weights
    def _init_weights(self, pretr_embeddings):
        initrange = 0.5
        
        # We add an option to initialize the embeddings from external source
        if pretr_embeddings is not None:
            self._embedding_layer.weight.data.copy_(pretr_embeddings)
        else:
            self._embedding_layer.weight.data.uniform_(-initrange, initrange)

        self._fully_connected_layer.weight.data.uniform_(-initrange, initrange)
        self._fully_connected_layer.bias.data.zero_()

    def forward(self, input_batch):
        # Process the input batch and return a result that can be processed from the loss function
        # input_batch is of shape - [ batch_size x max_length ]
        
        embeddings = self._embedding_layer.forward(input_batch)
        # embeddings are of shape - [ batch_size x max_length x embedding_dimension ]
        
        embeddings_mean = embeddings.mean(dim=1)
        # embeddings_mean is now - [ batch_size x embedding_dimension ]
        
        result = self._fully_connected_layer.forward(embeddings_mean)
        # finally, our result is - [ batch_size x 1 ]
        
        return result

In [None]:
INPUT_DIMENSION = len(vocab)
EMBEDDING_DIMENSION = 300
OUTPUT_DIMENSION = 1

# We can now initialize our model. Note the .to(device) part. 
# This transfers the model to the previously defined device (could be a GPU) for faster computation.

model = SentimentAnalysisModel(
    INPUT_DIMENSION, 
    EMBEDDING_DIMENSION, 
    OUTPUT_DIMENSION).to(device)

In [None]:
# We can now check how our model looks like
print(model)

After we have the model, we must define a loss function. Here, we will make use of the [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) defined in PyTorch. This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

In [None]:
criterion = nn.BCEWithLogitsLoss().to(device)

The next simple example shows how we use our loss function during training. 

Our model outputs prediction (logit) for each entry. Let's assume we have three elements in our dataset. As our task is a binary classification one, our model outputs a value which is then transformed into a probability (value between 0 and 1) using a sigmoid function. 

On the other side, our targets are usually integers. In the case of binary classification, they can be only 0 or 1. For the purpose of comparison with the probabilities which are floats, we use float type for the labels too. 

Finally, we pass the model output and the original targets to the loss function which computes a number - that is our loss. The lower and closer to 0 the loss value is, the better are our model predictions resembling the true labels. 

During training, we must also call the `.backward()` function of the loss result in order to back-propagate through the model and update our weights. 

_Note: Execute the next cell multiple times to see how the loss changes depending on the different targets and model outputs_

In [None]:
model_output = torch.randn(3)
print(f'Model output: {model_output}')
target = torch.empty(3).random_(2)
print(f'Target: {target}')
test_criterion = nn.BCEWithLogitsLoss()
output = test_criterion.forward(model_output, target)
print(f'Calculated loss: {output}')

Finally, an `torch.optim` Optimizer must be defined. This is used to traverse through the parameter space and find the optimal weights for the model.

Initialize the [`Adam`](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam) optimizer. Stick to the default arguments. Later, you can experiment with different learning rates that can change the training process.

In [None]:
optimizer = torch.optim.Adam(model.parameters())

Before we go to the training process, we must prepare a function that can calculate the accuracy of the model. Due to the nature of the training, our model is outputting raw predictions while we also have ground truth values. These are both in the forms of vectors, although the predictions could be float numbers while the true values are usually integers.

Implement a function, which takes these two vectors and calculates a value between 0 and 1 which corresponds to the accuracy. Even more so, we are working with batches and therefore have vectors of such values, e.g. if we have predicted [0, 1, 0, 1] and the ground truth is [1, 1, 1, 1], then the function should output 0.5. Keep in mind that the `calculate_accuracy` function accepts and works with tensors and not regular lists.

In [None]:
def calculate_accuracy(predictions, ground_truth):
    # Returns accuracy per batch
    
    # transform the predictions into probabilities
    predictions = torch.sigmoid(predictions)
    
    # round the predictions into integers
    rounded_predictions = torch.round(predictions)
    
    # compare which of the predictions are equal to their corresponding true value
    correct = (rounded_predictions == ground_truth).float()
    
    # take the average for all elements in one batch
    accuracy = correct.sum() / len(correct)
    return accuracy

## Part III: Training

We can now proceed to the training of our defined model. 

First you will see the `zero_optimizer_gradients` function. This is necessary during training to avoid problems with gradients accumulating. We usually execute this before we backpropagate to avoid using gradients from previous steps.

In [None]:
def zero_optimizer_gradients(optimizer):
    optimizer.zero_grad()

We now proceed to the training process. We define one _epoch_ to be the period where we iterate over **all** elements (or batches) in our dataset. Usually, during training we can iterate over many epochs until we are comfortable with our results. Starting, with the smallest iteration, we a function which works on batch-level. The most important steps during one such pass are:
* Perform a forward pass through the model
* Perform a forward pass through the loss function using the predicted labels from the model
* Calculate the accuracy by comparing the predictions and the ground truth

This function can be used both during training (when we update the parameters of the network) and during evaluation (where we only want to predict labels). To distinguish the two modes, we use the `eval_mode` argument. When we are training (i.e. `eval_mode == False`) we must perform a backpropagation from the loss function and perform a step in the parameter space using the optimizer

In [None]:
def perform_batch_iteration(
    batch,
    model,
    criterion,
    optimizer,
    eval_mode):

    if not eval_mode:
        zero_optimizer_gradients(optimizer)
    
    text, label = batch
    
    predictions = model.forward(text).squeeze(1)
    loss = criterion.forward(predictions, label)
    accuracy = calculate_accuracy(predictions, label)

    if not eval_mode:
        loss.backward()
        optimizer.step()

    return loss.item(), accuracy.item()

Having passed through one bach, we then define a function which takes care for a whole epoch iteration. In one epoch, we must process _all_ batches of our data and save the loss and accuracy values that we compute for each batch.

In [None]:
def perform_epoch_iteration(
    model,
    dataloader,
    criterion,
    optimizer,
    eval_mode):
    
    epoch_losses = []
    epoch_accuracies = []
    
    if not eval_mode:
        model.train()
    else:
        model.eval()

    for batch in dataloader:
        loss, accuracy = perform_batch_iteration(batch, model, criterion, optimizer, eval_mode)
        epoch_losses.append(loss)
        epoch_accuracies.append(accuracy)

    return np.mean(epoch_losses), np.mean(epoch_accuracies)

Finally, for the full training process, we perform multiple epoch iterations for both the training and the validation datasets. We keep the best validation score as a reference. In practice, one must keep reference to the model state and best validation results at the time but for simplicity reasons we skip this.

_Note: As an optional exercise, you can try to define your own `train_model_v2` function where you do this._

In [None]:
def train_model(model, train_dataloader, valid_dataloader, criterion, optimizer, epochs):
    print('Starting training...')
    
    train_losses, train_accuracies = [], []
    valid_losses, valid_accuracies = [], []
    for epoch in range(epochs):
        
        # iterate over the train data
        loss, accuracy = perform_epoch_iteration(
            model,
            train_dataloader,
            criterion,
            optimizer,
            eval_mode=False)

        # store the train loss and accuracy for later
        train_losses.append(loss)
        train_accuracies.append(accuracy)
        
        print(f'Epoch: {epoch:02}')
        print(f'\tTrain Loss: {loss:.3f} | Train Acc: {accuracy*100:.2f}%', end='')
        
        # iterate over the train data
        valid_loss, valid_accuracy = perform_epoch_iteration(
            model,
            valid_dataloader,
            criterion,
            None,
            eval_mode=True)

        # store the train loss and accuracy for later
        valid_losses.append(valid_loss)
        valid_accuracies.append(valid_accuracy)
        
        print(f'| Valid Loss: {valid_loss:.3f} | Valid Acc: {valid_accuracy*100:.2f}%')
            
    # finally, return the stored lists
    return train_losses, train_accuracies, valid_losses, valid_accuracies

In [None]:
# train the model

N_EPOCHS = 10

train_losses, train_accuracies, valid_losses, valid_accuracies = train_model(
    model,
    train_dataloader,
    valid_dataloader,
    criterion,
    optimizer,
    N_EPOCHS)

In [None]:
# Let's compare the train and validation losses and how they changed during the different epochs

plt.plot(train_losses, label='Train')
plt.plot(valid_losses, label='Validation')
plt.title('Loss during training')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# We can do the same thing with the accuracy

plt.plot(train_accuracies, label='Train')
plt.plot(valid_accuracies, label='Validation')
plt.title('Accuracy during training')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

You can see that as training progresses, performance over the validation dataset stops improving as much. If we continue to train, we risk achieving so-called overfitting. This is what happens when our model learns the training data "too much" and starts forgetting about generic features that are required when dealing with unseen data (which is the case with the validation data). This is something that we must be careful about in practice as it can prevent us from applying our model on actual real world problems.

In [None]:
# After we have fully trained our model,
# we can let it run over the test data and check the results that we receive

test_loss, test_acc = perform_epoch_iteration(model, test_dataloader, criterion, None, True)
print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

## Part IV: Pre-trained embeddings

In recent years, pre-trained embeddings have emerged as a very powerful tool for quickly optimizing the training process. When you define an embedding layer simply as it is, its weights are initialized _randomly_. This means that in order to learn which words are contextually close to one another, we must train the model from scratch, often for long periods of time. To overcome this limitation, pre-trained embeddings come to life. If we simply replace the randomly initialized embeddings with ones that have already been trained in another model, than surely we can gain some benefits, at least time wise. 

There are many pre-trained vectors and embeddings ones out there. As a starting point, you are advised to use GloVe which come built in the TorchText library. They can be accessed using the `vectors` argument during building the vocabulary. The most commonly used GloVe vectors are `glove.6B.300d` meaning that they have been trained on 6 billion tokens and have dimensionality of 300.

In [None]:
# Let's build a new vocabulary, this time using pre-trained vectors
pretr_vocab = build_vocabulary(data_counter, vectors='glove.6B.300d')

In [None]:
# Build dataloaders, this time using the new vocabulary
pretr_train_dataloader, pretr_valid_dataloader, pretr_test_dataloader = \
    get_dataloaders(BATCH_SIZE, pretr_vocab, tokenizer)

In [None]:
# We define a new SentimentAnalysisModel, this time using pretrained embeddings
pretr_model = SentimentAnalysisModel(
    INPUT_DIMENSION,
    EMBEDDING_DIMENSION, 
    OUTPUT_DIMENSION,
    pretr_embeddings=pretr_vocab.vectors).to(device)

pretr_optimizer = torch.optim.Adam(pretr_model.parameters())
pretr_criterion = nn.BCEWithLogitsLoss().to(device)

In [None]:
# Train the new model with pre-trained embeddings
(pretr_train_losses, pretr_train_accuracies, 
 pretr_valid_losses, pretr_valid_accuracies) = train_model(
    pretr_model,
    pretr_train_dataloader,
    pretr_valid_dataloader,
    pretr_criterion,
    pretr_optimizer,
    N_EPOCHS)

In [None]:
# Same as before, we perform an epoch iteration over the test dataset 
pretr_test_loss, pretr_test_acc = perform_epoch_iteration(
    pretr_model,
    pretr_test_dataloader,
    pretr_criterion,
    None,
    True)

# Print the results. Are those better than the non-pretrained ones?
print(f'Test Loss: {pretr_test_loss:.3f} | Test Acc: {pretr_test_acc*100:.2f}%')

In [None]:
# Let's compare the loss and accuracy values of the two models

plt.plot(pretr_valid_losses, label='GloVe pre-trained')
plt.plot(valid_losses, label='Randomly initialized')
plt.legend()
plt.title('Comparison of validation loss during training')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.show()

plt.plot(pretr_valid_accuracies, label='GloVe pre-trained')
plt.plot(valid_accuracies, label='Randomly initialized')
plt.legend()
plt.title('Comparison of validation accuracy during training')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.show()

As you can see, GloVe embeddings significantly outperform the randomly initialized ones. Can you guess why is that?

## Part V: Exercises

### Play with the training parameters
There is more than pre-trained weights that can change the outcome of a training. Try to change some of the following parameters and report how this changes the outcome of your results.
* batch size of the iterators
* learning rate of the optimizer
* increasing the number of epochs
 
 _Note: do not forget to re-initialize your variables (most importantly dataloader, model, optimizer). To be on the safe-side, you can use different namings for these variables in every experiment_

### Pre-trained embeddings

Try to use different pre-trained embeddings than the ones shown in this lab. You can use different versions of GloVe or entirely new pre-trained vectors. You can check all that are currently available [here](https://pytorch.org/text/stable/_modules/torchtext/vocab.html#Vocab.load_vectors)

### Improving the model (simple)

Let's try to improve our `SentimentAnalysisModel`.

In this exercise we took the mean of our embeddings after making a forward pass through the embedding layer. In practice, there is more efficient way using an `torch.nn.EmbeddingBag` [link](https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html#torch.nn.EmbeddingBag). This will make training faster and will also remove the need for taking the mean of the embeddings. Create a new class that contains the same layers as `SentimentAnalysisModel` but replacing the embedding layer with an embedding bag. Report if this changes the results in any way.

Another simple exercise you can perform is adding another linear layer after the embeddings and before the final fully connected one. Try this with different dimensionalities and report the differences.

### Improving the model (advanced)

These days, there are much more powerful components that researchers use instead of only embedding and linear layers. Recurrent neural networks (RNN) are one of the most popular ones. You can read about them here and specifically about PyTorch implementation [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html). 

Try to improve your model by including a `torch.nn.RNN` module. Once this is working, you can also try a more sophisticated implementation, such as `torch.nn.LSTM` or `torch.nn.GRU`. You can experiment with the amount of layers and the bi-directionality. Practice shows that bi-directional RNNs usually perform better than uni-directional. Is this also valid for this model?