<a href="https://colab.research.google.com/github/biggojo/Assignments-DL/blob/main/Exercise2a.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### This submission is for:
- Abraham Gassama (2285843)
- Zhuo Le Lee (2317240)

# Exercise 2A - Transformers

In this exercise, you'll implement a basic encoder-only Transformer architecture with PyTorch. We will start with building the basic building blocks and then integrate them into a fully-fleged Transformer model. We train the model to solve a POS-Tagging problem (more on that later). In the previous exercise, you implemented your work in numpy. Now, we will switch to PyTorch, which will track the gradients for us and allows us to focus more on the network itself.

You can receive up to three points for your implementation of Exercise 2A. Together with Exercise 1, you can get up to six bonus points for the exam.

**Important Notice**: Throughout the notebook, basic structures are provided such as functions and classes without bodies or partial bodies, and variables that you need to assign to. **Don't change the names of functions, variables, and classes - and make sure that you are using them!** You're allowed to introduce helper variables and functions. Occasionally, we use **type annotations** that you should follow. They are not enforced by Python. Whenenver you see an ellipsis `...` you're supposed to insert code.

In [2]:
!pip install torchtext torchdata torchmetrics

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import torch
from torch import nn, Tensor
"""
Shifted this definition to the beginning, since some of the implemented functions at the beginning
require this variable.
"""
DEVICE = 'cuda' # later replace with 'cuda' for GPU

Let's actually start with a few basic functions that we will need throughout the exercise, namely **Softmax** and **ReLu**.

$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$

$\text{ReLU}(x) = \max(0, x)$

In [4]:
def softmax(input: Tensor) -> Tensor:
      
    """
    Note 1: This implementation of softmax does the softmax operation only on the last dimension
    of the given input. To perform the operation on other axis/dimensions of the input, 
    use nn.Softmax

    Note 2: This implementation assumes that the input tensor has a dimension of 2. This means that 
    the input has to be reshaped first, if its dimension does not correspond to 2.
    """
    # Clone the input tensor and save as output tensor to be overwritten 
    output_tensor = torch.clone(input)

    # Perform the exponent operation over all elements in the input tensor
    exp_input = torch.exp(input)
    
    # Sum over the columns 
    sum_exp = torch.sum(exp_input, dim=1)

    # Overwrite the output tensor 
    for i in range(input.size()[0]):
      output_tensor[i, :] = exp_input[i, :] / sum_exp[i]
    
    return output_tensor

class relu(nn.Module):
  def __init__(self):
    super(relu, self).__init__()

  def forward(self, input: Tensor) -> Tensor:
    return torch.maximum(input.to(DEVICE), torch.zeros(input.size()).to(DEVICE))


In [5]:
# Sanity checks for the ReLU and softmax function

# Create a tensor of random values with 15 columns and 1 row to match the assumption made
sample_input = torch.rand(4, 10)

# Note: The output tensors are rounded to 4 decimals places to allow for easier comparison
# Output from our implementation
SM_out1 = torch.round(softmax(sample_input), decimals=4).to(DEVICE)
ReLU_ori = relu()
ReLU_out1 = torch.round(ReLU_ori(sample_input), decimals=4).to(DEVICE)

# Output from PyTorch Implementation
SM_pytorch = nn.Softmax(dim=-1)
SM_out2 = torch.round(SM_pytorch(sample_input), decimals=4).to(DEVICE)
ReLU_pytorch = nn.ReLU()
ReLU_out2 = torch.round(ReLU_pytorch(sample_input), decimals=4).to(DEVICE)


softmax_compare = torch.eq(SM_out1, SM_out2)
relu_compare = torch.eq(ReLU_out1, ReLU_out2)


print(softmax_compare)
print(relu_compare)

# Both functions passed the sanity checks 

tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]],
       device='cuda:0')
tensor([[True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True],
        [True, True, True, True, True, True, True, True, True, True]],
       device='cuda:0')


## Transformer Block

A typical transformer block consists of the following 
- Multi-Head Attention
- Layer Normalization
- Linear Layer
- Residual Connections

<center><img src="https://i.imgur.com/ZKgcoe4.png" alt="transformer block visualization" width="200">

In the next few subsections, we will build these basic building blocks.

### Multi-Head Attention

Multi-Head Attention concatenates the outputs of several so called **attention heads**.

$\textrm{MHA}(Q,K,V) = \textrm{Concat}(H_1,...,H_h)$

<center><img src="https://www.tensorflow.org/images/tutorials/transformer/multi_head_attention.png" width=300>

One attention head consists of linear projections for each of $Q, K$ and $V$ and an attention mechanism called **Scaled Dot-Product Attention**. The attention mechanism scales down the dot products by $\sqrt{d_k}$.

$\textrm{Attention}(Q,K,V)=\textrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V$



If we assume that $q$ and $v$ are $d_k$-dimensional vectors and its components are independent random variables with mean $0$ and a variance of $d_k$, then their dot product has a mean of $0$ and variance of $d_k$. It is preferred to have a variance of $1$ and that's why we scale them down by $\sqrt{d_k}$.

The dot product $q \cdot v$ resembles a measure of similarity.


<center><img src="https://www.tensorflow.org/images/tutorials/transformer/scaled_attention.png" width="350">

Let's start implementing these components. Note that our classes inherit from PyTorch's `nn.Module`. These modules allow us to hold our parameters and easily move them to the GPU (with `.to(...)`). It also let's us define the computation that is performed at every call, in the `forward()` method. For example, when we have an `Attention` module, initialize it like `attention = Attention(...)`, we are able to call it with `attention(Q, K, V)` (it'll execute the `forward` function in an optimized way).

### Assumptions: 


1.   Q, K and V are **batches** of matrices, each with shape *(batch_size, seq_length, num_features)*. 
2.   The attention here is self-attention, this means that the sequence length of Q and K are the same. 
3.   The number of hidden dimensions for all Q, K and V are the same.
4.   For added simplicity, we first assume that the batch size is one. We then introduce batched attention matrix after the simple attention has passed all of the sanity checks

### Operations in Attention and their outputs: 



1.   Multiplying the query (Q) and key (K) arrays in a linear matrix multiplication. This is the attention of this layer and determines how important each element in the key sequences is with regard to the query sequence. 

> Output 1: *(batch_size, seq_length, seq_length)*

2.   Scaling the attention to have variance of 1, regardless of the size of seq_length

> Output 2: *(batch_size, seq_length, seq_length)*

3.   Normalizing the attention array across the keys dimension with softmax, so that all the attention weights sum to one. (Because we can't pay more than 100% attention!)

> Output 3: *(batch_size, seq_length, seq_length)*

4.   Multiplying the attention matrix with the value (V) array using matrix multiplication. 

> Output 4: *(batch_size, seq_length, num_features)*









In [6]:
# Implementation of a single attention head

class Attention(nn.Module):
    def __init__(self, hidden_n: int):
        """
        hidden_n : Number of hidden dimensions
        It is here assumed that the number of hidden dimensions corresponds
        to the number of features in each input timestep.
        """
        super().__init__()
        self.hidden_n = hidden_n
        
        """
        nn.Linear intializes and defines a transformation matrix, where all but 
        the last dimension of the output is different than the input 
        """
        self.Q_linear = nn.Linear(hidden_n, hidden_n)
        self.V_linear = nn.Linear(hidden_n, hidden_n)
        self.K_linear = nn.Linear(hidden_n, hidden_n)


    def forward(self, Q: Tensor, K: Tensor, V: Tensor, mask: Tensor = None) -> Tensor:
        """
        Note 1: If it was a batched attention, then all of the following matmul 
        has to be changed to bmm (batched matrix multiplication, supported by
        pytorch)

        Note 2: The above softmax function would not work here, if the number of 
        time steps (i.e. sequence length) and the number of features are both
        more than one. The above softmax function would have to be revised to 
        include the number of dimensions.
        
        **Correction**: The above softmax function would work, since we are applying
        softmax only to the hidden dimension, i.e. the last dimension of the given
        embedding tensor as input. 

        Note 3: The dimensions of the transpose has to be double-checked against
        the size of the inputs to the attention layer later, when the inputs
        are given.

        Note 4: A naive implementation of the masking operation is given here, 
        should the a mask be defined as an input. This might have to be revised
        in later steps.`

        Args:
            Q (:class:`torch.Tensor` [batch_size, output_length, hidden_n]): 
            Sequence of queries to query the context.
            K (:class:`torch.Tensor` [batch_size, query_length, hidden_n]): 
            Data overwhich to apply the attention mechanism.
            V (:class:`torch.Tensor` [batch_size, output_length, hidden_n]): 
            Sequence of values to be scaled by the attended features

            here: output_length == query_length == seq_length 

        Returns:
            :class:`torch.Tensor`:

            output (:class:`torch.Tensor` [batch size, output_length, dimensions]):
            Tensor containing the attended features.
        """

        # Sanity Checks
        assert Q.size()[-1] == K.size()[-1] == V.size()[-1], "The hidden dimensions of Q, K, and V must match up!"
        assert Q.size()[-1] == self.hidden_n, "The hidden dimension of the given inputs must match that defined during the init!"

        batch_size, seq_length, hidden_n = Q.size() 
        
       
        """
        Step 1: Transforming Q, K, and V via linear transformation (with bias). This improves the versatility of the 
        network in adapting the given input arrays to its specific function. 

        (batch_size * seq_length, hidden_n) * (hidden_n * hidden_n) ->
        (batch_size * seq_length, hidden_n)
        """

        Q = Q.reshape(batch_size * seq_length, hidden_n)
        K = K.reshape(batch_size * seq_length, hidden_n)
        V = V.reshape(batch_size * seq_length, hidden_n)

        Q_arr = self.Q_linear(Q).reshape(batch_size, seq_length, hidden_n)
        K_arr = self.K_linear(K).reshape(batch_size, seq_length, hidden_n)
        V_arr = self.V_linear(V).reshape(batch_size, seq_length, hidden_n)


        """
        weights (:class:`torch.FloatTensor` [batch size, seq_length, seq_length]):
        Tensor containing attention weights.
        """
        scores = torch.bmm(Q_arr, K_arr.transpose(-1,1).contiguous()) / (self.hidden_n ** 0.5)
        
        """
        For the case of masked attention, a masking operation has to be implemented
        """
        if mask is not None:
          mask = mask.unsqueeze(-1)
          scores = scores.masked_fill(mask == 0, float("-1e20"))
          scores = softmax(scores)


        scores = scores.view(batch_size * seq_length, seq_length)
        weights = softmax(scores)
        weights = weights.view(batch_size, seq_length, seq_length)

        """
        (batch_size, seq_length, seq_length) * (batch_size, seq_length, hidden_n) ->
        (batch_size, seq_length, hidden_n)
        """

        output = torch.bmm(weights, V_arr)
        return output 


In [7]:
class MultiHeadAttention(nn.Module):
    def __init__(self, hidden_n: int, h: int = 2):
        """
        hidden_n: hidden dimension
        h: number of heads
        """
        super().__init__()
        self.hidden_n = hidden_n
        self.h = h
        # self.d_k = hidden_n // h

        """
        Intialize an attention head for each element in h
        """
        self.heads = nn.ModuleList(
            [Attention(hidden_n) for x in range(h)]
        )
        
        """
        After concatenating, the output size would be (batch_size, seq_length, num_features * h)
        This can be reverted back to the original output size via a linear layer

        Note 1: There are other implementation variations regarding this step.
        E.g. we could split the input into n heads, so that each will have the dimensions
        (batch_size, seq_length, num_features, num_features / n), with the last dimension denoting
        which head each input is subjected to. Each input is then passed through the according
        attention head and then concatenated. 

        This variation was not chosen for simplicity, e.g. self.d_k has to be a factor of 
        self.hidden_n
        """
        self.out = nn.Linear(h * hidden_n, hidden_n)

   
    def forward(self, Q: Tensor, K: Tensor, V: Tensor, mask=None):
        return self.out(
            torch.cat([h(Q,K,V) for h in self.heads], dim=-1)
        )

### Layer Normalization

From the lecture, remember layer normalization where the values are normalized across the feature dimension, independently for each sample in the batch. For that, first calculate mean and standard-deviation across the feature dimension and then scale them appropriately such that the mean is 0 and the standard deviation is 1. Introduce **two sets of learnable parameters**, one for shifting the mean (addition) and one for scaling the variance (multiplication) the normalized features (i.e., two parameters for each feature). Tip: Use `nn.Parameter` for that.

$y_{\textrm{norm}}=\frac{x-\mu}{\sqrt{\sigma+\epsilon}}$

$y=y_{\textrm{norm}}\cdot\beta+\alpha$

<center>
<img src="https://i.stack.imgur.com/E3104.png" alt="visualization of layer norm vs. batch norm" width="420">

In [8]:
class LayerNorm(nn.Module):
    def __init__(self, norm_shape):
        """
        Applies Layer Normalization over a mini-batch of inputs as described in 
        [Layer Normalization](https://arxiv.org/pdf/1607.06450.pdf)
        
        DISCLAIMER: This layer normalization currently works only when the dimension of 
        the input to be normalized is an integer. 

        The mean and standard-deviation are calculated over the last 'D' dimensions, 
        where 'D' is the dimension of :attr:'norm_shape'. If a single integer is used, 
        it is treated as a singleton list, and this module will normalize over the 
        last dimension which is expected to be of that specific size. 

        :param norm_shape: The dimension of the layer to be normalized, i.e. the 
        shape of the input tensor or the last dimension of the input tensor.

        """

        super().__init__()

        # If the given dimension is an integral, make it into a tuple 
        if isinstance(norm_shape, int):
          norm_shape = (norm_shape, )
        
        self.norm_shape = tuple(norm_shape)


        """
        :param alpha: Scale parameter (Initialized to zeros)
        :param beta: Offset parameter (Initialized to ones)
        :param epsilon: A value added to the demoniator for numerical stability

        The initialized value of alpha and beta should be the same as the dimension
        of the layer to be normalized, assuming that the multiplication between
        beta and y_norm is element-wise.

        Note 1: It is often said that the parameters initialized with nn.Parameter()
        often have dimunitive values like 1.4013e-45. Other alternatives such as
        nn.Linear() might be worth considering, since it does initial processing to the 
        input tensor such as uniformization. 
     
        Note 3: The paramters can also be intialized with torch.Tensor(*norm_shape)
        or torch.empty(norm_shape)
        """
        self.alpha = torch.nn.Parameter(torch.empty(self.norm_shape))
        self.beta = torch.nn.Parameter(torch.rand(self.norm_shape))
        self.epsilon = 1e-10

        # Initializing the weights with ones and biases with zeros
        self.alpha.data.zero_()
        self.beta.data.fill_(1)

    def forward(self, x):
        """
        Note 1: It is assumed that the feature dimension is the last dimension, which 
        matches our assumption from above.
        """
        mean = x.mean(dim=-1, keepdim=True)
        var = ((x - mean) ** 2).mean(dim=-1, keepdim=True)
        std = (var + self.epsilon).sqrt()
        y_norm = (x - mean) / std

        y = (y_norm * self.beta) + self.alpha
        return y

In [9]:
# LayerNorm Sanity Check 

# NLP Example
batch, sentence_length, embedding_dim = 20,5,10
embedding = torch.randn(batch, sentence_length, embedding_dim)

# Layer normalization as implemented by the module nn
layer_norm_nn = nn.LayerNorm(embedding_dim)
out_nn = layer_norm_nn(embedding)

# Our version of layer normalization
layer_norm_ori = LayerNorm(embedding_dim)
out_ori = layer_norm_ori(embedding)

# Check 1: Compare both tensors to see if they match
layernorm_compare = torch.eq(torch.round(out_nn, decimals=3), torch.round(out_ori, decimals=3))
print(layernorm_compare)
# Results: They match, if the values of rounded up to three decimals, which is acceptable

# Check 2: See if the output dimension matches the input dimension
assert embedding.size() == out_ori.size(), "Output size does not match with input size"
# Results: Passed.

"""
DISCLAIMER: The sanity checks did not check if the learnable affine transform paramters
are intialized correctly and can be learned stimultaneously with the other weights of
the architecture. 
"""

tensor([[[ True,  True,  True,  True,  True,  True,  True, False,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True]],

        [[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True]],

        [[ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True,  True,  True,  True,  True,  True,  True,  True,  True],
         [ True,  True, 

'\nDISCLAIMER: The sanity checks did not check if the learnable affine transform paramters\nare intialized correctly and can be learned stimultaneously with the other weights of\nthe architecture. \n'

### Transformer Block

Here, we bring all ingredients together into a single module. Don't forget to add the residual connections.

In [10]:
class TransformerBlock(nn.Module):
    def __init__(self, hidden_n: int, h: int = 2):
        """
        hidden_n: hidden dimension
        h: number of heads
        """
        super().__init__()
        self.hidden_n = hidden_n
        self.h = h
        self.attention = MultiHeadAttention(hidden_n, h)
        self.norm1 = LayerNorm(hidden_n)
        self.norm2 = LayerNorm(hidden_n)

        # This following variable can be eventually defined via input 
        ff_dim = 2048
        self.feed_forward = nn.Sequential(
            nn.Linear(hidden_n, ff_dim*hidden_n),
            # Here we are using our implemented version of ReLU, which might not be
            # compatible in this specific context
            relu(),
            nn.Linear(ff_dim*hidden_n, hidden_n)
        )

    # Assuming that the query, key and value arrays are the same and given as input
    def forward(self, x: Tensor):
        """
        Alternatively, the inputs could be normalized first before using them as
        input for the attention layer, while the residual connections are not normalised

        x_norm1 = self.norm1(x)
        x_out1 = x + self.attention(x_norm1, x_norm1, x_norm1)
        x_norm2 = self.norm_2(x_out1)
        x_out2 = x_out1 + self.feed_forward(x_norm2)

        return x_out2

        """
        
        x_out1 = x + self.attention(x, x, x)
        x_attn = self.norm1(x)

        x_out2 = x_attn + self.feed_forward(x_attn)
        x_ff = self.norm2(x_out2)
        return x_ff


## A Simple Transformer Architecture

Let's stack our transformer blocks and add an embedding layer for a simple transformer architecture. You are allowed to use `nn.Embedding` here.

In [11]:
class Transformer(nn.Module):
    def __init__(self, emb_n: int, hidden_n: int, n:int =3, h:int =2):
        """
        emb_n: number of token embeddings
        hidden_n: hidden dimension
        n: number of layers
        h: number of heads per layer
        """
        super().__init__()
        self.emb_n = emb_n
        self.hidden_n = hidden_n
        self.n = n
        self.h = h

        self.embed = nn.Embedding(emb_n, hidden_n, padding_idx=0)
        self.layers = nn.ModuleList(
            [
                TransformerBlock(hidden_n, h)
                for i in range(n)
            ]
        )

    def forward(self, src: Tensor) -> Tensor:
      src = self.embed(src)
      for layer in self.layers:
        src = layer(src)

      return src

## POS-Tagging

Part-Of-Speech-Tagging (**POS-Tagging**) is a **sequence labeling problem** where we categorize words in a text in correspondence with a particular part of speech (e.g., "noun" or "adjective"). A few examples and classes are shown in the following table:

|  POS Tag  |  Description  |  Examples  |
|-----------|------------|------------|
|  NN | Noun (singular, common) | mass, wind, ...  |
|  NNP | Noun (singular, proper) | Obama, Liverpool, ...  |
| CD  | Numeral (cardinal)  | 1890, 0.5, ...  |
|  DT | Determiner  | all, any, ... |
| JJ | Adjective (ordinal) | oiled, third, ... |
... many more

### CoNLL2000 Dataset

Let's load our dataset which is the **CoNLL2000 dataset** and look at an example.

In [12]:
from torch.utils.data import Dataset, DataLoader
from torchtext.datasets import CoNLL2000Chunking
import pandas as pd

train_df = pd.DataFrame(CoNLL2000Chunking()[0], columns=['words', 'pos_tags', 'chunk'])
test_df = pd.DataFrame(CoNLL2000Chunking()[1], columns=['words', 'pos_tags', 'chunk'])

train_src, train_tgt = train_df['words'].tolist(), train_df['pos_tags'].tolist()
test_src, test_tgt = test_df['words'].tolist(), test_df['pos_tags'].tolist()

print(train_src[0])
print(train_tgt[0])

['Confidence', 'in', 'the', 'pound', 'is', 'widely', 'expected', 'to', 'take', 'another', 'sharp', 'dive', 'if', 'trade', 'figures', 'for', 'September', ',', 'due', 'for', 'release', 'tomorrow', ',', 'fail', 'to', 'show', 'a', 'substantial', 'improvement', 'from', 'July', 'and', 'August', "'s", 'near-record', 'deficits', '.']
['NN', 'IN', 'DT', 'NN', 'VBZ', 'RB', 'VBN', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NN', 'NNS', 'IN', 'NNP', ',', 'JJ', 'IN', 'NN', 'NN', ',', 'VB', 'TO', 'VB', 'DT', 'JJ', 'NN', 'IN', 'NNP', 'CC', 'NNP', 'POS', 'JJ', 'NNS', '.']


First, we need to create a vocabulary. Our dataset is already tokenized. However, we need to assign ids to them in order to input them to the embedding layer. We also need the number of embeddings (`num_embeddings`) for the size of our lookup table of `nn.Embedding`.

Thus, we will iterate over all sentences replace them with ids and the mapping to our vocabulary. It'll be handy to have two different mappings, from id to token, as well as, from token to id. Note that we will add a special token `<unk>` with id `0` for words that are unknown (that are not in the training dataset but could possibly be in the test dataset).

In [13]:
# Shift the first entry to the key of one, so that key of zero is saved for the padding token
vocabulary_id2token : dict = {1: '<unk>'}
vocabulary_token2id : dict = {'<unk>': 1}

# Combining all the sentences in the dataset into one big list
sentence_list = []
for sentence in (train_src + test_src): sentence_list += sentence

# Removing duplicates
sentence_list = list(set(sentence_list))

# Create two enumerate objects, one for each list
enum1 = enumerate(sentence_list)
enum2 = enumerate(sentence_list)

# Create two dictionaries from the two enum objects
vocab_id2token = dict((i+2,j) for i,j in enum1)
vocab_token2id = dict((j,i+2) for i,j in enum2)

# Append the existing dictionaries
vocabulary_id2token.update(vocab_id2token)
vocabulary_token2id.update(vocab_token2id)

# Size of each dictionary
voc_id2token_size = len(vocabulary_id2token)
voc_token2id_size = len(vocabulary_token2id)

# Sanity Check
assert len(vocabulary_id2token) == len(vocabulary_token2id), "The length of the dictionaries should be the same, since they are identical but in different order"

# First first elements of each dictionary
l5_vocabid2token = {k: vocabulary_id2token[k] for k in list(vocabulary_id2token)[-5:]}
l5_vocab_token2id = {k: vocabulary_token2id[k] for k in list(vocabulary_token2id)[-5:]}

print("Here are the first five elements from each dictionary:")
print(f"vocabulary_id2token: {l5_vocabid2token}.")
print(f"vocabulary_token2id: {l5_vocab_token2id}.")
print()
print(f"Size of the dictionary vocabulary_i2token: {voc_id2token_size}")
print(f"Size of the dictionary vocabulary_token2id: {voc_token2id_size}")

Here are the first five elements from each dictionary:
vocabulary_id2token: {21586: 'grandkids', 21587: 'underreacting', 21588: 'absence', 21589: 'Ernesto', 21590: 'Time'}.
vocabulary_token2id: {'grandkids': 21586, 'underreacting': 21587, 'absence': 21588, 'Ernesto': 21589, 'Time': 21590}.

Size of the dictionary vocabulary_i2token: 21590
Size of the dictionary vocabulary_token2id: 21590


Let's do the same for our classes:

In [15]:
classes_id2name : dict = {}
classes_name2id : dict = {}

# Combining all the post-tags in the dataset into one big list
pos_tag_list = []
for posttag in (train_tgt + test_tgt): pos_tag_list += posttag

# Removing duplicates
pos_tag_list = list(set(pos_tag_list))

# Create two enumerate objects, one for each list
enum1 = enumerate(pos_tag_list)
enum2 = enumerate(pos_tag_list)

# Create two dictionaries from the two enum objects
"""
Note 1: We have to skip the tag_id of 0, since we are using that to pad
the targets later in both the train and test dataset.
"""
posttag_id2name = dict((i+1,j) for i,j in enum1)
posttag_name2id = dict((j,i+1) for i,j in enum2)

# Append the existing dictionaries
classes_id2name.update(posttag_id2name)
classes_name2id.update(posttag_name2id)

# Size of each dictionary
posttag_id2name_size = len(classes_id2name)
posttag_name2id_size = len(classes_name2id)

# Sanity Check
assert posttag_id2name_size == posttag_name2id_size, "The length of the dictionaries should be the same, since they are identical but in different order"

# First first elements of each dictionary
l5_classes_id2name = {k: classes_id2name[k] for k in list(classes_id2name)[-5:]}
l5_classes_name2id = {k: classes_name2id[k] for k in list(classes_name2id)[-5:]}

print("Here are the first five elements from each dictionary:")
print(f"classes_id2name: {l5_classes_id2name}.")
print(f"classes_name2id: {l5_classes_name2id}.")
print()
print(f"Size of the dictionary classes_id2name: {posttag_id2name_size}")
print(f"Size of the dictionary classes_name2id: {posttag_name2id_size}")


Here are the first five elements from each dictionary:
classes_id2name: {40: ',', 41: 'DT', 42: 'NNS', 43: 'PDT', 44: 'WDT'}.
classes_name2id: {',': 40, 'DT': 41, 'NNS': 42, 'PDT': 43, 'WDT': 44}.

Size of the dictionary classes_id2name: 44
Size of the dictionary classes_name2id: 44


Now, let's use PyTorch's `Dataset` and `DataLoader` to help us batch our data. Let's also replace tokens and classes with our ids. For that, complete `get_token_ids` and `get_class_ids`.

In [16]:
from typing import List

def get_token_ids(src: List[str]) -> List[int]:
    return [vocabulary_token2id[word] for word in src]

def get_class_ids(tgt: List[str]) -> List[int]:
    return [classes_name2id[class_name] for class_name in tgt]

class ConllDataset(Dataset):
  def __init__(self, src, tgt):
        self.src = src
        self.tgt = tgt

  def __len__(self):
        return len(self.src)

  def __getitem__(self, index):
        src = self.src[index]
        tgt = self.tgt[index]
        
        return {
            'src': get_token_ids(src),
            'tgt': get_class_ids(tgt),
        }

train_dataset = ConllDataset(train_src, train_tgt)
test_dataset = ConllDataset(test_src, test_tgt)

We will use a **batch size of 32**.

In [17]:
BATCH_SIZE = 4

However, since our examples are of different length, we need to pad shorter examples to the length of the example with the maximum length in our batch. So, let's define a special **padding token** in our vocabulary:

In [18]:
padding_token = "PAD"
padding_token_id = 0

vocabulary_id2token.update({padding_token_id: padding_token})
vocabulary_token2id.update({padding_token: padding_token_id})

classes_id2name.update({padding_token_id: padding_token})
classes_name2id.update({padding_token: padding_token_id})

# Update the sizes of the dictionaries
voc_id2token_size = len(vocabulary_id2token)
voc_token2id_size = len(vocabulary_token2id)
posttag_id2name_size = len(classes_id2name)
posttag_name2id_size = len(classes_name2id)


The `collate_fn` is the function that actually receives a batch and needs to add the padding tokens, then returns `src` and `tgt` as `Tensor`s of size `[B, S]` where `B` is our batch size and `S` our maximum sequence length. This function should additionally return a `mask`, a `Tensor` with binary values to indicate whether the specific element is a padding token or not (0 if it's a padding token, 1 if not), such that we can ignore padding tokens in our attention mechanism and loss calculation. 

In [19]:
from typing import Dict

def collate_fn(batch: List[Dict]) -> Dict[str, Tensor]:
    """
    batch: list of dictionaries with keys src and tgt (as defined in ConllDataset)
    e.g. [dict1, dict2, dict3, dict4, ..., dict 32]
    
    Each dictionary has the following key and values:
    "src": a list of token ids based on the tokens of the word sequence
    "tgt": a list of class ids based on the POS-tag of each token

    Note 1: We are padding every input of the batch based on the maximum sequence 
    length of our batch B. Another alternative would be padding every input based
    on the maximum sequence length of the entire dataset. 
    """
    
    #1: Unpack the given batch of data
    """
    token_ids: A list of lists for the token_ids of each sequence within this sampled batch
    class_ids: A list of lists for the class_ids of each sequence within this sampled batch
    """
    
    token_ids = []           
    class_ids = []            
    
    for x in batch:
      token_ids.append(x["src"])
      class_ids.append(x["tgt"])
      assert len(x["src"]) == len(x["tgt"]), "The length of the sequences token_ids and class_ids should be the same, since each token should be assigned to a class"
    
    assert len(token_ids) == len(class_ids) == BATCH_SIZE, "The amount of sequences within this batch should correspond with the batch size, and the size of both lists should be the same"

    #2: Get maximum sequence length within this batch 
    list_len = [len(i) for i in token_ids]
    max_seq_len = max(list_len)

    #3: Create tensors from the unpacked data and intialize a mask for each sequence in batch
    mask = torch.full((BATCH_SIZE, max_seq_len), -1)

    #4: Pad sequences that are shorter than the maximum sequence length 
    #4.1: Iterate over every item in the list, i.e. every word sequence in the batch 
    for i in range(len(token_ids)):

      #4.2: Get each sequence length (to be compared with the maximum sequence length)
      seq_length = len(token_ids[i])
      length_diff = max_seq_len - seq_length

      #4.3: Pad the input sequences if it is shorter than the maximum sequence length (in-place) 
      padding = [padding_token_id] * length_diff
      token_ids[i] = token_ids[i] + padding
      class_ids[i] = class_ids[i] + padding

      #4.4: Create the attention mask for later use
      ones = torch.ones(seq_length)
      zeros = torch.zeros(length_diff)
      mask[i] = torch.cat((ones, zeros))

    #5: Create tensors for src and tgt from nested lists
    src = torch.tensor(token_ids)
    tgt = torch.tensor(class_ids)

    # Sanity checks as per the given requirements
    assert len(mask) == BATCH_SIZE, "The number of masks in the list of masks should correspond to the number of word sequences, i.e. the batch size"
    assert src.size(dim=0) == BATCH_SIZE, "The size of the first dimension of the src tensor should match the batch size"
    assert src.size(dim=1) == max_seq_len, "The size of the second dimension of the src tensor should match the maximum sequence length"
    assert tgt.size(dim=0) == BATCH_SIZE, "The size of the first dimension of the tgt tensor should match the batch size"
    assert tgt.size(dim=1) == max_seq_len, "The size of the second dimension of the src tensor should match the maximum sequence length"


    return {
        'src': src,
        'tgt': tgt,
        'mask': mask,
    }
    

With that, we can use PyTorch's `DataLoader` which will shuffle and batch our data automatically.

In [20]:
train_data_loader = DataLoader(train_dataset, collate_fn=collate_fn, batch_size=BATCH_SIZE, shuffle=True)
test_data_loader = DataLoader(test_dataset, collate_fn=collate_fn, batch_size=BATCH_SIZE, shuffle=True)

### Architecture

Let's build a transformer model with three layers, three attention heads and an embedding dimension of 128. Also, let's not forget to add a classification head to our model.

In [21]:
EMB_DIMENSION = 128
NUM_LAYERS = 3
NUM_ATT_HEADS = 3

class CoNLL2000Transformer(nn.Module):
    def __init__(self, transformer, hidden_n:int):
        super().__init__()
        self.transformer = transformer
        self.classification_layer = nn.Linear(hidden_n, posttag_id2name_size)
        self.hidden_n = hidden_n

    def forward(self, src: Tensor) -> Tensor:
      src = self.transformer(src)
      output = self.classification_layer(src)
      output = softmax(output)

      return output



model = CoNLL2000Transformer(Transformer(voc_id2token_size, EMB_DIMENSION, NUM_LAYERS, NUM_ATT_HEADS), EMB_DIMENSION)

### Training

Initialize the **AdamW** optimizer from the `torch.optim` module and choose the most appropriate loss function for our task.

In [22]:
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-8)

# We are using cross entropy loss here, since the task is a classification task
"""
Note 1: The softmax operation might be included in the cross entropy loss defined below.
If this is the case (assuming that it is noticeable from the training loss), then the softmax
operation before the output must be removed. 
""" 
criterion = nn.CrossEntropyLoss()

Build a basic training loop and train the network for three epochs.
- Use everything we've built to far, including `train_data_loader`, `model`, `optimizer` and `criterion`.
- At every 50th step print the average loss of the last 50 steps. 
- It is suggested to make a basic training procedure to work on the CPU first. Once it successfully runs on the CPU, you can switch to the GPU (click on change runtime and add an hardware accelerator if you use Colab) and run for the whole three epochs. Note: For this to work, you need to transfer the `model` and the input tensors to the GPU memory. This simply works by calling `.to(device)` on the model and tensors, where `device` and either be `cpu` or `cuda` (for the GPU).

In [23]:
torch.cuda.empty_cache()

In [24]:
# The variable DEVICE has been moved to the beginning of this notebook
EPOCHS = 3

"""
Note 1: I am implementing gradient accumulation here, since the number of parameters
is too high and the batch_size is only limited to 4, or maybe even 8. Other alternatives
to this problem might exist, however were not considered due to the limited time frame
given to work on this assignment. 

Here, the batch size was reduced to 4, while a gradient accumulation technique allows
the effective batch size to be 4*8 = 32, i.e. the original pre-set batch size 
"""
GRAD_ACC = 8

model = model.to(DEVICE)

"""
DISCLAIMER: The library tqdm and time is installed here to implement a progress bar,
so that the training progress can be better checked. Since this has little relevance to
the actual exercise, we are hoping this does not affect the grading. 

    # Setup loop with TQDM and dataloader, can be removed if not allowed
    loop = tqdm(train_data_loader, leave=True)
    for batch in loop:
      
      #1: Initialize calculated gradients (from previous step)
      optimizer.zero_grad()

      #2: Pull all tensor batches required for training
      input_sentence = batch[]

from tqdm import tqdm
import time
"""

# Activate training mode 
model.train()

# Initiate losses:
running_loss = 0
avg_loss_in_50 = 0 

for epoch in range(EPOCHS):
  for i, data in enumerate(train_data_loader):

    #1: Unpack the data from the dataloader
    """
    Every data instance is a dictionary of tensors, with each tensor having 
    the shape (batch_size, maximum_seq_length)
    """
    inputs = data["src"].to(DEVICE)
    labels = data["tgt"].to(DEVICE)
    masks = data["mask"].to(DEVICE)

    #2: Make predictions for this batch 
    masked_outputs = model(inputs)

    #3: Multiply the outputs with the mask to obtain original sequence length 
    """
    Note 1: We need to tranpose the outputs so that the shape is (batch_size, posttag_name2id_size, max_seq_length), 
    instead of (batch_size, max_seq_length, posttag_name2id_size), so that the cross-entropy loss can 
    be operated based on the given labels and outputs 
    """
    outputs = torch.transpose(torch.mul(masked_outputs, torch.unsqueeze(masks, -1)), 1, 2)


    #4: Compute the loss and its gradients (w/ gradient accumulation)
    loss = criterion(outputs, labels) 
    (loss / GRAD_ACC).backward()

    #5: Adjust learning weights
    if (i+1) % GRAD_ACC == 0:
      optimizer.step()
      model.zero_grad()

    #6: Gather data and report 
    running_loss += loss.item()
    if i % 50 == 49:
      avg_loss_in_50 = running_loss / 50
      print('  batch {} loss: {}'.format(i + 1, avg_loss_in_50))
      running_loss = 0
      






  batch 50 loss: 3.741136088371277
  batch 100 loss: 3.6963444757461548
  batch 150 loss: 3.686091432571411
  batch 200 loss: 3.680394377708435
  batch 250 loss: 3.6688568782806397
  batch 300 loss: 3.6674539613723756
  batch 350 loss: 3.6468973112106324
  batch 400 loss: 3.65030912399292
  batch 450 loss: 3.6412076854705813
  batch 500 loss: 3.623781099319458
  batch 550 loss: 3.623956661224365
  batch 600 loss: 3.62014262676239
  batch 650 loss: 3.6128553104400636
  batch 700 loss: 3.5921430444717406
  batch 750 loss: 3.594812426567078
  batch 800 loss: 3.595312180519104
  batch 850 loss: 3.5713216972351076
  batch 900 loss: 3.5712246227264406
  batch 950 loss: 3.5826078510284423
  batch 1000 loss: 3.569926528930664
  batch 1050 loss: 3.584328837394714
  batch 1100 loss: 3.5478425884246825
  batch 1150 loss: 3.572008852958679
  batch 1200 loss: 3.566558141708374
  batch 1250 loss: 3.5537648677825926
  batch 1300 loss: 3.557211012840271
  batch 1350 loss: 3.551089506149292
  batch 140

### Evaluation

Let's see what's the accuracy is of our model. Since we already implemented accuracy in the previous exercise, we'll now let you use the torchmetrics package.

In [25]:
from torchmetrics import Accuracy

"""
Note 1: The parameter average='micro' sometimes bugged the usage of the accuracy metric, 
creating an error. If this is the case, removing the parameter would help solve the issue.
This does not change anything, since this setting is also the default setting. 
"""
accuracy = Accuracy(average='micro').to(DEVICE)

Calculate the average accuracy of all examples in the test dataset.

In [26]:
model.eval()

with torch.no_grad():
  for i, data in enumerate(test_data_loader):
      
      #1: Unpack the data from the dataloader
      """
      Every data instance is a dictionary of tensors, with each tensor having 
      the shape (batch_size, maximum_seq_length)
      """
      inputs = data["src"].to(DEVICE)
      labels = data["tgt"].to(DEVICE)
      masks = data["mask"].to(DEVICE)

      #2: Make predictions for this batch 
      masked_outputs = model(inputs)

      #3: Multiply the outputs with the mask
      """
      Note 1: Since we are in inference / evaluation mode, we take the argmax of the last dimension, i.e. posttag_name2id_size, 
      and reshape the shape of the predictions so that it batches the labels, i.e. (batch_size, max_seq_length)

      Shapes:
      unmasked_outputs: (batch_size, max_seq_length, posttag_name2id_size)
      pred: (batch_size, max_seq_length)

      """
      unmasked_outputs = torch.mul(masked_outputs, torch.unsqueeze(masks, -1)).to(DEVICE)
      values, indices = torch.topk(unmasked_outputs, k=1, dim=-1)
      pred = torch.squeeze(indices, -1).to(DEVICE)

      """
      Why is the size of the second dimension here not BATCH_SIZE?                 !!!
      """

      for x in range(BATCH_SIZE):
        batch_acc = accuracy(labels[x], pred[x])
      
      #4: Print batch accuracy for each 25 batches 
      if i % 25 == 24:
        print('  Batch {} Batch accuracy: {}'.format(i + 1, batch_acc))

print('Overall accuracy:', accuracy.compute())

  Batch 25 Batch accuracy: 0.8275862336158752
  Batch 50 Batch accuracy: 0.8888888955116272
  Batch 75 Batch accuracy: 0.5151515007019043
  Batch 100 Batch accuracy: 0.8928571343421936
  Batch 125 Batch accuracy: 0.7142857313156128
  Batch 150 Batch accuracy: 0.8235294222831726
  Batch 175 Batch accuracy: 0.7435897588729858
  Batch 200 Batch accuracy: 0.75
  Batch 225 Batch accuracy: 0.9111111164093018
  Batch 250 Batch accuracy: 0.9677419066429138
  Batch 275 Batch accuracy: 0.7678571343421936
  Batch 300 Batch accuracy: 0.8260869383811951
  Batch 325 Batch accuracy: 0.6206896305084229
  Batch 350 Batch accuracy: 0.8125
  Batch 375 Batch accuracy: 0.6585366129875183
  Batch 400 Batch accuracy: 0.7777777910232544
  Batch 425 Batch accuracy: 0.8529411554336548
  Batch 450 Batch accuracy: 0.800000011920929
  Batch 475 Batch accuracy: 0.8085106611251831
  Batch 500 Batch accuracy: 0.9047619104385376
Overall accuracy: tensor(0.8107, device='cuda:0')


Let's also look at the accuracy **for each class separately**:

In [27]:
torch.cuda.empty_cache()

In [41]:
class_acc = Accuracy(average='none', num_classes=posttag_id2name_size).to(DEVICE)
for i, data in enumerate(test_data_loader):
    
    #1: Unpack the data from the dataloader
    """
    Every data instance is a dictionary of tensors, with each tensor having 
    the shape (batch_size, maximum_seq_length)
    """
    inputs = data["src"].to(DEVICE)
    labels = data["tgt"].to(DEVICE)
    masks = data["mask"].to(DEVICE)

    #2: Make predictions for this batch 
    masked_outputs = model(inputs)

    #3: Multiply the outputs with the mask
    """
    Note 1: Since we are in inference / evaluation mode, we take the argmax of the last dimension, i.e. posttag_name2id_size, 
    and reshape the shape of the predictions so that it batches the labels, i.e. (batch_size, max_seq_length)

    Shapes:
    unmasked_outputs: (batch_size, max_seq_length, posttag_name2id_size)
    pred: (batch_size, max_seq_length)

    """
    unmasked_outputs = torch.mul(masked_outputs, torch.unsqueeze(masks, -1)).to(DEVICE)
    values, indices = torch.topk(unmasked_outputs, k=1, dim=-1)
    pred = torch.squeeze(indices, -1).to(DEVICE)

    for i in range(BATCH_SIZE):
      batch_acc = class_acc(labels[i], pred[i])

    """
    #4: Print batch accuracy for each 25 batches 
    if i % 25 == 24:
      print('  Batch {} Batch accuracy: {}'.format(i + 1, batch_acc))
    """



KeyboardInterrupt: ignored

In [44]:
per_class_acc = class_acc.compute().tolist()
per_class_acc = [round(item, 4) for item in per_class_acc]
class_ids = range(posttag_id2name_size)
class_accuracies = {class_ids[i]: per_class_acc[i] for i in range(len(per_class_acc))}

class_accuracies

{0: 0.9935,
 1: 0.1871,
 2: 0.3125,
 3: 0.3065,
 4: 0.0,
 5: nan,
 6: 0.144,
 7: 0.743,
 8: nan,
 9: 0.0,
 10: 0.0,
 11: 0.3571,
 12: 0.0,
 13: 0.0588,
 14: 1.0,
 15: 0.3623,
 16: 0.4,
 17: 0.0,
 18: 0.0,
 19: 0.7939,
 20: 0.0677,
 21: 0.0,
 22: 0.0811,
 23: 0.0,
 24: 0.0,
 25: 0.0,
 26: 0.0,
 27: 0.0,
 28: 0.0,
 29: 0.0,
 30: 0.0,
 31: 0.0,
 32: nan,
 33: 0.9722,
 34: 0.8856,
 35: nan,
 36: 0.2435,
 37: 0.9825,
 38: 0.0,
 39: 0.0752,
 40: 0.0682,
 41: 0.2593,
 42: 0.0,
 43: 0.0}

## Positional Embeddings

The attention mechanism does not consider the position of the tokens which hurts its performance for many problems. We can solve this issue in several ways. We can either add a positional encoding (via trigonometric functions) or we can learn positional embeddings along the way, in a similar way as BERT does. Here, we will add learnable positional embeddings to our exisisting model with another embedding layer.

The longest sequence in our dataset has 78 tokens (you can trust us on that). So, let's set the number of embeddings for our positional embedding layer to that number. Again, you should use `nn.Embedding`.

Copy the inner parts of your `Transformer` class and add positional embeddings to it.

In [None]:
MAX_SEQ_LEN = 78

class TransformerPos(nn.Module):
    def __init__(self, emb_n: int, pos_emb_n: int, hidden_n: int, n:int =3, h:int =2):
        """
        emb_n: number of token embeddings
        pos_emb_n: number of position embeddings
        hidden_n: hidden dimension
        n: number of layers
        h: number of heads per layer
        """
        super().__init__()
        self.positional_embeddings = pos_emb_n
        self.emb_n = emb_n
        self.hidden_n = hidden_n
        self.n = n
        self.h = h

        self.embed = nn.Embedding(emb_n, hidden_n, padding_idx=0)
        self.pos_emb = nn.Embedding(MAX_SEQ_LEN, hidden_n)
        self.layers = nn.ModuleList(
            [
                TransformerBlock(hidden_n, h)
                for i in range(n)
            ]
        )

    def forward(self, src: Tensor) -> Tensor:
      batch_size, seq_length = src.shape
      positions = torch.arrange(0, seq_length).expand(N, seq_length).to(DEVICE)
      src = self.embed(src) + self.pos_emb(positions)
      for layer in self.layers:
        src = layer(src)

      return src

In [None]:
model_pos = CoNLL2000Transformer(TransformerPos(voc_id2token_size, MAX_SEQ_LEN, EMB_DIMENSION, NUM_LAYERS, NUM_ATT_HEADS), EMB_DIMENSION)

### Training

Same procedure as before. Let's reinitialize our optimizer and our loss function and run the same training loop with our new model `model_pos`.

In [None]:
optimizer = torch.optim.AdamW(model_pos.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-8)

# We are using cross entropy loss here, since the task is a classification task
"""
Note 1: The softmax operation might be included in the cross entropy loss defined below.
If this is the case (assuming that it is noticeable from the training loss), then the softmax
operation before the output must be removed. 
""" 
criterion = nn.CrossEntropyLoss()

In [None]:
torch.cuda.empty_cache()

In [None]:
# The variable DEVICE has been moved to the beginning of this notebook
EPOCHS = 3

"""
Note 1: I am implementing gradient accumulation here, since the number of parameters
is too high and the batch_size is only limited to 4, or maybe even 8. Other alternatives
to this problem might exist, however were not considered due to the limited time frame
given to work on this assignment. 

Here, the batch size was reduced to 4, while a gradient accumulation technique allows
the effective batch size to be 4*8 = 32, i.e. the original pre-set batch size 
"""
GRAD_ACC = 8

model_pos = model_pos.to(DEVICE)

"""
DISCLAIMER: The library tqdm and time is installed here to implement a progress bar,
so that the training progress can be better checked. Since this has little relevance to
the actual exercise, we are hoping this does not affect the grading. 

    # Setup loop with TQDM and dataloader, can be removed if not allowed
    loop = tqdm(train_data_loader, leave=True)
    for batch in loop:
      
      #1: Initialize calculated gradients (from previous step)
      optimizer.zero_grad()

      #2: Pull all tensor batches required for training
      input_sentence = batch[]

from tqdm import tqdm
import time
"""

# Activate training mode 
model_pos.train()

# Initiate losses:
running_loss = 0
avg_loss_in_50 = 0 

for epoch in range(EPOCHS):
  for i, data in enumerate(train_data_loader):

    #1: Unpack the data from the dataloader
    """
    Every data instance is a dictionary of tensors, with each tensor having 
    the shape (batch_size, maximum_seq_length)
    """
    inputs = data["src"].to(DEVICE)
    labels = data["tgt"].to(DEVICE)
    masks = data["mask"].to(DEVICE)

    #2: Make predictions for this batch 
    masked_outputs = model_pos(inputs)

    #3: Multiply the outputs with the mask to obtain original sequence length 
    """
    Note 1: We need to tranpose the outputs so that the shape is (batch_size, posttag_name2id_size, max_seq_length), 
    instead of (batch_size, max_seq_length, posttag_name2id_size), so that the cross-entropy loss can 
    be operated based on the given labels and outputs 
    """
    outputs = torch.transpose(torch.mul(masked_outputs, torch.unsqueeze(masks, -1)), 1, 2)


    #4: Compute the loss and its gradients (w/ gradient accumulation)
    loss = criterion(outputs, labels) 
    (loss / GRAD_ACC).backward()

    #5: Adjust learning weights
    if (i+1) % GRAD_ACC == 0:
      optimizer.step()
      model_pos.zero_grad()

    #6: Gather data and report 
    running_loss += loss.item()
    if i % 50 == 49:
      avg_loss_in_50 = running_loss / 50
      print('  batch {} loss: {}'.format(i + 1, avg_loss_in_50))
      running_loss = 0
      






  batch 50 loss: 3.741136088371277
  batch 100 loss: 3.6963444757461548
  batch 150 loss: 3.686091432571411
  batch 200 loss: 3.680394377708435
  batch 250 loss: 3.6688568782806397
  batch 300 loss: 3.6674539613723756
  batch 350 loss: 3.6468973112106324
  batch 400 loss: 3.65030912399292
  batch 450 loss: 3.6412076854705813
  batch 500 loss: 3.623781099319458
  batch 550 loss: 3.623956661224365
  batch 600 loss: 3.62014262676239
  batch 650 loss: 3.6128553104400636
  batch 700 loss: 3.5921430444717406
  batch 750 loss: 3.594812426567078
  batch 800 loss: 3.595312180519104
  batch 850 loss: 3.5713216972351076
  batch 900 loss: 3.5712246227264406
  batch 950 loss: 3.5826078510284423
  batch 1000 loss: 3.569926528930664
  batch 1050 loss: 3.584328837394714
  batch 1100 loss: 3.5478425884246825
  batch 1150 loss: 3.572008852958679
  batch 1200 loss: 3.566558141708374
  batch 1250 loss: 3.5537648677825926
  batch 1300 loss: 3.557211012840271
  batch 1350 loss: 3.551089506149292
  batch 140

### Evaluation

Now, let's check if our performance on the accuracy got improved.

In [None]:
...

Again, let's also check each class. Which classes got improved the most by adding positional embeddings?

In [None]:
...

The last question in this assignment doesn't require you to code anything. Instead, you're asked to point out possible issues with our current approach and name potential improvements. 
* ...
* ...
* ...

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=faa4af3b-d086-4f42-8b7d-d29c91b1d0f6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>