### Transformers ([Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf))

RNNs have two major drawbacks.
First, they can suffer from vanishing gradients for long sequences.
Second, they can take a long time to train due to sequential dependencies between hidden states which does not take advantage of the massively parallel architecture of modern GPUs.
The first issue is largely addressed by alternate RNN architectures (LSTMs, GRUs) but not the second.

Transformers solve these problems up to a certain extent by enabling to process the input parallely during training with long sequences. Though the computation is quadratic with respect to the input sequence length, it still managable with modern GPUs.

In this notebook, we will implement Transformers model step-by-step by referencing the original paper, [Attention is all you need](https://arxiv.org/pdf/1706.03762.pdf). We will also use a toy dataset to solve a vector-to-vector problem which is a subset of sequence-to-sequence problem.

## Table of Contents

This assignment has 4 parts. In the class we learned about Encoder based Transformers but often we use an Encoder and a Decoder for sequence to sequence task. In this notebook, you will learn how to implement an Encoder Transformers in a step-by-step manner.


1. **Part I (Implement Transformer blocks)**: we will look how to implement building blocks of a Transformer. It will consist of following blocks
   1. MultiHeadAttention
   2. FeedForward
   3. LayerNorm
   4. Encoder Block
1. **Part II (Preparation)**: We will preprocess a dataset
1. **Part III (Train a model)**: In the last part we will look at how to fit the implemented Transformer model to the toy dataset.

You can run all things on CPU till part 3. Part 4 requires GPU and while changing the runtime for this part, you would also have to run all the previous parts as part 4 has dependency on previous parts.

![Encoder Block](https://miro.medium.com/v2/resize:fit:880/format:webp/1*Wew2tXiDk_rMPFCq73cysw.png)

In [None]:
%load_ext autoreload
%autoreload 2

### Google Colab Setup

Next we need to run a few commands to set up our environment on Google Colab. If you are running this notebook on a local machine you can skip this section.

Run the following cell to mount your Google Drive. Follow the link, sign in to your Google account (the same account you used to store this notebook!) and copy the authorization code into the text box that appears below.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import sys

GOOGLE_DRIVE_PATH_AFTER_MYDRIVE = "/content/drive/MyDrive/Colab Notebooks/HW4-Code/Transformer"

GOOGLE_DRIVE_PATH = os.path.join(GOOGLE_DRIVE_PATH_AFTER_MYDRIVE)
print(os.listdir(GOOGLE_DRIVE_PATH))

# Add to sys so we can import .py files.

sys.path.append(GOOGLE_DRIVE_PATH)

['a5_helper.py', '.DS_Store', 'helpers_module', '__pycache__', 'transformers_sentiment_analysis.py', 'transformers_sentiment_analysis.ipynb']


Once you have successfully mounted your Google Drive and located the path to this assignment, run th following cell to allow us to import from the `.py` files of this assignment. If it works correctly, it should print the message:

```
Hello from transformers_sentiment_analysis.py!
```

as well as the last edit time for the file `transformers_sentiment_analysis.py`.

In [None]:
import os
import time
from transformers_sentiment_analysis import hello_transformers


os.environ["TZ"] = "Asia/Tehran"
time.tzset()
hello_transformers()

transformers_path = os.path.join(GOOGLE_DRIVE_PATH, "transformers_sentiment_analysis.py")
transformers_edit_time = time.ctime(os.path.getmtime(transformers_path))
print("transformers.py last edited on %s" % transformers_edit_time)

Hello from transformers_sentiment_analysis.py!
transformers.py last edited on Thu Dec  5 22:27:37 2024


In [None]:
import torch
import torch.nn.functional as F
from torch import Tensor
from torch import nn

import torch

from torch import nn
import torch.nn.functional as F

from helpers_module.utils import (
    reset_seed,
    tensor_to_image,
    attention_visualizer,
)
from helpers_module.grad import rel_error, compute_numeric_gradient
import matplotlib.pyplot as plt
import time
from IPython.display import Image


# for plotting
%matplotlib inline
plt.rcParams["figure.figsize"] = (10.0, 8.0)  # set default size of plots
plt.rcParams["image.interpolation"] = "nearest"
plt.rcParams["image.cmap"] = "gray"

We will use the GPU to accelerate our computation. Run this cell to make sure you are using a GPU.

We will be using `torch.float = torch.float32` for data and `torch.long = torch.int64` for labels.

Please refer to https://pytorch.org/docs/stable/tensor_attributes.html#torch-dtype for more details about data types.

In [None]:
to_float = torch.float
to_long = torch.long

if torch.cuda.is_available():
    print("Good to go!")
    DEVICE = torch.device("cuda")
else:
    print("Please set GPU via Edit -> Notebook Settings.")
    DEVICE = torch.device("cpu")

Good to go!


# Part I.  Implementing Transformer building blocks

Now that we have looked at the data, the task is to sentiment analysis.

In this section, we will look at implementing various building blocks used for implementing Transformer model. This will then be used to make Transformer encoder and decoder.
Each block will be implemented as a subclass of `nn.Module`; we will use PyTorch autograd to compute gradients, so we don't need to implement backward passes manually.

We will implement the following blocks, by referencing the original paper:

1. MultHeadAttention Block
2. FeedForward Block
3. Layer Normalization
4. Positional Encoding block

We will then use these building blocks, combined with the input embedding layer to construct the Transformer Encoder. We will start with MultiHeadAttention block, FeedForward Block, and Layer Normalization and look at Position encoding and input embedding later.

**Note:** One thing to keep in mind while implementing these blocks is that the shape of input and output Tensor from all these blocks we will be same. It always helps by checking the shapes of inputp and output tensors.

### MultiHeadAttention Block

![Encoder Block](https://uvadlc-notebooks.readthedocs.io/en/latest/_images/multihead_attention.svg)

Transformers are sequence to sequence networks i.e., we get a sequence (for example a sentence in English) and get output a sequence (for example a sentence in Spanish). The input sequence are first transformed into embeddings as discussed in the RNN section and these embeddings are then passed through a Positional Encoding block. The resultant Embeddings are then transformed into three vectors, *query*, *key*, and *value* using learnable weights and we then use a Transformer Encoder and Decoder to get the final output sequence. For this section, we will assume that we have the *query*, *key*, and the *value* vector and work on them.

In the above figure, you can see that the Encoder has multihead attention block is right after these blocks. There is also a masked multihead attention in the deocoder but we will not implement.
To implement the basic MultiheadAttention block, we will first implement the Self Attention block and see that MultiHeadAttention can be implemented as a direct extension of the Self Attention block.

## Self Attention Block

Taking inspiration from information retreival paradigm, Transformers have this notion of *query*, *key*, and *value* where given a *query* we try extract information from *key*-*value* pairs. Moving along those lines, we perform this mathematically by taking the weighted sum of *values* for each *query*, where weight is computed by dot product of *query* and the *key*. More precisely, for each query we compute the dot product with all the keys and then use the scalar output of those dot products as weights to find the weighted sum of *values*. Note that before finding the weighted sum, we also apply softmax function to the weights vector. Lets start with implementing of Attention Block that takes input as *query*, *key*, and *value* vectors and returns a Tensor, that is weighted sum of the *values*.

For this section, you need to implement three functions, `scaled_dot_product_two_loop_single`, `scaled_dot_product_two_loop_batch`, and `scaled_dot_product_no_loop_batch` inside the transformers.py file. This might look very similar to the `dot_product_attention` in the RNN notebook but there is a subtle difference in the inputs. You should see the errors of the order less than 1e-5

In [None]:
from transformers_sentiment_analysis import (
    scaled_dot_product_two_loop_single,
    scaled_dot_product_two_loop_batch,
    scaled_dot_product_no_loop_batch,
)

In [None]:
N = 2  # Number of sentences
K = 5  # Number of words in a sentence
M = 4  # feature dimension of each word embedding

query = torch.linspace(-0.4, 0.6, steps=K * M).reshape(K, M)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=K * M).reshape(K, M)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=K * M).reshape(K, M)  # *to_double_cuda

y = scaled_dot_product_two_loop_single(query, key, value)
y_expected = torch.tensor(
    [
        [0.08283, 0.14073, 0.19862, 0.25652],
        [0.13518, 0.19308, 0.25097, 0.30887],
        [0.18848, 0.24637, 0.30427, 0.36216],
        [0.24091, 0.29881, 0.35670, 0.41460],
        [0.29081, 0.34871, 0.40660, 0.46450],
    ]
).to(torch.float32)
print("sacled_dot_product_two_loop_single error: ", rel_error(y_expected, y))

sacled_dot_product_two_loop_single error:  5.204997002435008e-06


In [None]:
N = 2  # Number of sentences
K = 5  # Number of words in a sentence
M = 4  # feature dimension of each word embedding

query = torch.linspace(-0.4, 0.6, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=N * K * M).reshape(N, K, M)  # *to_double_cuda

y = scaled_dot_product_two_loop_batch(query, key, value)
y_expected = torch.tensor(
    [
        [
            [-0.09603, -0.06782, -0.03962, -0.01141],
            [-0.08991, -0.06170, -0.03350, -0.00529],
            [-0.08376, -0.05556, -0.02735, 0.00085],
            [-0.07760, -0.04939, -0.02119, 0.00702],
            [-0.07143, -0.04322, -0.01502, 0.01319],
        ],
        [
            [0.49884, 0.52705, 0.55525, 0.58346],
            [0.50499, 0.53319, 0.56140, 0.58960],
            [0.51111, 0.53931, 0.56752, 0.59572],
            [0.51718, 0.54539, 0.57359, 0.60180],
            [0.52321, 0.55141, 0.57962, 0.60782],
        ],
    ]
).to(torch.float32)
print("scaled_dot_product_two_loop_batch error: ", rel_error(y_expected, y))

scaled_dot_product_two_loop_batch error:  4.020571992067902e-06


In [None]:
N = 2  # Number of sentences
K = 5  # Number of words in a sentence
M = 4  # feature dimension of each word embedding

query = torch.linspace(-0.4, 0.6, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=N * K * M).reshape(N, K, M)  # *to_double_cuda


y, _ = scaled_dot_product_no_loop_batch(query, key, value)

y_expected = torch.tensor(
    [
        [
            [-0.09603, -0.06782, -0.03962, -0.01141],
            [-0.08991, -0.06170, -0.03350, -0.00529],
            [-0.08376, -0.05556, -0.02735, 0.00085],
            [-0.07760, -0.04939, -0.02119, 0.00702],
            [-0.07143, -0.04322, -0.01502, 0.01319],
        ],
        [
            [0.49884, 0.52705, 0.55525, 0.58346],
            [0.50499, 0.53319, 0.56140, 0.58960],
            [0.51111, 0.53931, 0.56752, 0.59572],
            [0.51718, 0.54539, 0.57359, 0.60180],
            [0.52321, 0.55141, 0.57962, 0.60782],
        ],
    ]
).to(torch.float32)

print("scaled_dot_product_no_loop_batch error: ", rel_error(y_expected, y))

scaled_dot_product_no_loop_batch error:  4.020571992067902e-06


## Observing time complexity:

As Transformers are infamous for their time complexity that depends on the size of the input sequence.
We can verify this now that we have implemented `self_attention_no_loop`.
Run the cells below: the first has a sequence length of 256 and the second one has a sequence length of 512. You should roughly be 4 times slower with sequence length 512, hence showing that compleixity of the transformers increase quadratically with resprect to increase in the in sequence length.
The `%timeit` lines may take several seconds to run.

In [None]:
N = 64
K = 256  # defines the input sequence length
M = emb_size = 2048
dim_q = dim_k = 2048
query = torch.linspace(-0.4, 0.6, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=N * K * M).reshape(N, K, M)  # *to_double_cuda

%timeit -n 5 -r 2  y = scaled_dot_product_no_loop_batch(query, key, value)

1.19 s ± 301 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


In [None]:
N = 64
K = 512  # defines the input sequence length
M = emb_size = 2048
dim_q = dim_k = 2048
query = torch.linspace(-0.4, 0.6, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=N * K * M).reshape(N, K, M)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=N * K * M).reshape(N, K, M)  # *to_double_cuda

%timeit -n 5 -r 2  y = scaled_dot_product_no_loop_batch(query, key, value)

4.57 s ± 852 ms per loop (mean ± std. dev. of 2 runs, 5 loops each)


Now that we have implemented `scaled_dot_product_no_loop_batch`, lets implement `SingleHeadAttention`, that will serve as a building block for the `MultiHeadAttention` block. For this exercise, we have made a `SingleHeadAttention` class that inherits from `nn.module` class of Pytorch. You need to implement the `__init__` and the `forward` functions inside `Transformers.py`

Run the following cells to test your implementation of `SelfAttention` layer. We have also written code to check the backward pass using pytorch autograd API in the following cell. You should expect the error to be less than 1e-5

In [None]:
from transformers_sentiment_analysis import SelfAttention

In [None]:
reset_seed(0)
N = 2
K = 4
M = emb_size = 4
dim_q = dim_k = 4
atten_single = SelfAttention(emb_size, dim_q, dim_k)

for k, v in atten_single.named_parameters():
    # print(k, v.shape) # uncomment this to see the weight shape
    v.data.copy_(torch.linspace(-1.4, 1.3, steps=v.numel()).reshape(*v.shape))

query = torch.linspace(-0.4, 0.6, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # *to_double_cuda

query.retain_grad()
key.retain_grad()
value.retain_grad()

y_expected = torch.tensor(
    [
        [
            [-1.10382, -0.37219, 0.35944, 1.09108],
            [-1.45792, -0.50067, 0.45658, 1.41384],
            [-1.74349, -0.60428, 0.53493, 1.67414],
            [-1.92584, -0.67044, 0.58495, 1.84035],
        ],
        [
            [-4.59671, -1.63952, 1.31767, 4.27486],
            [-4.65586, -1.66098, 1.33390, 4.32877],
            [-4.69005, -1.67339, 1.34328, 4.35994],
            [-4.71039, -1.68077, 1.34886, 4.37848],
        ],
    ]
)

dy_expected = torch.tensor(
    [
        [
            [-0.09084, -0.08961, -0.08838, -0.08715],
            [0.69305, 0.68366, 0.67426, 0.66487],
            [-0.88989, -0.87783, -0.86576, -0.85370],
            [0.25859, 0.25509, 0.25158, 0.24808],
        ],
        [
            [-0.05360, -0.05287, -0.05214, -0.05142],
            [0.11627, 0.11470, 0.11312, 0.11154],
            [-0.01048, -0.01034, -0.01019, -0.01005],
            [-0.03908, -0.03855, -0.03802, -0.03749],
        ],
    ]
)

y = atten_single(query, key, value)
dy = torch.randn(*y.shape)  # , **to_double_cuda

y.backward(dy)
query_grad = query.grad

print("SelfAttention error: ", rel_error(y_expected, y))
print("SelfAttention error: ", rel_error(dy_expected, query_grad))

SelfAttention error:  5.282987963847609e-07
SelfAttention error:  2.474069076879365e-06


We have implemented the `SingleHeadAttention` block which brings use very close to implementing `MultiHeadAttention`. We will now see that this can be achieved by manipulating the shapes of input tensors based on number of heads in the Multi-Attention block. We design a network that uses multiple SingleHeadAttention blocks on the same input to compute the output tensors and finally concatenate them to generate a single output. This is not the implementation used in practice as it forces you to initialize multiple layers but we use it here for simplicity. Implement MultiHeadAttention block in the `transformers.py` file by using the SingleHeadAttention block.

Run the following cells to test your `MultiHeadAttention` layer. Again, as `SelfAttention`, we have used pytorch autograd API to test the backward pass. You should expect error values below 1e-5.

In [None]:
from transformers_sentiment_analysis import MultiHeadAttention

In [None]:
reset_seed(0)
N = 2
num_heads = 2
K = 4
M = inp_emb_size = 4
out_emb_size = 8
atten_multihead = MultiHeadAttention(num_heads, inp_emb_size, out_emb_size)

for k, v in atten_multihead.named_parameters():
    # print(k, v.shape) # uncomment this to see the weight shape
    v.data.copy_(torch.linspace(-1.4, 1.3, steps=v.numel()).reshape(*v.shape))

query = torch.linspace(-0.4, 0.6, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # **to_double_cuda
key = torch.linspace(-0.8, 0.5, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # **to_double_cuda
value = torch.linspace(-0.3, 0.8, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # *to_double_cuda

query.retain_grad()
key.retain_grad()
value.retain_grad()

y_expected = torch.tensor(
    [
        [
            [-0.23104, 0.50132, 1.23367, 1.96603],
            [0.68324, 1.17869, 1.67413, 2.16958],
            [1.40236, 1.71147, 2.02058, 2.32969],
            [1.77330, 1.98629, 2.19928, 2.41227],
        ],
        [
            [6.74946, 5.67302, 4.59659, 3.52015],
            [6.82813, 5.73131, 4.63449, 3.53767],
            [6.86686, 5.76001, 4.65315, 3.54630],
            [6.88665, 5.77466, 4.66268, 3.55070],
        ],
    ]
)
dy_expected = torch.tensor(
    [[[ 0.56268,  0.55889,  0.55510,  0.55131],
         [ 0.43286,  0.42994,  0.42702,  0.42411],
         [ 2.29865,  2.28316,  2.26767,  2.25218],
         [ 0.49172,  0.48841,  0.48509,  0.48178]],

        [[ 0.25083,  0.24914,  0.24745,  0.24576],
         [ 0.14949,  0.14849,  0.14748,  0.14647],
         [-0.03105, -0.03084, -0.03063, -0.03043],
         [-0.02082, -0.02068, -0.02054, -0.02040]]]
)

y = atten_multihead(query, key, value)
dy = torch.randn(*y.shape)  # , **to_double_cuda

y.backward(dy)
query_grad = query.grad
print("MultiHeadAttention error: ", rel_error(y_expected, y))
print("MultiHeadAttention error: ", rel_error(dy_expected, query_grad))

MultiHeadAttention error:  5.366163452092416e-07
MultiHeadAttention error:  1.2122404599657782e-06


### LayerNormalization

We implemented BatchNorm while working with CNNs. One of the problems of BatchNorm is its dependency on the the complete batch which might not give good results when the batch size is small. Ba et al proposed `LayerNormalization` that takes into account these problems and has become a standard in sequence-to-sequence tasks. In this section, we will implement `LayerNormalization`. Another nice quality of `LayerNormalization` is that as it depends on individual time steps or each element of the sequence, it can be parallelized and the test time runs in a similar manner hence making it better implementation wise. Again, you have to only implement the forward pass and the backward pass will be taken care by Pytorch autograd. Implement the `LayerNormalization` class in `transformers.py`, you should expect the error below 1e-5

In [None]:
from transformers_sentiment_analysis import LayerNormalization

In [None]:
reset_seed(0)
N = 2
K = 4
norm = LayerNormalization(K)
inp = torch.linspace(-0.4, 0.6, steps=N * K, requires_grad=True).reshape(N, K)

inp.retain_grad()
y = norm(inp)

y_expected = torch.tensor(
    [[-1.34164, -0.44721, 0.44721, 1.34164], [-1.34164, -0.44721, 0.44721, 1.34164]]
)

dy_expected = torch.tensor(
    [[  5.70524,  -2.77289, -11.56993,   8.63758],
        [  2.26242,  -4.44330,   2.09933,   0.08154]]
)

dy = torch.randn(*y.shape)
y.backward(dy)
inp_grad = inp.grad

print("LayerNormalization error: ", rel_error(y_expected, y))
print("LayerNormalization grad error: ", rel_error(dy_expected, inp_grad))

LayerNormalization error:  1.3772273765080196e-06
LayerNormalization grad error:  2.0542348921649034e-07


### FeedForward Block

In the image below we have highlighted the parts where FeedForward Block is used.
<img src="https://drive.google.com/uc?export=view&id=1WCNACnI-Q6OfU3ngjIMCbNzb1sbFnCgP" alt="Layer_norm" width="80%">

Next, we will implement the `Feedforward` block. These are used in both the Encoder and Decoder network of the Transformer and they consist of stacked MLP and ReLU layers. In the overall architecture, the output of `MultiHeadAttention` is fed into the `FeedForward` block. Implement the `FeedForwardBlock` inside `transformers.py` and execute the following cells to check your implementation. You should expect the errors below 1e-5

In [None]:
from transformers_sentiment_analysis import FeedForwardBlock

In [None]:
reset_seed(0)
N = 2
K = 4
M = emb_size = 4

ff_block = FeedForwardBlock(emb_size, 2 * emb_size)

for k, v in ff_block.named_parameters():
    v.data.copy_(torch.linspace(-1.4, 1.3, steps=v.numel()).reshape(*v.shape))

inp = torch.linspace(-0.4, 0.6, steps=N * K, requires_grad=True).reshape(
    N, K
)
inp.retain_grad()
y = ff_block(inp)

y_expected = torch.tensor(
    [[-2.46161, -0.71662, 1.02838, 2.77337], [-7.56084, -1.69557, 4.16970, 10.03497]]
)

dy_expected = torch.tensor(
    [[0.55105, 0.68884, 0.82662, 0.96441], [0.30734, 0.31821, 0.32908, 0.33996]]
)

dy = torch.randn(*y.shape)
y.backward(dy)
inp_grad = inp.grad

print("FeedForwardBlock error: ", rel_error(y_expected, y))
print("FeedForwardBlock error: ", rel_error(dy_expected, inp_grad))

FeedForwardBlock error:  2.1976864847460601e-07
FeedForwardBlock error:  2.302209886634859e-06


Now, if you look back to the original paper, Attention is all you Need, then, we are almost done with the building blocks of an encoder transformer. What's left is:

- Encapsulating the building blocks into Encoder Block
- Handling the input data preprocessing and positional encoding.

We will first look at implementing the Encoder Block. The positional encoding is a non learnable embedding and we can treat it as a preprocessing step in our DataLoader.

In the figure below we have highlighted the encoder block in a Transformer. Notice that it is build using all the components we already implemented before. We just have to be careful about
the residual connections in various blocks.


As shown in the figure above, the encoder block takes it inputs three tensors. We will assume that we have those three tensors, query, key, and value. Run the cell below to check your implementation of the EncoderBlock. You should expect the errors below 1e-5

In [None]:
from transformers_sentiment_analysis import EncoderBlock

In [None]:
reset_seed(0)
N = 2
num_heads = 2
emb_dim = K = 4
feedforward_dim = 8
M = inp_emb_size = 4
out_emb_size = 8
dropout = 0.2

enc_seq_inp = torch.linspace(-0.4, 0.6, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # **to_double_cuda

enc_block = EncoderBlock(num_heads, emb_dim, feedforward_dim, dropout)

for k, v in enc_block.named_parameters():
    # print(k, v.shape) # uncomment this to see the weight shape
    v.data.copy_(torch.linspace(-1.4, 1.3, steps=v.numel()).reshape(*v.shape))

encoder_out1_expected = torch.tensor(
    [[[ 0.00000, -0.31357,  0.69126,  0.00000],
         [ 0.42630, -0.25859,  0.72412,  3.87013],
         [ 0.00000, -0.31357,  0.69126,  3.89884],
         [ 0.47986, -0.30568,  0.69082,  3.90563]],

        [[ 0.00000, -0.31641,  0.69000,  3.89921],
         [ 0.47986, -0.30568,  0.69082,  3.90563],
         [ 0.47986, -0.30568,  0.69082,  3.90563],
         [ 0.51781, -0.30853,  0.71598,  3.85171]]]
)
encoder_out1 = enc_block(enc_seq_inp)
print("EncoderBlock error 1: ", rel_error(encoder_out1, encoder_out1_expected))


N = 2
num_heads = 1
emb_dim = K = 4
feedforward_dim = 8
M = inp_emb_size = 4
out_emb_size = 8
dropout = 0.2

enc_seq_inp = torch.linspace(-0.4, 0.6, steps=N * K * M, requires_grad=True).reshape(
    N, K, M
)  # **to_double_cuda

enc_block = EncoderBlock(num_heads, emb_dim, feedforward_dim, dropout)

for k, v in enc_block.named_parameters():
    # print(k, v.shape) # uncomment this to see the weight shape
    v.data.copy_(torch.linspace(-1.4, 1.3, steps=v.numel()).reshape(*v.shape))

encoder_out2_expected = torch.tensor(
    [[[ 0.42630, -0.00000,  0.72412,  3.87013],
         [ 0.49614, -0.31357,  0.00000,  3.89884],
         [ 0.47986, -0.30568,  0.69082,  0.00000],
         [ 0.51654, -0.32455,  0.69035,  3.89216]],

        [[ 0.47986, -0.30568,  0.69082,  0.00000],
         [ 0.49614, -0.31357,  0.69126,  3.89884],
         [ 0.00000, -0.30354,  0.76272,  3.75311],
         [ 0.49614, -0.31357,  0.69126,  3.89884]]]
)
encoder_out2 = enc_block(enc_seq_inp)
print("EncoderBlock error 2: ", rel_error(encoder_out2, encoder_out2_expected))

EncoderBlock error 1:  5.799257304378232e-07
EncoderBlock error 2:  6.26799449492745e-07


Great! You're almost done with the implementation of the Transformer model.

Lets finally implement the decoder block now that we have all the required tools to implement it. Fill in the init function and the forward pass of the `DecoderBlock` inside `transformers.py`. Run the following cells to check your implementation of the `DecoderBlock`. You should expect the errors below 1e-5.

Based on the implementation of `EncoderBlock` and `DecoderBlock`, we have implemented the `Encoder` and `Decoder` networks for you in transformers.py. You should be able to understand the input and outputs of these Encoder and Decoder blocks. Implement the Transformer block inside transformer.py using these networks.

## Part III: Data loader

In this part, we will have a look at creating the final data loader for the task, that can be used to train the Transformer model. This will comprise of two things:

- Implement Positional Encoding
- Create a dataloader using the `prepocess_input_sequence` fucntion that we created in Part I.

Lets start with implementing the Positional Encoding for the input. The positional encodings make the Transformers positionally aware about sequences. These are usually added to the input and hence should be same shape as input. As these are not learnable, they remain constant throughtout the training process. For this reason, we can look at it as a pre-processing step that's done on the input. Our strategy here would be to implement positional encoding function and use it later while creating DataLoader for the toy dataset.

Lets look at the simplest kind of positional encoding, i.e. for a sequence of length K, assign the nth element in the sequence a value of n/K, where n starts from 0. Implement the position_encoding_simple inside `transformers.py`. You should expect error less than 1e-9 here.

### Simple positional encoding

In [None]:
from transformers_sentiment_analysis import position_encoding_simple

reset_seed(0)
K = 4
M = emb_size = 4

y = position_encoding_simple(K, M)
y_expected = torch.tensor(
    [
        [
            [0.00000, 0.00000, 0.00000, 0.00000],
            [0.25000, 0.25000, 0.25000, 0.25000],
            [0.50000, 0.50000, 0.50000, 0.50000],
            [0.75000, 0.75000, 0.75000, 0.75000],
        ]
    ]
)

print("position_encoding_simple error: ", rel_error(y, y_expected))

K = 5
M = emb_size = 3


y = position_encoding_simple(K, M)
y_expected = torch.tensor(
    [
        [
            [0.00000, 0.00000, 0.00000],
            [0.20000, 0.20000, 0.20000],
            [0.40000, 0.40000, 0.40000],
            [0.60000, 0.60000, 0.60000],
            [0.80000, 0.80000, 0.80000],
        ]
    ]
)
print("position_encoding_simple error: ", rel_error(y, y_expected))

position_encoding_simple error:  0.0
position_encoding_simple error:  0.0


### Sentiment Analysis With Above Model
Get, Preprocess and Load Data - Hugging Face.

- Use an Autotokenizer and DataCollatorWithPadding from hugging face library to tokenize data and generate padding mask.
- Load data with torc dataloader.

In [None]:
! pip install transformers datasets

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (

In [None]:
from transformers import AutoTokenizer, DataCollatorWithPadding

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-cased')
tokenizer

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

DistilBertTokenizerFast(name_or_path='distilbert-base-cased', vocab_size=28996, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}

In [None]:
from datasets import load_dataset
raw_datasets = load_dataset("glue", "sst2")

README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/3.11M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/72.8k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/148k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/67349 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/872 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [None]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx'],
        num_rows: 1821
    })
})

In [None]:

def tokenize_fn(batch):
  return tokenizer(batch['sentence'], truncation=True)
# map toeknize function to dataset
tokenized_datasets = raw_datasets.map(tokenize_fn, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Map:   0%|          | 0/67349 [00:00<?, ? examples/s]

Map:   0%|          | 0/872 [00:00<?, ? examples/s]

Map:   0%|          | 0/1821 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['sentence', 'label', 'idx', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(["sentence", "idx"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 67349
    })
    validation: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 872
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 1821
    })
})

In [None]:
from torch.utils.data import DataLoader

train_loader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    batch_size=8,
    collate_fn=data_collator
)
valid_loader = DataLoader(
    tokenized_datasets["validation"],
    batch_size=8,
    collate_fn=data_collator
)

In [None]:
# check how it works
for batch in train_loader:
  for k, v in batch.items():
    print("k:", k, "v.shape:", v.shape)
  break


k: labels v.shape: torch.Size([8])
k: input_ids v.shape: torch.Size([8, 26])
k: attention_mask v.shape: torch.Size([8, 26])


In [None]:
batch.items()

dict_items([('labels', tensor([0, 1, 0, 1, 1, 1, 1, 1])), ('input_ids', tensor([[  101,  1103,  1642,  1110, 24017,   117,  1103, 13948,  1132,  4701,
          5387,  2879, 14550,   117,  1105,  1103,  9688,  1114,   187, 19429,
          1200,  1110, 23609, 18537,   119,   102],
        [  101,  2265,  6608,  5558,   102,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  1115,  1122, 17462,  1116,  1315,  1242,  3073,  8057, 27647,
         11603,  1642,  3050,  1154,  1103,  1919,  1159,   102,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,  1482,   117,   170,  1762,  8124,  6066,  9688,  1111, 13663,
           102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0],
        [  101,   170,  2426, 14524, 24836,  1578,   117,   172, 22761,  1746,
         

In [None]:
from transformers_sentiment_analysis import Transformer_encoder

num_heads = 4                # تعداد سرها در MultiHeadAttention
emb_dim = 32                 # ابعاد انکودر (معمولاً 768 برای مدل‌های BERT)
feedforward_dim = 64        # ابعاد لایه FeedForward (دو برابر emb_dim)
dropout = 0.5                 # نرخ Dropout
num_enc_layers = 2            # تعداد لایه‌های انکودر
vocab_len = 28996             # اندازه واژگان BERT (واژه‌های موجود)
n_classes = 2                 # تعداد کلاس‌ها (مثلاً برای پیش‌بینی مثبت یا منفی)

checkpoint = 'distilbert-base-cased'
model = Transformer_encoder(num_heads, emb_dim, feedforward_dim, dropout, num_enc_layers, vocab_len, n_classes)
# model = Transformer_encoder()
model.to('cuda')

Transformer_encoder(
  (emb_layer): Embedding(28996, 32)
  (enc_layers): ModuleList(
    (0-1): 2 x EncoderBlock(
      (MultiHeadBlock): MultiHeadAttention(
        (head): ModuleList(
          (0-3): 4 x SelfAttention(
            (q): Linear(in_features=32, out_features=8, bias=True)
            (k): Linear(in_features=32, out_features=8, bias=True)
            (v): Linear(in_features=32, out_features=8, bias=True)
          )
        )
        (linear): Linear(in_features=32, out_features=32, bias=True)
      )
      (norm1): LayerNormalization()
      (norm2): LayerNormalization()
      (ffn): FeedForwardBlock(
        (linear1): Linear(in_features=32, out_features=64, bias=True)
        (r): ReLU()
        (linear2): Linear(in_features=64, out_features=32, bias=True)
      )
      (dropout): Dropout(p=0.5, inplace=False)
    )
  )
  (fc_out): Linear(in_features=32, out_features=2, bias=True)
)

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters())

In [None]:
import numpy as np
from datetime import datetime
device ='cuda'
# Training loop
def train(model, criterion, optimizer, train_loader, valid_loader, epochs):
  train_losses = np.zeros(epochs)
  test_losses = np.zeros(epochs)

  for it in range(epochs):
    model.train()   # Model in training mode
    t0 = datetime.now()
    train_loss = 0
    n_train = 0
    for batch in train_loader:
      # move data to GPU
      batch = {k: v.to(device) for k, v in batch.items()}

      # zero the parameter gradients
      optimizer.zero_grad()

      # Forward pass
      ###pass your model
      outputs = model(batch['input_ids'])

      loss = criterion(outputs, batch['labels'])

      # Backward and optimize
      loss.backward()   # Compute Gradients (Back prop)
      optimizer.step()  # Update weights(GD/Adam)

      train_loss += loss.item()*batch['input_ids'].size(0)
      n_train += batch['input_ids'].size(0)

    # Get average train loss
    train_loss = train_loss / n_train

    # Evalaute model at the end of each epoch
    model.eval()
    test_loss = 0
    n_test = 0
    for batch in valid_loader:
      batch = {k: v.to(device) for k, v in batch.items()}
      outputs = model(batch['input_ids'])
      loss = criterion(outputs, batch['labels'])
      test_loss += loss.item()*batch['input_ids'].size(0)
      n_test += batch['input_ids'].size(0)
    test_loss = test_loss / n_test

    # Save losses
    train_losses[it] = train_loss
    test_losses[it] = test_loss

    dt = datetime.now() - t0
    print(f'Epoch {it+1}/{epochs}, Train Loss: {train_loss:.4f}, \
      Test Loss: {test_loss:.4f}, Duration: {dt}')

  return train_losses, test_losses

In [None]:
train(model, criterion,
      optimizer, train_loader, valid_loader, epochs=4)

Epoch 1/4, Train Loss: 0.6476,       Test Loss: 0.6017, Duration: 0:02:08.926843
Epoch 2/4, Train Loss: 0.4533,       Test Loss: 0.5277, Duration: 0:02:13.783611
Epoch 3/4, Train Loss: 0.3333,       Test Loss: 0.5102, Duration: 0:02:30.794903
Epoch 4/4, Train Loss: 0.2722,       Test Loss: 0.5618, Duration: 0:02:11.866288


(array([0.6476187 , 0.45325375, 0.33326618, 0.27221782]),
 array([0.60172961, 0.52766662, 0.51020182, 0.56180859]))