<img src="https://drive.google.com/uc?id=1dFgNX9iQUfmBOdmUN2-H8rPxL3SLXmxn" width="400"/>


---




### **REMINDER**: Friday's coursework 2 will be in the classroom (1.51 or 1.47 depending if you are an ACSE, EDSML, or GEMS student).

# **Transformers**

#### **Morning & afternoon contents/agenda**

1. What is a Transformer?

2. Applications and impact

3. Dissecting and implementing a Transformer:
  - Embeddings
  - Positional Encoding
  - Self-attention mechanism & multi-head attention
  - Dimensions, parallelisation and residual connections
  - Masked multi-head attention and other decoder differences
  - Regularisation and optimiser

4. Visualisation of attention maps

5. `torch.nn.Transformer`

\[**NOTE**: Vision Transformers (ViT) and multi-modal Transformers will be covered on Thursday\]

\\

#### **Learning outcomes**

1. Understand the transformer architecture based on the self-attention mechanism

2. Develop an intuition of what self-attention does by visualising attention maps

3. Implement the main components of a transformer in PyTorch

<br>

---

<br>

# WARNING

Transformers are difficult to understand in their entirety, and they require multiple reviews of their architecture and internal processing elements. I do not expect you to walk away from today's lecture with a perfect understanding of how they work to the the last detail.

\\

But I do expect you to understand:
- **What** is the self-attention mechanism and how does it work.
- **What** are the embeddings, the positional encodings, and skip-connections used in a transformer.
- **How** a forward pass through a transformer digests data to produce an output (for our particular implementation).

\\

I will **not** assess you or expect to fully grasp the following concepts:
- **Why** self-attention, positional encodings, skip-connections, label smoothing and other tricks make transformers work so well.
- **How** to train a transformer
- **How** are transformers used in chatGPT, BERT, DALL·E 2, and other advanced networks.


## 1. What is a transformer?

<img src="https://drive.google.com/uc?export=view&id=1RHmk_SZSC08IGxQbuzUC7i5gWarGC-Yz" width="1000"/>


A transformer is a deep-learning model that uses a mechanism called self-attention to balance the importance of different parts of the input.


### VAEs, GANs, transformers, and many other architectures, aim to understand and perhaps mimic how humans learn. Transformers try to capture the concept of humans not learning sequentally, but accessing memories when they are relevant for our learning process.

### How does it work?

A transformer, like any other network will take an input and produce an output. The way the output is generated differs significantly from other architectures we have seen in previous lectures. It uses a **self-attention mechanism** to focus on different parts of the input data, but it does not require sequential inputs (like RNNs or LSTMs).

<img src="https://drive.google.com/uc?export=view&id=14NBRyF8CZ0J_6F2ELo0wUesBN8I6gLZv" width="1000"/>

The inputs and outputs can be anything you want, they do not have to be text strings

\\

\\

### History

Transformers were introduced by a team working at Google Brain. They published a paper called [*Attention is all you need*](https://arxiv.org/pdf/1706.03762.pdf)

\\



In [None]:
%%html
<iframe src="https://arxiv.org/pdf/1706.03762.pdf" width="1000" height="800">
</iframe>

Visual representation of the **self-attention mechanism** when using a transformer to generate text:

<img src="https://miro.medium.com/max/1422/0*ODlgeguKzjyzjuuJ.gif" width="700"/>

\\

and compared with a visual representation of how a **CNN-based NLP model** consumes and propagates input data:

<img src="https://lh3.googleusercontent.com/NzOntsH9hZRjMaKVfuEeyyRKZ9RVxBzahp3VqkxEjcq6c9IrOKYhGuOQFkOpV3WPphaLNNTscng4tN2HdZW12-dMJgOzoJ_X0MB1loWIYKGn4mcl0Hgn-hrMJ6g4TpI-vFLqiACz" width="700"/>

\\

and we have also seen that RNNs, LSTMs and GRUs consume input data sequentally and try to keep some record of what they have seen in **hidden-** or **cell-state vectors**.

<img src="https://miro.medium.com/max/1400/1*n-IgHZM5baBUjq0T7RYDBw.gif" width="700"/>

\\

### The main contribution of the Transformer architecture is that it dispenses with the recurrent cells and/or convolution layers by using self-attention.


One important difference between sequential methods and Transformers is that Transformers can paralellise inputs on training, so that they do not require sequential data processing.

This results in an increase in computational efficiency, and also it means that they can see **'into the future'** (even though in the gif above we are only using the decoder part of a pre-trained transformer, and we can't really see this there).


\\

---

\\

## 2. **Applications and Impact**

Transformers have revolutionised deep learning, and perhaps are one the biggest break-throughs in the field in the last years. They were originally created for NLP applications and have excelled at a performing a variety of NLP tasks:
- translation
- text generation
- text comprehension
- etc

\\

The most prominent example nowadays is perhaps the recently released **chatGPT** (We will try it on Thursday).

#### [chatGPT](https://chat.openai.com/chat)

<img style="float: right;" src="https://openai.com/content/images/size/w1400/2022/11/ChatGPT.jpg" width="200"/>

\\

But they can also be used in other tasks such as computer vision (like video understanding, image classification, etc), as well as a wide range of other applications, like fraud detection.

\\

### High-impact Transformer models:


#### [GATO](https://www.deepmind.com/publications/a-generalist-agent)

<img style="float: right;" src="https://assets-global.website-files.com/621e749a546b7592125f38ed/627d13d743dc353a184da8d4_data_sequences.png" width="500"/>

- I haven't even had time to see what this does properly.
- Multi-modal, multi-task, multi-embodiment generalist policy.
- The same network with **the same weights** can play Atari, caption images, chat, stack blocks with a real robot arm and much more.
- Decides based on its context whether to output text, joint torques, button presses, or other tokens.
- It uses Transformers by tokenising and embedding the different data input types appropriately (I think).

\\

#### [AlphaFold](https://alphafold.ebi.ac.uk/)

<img style="float: right;" src="https://alphafold.ebi.ac.uk/assets/img/Q8I3H7_1.png" width="300"/>

- Predicts protein's 3D structure from its amino-acid sequence
- Beats all previous methods on this particular task by a large margin
- The first version (AlphaFold 1) has 21 million parameters

\\

#### [GPT-3](https://openai.com/blog/gpt-3-apps/)

#### ***they do not seem to have a logo anymore...***


- Language model that produces high-quality text (https://www.theguardian.com/commentisfree/2020/sep/08/robot-wrote-this-article-gpt-3)
- It has 175 billion parameters (350GB memory required)
- It would take 1024 A100 GPUs (me and my colleagues have spent almost £200k to buy just 8 of them) working at full capacity for 34 days.
- Estimated cost of training: $10-20 milion (that's compute cost only!)

\\

#### [DALL·E 2](https://openai.com/dall-e-2/)

<img style="float: right;" src="https://cdn.openai.com/dall-e-2/demos/text2im/soup/portal/digital_art/0.jpg" width="300"/>


- Uses a combination of CLIP (Contrastive Language-Image Pre-training) and Transformers to generate images from text. It is a bit more complicated than that really.
- Has 3.5 billion parameters (DALL·E had 12 billion, so it is better and more efficient).

\\

#### [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
**B**idirectional **E**ncoder **R**epresentations from **T**ransformers

<img style="float: right;" src="https://miro.medium.com/max/1220/1*PD-PyafBI-GUXT45MVwefQ.png" width="300"/>

- Language modelling and next sequence prediction
- Adopted by Google's search engine
- 110 million parameters

#### [Perceiver](https://arxiv.org/abs/2103.03206)

<img style="float: right;" src="https://miro.medium.com/max/1400/1*hVHk3wJY5FZ6zg6zvBKxJA.png" width="300"/>

- Perceivers are a Transformer-based architecture that accept any inputs without any modality-specific elements.
- Multi-tasking capabilities: can output different modalities and perform a wide range of tasks.

\\

### **Applications in my research**:

My group and I are exploring ways to apply this Transformer (and other state-of-the-art) ideas to solve scientific problems related to PDEs, image reconstruction, new Loss functions and generative models for DL, etc.

<img style="float: right;" src="https://drive.google.com/uc?id=194uJja4G6L3ghvcvuk89OHg8_hdY_atr" width="1000"/>

\\

---

\\

## 3. **Dissecting Transformers**

A high-level analysis reveals that Transformers are composed by an encoder and a decoder:


<img src="https://drive.google.com/uc?export=view&id=12R5_hjYmB-phlqcV7rdfr2qbZlzCtOOC" width="1000"/>


Both encoder and decoder are actually a combination of a series of encoding and decoding layers respectively (6 in the [original paper](https://arxiv.org/pdf/1706.03762.pdf)).

<img src="https://drive.google.com/uc?export=view&id=1dhTxerpmzGbeToBDuubU3bQdabSWc7OJ" width="1000"/> | <img src="https://drive.google.com/uc?export=view&id=1-fBv1WcbADzeuw13BH4HbD9Z-C2GkMA5" width="1000"/>
-|-


\\

Each encoder/decoder is composed of a multi-head attention layer (sometimes masked in the decoder) followed by a fully-connected layer (feed-forward), plus a some skip connections.

<img src="https://drive.google.com/uc?export=view&id=1qujnfANe7-qJSO9Oyw5T82jDirGmQIHf" width="500"/>


#### Understanding the data flow:

Here we will try to understand, at high level and for this particular implementation of the Transformer architecture designed to generate outputs based on input, what is the data input, and how it produces outputs that are fed back into the network. This auto-regressive processing is not specific to Transformers, most systems that generate sequential outputs use it.

Let's look at a very simple example:

<img src="https://drive.google.com/uc?export=view&id=1JjL4nwTPd_howu8Zi45yFH-HBUfRiSaD" width="1000"/>


1. The `Inputs` are the embedded words in **"convolutions are dead"**. We feed this to the encoder.

\\

2. The decoder needs also an input, which is called `Outputs (shifted right)` in the diagram. At the beginning we don't have any output from the decoder yet (`Output probabilities` in the diagram), so we feed it a special character that indicates that we are at the beginning of a sentence: `<SOS>`. This character also has an embedding.

\\

3. We run the Transformer (assume is trained already), and we produce an vector of probabilities `Output probabilities` that has the size of the dictionary we are using, ie as many entries as words I want to consider using. This dictionary is indexed and has a corresponding embedding representation for every entry.

\\

4. We select one word from the `Output probabilities` vector. This can be just picking the one with maximum value. In practice, we can select a few with high probabilities and explore later on which one is best (this is called beaming, but we won't talk about it). Also here, let's clarify, if we want to use a Transformer to generate new text by just using the decoder, then we can sample instead of taking the maximum probability. But for now, let's assume that we want to translate a sentence and therefore we pick the maximum value of the `Output probabilities` vector. In our case this is **"les"**

\\

5. We now feed this **"les"** word to the decoder from the bottom of the diagram in `Outputs (shifted right)` together with the `<SOS>`: `(<SOS>, "les")`

\\

6. Run through the Transformer again and repeat the process to get a vector of probabilities of which the one with highest value will correspond to the word **"convolucions"**.

\\

7. We repeat this process until the model predicts `<EOS>`, the end of sentence embedded token.

\\

\[**NOTE**: during the whole process, the input to the encoder is the whole sentence, not only the previous words, as we did with RNNs and LSTMs\]







#### We will follow [this](https://www.kaggle.com/code/arunmohan003/transformer-from-scratch-using-pytorch) implementation from kaggle of the various components of the Transformer.

A few imports we will need:

In [None]:
import torch.nn as nn
import torch
import torch.nn.functional as F
import math,copy,re
import warnings
# import pandas as pd
import numpy as np
import seaborn as sns
import torchtext
import matplotlib.pyplot as plt
# warnings.simplefilter("ignore")
print(torch.__version__)



1.13.0+cu116


#### 3.1 **Embeddings**

<img src="https://drive.google.com/uc?export=view&id=1zl60-CCz6p9OVaNVcO_Ag6G0zPkfO9Zl" width="500"/>

We have seen embeddings in previous lectures (diffusion models, RNNs & LSTMs, and NLP).

They take an input that may not be in suitable form for us to operate (words for example), and after tokenising them, represent them in a more suitable form to perform operations with them:

<img src="https://miro.medium.com/max/1400/0*K5a1Ws_nsbEjhbYk.png" width="800"/>

\\

and this operations have some interpretation:

<img src="https://jalammar.github.io/images/word2vec/king-analogy-viz.png" width="600"/>

\\

So the first step is to add an embedding layer for both inputs and outputs:

In [None]:
class Embedding(nn.Module):
    def __init__(self, vocab_size, embed_dim):
        """
        Args:
            vocab_size: size of vocabulary
            embed_dim: dimension of embeddings
        """
        super(Embedding, self).__init__()
        self.embed = ### define an embedding layer
    def forward(self, x):
        """
        Args:
            x: input vector
        Returns:
            out: embedding vector
        """
        out = ### embed inputs
        return out

#### 3.2 **Positional encodings**

<img src="https://drive.google.com/uc?export=view&id=1q3qy8NvOW1ozICjLktNgLLufnREaG9SZ" width="500"/>

Positional encodings are an addition that allows to introduce information about the sequence ordering. Without it, the self-attention mechanism would completely ignore the relative positions of the inputs, and therefore would perform very poorly. 

Recursive neural networks do not have this problem because they read data sequentally, and this structured input enforces the ordering to be taken into account as the data is fed into the network to produce outputs. But recurrent neural networks are hindered because they naturally give less importance to parts of the input that occured much earlier in the sequence.

Positional encodings are only added at the bottom of the encoder and decoder stacks.

The original transformer implementation introduced these positional encodings by adding sine and cosine functions to the embeddings. The positional encodings are calculated as:

<img src="https://drive.google.com/uc?export=view&id=1WI7aVpCDEclX3HnV6pBY-5F1Jb5roy4p" width="500"/>

and in the original paper the positional encodings where not learned but imposed. But it is also possible to learn this encodings during training (the authors reported no differences by trainging the positional encodings).

\\

<img src="https://jalammar.github.io/images/t/transformer_positional_encoding_vectors.png" width="800"/>

\\

The signals created with these sine and cosine functions can be interleaved or concatenated before adding them:

\\

<img src="https://jalammar.github.io/images/t/transformer_positional_encoding_large_example.png" height="200"/>

<img src="https://jalammar.github.io/images/t/attention-is-all-you-need-positional-encoding.png" height="200"/>

\\

There are other ways to combine the positional encodings with the embeddings of the input data, for example, you can concatenate the positional encoding to the embedding of the inputs to create a larger tensor.

I suspect more strategies will appear as we gain a better understanding on how to optimally encode positions in the inputs.

The implementation of the positional encoding is as follows:

In [None]:
# register buffer in Pytorch ->
# If you have parameters in your model, which should be saved and restored in the state_dict,
# but not trained by the optimizer, you should register them as buffers.
#
# a good discussion on the use of registering buffers: https://discuss.pytorch.org/t/what-is-the-difference-between-register-buffer-and-register-parameter-of-nn-module/32723
#

class PositionalEmbedding(nn.Module):
    def __init__(self,max_seq_len,embed_model_dim):
        """
        Args:
            seq_len: length of input sequence
            embed_model_dim: demension of embedding
        """
        super(PositionalEmbedding, self).__init__()
        self.embed_dim = embed_model_dim

        pe = torch.zeros(max_seq_len,self.embed_dim)
        for pos in range(max_seq_len):
            for i in range(0,self.embed_dim,2):  ### why do you think we take steps of 2 here??
                ### implement positional encoding formulas 
                ### 
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)


    def forward(self, x):
        """
        Args:
            x: input vector
        Returns:
            x: output
        """
      
        # make embeddings relatively larger
        x = x * math.sqrt(self.embed_dim)
        #add constant to embedding
        seq_len = x.size(1)
        # x = x + torch.autograd.Variable(self.pe[:,:seq_len], requires_grad=False) ### deprecated implementation
        x = x + self.pe[:,:seq_len].requires_grad_(False)  ### This is the correct modern implementation
        return x

#### 3.3 **Self-attention mechanism & multi-head attention**

Let's start by focusing on understanding one of the encoder layers:

<img src="https://drive.google.com/uc?export=view&id=1qujnfANe7-qJSO9Oyw5T82jDirGmQIHf" width="500"/>

and in particular on the self-attention mechanism:

<img src="https://drive.google.com/uc?export=view&id=1k5_KWliQJ4WmU7VsoGQZZ9jOM5MER_c_" width="800"/>

How do we go from the word *'Convolutions'* to the vector $\color{purple}{\bf z1}$??

\\

\\



#### ***Step 1***

Let's assume we have already embed and positionally encoded our inputs, so that the vecor $\color{orange}{\bf x1}$ is ready to go into the self-attention layer:(in this case, but could be images or other inputs) into vectors:

<img src="https://drive.google.com/uc?export=view&id=1J55o4X3vU06HgcCEQ1j6UzEegO_mPcLu" width="800"/>


#### ***Step 2***

Generate a $\color{red}{\bf query}$, a $\color{green}{\bf key}$, and a $\color{blue}{\bf value}$ vector from the input $\color{orange}{\bf x1}$:

\\


<img src="https://drive.google.com/uc?export=view&id=10WNT0MI_Vduqh9CrwE22jKxacEOeOdYg" width="1000"/>


Where the matrices $\color{red}{W^Q}$, $\color{green}{W^K}$, and $\color{blue}{W^V}$ are composed of trainable parameters.

\\

What are these $\color{red}{\bf query}$, a $\color{green}{\bf key}$, and a $\color{blue}{\bf value}$ vectors? A good explanation from [stack exchange](https://stats.stackexchange.com/questions/421935/what-exactly-are-keys-queries-and-values-in-attention-mechanisms):



*The key/value/query concepts come from retrieval systems. For example, when you type a query to search for some video on Youtube, the search engine will map your query against a set of keys (video title, description, etc.) associated with candidate videos in the database, then present you the best matched videos (values).*

#### ***Step 3***

Calculate $\color{olive}{\bf scores}$ and apply softmax to a normalised dot product between the query of the current word and **all the keys in the input sentence (it looks at what's going on in the input sequence as a whole!)**. Then scale the $\color{blue}{\bf values}$ for all the words in the input sentence using the softmax. Here we show the value vector transposed to what it was before for visualisation purposes (it was a horizontal vector and now it we see it as a vertical vector, is not relevant to anything else other than the visualisation)



<img src="https://drive.google.com/uc?export=view&id=14ltTGJQ-H-kn_0sBTV-FxflV6okZTudr" width="1000"/>

\\

An example of a self-attention matrix (softmaxed scores):

<img src="https://miro.medium.com/max/1556/0*qSKUxncfQVhUJeCr.png" width="400"/>

[image source](https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0)

\\



#### ***Step 4***

All the scaled values $\color{blue}{\bf v'}$ are added together to obtain $\color{purple}{\bf z1}$:


<img src="https://drive.google.com/uc?export=view&id=1aCq95c2P07lnUwkVUqKgVADtysf5ihSU" width="1000"/>

### **Question**:
How many value vectors do we have?

\\

When you read the original paper [*Attention is all you need*](https://arxiv.org/pdf/1706.03762.pdf), as I am sure you will (you should!), you will see this diagram to explain this set of operations we just saw:

<img src="https://production-media.paperswithcode.com/methods/SCALDE.png" width="300"/>

\[We will have a look at the `Mask(opt.)` a bit later when we see the decoder.\]

The authors refer to this self-attention implementation as scaled dot-product attention:

<img src="https://drive.google.com/uc?export=view&id=1WdultedjMzqaSPLlyEXL5GGbwn-do-Mr" width="500"/>

which describes in one formula all the operations that we have seen in the figures above. The squared root of $\sqrt{d_k}$ helps stabilise the gradients during backpropagation.


#### ***Step 5***

When the Transformer was introduced, they combined this self-attention mechanism with something called **multi-head attention**.

Before we jump into it, it may be worth explaining the rational for this multi-headed attention approach. There are many ways of doing this multi-headed attention, but the authors chose to reduce the size of each of the key, query, and value vectors and distribute the operations over a number of independently trainable heads. You could achieve the same dimensional output by using just a single head with larger vectors, but then the resulting attention maps would be averaged over the inputs too much. By splitting them in different heads, and eventually concatenating the output of each head, we get more 'independent' contributions to what the model should attend to. 

<img src="https://production-media.paperswithcode.com/methods/multi-head-attention_l1A3G7a.png" width="300"/>

where the value, keys, and queries are passed through an initial linear layer (with trainable parameters, of course) before being used in a scaled dot-product attention layer, and finally concatenated in a single vector.

Let's go through the steps:

<img src="https://drive.google.com/uc?export=view&id=1M2FkiB_0BDP6MyXMYQckw7rQAPlApkC_" width="1000"/>


The different heads have independent query, key, and value matrices that are independently initialised:

\\

<img src="https://drive.google.com/uc?export=view&id=1BHCOQ-zLmk-yvCOL0Y_mxjCi5yIp0CcE" width="1000"/>



#### ***step 6***


Then the outputs are concatenated and multiplied with an auxiliary trainable matrix ${\color{violet}{W^0}}$:

\\

<img src="https://drive.google.com/uc?export=view&id=1CSM-2CQQLvmq0fcJJMjqj0vV-Q5cpwWW" width="1000"/>



\[**NOTE**: The dimensions of these vectors are way too small. This is to facilitate the visual representation. In the original paper, the dimensions of the **q**, **k**, and **v** vectors was 64, and they used 8 heads. After concatenation, this results in a 512-dimensional vector which is the size of the original embedding. \]

\[**NOTE**: Now that we have a bit more context: the size of the embedding vector is kept constant throughout the whole network (in the original publication I think they used 512). That is, the size of ${\color{purple}{z}}$ is the same as the input size, and it does not change as we go through the blocks in the encoder or decoder.\]

<img src="https://miro.medium.com/max/1280/0*X0c962yMhgRKfMTD.gif" width="600"/>


#### 3.4 **Dimensions, parallelisation and residual connections**

At his point, we should ask ourselves three important questions:

1. Where is my sentence length dimension? And my batch size?

2. Where does the parallelisation of inputs (non-sequential) comes into play?

3. How do we keep the positional encoding information through the stacked layers of encoders?

\\

### 1. Keeping track of dimensions:

<img src="https://drive.google.com/uc?export=view&id=18UETfazv0VYSHKYWhID41T3mpxwwB41k" width="1000"/>

<img src="https://drive.google.com/uc?export=view&id=1aBO4HBD20FvnBfryoBehCQdWi4VdKVYi" width="1000"/>

\\

### 2. Parallel inputs

The second question is answered by noting that the network can consume any of the inputs in the sequence independently:

\\

<img src="https://drive.google.com/uc?export=view&id=1k5_KWliQJ4WmU7VsoGQZZ9jOM5MER_c_" width="400"/>

<img src="https://drive.google.com/uc?export=view&id=1Xxalw7hp2KecUrA2cJK9UARcQqNWlv35" width="400"/>

\\

The matrices of weights $\color{red}{W^Q}$, $\color{green}{W^K}$, and $\color{blue}{W^V}$ use the same parameters for all the inputs, which means that the training is [embarassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel)

\\

### 3. Keeping positional encoding information

The third question correctly identifies the loss of positional encoding as we traverse the encoder stacks upwards. To avoid loosing this information we add a residual connection that just adds the value of original **`embedding + positional encoding`** to the output of the feed-forward layer at every step:


<img src="https://drive.google.com/uc?export=view&id=19uqHiNyGgIDll1WlvYsbjPbzrjqDazfu" width="600"/>


In [None]:
class MultiHeadAttention(nn.Module):
    def __init__(self, embed_dim=512, n_heads=8):
        """
        Args:
            embed_dim: dimension of embeding vector output
            n_heads: number of self attention heads
        """
        super(MultiHeadAttention, self).__init__()

        self.embed_dim = embed_dim    #512 dim
        self.n_heads = n_heads   #8
        self.single_head_dim = int(self.embed_dim / self.n_heads)   #512/8 = 64  . each key,query, value will be of 64d
       
        #key,query and value matrixes    #64 x 64   
        self.query_matrix = ### linear layer for querys # single key matrix for all 8 keys #512x512
        self.key_matrix =   ### linear layer for keys
        self.value_matrix = ### linear layer for values
        self.out =          ### final linear layer

    def forward(self,key,query,value,mask=None):    #batch_size x sequence_length x embedding_dim    # 32 x 10 x 512
        
        """
        Args:
           key : key vector
           query : query vector
           value : value vector
           mask: mask for decoder
        
        Returns:
           output vector from multihead attention
        """
        batch_size = key.size(0)
        seq_length = key.size(1)
        
        # query dimension can change in decoder during inference. 
        # so we can't take general seq_length
        seq_length_query = query.size(1)  ### let's park this for now...
        
        # 32x10x512
        key = key.view(batch_size, seq_length, self.n_heads, self.single_head_dim)  #batch_size x sequence_length x n_heads x single_head_dim = (32x10x8x64)
        query = query.view(batch_size, seq_length_query, self.n_heads, self.single_head_dim) #(32x10x8x64)
        value = value.view(batch_size, seq_length, self.n_heads, self.single_head_dim) #(32x10x8x64)
       
        k = ### generate keys using linear layer defined above     # (32x10x8x64)
        q = ### generate querys using linear layer defined above
        v = ### generate values using linear layer defined above

        q = q.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)    # (32 x 8 x 10 x 64)
        k = k.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
        v = v.transpose(1,2)  # (batch_size, n_heads, seq_len, single_head_dim)
       
        # computes attention
        # adjust key for matrix multiplication
        k_adjusted = k.transpose(-1,-2)  #(batch_size, n_heads, single_head_dim, seq_ken)  #(32 x 8 x 64 x 10)
        product = ### compute Q K product  #(32 x 8 x 10 x 64) x (32 x 8 x 64 x 10) = #(32x8x10x10)
      
        
        # fill those positions of product matrix as (-1e20) where mask positions are 0
        if mask is not None:
             product = product.masked_fill(mask == 0, float("-1e20"))  ### we will see later in the decoder

        #divising by square root of key dimension
        product = ### divide by sqrt(dim_k) # / sqrt(64)

        #applying softmax
        scores = ### apply softmax
 
        #mutiply with value matrix
        scores = ### multiply scores and value matrix  #(32x8x 10x 10) x (32 x 8 x 10 x 64) = (32 x 8 x 10 x 64) 
        
        #concatenated output
        concat = scores.transpose(1,2).contiguous().view(batch_size, seq_length_query, self.single_head_dim*self.n_heads) ### view only works on contiguous tensors  # (32x8x10x64) -> (32x10x8x64)  -> (32,10,512)
                                                                                                                          ### see this: https://discuss.pytorch.org/t/contigious-vs-non-contigious-tensor/30107
        output = ### apply final linear layer #(32,10,512) -> (32,10,512) ### because we have split the 512 in 8 groups of 64
                                                              ### and we want to mantain the embedding dimension (512) throughout the network (from the original implementation)
       
        return output


### and now we can implement the Transformer Block and the Encoder:

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(TransformerBlock, self).__init__()
        
        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: fator ehich determines output dimension of linear layer
           n_heads: number of attention heads
        
        """
        self.attention = ### define multihead attention
        
        self.norm1 = nn.LayerNorm(embed_dim) 
        self.norm2 = nn.LayerNorm(embed_dim)
        
        self.feed_forward = nn.Sequential(
                          nn.Linear(embed_dim, expansion_factor*embed_dim),
                          nn.ReLU(),
                          nn.Linear(expansion_factor*embed_dim, embed_dim)
        )

        self.dropout1 = ### define a dropout layer with p=0.2
        self.dropout2 = ### define a dropout layer with p=0.2

    def forward(self,key,query,value):
        
        """
        Args:
           key: key vector
           query: query vector
           value: value vector
           norm2_out: output of transformer block
        
        """
        
        attention_out = ### calculate output of multihead attention #32x10x512
        attention_residual_out =  ### add residual connection #32x10x512
        norm1_out =  ### apply layer normalisation and dropout to attention_residual_out  #32x10x512

        feed_fwd_out =  ### pass norm1_out through feed forward block #32x10x512 -> #32x10x2048 -> 32x10x512
        feed_fwd_residual_out =  ### add residual connection #32x10x512
        norm2_out =  ### apply layer normalisation and dropout to attention_residual_out  #32x10x512

        return norm2_out



class TransformerEncoder(nn.Module):
    """
    Args:
        seq_len : length of input sequence
        embed_dim: dimension of embedding
        num_layers: number of encoder layers
        expansion_factor: factor which determines number of linear layers in feed forward layer
        n_heads: number of heads in multihead attention
        
    Returns:
        out: output of the encoder
    """
    def __init__(self, seq_len, vocab_size, embed_dim, num_layers=2, expansion_factor=4, n_heads=8):
        super(TransformerEncoder, self).__init__()
        
        self.embedding_layer =  ### define embedding layer
        self.positional_encoder =  ### define positional encoding

        self.layers = nn.ModuleList([ ### add transformer blocks   ]) 
    
    def forward(self, x):
        embed_out = ### apply embedding
        out =  ### apply positional encoding
        for layer in self.layers:
            out =  ### run through blocks

        return out  #32x10x512

#### 3.5 **Masked multi-head attention and other decoder differences**

The decoder side is very similar to the encoder. It stacks a series of Multi-head attention combined with FC layers (This is what the `Nx` means next to the encoder and decoder figure).

<img src="https://drive.google.com/uc?export=view&id=12R5_hjYmB-phlqcV7rdfr2qbZlzCtOOC" width="1000"/>


But there are a few important differences:
- The decoder has **`Outputs (shifted right)`** as inputs
- The first multi-head attention layer is **`masked`**
- Some information is being fed from the **encoder stack**
- It has a **`Linear`** + **`Softmax`** layer on output (top of the figure)

Breaking them one by one:

- The decoder has **`Outputs (shifted right)`** as inputs:

This is similar to what we saw in RNNs and LSTMs: this is to ensure that the decoder does not use the current position (the right word to translate for example), and only uses data that has been generated/predicted/translated before in the sentence (input sequence). For example, if we are generating text, the first generated word passed to the decoder is the token <start> and the prediction process continues until the decoder generates a special end token <eos>.

<img src="https://miro.medium.com/max/960/0*u8nSpT8Z8ITwzNLV.gif" width="500"/>

\\

- The first multi-head attention layer is **`masked`**

This is an important one. We do not want the decoder to have access to any information ahead of its current processing position:

<img src="https://miro.medium.com/max/1090/0*0pqSkWgSPZYr_Sjx.png" width="300"/>

The word **`am`** should not consider the word **`fine`** in its attention mapping. It should only attend to words (inputs) that precede it in the sentence (sequence). Why? To keep its auto-regressive properties. What does auto-regressive mean? In models that operate with sequential data, it means that the model uses its outputs as inputs for the next sequential step.

RNNs and LSTMs are also autoregressive, they keep previous information in hidden- and cell-state vectors, but they are limited in the contextual information that they have available (only in the past, and with vanishing importance with distance). 

A good intro on autoregressive models [here](https://ml.berkeley.edu/blog/posts/AR_intro/)

To restrict the use of 'future' words, we mask the attention matrices:

<img src="https://miro.medium.com/max/1400/0*QYFua-iIKp5jZLNT.png" height="150"/>

<img src="https://miro.medium.com/max/1400/0*3ykVCJ9okbgB0uUR.png" height="150"/>

This masked self-attention layer also has multiple heads (8 in the original publication).

\\

- Some information is being fed from the **encoder stack**

The outputs from the encoder are fed to the **encoder-decoder attention** layers:

<img src="https://drive.google.com/uc?export=view&id=1Uw-AZWXy_UqHJGYcIpLjvZeBdoJAFHzm" width="300"/>

The **keys** and **values** from the encoder output are combined with the **queries** of the previous decoder layer and used as inputs for these **encoder-decoder attention** layers. This allows every position in the decoder to attend to all positions in the input sequence.

\\

- It has a **`Linear`** + **`Softmax`** layer on output (top of the figure):

**QUESTIONS**:

1. What do this combination of fully-connected layer and Softmax activations normally do at the end of a network?

2. What could be the dimension of this output softmax vector?

\\

We can now implement the Decoder:

In [None]:
class DecoderBlock(nn.Module):
    def __init__(self, embed_dim, expansion_factor=4, n_heads=8):
        super(DecoderBlock, self).__init__()

        """
        Args:
           embed_dim: dimension of the embedding
           expansion_factor: fator ehich determines output dimension of linear layer
           n_heads: number of attention heads
        
        """
        self.attention = ### multihead attention
        self.norm = ### layer norm layer
        self.dropout = ### dropout layer
        self.transformer_block = ### transformer block
        
    
    def forward(self, key, query, x,mask):
        
        """
        Args:
           key: key vector
           query: query vector
           value: value vector
           mask: mask to be given for multi head attention 
        Returns:
           out: output of transformer block
    
        """
        
        #we need to pass the mask only to fst attention
        attention = self.attention(x,x,x,mask=mask) #32x10x512
        value = ### apply dropout to attention with skip connection
        
        out = ### pass through transformer block

        
        return out


class TransformerDecoder(nn.Module):
    def __init__(self, target_vocab_size, embed_dim, seq_len, num_layers=2, expansion_factor=4, n_heads=8):
        super(TransformerDecoder, self).__init__()
        """  
        Args:
           target_vocab_size: vocabulary size of taget
           embed_dim: dimension of embedding
           seq_len : length of input sequence
           num_layers: number of encoder layers
           expansion_factor: factor which determines number of linear layers in feed forward layer
           n_heads: number of heads in multihead attention
        
        """
        self.word_embedding = nn.Embedding(target_vocab_size, embed_dim)
        self.position_embedding = PositionalEmbedding(seq_len, embed_dim)

        self.layers = nn.ModuleList(
            [
                DecoderBlock(embed_dim, expansion_factor=4, n_heads=8) 
                for _ in range(num_layers)
            ]

        )
        self.fc_out = nn.Linear(embed_dim, target_vocab_size)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x, enc_out, mask):
        
        """
        Args:
            x: input vector from target
            enc_out : output from encoder layer
            trg_mask: mask for decoder self attention
        Returns:
            out: output vector
        """
            
        
        x = ### word embedding  #32x10x512
        x = ### add positional encoding #32x10x512
        x = ### apply dropout
     
        for layer in self.layers:
            x = layer(enc_out, x, enc_out, mask) 

        out = ### apply softmax to output of final linear layer (fc_out)

        return out

#### Now we have all the elements to implement our Transformer and test it:

In [None]:
class Transformer(nn.Module):
    def __init__(self, embed_dim, src_vocab_size, target_vocab_size, seq_length,num_layers=2, expansion_factor=4, n_heads=8):
        super(Transformer, self).__init__()
        
        """  
        Args:
           embed_dim:  dimension of embedding 
           src_vocab_size: vocabulary size of source
           target_vocab_size: vocabulary size of target
           seq_length : length of input sequence
           num_layers: number of encoder layers
           expansion_factor: factor which determines number of linear layers in feed forward layer
           n_heads: number of heads in multihead attention
        
        """
        
        self.target_vocab_size = target_vocab_size

        self.encoder = TransformerEncoder(seq_length, src_vocab_size, embed_dim, num_layers=num_layers, expansion_factor=expansion_factor, n_heads=n_heads)
        self.decoder = TransformerDecoder(target_vocab_size, embed_dim, seq_length, num_layers=num_layers, expansion_factor=expansion_factor, n_heads=n_heads)
        
    
    def make_trg_mask(self, trg):
        """
        Args:
            trg: target sequence
        Returns:
            trg_mask: target mask
        """
        batch_size, trg_len = trg.shape
        # returns the lower triangular part of matrix filled with ones
        trg_mask = torch.tril(torch.ones((trg_len, trg_len))).expand(
            batch_size, 1, trg_len, trg_len
        )
        return trg_mask    

    def decode(self,src,trg):
        """
        for inference
        Args:
            src: input to encoder 
            trg: input to decoder
        out:
            out_labels : returns final prediction of sequence
        """
        trg_mask = self.make_trg_mask(trg)
        enc_out = self.encoder(src)
        out_labels = []
        batch_size,seq_len = src.shape[0],src.shape[1]
        #outputs = torch.zeros(seq_len, batch_size, self.target_vocab_size)
        out = trg
        for i in range(seq_len): #10
            out = self.decoder(out,enc_out,trg_mask) #bs x seq_len x vocab_dim
            # taking the last token
            out = out[:,-1,:]
     
            out = out.argmax(-1)
            out_labels.append(out.item())
            out = torch.unsqueeze(out,axis=0)
          
        
        return out_labels
    
    def forward(self, src, trg):
        """
        Args:
            src: input to encoder 
            trg: input to decoder
        out:
            out: final vector which returns probabilities of each target word
        """
        trg_mask = ### make a target mask
        enc_out = ### pass through encoder
   
        outputs = ### pass through decoder
        return outputs

#### 3.6 **Regularisation and optimiser**

Finally, such complex architecture requires some sort of regularisation in order to be able to train properly.

Four regularisation methods are applied to this network:
- Layer normalisation applied after the residual connections are added in `Add & Norm` yellow blocks in the diagram. Layer normalisation is similar to batch normalisation but does not depend on batch size. Introduced by Hinton in [this paper](https://arxiv.org/pdf/1607.06450.pdf).
- Dropout (after most layers, including after the positional encoding layer) with a dropout rate of $P_{drop}=0.2$ (in the paper I think they use 0.1).
- Skip connections to mantain positional encoding information.
- Label smoothing (basically add noise to the labels, you can read about it [here](https://paperswithcode.com/method/label-smoothing)). According to the authors: *This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score*.

The optimiser used in the original publication was Adam.


### Let's test a forward pass through our network and check if at least the dimensions add up to what we expect:

In [None]:
### let's test if it does what we want, at least in terms of dimensions:

src_vocab_size = 11
target_vocab_size = 11
num_layers = 6
seq_length= 12


# let 0 be sos token and 1 be eos token
src = torch.tensor([[0, 2, 5, 6, 4, 3, 9, 5, 2, 9, 10, 1], 
                    [0, 2, 8, 7, 3, 4, 5, 6, 7, 2, 10, 1]])
target = torch.tensor([[0, 1, 7, 4, 3, 5, 9, 2, 8, 10, 9, 1], 
                       [0, 1, 5, 6, 2, 4, 7, 6, 2, 8, 10, 1]])

print(src.shape,target.shape)
model = Transformer(embed_dim=512, src_vocab_size=src_vocab_size, 
                    target_vocab_size=target_vocab_size, seq_length=seq_length,
                    num_layers=num_layers, expansion_factor=4, n_heads=8)

print(model(src,target))



## 4. Visualisation of attention maps

To visualise how words attend to other words we will use an interactive tool called [bertviz](https://github.com/jessevig/bertviz).

We have not covered how BERT (Bidirectional Encoder Representations from Transformers) work, here we will just use the visualisation tool provided to provide better intuitions on how words attend to each other.

In [None]:
!pip install bertviz

In [None]:
# Load model and retrieve attention weights

from bertviz import head_view, model_view
from transformers import BertTokenizer, BertModel

model_version = 'bert-base-uncased'
model = BertModel.from_pretrained(model_version, output_attentions=True)
tokenizer = BertTokenizer.from_pretrained(model_version)
sentence_a = "The cat sat on the mat"
sentence_b = "The cat lay on the rug"
inputs = tokenizer.encode_plus(sentence_a, sentence_b, return_tensors='pt')
input_ids = inputs['input_ids']
token_type_ids = inputs['token_type_ids']
attention = model(input_ids, token_type_ids=token_type_ids)[-1]
sentence_b_start = token_type_ids[0].tolist().index(1)
input_id_list = input_ids[0].tolist() # Batch index 0
tokens = tokenizer.convert_ids_to_tokens(input_id_list) 

### Head View
<b>The head view visualizes attention in one or more heads from a single Transformer layer.</b> Each line shows the attention from one token (left) to another (right). Line weight reflects the attention value (ranges from 0 to 1), while line color identifies the attention head. When multiple heads are selected (indicated by the colored tiles at the top), the corresponding  visualizations are overlaid onto one another.  For a more detailed explanation of attention in Transformer models, please refer to the [blog](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1).

#### Usage

👉 **Hover** over any **token** on the left/right side of the visualization to filter attention from/to that token. <br/>
👉 **Double-click** on any of the **colored tiles** at the top to filter to the corresponding attention head.<br/>
👉 **Single-click** on any of the **colored tiles** to toggle selection of the corresponding attention head. <br/>
👉 **Click** on the **Layer** drop-down to change the model layer (zero-indexed).

In [None]:
head_view(attention, tokens, sentence_b_start)

### Model View
<b>The model view provides a birds-eye view of attention throughout the entire model</b>. Each cell shows the attention weights for a particular head, indexed by layer (row) and head (column).  The lines in each cell represent the attention from one token (left) to another (right), with line weight proportional to the attention value (ranges from 0 to 1).  For a more detailed explanation, please refer to the [blog](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1).

#### Usage
👉 **Click** on any **cell** for a detailed view of attention for the associated attention head (or to unselect that cell). <br/>
👉 Then **hover** over any **token** on the left side of detail view to filter the attention from that token.

In [None]:
model_view(attention, tokens, sentence_b_start)

In [None]:
%%html
<iframe src="https://poloclub.github.io/dodrio/" width="1000" height="800">
</iframe>

---

## 5. `torch.nn.Transformer`

Why did we go through all the pain of implementing a transformer if we can just use the [`torch.nn.Transformer`](https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html) class?

The reason is that it is important to understand the underlying architecture of complex networks. In the real world, when you are working on a particular problem, it is quite likely that you need to modify these baseline provided networks to adapt them to the specifications of your problem.

Anyone can download a Transformer and use it, but **what will set you appart is your ability to understand what's going on under the hood.**

\\

But, if we just want to get a Transformer model in PyTorch (it implements the architecture described in the original paper) we can do:

In [None]:
pip install transformers

In [None]:
transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
src = torch.rand((10, 32, 512))
tgt = torch.rand((20, 32, 512))
out = transformer_model(src, tgt)

In [None]:
print(out.shape)

\\

---

\\

## 6. **Useful materials and additional resources**


- The original Transformer paper: [Attention is all you need](https://arxiv.org/abs/1706.03762)
- A good blog describing these and other components of the Transformers: [The illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
- And another: [Illustrated guide to Transformers](https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0)
- The visualisation tools are available here: [bertviz](https://github.com/jessevig/bertviz)
- Brilliant website to visualise Transformers in action: [dodrio](https://poloclub.github.io/dodrio/)
- Paper describing GPT-3: [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165)
- Alphafold Nature paper: [Highly accurate protein structure prediction with AlphaFold](https://www.nature.com/articles/s41586-021-03819-2)



## **Bonus**

We can use a pretrained Transformer (as we do not have the resources to train a proper one):

In [None]:
from transformers import pipeline

In [None]:
question_answering = pipeline('question-answering')

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


First paragraphs of wiki from its string theory entry:

In [None]:
context = """
In physics, string theory is a theoretical framework in which the point-like particles
of particle physics are replaced by one-dimensional objects called strings. String
theory describes how these strings propagate through space and interact with each other.
On distance scales larger than the string scale, a string looks just like an ordinary
particle, with its mass, charge, and other properties determined by the vibrational state
of the string. In string theory, one of the many vibrational states of the string
corresponds to the graviton, a quantum mechanical particle that carries the gravitational
force. Thus, string theory is a theory of quantum gravity.

String theory is a broad and varied subject that attempts to address a number of deep
questions of fundamental physics. String theory has contributed a number of advances
to mathematical physics, which have been applied to a variety of problems in black hole
physics, early universe cosmology, nuclear physics, and condensed matter physics, and
it has stimulated a number of major developments in pure mathematics. Because string
theory potentially provides a unified description of gravity and particle physics, it
is a candidate for a theory of everything, a self-contained mathematical model that
describes all fundamental forces and forms of matter. Despite much work on these
problems, it is not known to what extent string theory describes the real world or
how much freedom the theory allows in the choice of its details.

String theory was first studied in the late 1960s as a theory of the strong nuclear
force, before being abandoned in favor of quantum chromodynamics. Subsequently, it was
realized that the very properties that made string theory unsuitable as a theory of
nuclear physics made it a promising candidate for a quantum theory of gravity. The
earliest version of string theory, bosonic string theory, incorporated only the class
of particles known as bosons. It later developed into superstring theory, which posits
a connection called supersymmetry between bosons and the class of particles called fermions.
Five consistent versions of superstring theory were developed before it was conjectured in
the mid-1990s that they were all different limiting cases of a single theory in 11 dimensions
known as M-theory. In late 1997, theorists discovered an important relationship called the
anti-de Sitter/conformal field theory correspondence (AdS/CFT correspondence), which relates
string theory to another type of physical theory called a quantum field theory.

One of the challenges of string theory is that the full theory does not have a satisfactory
definition in all circumstances. Another issue is that the theory is thought to describe an
enormous landscape of possible universes, which has complicated efforts to develop theories
of particle physics based on string theory. These issues have led some in the community to
criticize these approaches to physics, and to question the value of continued research on
string theory unification. 
"""

In [None]:
question = "what is string theory?"
result = question_answering(question=question, context=context)

print("Answer:", result['answer'])
print("Score:", result['score'])

In [None]:
question = "What is the goal of string theory?"
result = question_answering(question=question, context=context)

print("Answer:", result['answer'])
print("Score:", result['score'])

Batch normalisation first paragraphs from wikipedia:

In [None]:
context = """
Batch normalization (also known as batch norm) is a method used to make training
of artificial neural networks faster and more stable through normalization of the
layers' inputs by re-centering and re-scaling. It was proposed by Sergey Ioffe and Christian Szegedy in 2015.[1]

While the effect of batch normalization is evident, the reasons behind its effectiveness
remain under discussion. It was believed that it can mitigate the problem of internal
covariate shift, where parameter initialization and changes in the distribution of the
inputs of each layer affect the learning rate of the network.[1] Recently, some scholars
have argued that batch normalization does not reduce internal covariate shift, but rather
smooths the objective function, which in turn improves the performance.[2] However, at
initialization, batch normalization in fact induces severe gradient explosion in deep networks, which
is only alleviated by skip connections in residual networks.[3] Others maintain that batch normalization
achieves length-direction decoupling, and thereby accelerates neural networks.[4] More recently a
normalize gradient clipping technique and smart hyperparameter tuning has been introduced in
Normalizer-Free Nets, so called "NF-Nets" which mitigates the need for batch normalization.
"""

In [None]:
question = "what is the purpose of batch normalisation?"
result = question_answering(question=question, context=context)

print("Answer:", result['answer'])
print("Score:", result['score'])

On Thursday we will see how this can be greatly improved by using better more modern implementations of Transformers.