# Transformers from Scratch in PyTorch
 

## Transformers 

“The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.”

Here, “transduction” means the conversion of input sequences into output sequences. The idea behind Transformer is to handle the dependencies between input and output with attention and recurrence completely.

The idea behind Transformer is to handle the dependencies between input and output with attention and recurrence completely.

A transformer model can “attend” or “focus” on all previous tokens that have been generated.

The Transformer is an architechture for solving sequence to sequence tasks which can be used for various tasks like machine translation, speech recognition, question answering, summarization, conversational chatbots, and even to power better search engines and many more.

So to get basic clear I would recommend to learn basis of [seq2seq](https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/) Models and [Language Models](https://towardsdatascience.com/the-beginners-guide-to-language-models-aa47165b57f9)

Lets prepare some clear intuition for Transformer with it Architechture and Building Blocks 
Transformer consists of Encoder Decoder architecture


![Transformer Architechture]('transformer_architecture.png')


On higher Level 

The encoder maps an input sequence into an abstract continuous representation that holds all the learned information of  that input. The decoder then takes that continuous representation and step by step generates a single output while also being fed the previous output.

Brief of Encoder and Decoder
The word embeddings of the input sequence are passed to the first encoder
These are then transformed and propagated to the next encoder
The output from the last encoder in the encoder-stack is passed to all the decoders in the decoder-stack as shown in the figure below:

Encoder and Decoder to focus
Encoder and Decoder blocks are actually multiple identical encoders and decoders stacked on top of each other. Both the encoder stack and the decoder stack have the same number of units.


The encoder-decoder attention is trained to associate the input sentence with the corresponding output word.
To [visualize the inner working of Attention](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1)

##### Attach Encoder and Decoder Diagram given [here](https://theaisummer.com/transformer/)


####  [Steps for  Transformer encoder](https://theaisummer.com/transformer/) 

To process a sentence we need these 3 steps:

1. Word embeddings of the input sentence are computed simultaneously.

2. Positional encodings are then applied to each embedding resulting in word vectors that also include positional information.

3. The word vectors are passed to the first encoder block.

Each block consists of the following layers in the same order:

1. A multi-head self-attention layer to find correlations between each word

2. A normalization layer

3. A residual connection around the previous two sublayers

4. A linear layer

5. A second normalization layer

6. A second residual connection





# Attention makes Transformer Special

[Attention is just a fancy name](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1) for weighted average! The weights show how much the model attends to each input in X when computing the weighted average, and are thus referred to as attention weights

What is [Self Attention](https://theaisummer.com/transformer/)
Self-attention enables us to find correlations between different words of the input indicating the syntactic and contextual structure of the sentence.


#### Append Diagram mention in links for better exposure

“Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.” ~ Ashish Vaswani et al. from Google Brain.


Self-attention allows the model to look at the other words in the input sequence to get a better understanding of a certain word in the sequence. Now, let’s see how we can calculate self-attention.


How to [Calculate Self Attention](https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/)
1. Create Query, Key and Value from each encoder's Input will discuss in detail while exploring Multi head attention
for more details of working query, key and value is given details in above link for Calculate self Attention and also [here](https://theaisummer.com/transformer/) 

2. Next, we will calculate self-attention for every word in the input sequence
3. Consider this phrase – “Action gets results”. To calculate the self-attention for the first word “Action”, we will calculate scores for all the words in the phrase with respect to “Action”. This score determines the importance of other words when we are encoding a certain word in an input sequence


We use the keys to define the attention weights to look at the data and the values as the information that we will actually get.


## Lets Dive to Multi Head Attention
The intuition behind [multi-head attention](https://theaisummer.com/transformer/) is that it allows us to attend to different parts of the sequence differently each time. This practically means that:

The model can better capture positional information because each head will attend to different segments of the input. The combination of them will give us a more robust representation.

Each head will capture different contextual information as well, by correlating words in a unique manner.


From Paper Quoted
“Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this.”



#### Masked Multi-head attention

##### Transformer decoder: what is different?

1. The decoder consists of all the aforementioned components plus two novel ones. As before:

2. The output sequence is fed in its entirety and word embeddings are computed

3. Positional encoding are again applied

And the vectors are passed to the first Decoder block

Each decoder block includes:

1. A Masked multi-head self-attention layer

2. A normalization layer followed by a residual connection

3. A new multi-head attention layer (known as Encoder-Decoder attention)

4. A second normalization layer and a residual connection

5. A linear layer and a third residual connection

The decoder block appears again 6 times. The final output is transformed through a final linear layer and the output probabilities are calculated with the standard softmax function.





#### [Steps of model Implementation](https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec)
1. Embedding the inputs 
2. The Positional Encodings 
3. Creating Masks 
4. The Multi-Head Attention layer 
5. The Feed-Forward layer


### code  
https://medium.com/swlh/what-exactly-is-happening-inside-the-transformer-b7f713d7aded



### Index
1) Intuitive Background of seq2seq. 

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ \
https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html \


https://github.com/bentrevett/pytorch-seq2seq

2) High Level Introduction to Transformer. 

https://theaisummer.com/transformer/ \
https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/ \

3) Various components of Transformers.\

4) Visualization of each layer. \

5) Attention alway to be attend in detail.\
6) Pratical Implementation \
7) Clean code 




#### Various components of Transformers.
https://theaisummer.com/normalization/


#### Pratical Implementation
https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec#3fa3 \
https://medium.com/the-dl/transformers-from-scratch-in-pytorch-8777e346ca51 \
https://towardsdatascience.com/a-detailed-guide-to-pytorchs-nn-transformer-module-c80afbc9ffb1 \

#### Clean code 
https://theaisummer.com/einsum-attention/

#### Visualization
https://pythonrepo.com/repo/jessevig-bertviz-python-deep-learning-model-explanation
https://github.com/jessevig/bertviz
https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1
https://www.machinecurve.com/index.php/2021/01/19/visualizing-transformer-behavior-with-ecco/




Define Roadmap for Intuitional Understanding
Here is details of each layer with [theory of transformer for conversational chatbot](https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0) also Jay Alammar's [illustrated-transformer](https://jalammar.github.io/illustrated-transformer/) explained with each details and for Language translation for code implementaion mention in [How to code The Transformer in Pytorch](https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec ) also each layer code implementation using tensorflow the post [what-exactly-is-happening-inside-the-transformer](https://medium.com/swlh/what-exactly-is-happening-inside-the-transformer-b7f713d7aded) also code mention here [attention with standford code](http://nlp.seas.harvard.edu/2018/04/03/attention.html) gives clear understanding for practical implementation. Last but not least [visualization](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1) of inner working gives better understanding to build the knowledge also others are mention below.



Also When BERT is to be understand [Deconstructing BERT, Part 2: Visualizing the Inner Workings of Attention](https://towardsdatascience.com/deconstructing-bert-part-2-visualizing-the-inner-workings-of-attention-60a16d86b5c1) gives overview in details.


    
   

Steps for understanding 

1) Understanding the basic intuition
   
   https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/ \
   Task: Get the steps , understand intuitional flow and create documents  
    
   https://theaisummer.com/transformer/ \
   Task : Define steps 
    
   https://jalammar.github.io/illustrated-transformer/
   Task : Get the images to build better documents
   
   https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0
   
   Task: verify for some application 

2)  Build and test code 
    
    https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec#3fa3

In [2]:
# https://www.youtube.com/watch?v=U0s0f995w14&t=20s
import torch 
import torch.nn as nn


In [2]:
import torch
import torch.nn as nn

x = torch.tensor([[1.0, -1.0],
                  [0.0,  1.0],
                  [0.0,  0.0]])

in_features = x.shape[1]  # = 2
print(in_features)
out_features = 2

m = nn.Linear(in_features, out_features)
print(m)

2
Linear(in_features=2, out_features=2, bias=True)


In [3]:
class SelfAttention(nn.Module):
    def __init__(self,embed_size,heads):  
        '''head:split embedding into parts eg:emdedd size=256  heads=8 split=8/32 parts'''
        super(SelfAttention,self).__init__()
        self.embed_size=embed_size
        self.heads=heads
        self.head_dim=embed_size //heads
        assert (self.head_dim*heads==embed_size), "Embed size needs to be divisible by heads"
        
        self.values=nn.Linear(self.head_dim,self.head_dim,bias=False)
        self.keys=nn.Linear(self.head_dim,self.head_dim,bias=False)
        self.queries=nn.Linear(self.head_dim,self.head_dim,bias=False)
        self.fc_out= nn.Linear(heads*self.head_dim,embed_size)
    
    def forward(self,values,keys,query,mask):
        N=query.shape[0]      ## no of training examples ie how many examples at same time
        value_len,key_len,query_len=values.shape[1],keys.shape[1],query.shape[1]              #depends on where attention mechanism is used be corresponding to source len and target len
        
        # split embedding into split.heads pieces
        values=values.reshape(N,value_len,self.heads,self.head_dim)
        keys=keys.reshape(N,key_len,self.heads,self.head_dim)
        queries=query.reshape(N,query_len,self.heads,self.head_dim)
        
        energy =torch.einsum("nqhd,nkhd->nhqk",[queries,keys])
        # queries shape : (N,query_len,heads,head_dim)
        # keys shape: (N,key_len,heads,head_dim)
        # energy shape: (N,heads,query_len,key_len)
        if mask is not None:
            energy=energy.masked_fill(mask==0,float("-1e20"))
        
        attention=torch.softmax(energy/(self.embed_size**(1/2)),dim=3)
        
        out=torch.einsum("nhql,nlhd->nqhd",[attention,values]).reshape(
            N, query_len,self.heads*self.head_dim
        )
        
        # attention shape: (N,heads,query_len,key_len)
        # values shape: (N,value_shape,heads,head_dim)
        # (N,query_len,heads,head_dim)
        #(N,query_len,heads,head_dim)
        

In [None]:
class TransformerBlock(nn.Module):
    def __init__(self,embed)




link: 
1) https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0 \
2) https://medium.com/the-dl/transformers-from-scratch-in-pytorch-8777e346ca51 \
3) https://towardsdatascience.com/how-to-code-the-transformer-in-pytorch-24db27c8f9ec \
4) https://pytorch.org/tutorials/beginner/transformer_tutorial.html \


“Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences.”

![image.png](attachment:image.png)

Mathematically, it is expressed as:

![image.png](attachment:image.png)

What exactly is happening here? Q, K, and V are batches of matrices, each with shape (batch_size, seq_length, num_features). Multiplying the query (Q) and key (K) arrays results in a (batch_size, seq_length, seq_length) array, which tells us roughly how important each element in the sequence is. This is the attention of this layer — it determines which elements we “pay attention” to. The attention array is normalized using softmax, so that all of the weights sum to one. (Because we can’t pay more than 100% attention, right?) Finally, the attention is applied to the value (V) array using matrix multiplication.

In [1]:
## Ref link 2 https://medium.com/the-dl/transformers-from-scratch-in-pytorch-8777e346ca51 \
from torch import Tensor
import torch.nn.functional as f

def scaled_dot_product_attention(query: Tensor, key: Tensor, value: Tensor) -> Tensor:
    temp = query.bmm(key.transpose(1, 2))
    scale = query.size(-1) ** 0.5
    softmax = f.softmax(temp / scale, dim=-1)
    return softmax.bmm(value)

Note that MatMul operations are translated to torch.bmm in PyTorch. That’s because Q, K, and V (query, key, and value arrays) are batches of matrices, each with shape (batch_size, sequence_length, num_features). Batch matrix multiplication is only performed over the last two dimensions.
From the diagram above, we see that multi-head attention is composed of several identical attention heads. Each attention head contains 3 linear layers, followed by scaled dot-product attention. Let’s encapsulate this in an AttentionHead layer:

In [2]:

import torch
from torch import nn


class AttentionHead(nn.Module):
    def __init__(self, dim_in: int, dim_k: int, dim_v: int):
        super().__init__()
        self.q = nn.Linear(dim_in, dim_k)
        self.k = nn.Linear(dim_in, dim_k)
        self.v = nn.Linear(dim_in, dim_v)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        return scaled_dot_product_attention(self.q(query), self.k(key), self.v(value))

Now, it’s very easy to build the multi-head attention layer. Just combine num_heads different attention heads and a Linear layer for the output.

In [3]:

class MultiHeadAttention(nn.Module):
    def __init__(self, num_heads: int, dim_in: int, dim_k: int, dim_v: int):
        super().__init__()
        self.heads = nn.ModuleList(
            [AttentionHead(dim_in, dim_k, dim_v) for _ in range(num_heads)]
        )
        self.linear = nn.Linear(num_heads * dim_v, dim_in)

    def forward(self, query: Tensor, key: Tensor, value: Tensor) -> Tensor:
        return self.linear(
            torch.cat([h(query, key, value) for h in self.heads], dim=-1)
        )

Let’s pause again to examine what’s going on in the MultiHeadAttention layer. Each attention head computes its own query, key, and value arrays, and then applies scaled dot-product attention. Conceptually, this means each head can attend to a different part of the input sequence, independent of the others. Increasing the number of attention heads allows us to “pay attention” to more parts of the sequence at once, which makes the model more powerful.

## Positional Encoding

We need one more component before building the complete transformer: positional encoding. Notice that MultiHeadAttention has no trainable components that operate over the sequence dimension (axis 1). Everything operates over the feature dimension (axis 2), and so it is independent of sequence length. We have to provide positional information to the model, so that it knows about the relative position of data points in the input sequences.
Vaswani et. al. encode positional information using trigonometric functions, according to the equation:


![image.png](attachment:image.png)
We can implement this in just a few lines of code:

In [4]:

def position_encoding(
    seq_len: int, dim_model: int, device: torch.device = torch.device("cpu"),
) -> Tensor:
    pos = torch.arange(seq_len, dtype=torch.float, device=device).reshape(1, -1, 1)
    dim = torch.arange(dim_model, dtype=torch.float, device=device).reshape(1, 1, -1)
    phase = pos / 1e4 ** (dim // dim_model)

    return torch.where(dim.long() % 2 == 0, torch.sin(phase), torch.cos(phase))

Note: I’ve gotten several questions about this code. In the equation above, there is a factor of two in the phase exponent. But it is applied at index 2i (+1) in the positional encoding. These factors of two should offset one another, and so I do not include it in my code. I believe this is correct, but it’s possible that I’ve missed something. Please leave me a comment if you see anything that needs fixing. 


Now, you may be thinking, “Why use such an unusual encoding? Surely, there are simpler choices!” You’re not wrong, and this was my first thought as well. According to the authors,

We also experimented with using learned positional embeddings instead, and found that the two versions produced nearly identical results.

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.


Why should sinusoidal encodings extrapolate to longer sequence lengths? Because sine/cosine functions are periodic, and they  cover a range of [0, 1]. 

Most other choices of encoding would not be periodic or restricted to the range [0, 1]. 

Suppose that, during inference, you provide an input sequence longer than any used during training. Positional encoding for the last elements in the sequence could be different than anything the model has seen before. For those reasons, and despite the fact that learned embeddings appeared to perform equally as well, the authors still chose to use sinusoidal encoding. (I personally prefer learned embeddings, because they’re easier to implement and debug. But we’ll follow the authors for this article.)


## The Transformer
![image.png](attachment:image.png)

Notice that the transformer uses an encoder-decoder architecture. The encoder (left) processes the input sequence and returns a feature vector (or memory vector). The decoder processes the target sequence, and incorporates information from the encoder memory. The output from the decoder is our model’s prediction!

We can code the encoder/decoder modules independently of one another, and then combine them at the end. But first we need a few more pieces of information, which aren’t included in the figure above. For example, how should we choose to build the feed forward networks?

Each of the layers in our encoder and decoder contains a fully connected feed-forward network, which … consists of two linear transformations with a ReLU activation in between. The dimensionality of input and output is 512, and the inner-layer has dimensionality 2048.

In [5]:
def feed_forward(dim_input: int = 512, dim_feedforward: int = 2048) -> nn.Module:
    return nn.Sequential(
        nn.Linear(dim_input, dim_feedforward),
        nn.ReLU(),
        nn.Linear(dim_feedforward, dim_input),
    )

The output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. … We apply dropout to the output of each sub-layer, before it is added to the sub-layer input and normalized.

In [6]:

class Residual(nn.Module):
    def __init__(self, sublayer: nn.Module, dimension: int, dropout: float = 0.1):
        super().__init__()
        self.sublayer = sublayer
        self.norm = nn.LayerNorm(dimension)
        self.dropout = nn.Dropout(dropout)

    def forward(self, *tensors: Tensor) -> Tensor:
        # Assume that the "value" tensor is given last, so we can compute the
        # residual.  This matches the signature of 'MultiHeadAttention'.
        return self.norm(tensors[-1] + self.dropout(self.sublayer(*tensors)))

In [7]:
class TransformerEncoderLayer(nn.Module):
    def __init__(
        self, 
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        dim_k = dim_v = dim_model // num_heads
        self.attention = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, src: Tensor) -> Tensor:
        src = self.attention(src, src, src)
        return self.feed_forward(src)


class TransformerEncoder(nn.Module):
    def __init__(
        self, 
        num_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])

    def forward(self, src: Tensor) -> Tensor:
        seq_len, dimension = src.size(1), src.size(2)
        src += position_encoding(seq_len, dimension)
        for layer in self.layers:
            src = layer(src)

        return src

#### Decoder
The decoder module is extremely similar. Just a few small differences:

The decoder accepts two arguments (target and memory), rather than one.

There are two multi-head attention modules per layer, instead of one.

The second multi-head attention accepts memory for two of its inputs.

In [8]:
class TransformerDecoderLayer(nn.Module):
    def __init__(
        self, 
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        dim_k = dim_v = dim_model // num_heads
        self.attention_1 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.attention_2 = Residual(
            MultiHeadAttention(num_heads, dim_model, dim_k, dim_v),
            dimension=dim_model,
            dropout=dropout,
        )
        self.feed_forward = Residual(
            feed_forward(dim_model, dim_feedforward),
            dimension=dim_model,
            dropout=dropout,
        )

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        tgt = self.attention_1(tgt, tgt, tgt)
        tgt = self.attention_2(memory, memory, tgt)
        return self.feed_forward(tgt)


class TransformerDecoder(nn.Module):
    def __init__(
        self, 
        num_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 8, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
    ):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerDecoderLayer(dim_model, num_heads, dim_feedforward, dropout)
            for _ in range(num_layers)
        ])
        self.linear = nn.Linear(dim_model, dim_model)

    def forward(self, tgt: Tensor, memory: Tensor) -> Tensor:
        seq_len, dimension = tgt.size(1), tgt.size(2)
        tgt += position_encoding(seq_len, dimension)
        for layer in self.layers:
            tgt = layer(tgt, memory)

        return torch.softmax(self.linear(tgt), dim=-1)


In [9]:
class Transformer(nn.Module):
    def __init__(
        self, 
        num_encoder_layers: int = 6,
        num_decoder_layers: int = 6,
        dim_model: int = 512, 
        num_heads: int = 6, 
        dim_feedforward: int = 2048, 
        dropout: float = 0.1, 
        activation: nn.Module = nn.ReLU(),
    ):
        super().__init__()
        self.encoder = TransformerEncoder(
            num_layers=num_encoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )
        self.decoder = TransformerDecoder(
            num_layers=num_decoder_layers,
            dim_model=dim_model,
            num_heads=num_heads,
            dim_feedforward=dim_feedforward,
            dropout=dropout,
        )

    def forward(self, src: Tensor, tgt: Tensor) -> Tensor:
        return self.decoder(tgt, self.encoder(src))

In [10]:

src = torch.rand(64, 16, 512)
tgt = torch.rand(64, 16, 512)
out = Transformer()(src, tgt)
print(out.shape)
# torch.Size([64, 16, 512])

torch.Size([64, 16, 512])
