# TRANSFORMERS

## INTRODUCTION

- Transformer is a novel deep learning model architecture proposed by [Google](https://arxiv.org/abs/1706.03762) in 2017 in the paper titled "Attention is All You Need". 
- It is an architeture that uses attention mechanism to learn complex dependencies between words in a sentence. 
- By this attention the model is able to understand the semantics(meaning) of the sentence independent of the sentence length.
- Here the overall flow is that there are three different step carried over they are:
    * **Pretraining:** This step is often referred to as language modeling. The model learns to predict the next word or a next sequence of words in a sentence based on the previous words. The model is trained on a large corpus of text such as Wikipedia text. These are not labeled data.
    * **Domain Adaptation:** In this step the pretrained language model adapts to the in-domain corpus of text. For eg: A model trained on Wikipedia corpus adapts to the emotion dataset. In this task again the model learns to predict the next word or a next sequence of words in a sentence based on the previous words but these are from the words present inside the in-domain corpus. A language modeling task is again performed in this restricted to the domain corpus.
    * **Fine-tuning:** In this step the language model is fine-tuned with a task layer based on the target task. For eg: A classfication layer is added to the model to classify the emotion of the text.
- The architecture of the transformer consists of two main parts:

<figure>
<img src= "res\06_1_trans_arch.png" style="width:100%">
<figcaption align = "center"><b>Fig.1 -  Transformer Architecture</b></figcaption>
</figure>

**Encoder:** This part is responsible for encoding the input sequence. The encoder consists of a stack of multi-head attention layers and a stack of feed-forward layers.
**Decoder:** This part is responsible for decoding the input sequence. The decoder consists of a stack of multi-head attention layers, masked multi-head attention layers and a stack of feed-forward layers.
- The encoder and decoder are stacked together to form the transformer.
- The flow is that the encoder takes the input sequence and encodes it into a sequence of hidden states. The decoder takes the input sequence and decodes it into a sequence of outputs.
- Initially Transformer models were used for sequence to sequence tasks. For eg: Text to text translation by using the encoder and decoder.

We will see the architecture in detail while implementing a transformer model from scratch.

- However then the researches found a way to use Encoder and Decoder themselves as a standalone model to perform different tasks.

Reference: Jay Alammar - The Illustrated Transformer is a very good introduction to the Transformer model. https://jalammar.github.io/illustrated-transformer/

## WHY USE TRANSFORMERS

- Transformers is a State of the art model for NLP tasks which is powerful than any other NLP model.
- It uses attention mechanism to learn complex dependencies between words in a sentence.
- By using Transfer learning we are able to use the pretrained model for our task by just changing the inputs and adding a task layer at the end. To be precise we change the head part of the model by changing the inputs.
- It is easily adaptable to any task in NLP.
- It is easy to implement and it is easy to train with the help of the library - **HUGGINGFACE**.

## HISTORY OF TRANSFORMERS

* In 2017, Google released a paper titled "Attention is All You Need" which described the architecture of the Transformer model.
* This paper proposed a novel neural network architecture for sequence modeling.
* Dubbing this Transformer architecture was created which outperformed the RNN networks in the machine translation task in terms of the both the quality of translation and the training cost(speed of training)
* Side by side there was an another research going on where an effective transfer learning method called ULMFiT was proposed.
* ULMFiT showed that training LSTM models on training for long hours on a very large diverse corpus could provide SOTA models even with little labelled data.
* This provided the missing piece for Transformer models to take off into the field of NLP.
* By the help of ULMFiT approach adapted to the "Attention is All You Need" paper, two revolutionary models came into existence. They are:
- **GPT(Generative Pretrained Transformer)**: It uses only the decoder part of the Transformer model. It follows the same approach as ULMFiT i.e., the language modeling process is the same as proposed bu ULMFiT. It was pretrained on BookCorpus.
- **BERT(Bidirectional Encoder Representation for Transformer):** This model uses only the encoder part of the Transformer model where in the pretraining step it uses a special type of language modeling called masked language modeling. In MLM task the model learns to predict the [MASK] token in a given sentence such as "I am [MASK] in NLP". Here the model will learn the sequence in both the ways - forward and backward direction because this is a bidirectional attention model and then predicts the appropriate word for the token. BERT was pretrained on BookCorpus and English Wikipedia.
- Later there were lots models been released such as RoBERTa, DistilBERT, XLM, XLNet, Electra, Albert, T5, Flaubert, CTRL, Roberta, and more.
- After the invasion of **HUGGINGFACE**, the implementation of Transformer models were becoming easier and lots of people started to use different models for different tasks.

## DIFFERENT TASKS IN NLP USING TRANSFORMERS

There are different tasks that can be performed using Transformer models. Some of the common tasks are:

**Language Modeling:** This is the task where the model learns to predict the next word or a next sequence of words in a sentence based on the previous words.

**Text Classification:** This is the task where the model learns to classify the text into a particular category.

**Question Answering:** This is the task where the model learns to answer the question posed by the user.

**Machine Translation** This is the task where the model learns to translate the input sequence into another sequence.

**Name Entity Recognition:** This is the task where the model learns to recognize the entities in the text.

**POS Tagging:** This is the task where the model learns to tag the words in the text with the appropriate part of speech.

**Summarization:** This is the task where the model learns to summarize the text.

**Text Generation:** This is the task where the model learns to generate the text based on the input sequence.

## SIMPLE IMPLEMENTATION OF TASKS USING **pipeline()** METHOD OF TRANSFORMERS

pipeline() method is a method provided by HuggingFace library. It is a method that abstracts away all the steps that is needed from conversion of raw text into tokens till generationg a set of predictions.

### Langugage Modeling

In [26]:
from transformers import pipeline
text = "I am a very <mask> person"
mask_filler = pipeline("fill-mask")
mask_filler(text)

No model was supplied, defaulted to distilroberta-base (https://huggingface.co/distilroberta-base)


[{'score': 0.05234813317656517,
  'token': 9152,
  'token_str': ' shy',
  'sequence': 'I am a very shy person'},
 {'score': 0.03213212266564369,
  'token': 5394,
  'token_str': ' lucky',
  'sequence': 'I am a very lucky person'},
 {'score': 0.03190971910953522,
  'token': 3458,
  'token_str': ' religious',
  'sequence': 'I am a very religious person'},
 {'score': 0.02915872447192669,
  'token': 12038,
  'token_str': ' intelligent',
  'sequence': 'I am a very intelligent person'},
 {'score': 0.026916896924376488,
  'token': 9256,
  'token_str': ' generous',
  'sequence': 'I am a very generous person'}]

### Text Classification

In [27]:
text = "I am good in NLP"
classifier = pipeline("text-classification")
classifier(text)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


[{'label': 'POSITIVE', 'score': 0.9998408555984497}]

### Named Entity Recognition

In [28]:
import pandas as pd
ner_tagger = pipeline("ner", aggregation_strategy="simple")
print(pd.DataFrame(ner_tagger(text)))

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english)


KeyboardInterrupt: 

### Question Answering

In [None]:
context = "Ineuron is a company that create affordable courses that are very good in NLP. Sudhanshu Kumar is the CEO and founder whereas the infamous youtuber Krish Naik is the co-founder of the company"
question = "Who is Ineuron?"
question_answerer = pipeline("question-answering")
question_answerer(question, context)

No model was supplied, defaulted to distilbert-base-cased-distilled-squad (https://huggingface.co/distilbert-base-cased-distilled-squad)


{'score': 0.17145124077796936,
 'start': 11,
 'end': 77,
 'answer': 'a company that create affordable courses that are very good in NLP'}

### Text Summarization

In [None]:
text = "Ineuron is a company that create affordable courses that are very good in NLP. Sudhanshu Kumar is the CEO and founder whereas the infamous youtuber Krish Naik is the co-founder of the company"
summarizer = pipeline("summarization")
summarizer(text)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)
Your max_length is set to 142, but you input_length is only 47. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=23)


[{'summary_text': ' Ineuron is a company that create affordable courses that are very good in NLP . Sudhanshu Kumar is the CEO and founder whereas the infamous youtuber Krish Naik is the co-founder of the company . The company is based in New York City .'}]

### Translation

In [None]:
text = "I am a very good person"
translator = pipeline("translation_en_to_fr")
translator(text)

No model was supplied, defaulted to t5-base (https://huggingface.co/t5-base)


[{'translation_text': 'Je suis une très bonne personne'}]

## ATTENTION IS ALL YOU NEED

In 2017, Google released a paper titled "Attention is All You Need" which described the architecture of the Transformer model. This paper proposed a novel neural network architecture for sequence modeling. It caused a revolution in the field of NLP. 

The architecture of the Transformer consists of two main parts:
- **Encoder:** This part is responsible for encoding the input sequence. The encoder consists of a stack of multi-head attention layers and a stack of feed-forward layers.
- **Decoder:** This part is responsible for decoding the input sequence. The decoder consists of a stack of multi-head attention layers, masked multi-head attention layers and a stack of feed-forward layers.

Paper: https://arxiv.org/abs/1706.03762

In simple the flow in an encoder and decoder for a machine translation will be as follows:

**Encoder part:** Tokenized text (English)-> Token encodings -> Token embeddings + Positional embeddings -> Encoder stack -> Hidden states -> K, V (Key, Value)

**Decoder part:** Tokenized text (Tamil)-> Token encodings -> Token embeddings + Positional embeddings + K,V (From Encoder) -> Decoder stack -> Hidden states -> Classification head -> Token predictions -> Output

## ENCODER

- Transformer encoder contains a stack of encoder layers.

**ARCHITECTURE:**
<figure>
<img src= "res\06_2_trans_enc.png">
<figcaption align = "center"><b>Fig.2 -  Transformer Encoder Architecture</b></figcaption>
</figure>

- The encoder consists of mainly two sublayers:
    + A multi-head self-attention layer
    + A fully-connected feed-forward layer 
- Here the stack will be always similar except that there will be a embedding layer in the first layer alone to maintain a constant embedding being used in the Transformer encoder.
- Each sublayer will recieve this sequence of embeddings.
- The output embeddings of each encoder layer will have a constant output size and the same input size of embeddings.
- The main purpose of the encoder is to **update** the input embeddings to produce representations that encode the contextual information in the sequence.

**For eg:** fly will be insect like and with more context such as birds, time, etc.. it will become a verb representation like fly which is an action of float.
- Here each of the sublayer uses skip connection and a layer normalization as well which helps in training the deep neural network effectively.
- Now lets see in detail about the self-attention part of the encoder which is the most important part of the encoder


## SELF ATTENTION

- Attention is a mechanism that allows neural networks to assign different amount of weights or **attention** to each element in a sequence. For text elements, the elements are token embeddings where each token is mapped to a vector of fixed dimension d. For eg: BERT model has embedding dimension of 768.

- The self part of the self attention refers to the fact that these weights are computed for all the states in the same set for a word in the set including itself. 

- The main idea behind self-attention is that instead of using a fixed embedding for each token we can make use of the whole sequence and calculate the weighted average embedding for a token.

- Here attention weights Wji defines how much it is important and is related in the sequence. Different tokens will have different proportion of weighted average embeddings. This type of embeddings where context has a major role is called ***contextualized embeddings***.

## SCALED DOT PRODUCT ATTENTION

One of the most common ways to calculate the attention weights is to use Scaled Dot Product Attention.

The following are the steps to calculate the attention weights using Scaled Dot Product Attention:

    1. Project each token embedding into three vectors query, key and value.
    2. Calculate the scaled dot product of the query and key vectors.
    3. Apply a softmax function to the scaled dot product.
    4. Apply the scaled dot product to the value vectors.

<figure>
<img src= "res\06_4_scaled_dot.png">
<figcaption align = "center"><b>Fig.3 - Scaled Dot Product Attention</b></figcaption>
</figure>

Now lets see the formula for the scaled dot product attention:

<figure>
<img src= "res\06_5_scaled_dot_formula.png">
<figcaption align = "center"><b>Fig.4 - Scaled Dot Product Attention Formula</b></figcaption>
</figure>

In simple flow

```
Q, K -> Mat Mul -> Scaled Dot Product -> Mask(Optional) -> Softmax -> Mat Mul(Softmax_out, V) -> Attention
```

**Query, Key and Value:**

The notion of query, key and value were inspired from information retrieval systems but we can understand that now with a simple example.

**Supermarket eg:**

**query:** Now considering a scenario that we need to prepare a dinner for which we have a list of ingredients required for preparing the dinner which can be assumed as query token.

**key:** As we go through the shelves of the supermarket we see the label of the product. This label can be assumed as key token.

**value:** The act of taking the item is considered as a value token.

**similarity function:** We will take the ingredient only if the ingredient matches our requirements.

Now we will visualize attention by using a library called **bertviz.**

In [None]:
from transformers import AutoTokenizer
from bertviz.transformers_neuron_view import BertModel
from bertviz.neuron_view import show

model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BertModel.from_pretrained(model_ckpt)

text = "Time flies like arrow"
show(model, "bert", tokenizer, text, layer=0, head=0)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from torch import nn
from transformers import AutoConfig, AutoTokenizer

config = AutoConfig.from_pretrained("bert-base-uncased")
token_emb = nn.Embedding(config.vocab_size, config.hidden_size)
model_ckpt = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
text = "Time flies like arrow"
inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
print("Embeddings_Dimensions", token_emb)
input_embeddings = token_emb(inputs.input_ids)
print("Input Embeddings Shape format[batch_size, seq_len, hidden_dim]", input_embeddings.size())
print("Input Embeddings", input_embeddings)

Embeddings_Dimensions Embedding(30522, 768)
Input Embeddings Shape format[batch_size, seq_len, hidden_dim] torch.Size([1, 4, 768])
Input Embeddings tensor([[[ 0.2709, -1.8986,  0.4015,  ..., -1.3549, -0.0362, -0.6397],
         [ 0.7809,  0.2935, -0.9990,  ..., -0.4781, -0.1515, -0.5107],
         [ 1.8515,  1.6065, -1.5233,  ..., -0.3414,  0.8963, -1.7680],
         [-1.6591,  2.7837, -3.0972,  ...,  1.1894,  1.3556,  1.2462]]],
       grad_fn=<EmbeddingBackward>)


In [None]:
import torch
from math import sqrt

query = key = value = input_embeddings
d_model = key.size()[-1]
attention_scores = torch.bmm(query, key.transpose(1, 2)) / sqrt(d_model)
print("Attention Scores Shape", attention_scores.size())
print("Attention Scores", attention_scores)

Attention Scores Shape torch.Size([1, 4, 4])
Attention Scores tensor([[[31.0025, -0.1536, -0.9703, -0.2598],
         [-0.1536, 28.9791, -1.3148,  0.9324],
         [-0.9703, -1.3148, 29.4150, -0.2738],
         [-0.2598,  0.9324, -0.2738, 26.2418]]], grad_fn=<DivBackward0>)


In [None]:
import torch.nn.functional as F
attention_weights = F.softmax(attention_scores, dim=-1)
attention_weights.sum(dim=-1)
print("Attention Weights Shape", attention_weights.size())
print("Attention Weights", attention_weights)

Attention Weights Shape torch.Size([1, 4, 4])
Attention Weights tensor([[[1.0000e+00, 2.9450e-14, 1.3014e-14, 2.6484e-14],
         [2.2276e-13, 1.0000e+00, 6.9749e-14, 6.5991e-13],
         [6.3658e-14, 4.5105e-14, 1.0000e+00, 1.2773e-13],
         [3.0939e-12, 1.0192e-11, 3.0508e-12, 1.0000e+00]]],
       grad_fn=<SoftmaxBackward>)


In [None]:
attention_outputs = torch.bmm(attention_weights, value)
print("Attention Outputs Shape", attention_outputs.size())

Attention Outputs Shape torch.Size([1, 4, 768])


In [None]:
print("Attention Outputs", attention_outputs)

Attention Outputs tensor([[[-1.2046,  0.8638,  2.0519,  ...,  1.7864, -1.7270,  0.9686],
         [-0.4713, -0.5196, -1.0900,  ...,  0.1141, -0.3583, -0.4779],
         [-1.5532, -0.7467,  1.1496,  ...,  0.6403,  1.8783,  0.1328],
         [-0.6959,  0.7367, -0.3461,  ...,  0.4862, -0.1238, -0.7458]]],
       grad_fn=<BmmBackward0>)


In [None]:
def scaled_dot_product_attention(q, k, v):
    '''
    Calculates the attention weights and gives the attention outputs for a given query, key and value    
    Inputs:
        q: Query tensor 
        k: Key tensor 
        v: Value tensor 
    Outputs:
        attention_outputs: Attention outputs of shape [batch_size, 1, hidden_dim]
    '''
    # Calculate the attention weights
    attention_scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(k.size()[-1])
    attention_weights = F.softmax(attention_scores, dim=-1)
    # Calculate the attention outputs
    attention_outputs = torch.bmm(attention_weights, v)
    return attention_outputs

## MULTI HEAD ATTENTION

In practise the scaled dot product attention will be modified slightly wherein the self-attention layer will apply three independed linear transformations to each embedding to generate the query, key and value vectors. These transformations help in projecting the embeddings and each projection is associated with a set of learnable parameters which helps the self attention layer to focus and understand the context of the input sequence.

Instead of one attention head we use multiple attention heads because softmax on one head will focus on one aspect of sentence but if we use multiple attention heads then it will focus on multiple aspects of the sentence like one on subject-verb another on prepositions placement and so on and so forth. It is not exactly done by us the model will learn the best way to use attention heads. 

The attention head can be visualized as like a convolutional layer in CNN with a kernel size of 1.

<figure>
<img src= "res\06_6_multi_head.png">
<figcaption align = "center"><b>Fig.5 -  Multi Head Attention</b></figcaption>
</figure>

Now lets see the code for the multi head attention:

In [None]:
# First we will create one attention head

class AttentionHead(nn.Module):
    def __init__(self, embed_dim, head_dim):
        super().__init__()
        self.query = nn.Linear(embed_dim, head_dim)
        self.key = nn.Linear(embed_dim, head_dim)
        self.value = nn.Linear(embed_dim, head_dim)

    def scaled_dot_product_attention(self, q, k, v):
        '''
        Calculates the attention weights and gives the attention outputs for a given query, key and value    
        Inputs:
            q: Query tensor 
            k: Key tensor 
            v: Value tensor 
        Outputs:
            attention_outputs: Attention outputs of shape [batch_size, 1, hidden_dim]
        '''
        # Calculate the attention weights
        attention_scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(k.size()[-1])
        attention_weights = F.softmax(attention_scores, dim=-1)
        # Calculate the attention outputs
        attention_outputs = torch.bmm(attention_weights, v)
        return attention_outputs

    def forward(self, hidden_state):
        '''
        Calculates the attention for query, key and value vector of a hidden state
        '''

        attn_out = self.scaled_dot_product_attention(
            self.query(hidden_state),
            self.key(hidden_state),
            self.value(hidden_state)
        )

        return attn_out


In [None]:
## Now as one attention head is created we will create multiple attention heads

class MultiHeadAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embed_dim = config.hidden_size
        self.num_heads = 8
        self.head_dim = self.embed_dim // self.num_heads


        self.query = nn.Linear(self.embed_dim, self.head_dim)
        self.key = nn.Linear(self.embed_dim, self.head_dim)
        self.value = nn.Linear(self.embed_dim, self.head_dim)

        self.attention_heads = nn.ModuleList(
            [AttentionHead(self.embed_dim, self.head_dim) for _ in range(self.num_heads)]
        )

        self.multi_output_linear = nn.Linear(self.embed_dim, self.embed_dim)

    def forward(self, hidden_state):
        multi_out = torch.cat([attn_head(hidden_state) for attn_head in self.attention_heads], dim=-1)
        multi_out = self.multi_output_linear(multi_out)
        return multi_out

In [None]:
from transformers import AutoConfig
config = AutoConfig.from_pretrained("bert-base-uncased")
multi_head_attention = MultiHeadAttention(config)
attention_output = multi_head_attention(input_embeddings)
print("Attention Output Shape", attention_output.size())
print("Attention Output", attention_output)

Attention Output Shape torch.Size([1, 4, 768])
Attention Output tensor([[[ 0.0804,  0.2587,  0.1097,  ..., -0.0540, -0.2068,  0.2530],
         [ 0.0170,  0.1257,  0.1496,  ..., -0.0711, -0.2183,  0.3089],
         [ 0.0242,  0.3737,  0.0985,  ..., -0.1705, -0.2277,  0.2513],
         [ 0.0353,  0.1308,  0.1180,  ..., -0.0772, -0.2217,  0.3329]]],
       grad_fn=<AddBackward0>)


In [None]:
from bertviz import head_view
from transformers import BertModel, AutoTokenizer, BertConfig
config = BertConfig.from_pretrained('bert-base-uncased', 
output_hidden_states=True, output_attentions=True)
bertmodel = BertModel.from_pretrained('bert-base-uncased', config=config)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text1 = "Time flies like arrow"
text2 = "Jet flies like bullet"
input = tokenizer(text1, text2, return_tensors="pt")
attention_out = bertmodel(**input)
text2_start = (input.token_type_ids==0).sum(dim=1)
attention_out = attention_out[-1]
tokens = tokenizer.convert_ids_to_tokens(input.input_ids[0])
head_view(attention_out, tokens, heads=[8])

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<IPython.core.display.Javascript object>

## FEED-FORWARD LAYER

- The feed-foward neural network in the encoder and decoder stack is a two-layer fully connected neural network.
- Here a small change is that this layer instead of processing whole sequence of embeddings at once as a single entity(vector) it processes each token embedding at a time independently.
- A thumb rule here is that the hidden size of the dirst layer should be four times the size of the input embedding size.
- Here we use GELU (Gaussian Error Linear Unit) activation function.
- Here is where most of the memorization and hypothesization is done.

In [None]:
class FeedForwardNN(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.linear1 = nn.Linear(config.hidden_size, config.intermediate_size)
        self.linear2 = nn.Linear(config.intermediate_size, config.hidden_size)
        self.gelu = nn.GELU()
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, hidden_state):
        intermediate_output = self.linear1(hidden_state)
        intermediate_output = self.gelu(intermediate_output)
        final_output = self.linear2(intermediate_output)
        final_output = self.dropout(final_output)
        return final_output

In [None]:
feed_forward = FeedForwardNN(config)
feed_forward_out = feed_forward(attention_output)
print("Feed Forward Output Shape", feed_forward_out.size())
print("Feed Forward Output", feed_forward_out)

Feed Forward Output Shape torch.Size([1, 4, 768])
Feed Forward Output tensor([[[-0.0323,  0.0024, -0.0419,  ...,  0.0000,  0.0604,  0.0260],
         [-0.0346,  0.0055, -0.0254,  ...,  0.0260,  0.0373,  0.0303],
         [-0.0380,  0.0006, -0.0000,  ...,  0.0110,  0.0596,  0.0205],
         [-0.0416, -0.0088, -0.0141,  ...,  0.0211,  0.0463,  0.0178]]],
       grad_fn=<MulBackward0>)


## LAYER NORMALIZATION

- For efficient training the Transformer architecture uses layer normalization and skip connections.
- Layer Normalization is used to normalize each input in the batch to have zero mean and unit variance.
- Skip connections are used to pass a tensor to the next layer of the model without performing any processing onto it and then adding it to the processed tensor.
- By the books, there are two ways to arrange a normalization layer in the encoder and decoder stack. They are
    + Post Layer Normalization
    + Pre Layer Normalization
- Mose commonly we use post layer normalization.
- In post layer normalization arrangement it places the normalization layer after the feed-forward layer and multi-head attention layer. So it makes the training process trickier where there is a possibility for the gradients to diverge. So here learning-rate-warm-up is used to make the gradients converge where it will make the weights larger as the training progress.
- For now we will implement the pre layer normalization.
- Pre layer normalization doesnt require any learning rate warm-up. It is very simple arrangement. But not commonly used

### Now lets implement the Transformer Encoder layer

In [29]:
class EncoderLayer(nn.Module):

    def __init__(self, config):
        super().__init__()
        self.multi_head_attention = MultiHeadAttention(config)
        self.feed_forward = FeedForwardNN(config)
        self.layer_norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.layer_norm2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

    def forward(self, input_embs):
        hidden_state = self.layer_norm1(input_embs)
        attention_out = input_embs + self.multi_head_attention(hidden_state)
        encoder_out = attention_out + self.layer_norm2(feed_forward(attention_out))
        return encoder_out
        

In [30]:
encoder =EncoderLayer(config)

In [31]:
print("Input Embeddings Shape", input_embeddings.size())
print("Input Embeddings", input_embeddings)
encoder_out = encoder(input_embeddings)
print("Encoder Output Shape", encoder_out.size())
print("Encoder Output", encoder_out)

Input Embeddings Shape torch.Size([1, 4, 768])
Input Embeddings tensor([[[ 0.2709, -1.8986,  0.4015,  ..., -1.3549, -0.0362, -0.6397],
         [ 0.7809,  0.2935, -0.9990,  ..., -0.4781, -0.1515, -0.5107],
         [ 1.8515,  1.6065, -1.5233,  ..., -0.3414,  0.8963, -1.7680],
         [-1.6591,  2.7837, -3.0972,  ...,  1.1894,  1.3556,  1.2462]]],
       grad_fn=<EmbeddingBackward>)
Encoder Output Shape torch.Size([1, 4, 768])
Encoder Output tensor([[[-0.0612, -1.7635, -2.2012,  ..., -2.3367, -2.4065,  0.5385],
         [-0.5775, -0.3986, -0.5353,  ..., -0.9847, -1.4074, -0.3467],
         [ 0.3435,  0.9355, -1.7949,  ...,  0.3756, -0.0127, -0.6638],
         [-3.0363,  2.5317, -3.3584,  ...,  1.0563,  2.1554,  1.3877]]],
       grad_fn=<AddBackward0>)


## POSITIONAL EMBEDDINGS

- The token based embeddings are not good for long sequences.
- The positional embeddings are used to add a relative position information to the token embeddings.
- It is needed because if there is a long sequeence of tokens the position of word might not be accounted for which is very important for the model to learn.
- So the scientists came up with an approach called positional embeddings
- The positional embeddings are added to the token embeddings and are sent to the model as input embeddings
```
Input embeddings = Token embeddings + Position embeddings

```

In [None]:
class Embedding(nn.Module):
    def __init__(self,config):
        super().__init__()
        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)

    def forward(self, input_ids):
        input_shape = input_ids.size()
        seq_length = input_shape[1]
        position_ids = torch.arange(seq_length, dtype=torch.long, device=input_ids.device)
        position_ids = position_ids.unsqueeze(0).expand(input_shape)
        embeddings = self.word_embeddings(input_ids) + self.position_embeddings(position_ids)
        embeddings = self.dropout(embeddings)
        embeddings = self.layer_norm(embeddings)
        return embeddings

In [None]:
embedding = Embedding(config)
embedding_out = embedding()
print("Embedding Output Shape", embedding_out.size())
print("Embedding Output", embedding_out)

Embedding Output Shape torch.Size([1, 11, 768])
Embedding Output tensor([[[ 0.8742,  0.9418,  0.7568,  ..., -0.0153,  0.0468, -1.9008],
         [ 0.6883, -0.9588,  1.9930,  ..., -1.1918,  1.8298, -1.9503],
         [-0.2955, -0.7558,  0.8221,  ..., -0.2064,  1.0719, -0.0103],
         ...,
         [-2.1075,  0.2146,  0.6799,  ...,  0.0504, -0.0317, -0.1086],
         [ 0.5274, -0.1322, -0.5101,  ...,  0.7027, -0.7478,  1.3452],
         [-0.7955,  2.7496,  1.2764,  ..., -0.8424,  0.1478, -0.5444]]],
       grad_fn=<NativeLayerNormBackward>)


## TRANSFORMER ENCODER

In [32]:
class Encoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embedding = Embedding(config)
        self.layers = nn.ModuleList([EncoderLayer(config) for _ in range(config.num_hidden_layers)])

    def forward(self, input_ids):
        embeddings = self.embedding(input_ids)
        encoder_out = embeddings
        for layer in self.layers:
            encoder_out = layer(encoder_out)
        return encoder_out

In [33]:
encoder = Encoder(config)
encoder_out = encoder(inputs.input_ids)
print("Encoder Output Shape", encoder_out.size())
print("Encoder Output", encoder_out)

Encoder Output Shape torch.Size([1, 4, 768])
Encoder Output tensor([[[ 12.8558, -11.7393, -11.3893,  ..., -11.3709,  -2.6956,   9.0815],
         [ -9.5084,  -2.2702,   5.6703,  ..., -12.5174,  -7.8584,  -3.6668],
         [  2.1241, -15.4188, -10.0014,  ...,  -9.1864,  -8.4766,   1.8402],
         [  2.6240, -11.7074,  -3.2278,  ..., -13.4729, -14.5102,   7.8571]]],
       grad_fn=<AddBackward0>)


## ATTACHING A TASK HEAD (CLASSIFICATION)

- At the end of the model finally a task specific head for different different tasks is attached.
- The task head is a linear layer.
- Here the task head is a classification head.
- The task head is used to predict the class of the input sequence.

In [34]:
class TransformerClassifier(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.transformer = Encoder(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

    def forward(self, input_ids):
        transformer_out = self.transformer(input_ids)[:, 0, :]
        logits = self.classifier(self.dropout(transformer_out))
        return logits

In [35]:
config.num_labels = 2 # Setting it as a binary classification problem
model = TransformerClassifier(config)
classify_out = model(inputs.input_ids)

print("Classification Output Shape", classify_out.size())
print("Classification Output", classify_out)

Classification Output Shape torch.Size([1, 2])
Classification Output tensor([[4.2532, 0.8641]], grad_fn=<AddmmBackward>)
