# Transformer

## Language Modeling 

https://lena-voita.github.io/nlp_course/language_modeling.html


# Why to go for transformer

Lets dive into basics of RNN and sequence to sequence model 

## RNN 
The RNN is the same as the n-gram model, except that the output of the current input will depend on the output of all the previous computations

- It allows the network to keep a history of previously learned parameters and use it to predict the following output, which overcomes the problem of word order and removes the computation cost, as we’ll just pass the words individually on our model.

Types of RNN

1. One to One 
2. One to many: Music generation
3. Many to One: Sentiment analysis
4. Many to many: Named-entity Recognition, Machine Translation. 


**Its Variants**
1. LSTM
2. GRU

**Problems**

1. Vanishing or exploding gradients problem. 
2. We can’t parallelize the computations, as the output depends on previous calculations.


## Sequence Modeling

In Sequence modeling how probable the sequence is or ability to estimate the likeliness of the sequence.

Sequential data has three properties:

1. Elements in the sequence can repeat
2. It follows order (contextual arrangement)
3. Length of data varies (potentially infinitely)

![seqvssupervised](./Sequence-modeling.jpg)


### Modeling Sequence 

<b>1. Conditional Probability</b>

<b>p(xT) = p(xT | x1…., xT-1).</b>

eg:
Cryptocurrency is the next big ______


|Target        |p(x/context)|
|--------------|----------- |
|Cryptocurrency|p(x1)       |
|Cryptocurrency <b>is|p(x2/x1)|
|Cryptocurrency is <b>the|p(x3/x2,x1)|
|Cryptocurrency is the <b>next|p(x4/x3,x2,x1)|
|Cryptocurrency is the next <b>big|p(x5/x4,x3,x2,x1)|
|Cryptocurrency is the next big <b>thing|p(x6/x5,x4,x3,x2,x1)|

    
This type of approach works well with a few sentences, and captures the structure of the data very well. But when we deal with paragraphs, then we have to deal with scalability
    
    
<b>2. N-grams</b>
    
To counter the issue of scalability, NLP (natural language processing) researchers introduced the idea of N-grams, where you take into account an ‘n’ number of words for conditional and joint probability. For instance, if n is equal 2, then only the previous two words of the sentence will be used to calculate joint probability instead of the entire sentence. 
    
This approach reduces the scalability issue, but not completely. 

The disadvantages of N-grams are:

1. Context of the sentence is lost if the sentence is long.
2. Reduces the scalability issue by a small scale
    
    
    
<b> 3. Context Vectorizing</b>
Context vectorizing is an approach where the input sequence is summarized to a vector such that that vector is then used to predict what the next word could be. 
    
![contextvect](./Context-vectorizing.jpg)
    
The advantages of context vectorizing are:

1. Order is preserved
2. Can operate in variable length of sequences
3. It can learn hence differentiable (backpropagation)
4. Context is preserved in short sentences or sequences. 
    

1. Context vectoring acts as “memory” which captures information about what has been calculated so far, and enables RNNs to remember past information, where they’re able to preserve information of long and variable sequences.
2. Because of that, RNNs can take one or multiple input vectors and produce one or multiple output vectors. 

    
***Which layer is used as contect vector***
    
The hidden state h(t) represents a contextual vector at time t and acts as “memory” of the network.
    
We denote a hidden state using this formula:

<b>ht= tanh(Whht-1 + Wxxt)</b>

<b>When t = 1,</b>

<b>h1= tanh(Whh0 + Wxx1), where x1 is ‘Cryptocurrency’, and h0 is initialised as zero</b>

<b>When t = 2,</b>

<b>h2= tanh(Whh1 + Wxx2), where x1 is ‘is’.</b>

<b>When t = 3,</b>

<b>h3= tanh(Whh2 + Wxx3), where x2 is ‘the’.</b> 

## Back propogation in RNN and Problems


## Backpropogation
Backpropation in RNN we travel from right to left minimizing loss using gradient through time. As Paramters are shareable so unfolding is used for Back propogation through Time(BPTT) 

Loss function used is Cross Entropy 

![unfold](./Unfolded-recurrent-network.jpg)

![bptt](./RNN_backpropation_equation.jpg)


## Issues

<b>1. Vanishing Gradients: </b>


<b>When the differentiating vector goes to zero exponentially fast, which in turn makes it difficult for the network to learn some long period dependencies, the problem is vanishing gradient.</b>

    Coming to backpropagation in RNNs, 
    1. we saw that every single neuron in the network participated in the calculation of the output with respect to the            cost function. 
    2. Because of that, we have to make sure that the parameters are updated for every neuron to minimize the error, and          this goes back to all the neurons in time. 
    3. So, you have to propagate all the way back through time to these neurons.


    Contextual vector

    1. We also know that the contextual vector, or the hidden state parameter, is shared across the network to preserve            order and continuity. 

    2. During initialization, the parameter is assigned with a random number which is close to zero, and when the hidden          state moves forward in time it 

    3. gets multiplied by itself over at different time steps, making the gradient Wh smaller and smaller, essentially zero        to a point where it vanishes. 

    4. The lower the gradient is, the harder it is for the network to update the weights, and if the gradient is zero, the weights will not be updated.



<b>2. Exploding gradients</b>

    Exploding gradients occur when large gradients accumulate due to an unstable process, and result in very large updates to the parameters.

    1. In RNNs, exploding gradients can occur during backpropagation and result in very large gradients essentially making large updates to the network parameters. 

    2. At an extreme, the values of weights can become so large that they become NaN values.


## Overcoming gradient issues

Gradient issues in RNNs can be solved with:

1. Gradient clipping

2. Gated networks 


<b>Gradient clipping</b>

Gradient clipping is a technique used to avoid exploding gradients. It’s fair to assume that RNNs behave in an approximate linear fashion, which makes the gradient unstable. 

In order to control the gradient, it’s clipped, or reshaped to a smaller value. There are two ways to clip gradients:

Clip the gradient from a mini batch just before the parameter is updated 
Use a hyperparameter C which measures the norm ||g|| where g is the gradient. If ||g|| > C then gg.C/||g|| 

## LSTM
This particular kind of RNN adds a forget mechanism, as the LSTM unit is divided into cells.

Each cell takes three inputs: :

1. current input, 
2. hidden state, 
3. memory state of the previous step (6). 


These inputs go through gates: 

1. input gate, 
2. forget gate, 
3. output gate. 


Benefits

1. LSTM was able to overcome vanishing and exploding gradients in the RNN model,

Problems

1. No parallelization, we still have a sequential path for the data, even more complicated than before.
2. Hardware resources are still a problem.

## Encoder-decoder sequence-to-sequence architecture
The advantage of using RNNs in sequential modeling is that it can:

Map an input sequence to a fixed-size vector
Map fixed-size vector to a sequence
Map an input sequence to an output sequence of the same length. 
But let’s say we want to train a RNN to map an input sequence to an output sequence, not necessarily of the same length. This can come up especially when we want to translate from one language to another. 

Encoder-decoder sequence-to-sequence is an architecture that deals with this type of problem. As the name suggests, it has two types of architecture: encoder and decoder. 

Encoder RNN receives the input sequence of variable length, and processes it to return a vector or a sequence of vectors called the “context” vector C.

The decoder RNN is conditioned on a fixed-length vector to generate an output sequence. Also, the last hidden state of the encoder is the initial hidden state of the decoder. 

![encoderdecoder](./Encoder-decoder.jpg)

Reference :

Explanation and code    

https://neptune.ai/blog/recurrent-neural-network-guide

https://www.tensorflow.org/guide/keras/rnn

https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

https://towardsdatascience.com/how-to-use-torchtext-for-neural-machine-translation-plus-hack-to-make-it-5x-faster-77f3884d95



## Lets Begin Transformers

Transformer model is designed consisting of two main components

1. Encoder

2. Decoder 

Encoder

composed of a stack of multiple identical layers, each layer containing two sublayers, multi-headed self-attention mechanism followed by residual connections, and simple-wise fully connected feed-forward network. 

1. The input will be the word embeddings for the first layer. For subsequent layers, it will be the output of previous layer.

2. Inside each layer, first the multi-head self attention is computed using the inputs for the layer as keys, queries and values.

3. The output of #2 is sent to a feed-forward network layer.

4. Here every position (i.e. every word representation) is fed through the same feed-forward that contains two linear transformations followed by a ReLU (input vector ->linear transformed hidden1->linear transformed hidden2 ->ReLU output).


Decoder 

1. The input will be the word embeddings generated so far for the first layer. For subsequent layers, it will be the output of previous layer.

2. Inside each layer, first the multi-head self attention is computed using the inputs for the layer as keys, queries and values (i.e. generated decoder outputs so far, padded for rest of positions).

3. The output of #2 is sent to a “multi-head-encoder-decoder-attention” layer. Here yet another attention is computed using #2 outputs as queries and encoder outputs as keys and values.

4. The output of #3 is sent to a position wise feed-forward network layer like in encoder.



### Attention
This model was inspired by the human vision system (7). As a brain receives a massive input of information from the eyes, more than the brain can process at a time, the attention cues in the eye sensory system make humans capable of paying attention to a fraction of what the eyes receive.


![Attention](self_attention.jpg)
The Transformer model uses a “Scaled Dot Product” attention mechanism.

We can apply this methodology to the problem at hand. If we know the parts that can affect our translation, we can focus on those parts and ignore the other useless information. 

This will affect the system’s performance. While you’re reading this article, you’re paying attention to this article and ignoring the rest of the world. This comes with a cost that can be described as the opportunity cost. 

We can select from different types of attention mechanisms, like attention pooling and fully-connected layers.

In attention pooling, inputs to the attention system can be divided into three types:

the Keys (the nonvolitional cues),

the Queries (Volitional Cues), 

the Values (the sensory inputs). 



### Different Types of Attention

![attention_types](./attention_types.jpg)


### Self Attention


<b>High Level Understanding</b>

Self-attention is the method the Transformer uses to bake the “understanding” of other relevant words into the one we’re currently processing.
for better explore [notebook](https://colab.research.google.com/github/tensorflow/tensor2tensor/blob/master/tensor2tensor/notebooks/hello_t2t.ipynb)


<span style="color: orange;">"The animal didn't cross the street because it was too tired”</span>


What does <span style="color: orange;">“it”</span> in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word <span style="color: orange;">“it”</span> , self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.



<b>In Detail</b>

<b>Steps</b>
  1. <b>Calculate three vectors</b> from encoder input vector i.e. embedding of each word
 
    1. Query 
    2. Key
    3. Value

Note: Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64, while the embedding and encoder input/output vectors have dimensionality of 512

  2. The second step in calculating self-attention is to <b>calculate a score</b>. Say we’re calculating the self-attention for the first word in this example, <span style="color: orange;">“Thinking”</span>. We need to score each word of the input sentence against this word. The score determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

    i. The score is calculated by taking the <b>dot product of the query vector with the key vector</b> of the respective word we’re scoring. 

    ii. So if we’re processing the self-attention for the word in position #1, the first score would be the dot product of q1 and k1. The second score would be the dot product of q1 and k2.  



  3. The third and fourth steps are to <b>divide the scores by 8 </b>(the square root of the dimension of the key vectors used in the paper – 64. This leads to having more stable gradients.
  
  
  4. There could be other possible values here, but this is the default), then pass the result through a <b>softmax operation.</b> Softmax normalizes the scores so they’re all positive and add up to 1
  

  5. Multiply each value vector by the softmax score (in preparation to sum them up). The intuition here is to keep intact the values of the word(s) we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).
  

 6. sum up the weighted value vectors. This produces the output of the self-attention layer at this position (for the first word).
 

  
![selfattention](./self-attention-output.jpg)



In [2]:
# http://nlp.seas.harvard.edu/2018/04/03/attention.html#background

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import math, copy, time
from torch.autograd import Variable
import matplotlib.pyplot as plt
import seaborn
seaborn.set_context(context="talk")
%matplotlib inline

C:\Users\HP\anaconda3\envs\Pytorchgpu\lib\site-packages\numpy\.libs\libopenblas.PYQHXLVVQ7VESDPUVUADXEVJOBGHJPAY.gfortran-win_amd64.dll
C:\Users\HP\anaconda3\envs\Pytorchgpu\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


In [9]:
 
def attention(query, key, value, mask=None, dropout=None):
    "Compute 'Scaled Dot Product Attention'"
    d_k = query.size(-1)
    scores = torch.matmul(query, key.transpose(-2, -1)) \
             / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    p_attn = F.softmax(scores, dim = -1)
    if dropout is not None:
        p_attn = dropout(p_attn)
    return torch.matmul(p_attn, value), p_attn


class MultiHeadedAttention(nn.Module):
    def __init__(self, h, d_model, dropout=0.1):
        "Take in model size and number of heads."
        super(MultiHeadedAttention, self).__init__()
        assert d_model % h == 0
        # We assume d_v always equals d_k
        self.d_k = d_model // h
        self.h = h
        self.linears = clones(nn.Linear(d_model, d_model), 4)
        self.attn = None
        self.dropout = nn.Dropout(p=dropout)
        
    def forward(self, query, key, value, mask=None):
        "Implements Figure 2"
        if mask is not None:
            # Same mask applied to all h heads.
            mask = mask.unsqueeze(1)
        nbatches = query.size(0)
        
        # 1) Do all the linear projections in batch from d_model => h x d_k 
        query, key, value = \
            [l(x).view(nbatches, -1, self.h, self.d_k).transpose(1, 2)
             for l, x in zip(self.linears, (query, key, value))]
        
        # 2) Apply attention on all the projected vectors in batch. 
        x, self.attn = attention(query, key, value, mask=mask, 
                                 dropout=self.dropout)
        
        # 3) "Concat" using a view and apply a final linear. 
        x = x.transpose(1, 2).contiguous() \
             .view(nbatches, -1, self.h * self.d_k)
        return self.linears[-1](x)

In [19]:
transformer_model = nn.Transformer(nhead=16, num_encoder_layers=12)
>>> src = torch.rand((10, 32, 512))
>>> tgt = torch.rand((20, 32, 512))
>>> out = transformer_model(src, tgt)
print(out)

tensor([[[-6.5165e-01, -4.8202e-01,  1.8513e+00,  ..., -3.5028e-03,
           2.4477e-01, -7.9970e-01],
         [-1.3169e+00, -7.8400e-02,  1.5168e+00,  ...,  4.3083e-01,
           4.2076e-01, -2.2733e+00],
         [-5.0611e-02,  2.4816e-01,  2.1436e+00,  ..., -6.5838e-01,
           1.8618e-01, -7.8179e-01],
         ...,
         [-3.3425e-01, -3.8498e-03,  2.3675e+00,  ...,  1.0746e-01,
           6.0675e-01, -2.1961e+00],
         [-1.7190e-01, -5.2445e-02,  1.9160e+00,  ...,  8.2587e-01,
          -4.9010e-01, -2.2013e+00],
         [-6.2436e-03, -3.1832e-01,  1.3241e+00,  ..., -1.3040e+00,
           6.0892e-01, -1.6055e+00]],

        [[ 1.9632e-01,  3.2807e-01,  1.6494e+00,  ..., -3.6864e-01,
          -3.6389e-01, -6.3401e-01],
         [-6.6213e-01,  6.9574e-03,  1.0705e+00,  ..., -3.9800e-01,
           3.7914e-01, -1.2937e+00],
         [ 3.7505e-02, -2.6562e-01,  1.3161e+00,  ..., -8.5314e-01,
           4.6169e-01, -1.8740e+00],
         ...,
         [ 2.1817e-01, -6

In [3]:
d_model=512

q_linear = nn.Linear(d_model, d_model)
print('query and type',q_linear,type(q_linear))

key_linear = nn.Linear(d_model, d_model)
print(key_linear)

a=[[1,2,3],[4,5,4]]
emb=nn.Embedding(3,3)
print(emb)

key=torch.tensor(data=a)
print(key)
print(key[-1])
key_transpose=key.transpose(-2, -1)
print(key_transpose)



query and type Linear(in_features=512, out_features=512, bias=True) <class 'torch.nn.modules.linear.Linear'>
Linear(in_features=512, out_features=512, bias=True)
Embedding(3, 3)
tensor([[1, 2, 3],
        [4, 5, 4]])
tensor([4, 5, 4])
tensor([[1, 4],
        [2, 5],
        [3, 4]])


https://neptune.ai/blog/comprehensive-guide-to-transformers

https://towardsdatascience.com/self-attention-and-transformers-882e9de5edda

# Finetuning 

https://www.youtube.com/watch?v=GSt00_-0ncQ

In [1]:
# from transformers import pipeline

In [None]:
## BERT 

SEntence BERT
https://www.pinecone.io/learn/sentence-embeddings/

In [None]:

# sent_list=[i for i in sent.split()]
