# Introduction

This is an implementation for [Attention is all you need](https://arxiv.org/abs/1706.03762) and references from [Pytorch Seq2Seq - Transformer](https://github.com/SethHWeidman/pytorch-seq2seq/blob/master/6%20-%20Attention%20is%20All%20You%20Need.ipynb), [Harvard's Annotated Transformer](http://nlp.seas.harvard.edu/annotated-transformer/) and this brilliant blog -[Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/) . 

It is implemented as an excerise to gain a deeper understanding of Transformer models by exploring its internal layers and implementing the same for translation task. The notebook is a followup to the first 3 notebooks where last implementation was a CNN Encoder-Decoder model with Attention.

Dataset used - [English - French Translations](https://www.kaggle.com/datasets/dhruvildave/en-fr-translation-dataset)

### Model architecture

The Transformer is an Encoder-Decoder Model ->

Basic Model flow = Input -> **Encoder** -> **Decoder** -> Output

    Encoder =
        Stack of 6 Encoders =
            Each Encoder =
                EncoderLayer(
                    Self-Attention -> Feed-Forward 
                    (each word embedding has its own parallel processing independently. only self attention has dependency on other words)
               )

The first encoder layer also contains the embedding layer for word along with positional embedding which it receives as input. All the other encoder layers receive the output of the previous layer as input. We'll see the embedding layer in detail later.

    Decoder =
        Stack of 6 Decoders =
            Each Decoder =
                DecoderLayer(
                    Self-Attention -> Encoder-Decoder-Attention -> Feed-Forward
               )  
               
Note - Each internal representation is 512 in dim

## Encoder

#### Embedding layer in first encoder layer

Embedding layer -> WordEmbedding
Postitional Encoding (formula mentioned in paper) -> PositionalEncoding (for each word)

Input to Encoder = WordEmbedding + PostitionalEncoding (each is of 512 size, summed element-wise) -> output(512 in length)


### Encoder Layer

Note - Each encoder layer has the same architecture

Consider a sentence as a matrix of **sent_len X 512** (where sent_len is length of sentence and 512 is the vector representation size of each word. For eg- input layer has the word embeddings)

Represent input matrix by **inp_mat**

Each encoder layer receives this dim vector as input.

#### Self-Attention
* For calculating self-attention, 3 weight matrices are maintained for Query(Wq), Key(Wk) and Value(Wv). Each of size **512 X 64** in the paper 
* **inp_mat** is multiplied with each weight matrix to create respective Query(Qv), Key(Kv) and Value(Vv) vectors. Each vector of size - **sent_len X 64**
* Calculate self-attention score for each word against all other words.
* Self attention score calculation - 

    For each word ->
        Multiply Qv (query) of the word with other words KvT (key vector transpose)
        -> returns the score vector(Sv) of size -> **sent_len X sent_len**
        -> by intution ( the size sent_len X sent_len means a value for each word in the sentence w.r.t all the other words)
        -> eg - sentence -> Good morning 
        -> Sv could be =           Good  morning
                         Good       [1.5,  2.7
                         morning    3.5,   0.3]


* Divide Sv by 8 (square root of key vector size - 64). For more stable gradients as per paper.
* Pass Sv through softmax operation to normalize the scores and make them add up to 1.
* Matrix **Sv(size - sent_len X sent_len)** is multiplied with Matrix **Vv(size - sent_len X 64)**. This produces the output of the self attention layer. Output size - **sent_len X 64
* By intitution, in the last step -> multiplying each words value vector by the current word's attention score for that vector highlights important words as their attention score would be more and diminshing other words with lower attention score as they get multiplied  with values like 0.0001.
* Let's call this output matrix Z (size - **sent_len X 64**)

##### Multi-headed attention
* Following from the previous step -> instead of a single set of weight matrices (Wq, Wk, Wv), consider multiple sets of these matrices (in paper 8 sets of query, key and value matrices are used).
* Now each of these sets are used separately to process the self-attention flow listed above and produce their respective Z matrix as output -> (in paper Z1, Z2....Z8 -> 8 matrices).
* This is called multi-headed attention. This is helpful for the model to look at differnt patterns in the sentences and maybe consider different sub-sentences lengths in different heads.

##### Final Processing of Self-Attention
* Concatenate all Z matrices -> along the column -> matrix of size - sent_len X (64X8) = **sent_len X 512**
* Multiply with another weight matrix W0 (this matrix is also learned along with the model) -> output O => size - **sent_len X 512**

#### Feed-Forward
* Now the output **O from Self-Attention** is passed to feed forward network. Each word embedding goes through a separate feed forward network. **So, the no. of feed forward networks = sent_len**.
* **Feed forward input 1 X 512**  -> **Feed forward output 1 X 512**
* All word outputs together form an output Fi matrix of size -> **sent_len X 512** -> which is the output of Encoder layer i. This will serve as the input of the next encoder layer and is of the same dimension as input to the encoder layer.
* The size is kept constant across layers in transformer.

#### Residuals (LayerNorm - Add & Normalize)
* Output of each Sublayer(Self-Attention & Feed-Forward) of Encoder layer is summed element wise with the input to that layer and normalized.
* For eg - Oi output from Self-Attention is added with Embedding in the first encoder layer and output Ei-1 in the other encoder layers and normalized. Fi output from feed-forward layer is added with Oi output from Self-Attention just before it that has been normalized and the sum is further normalized.

(ignoring embedding here in the first case for generic representation)
Encoder single layer process ->

    Input inp (sent_len X 512) ->
        Self-Attention ->
            W = 8 sets of Wq, Wk, Wv
            Z = []
            for (Wk, Wq, Wv) in W:
                Qv = inp X Wq (sent_len X 64)
                Kv = inp X Wk (sent_len X 64)
                Vv = inp X Wv (sent_len X 64)

                Sv = Qv X KvT (sent_len X sent_len)
                Sv = Sv/8  (for stable gradients)
                Sv = Softmax(Sv) (for normalizing and making sure all values sum upto 1)
                Zi = Sv X Vv (sent_len X 64)
                Z.append(Z)
                
             Zconcat = Z1.concat(Z2).concat(Z3)....concat(Z8)   (sent_len X 512)
             
             FI = Zconcat X Wo (sent_len X 512)
             
             FI = FI + inp (residual adding)
             FI = norm(FI)
        
        Feed-Forward ->
             FO = []
             for i in parallel_process(sent_len):
                 FOi = FI[i] -> feed-forward layer (1 X 512)
                 FO.concat(FOi)
                 
             FO (sent_len X 512)
             FO = FO + FI (residual summing)
             FO = norm(FO) (layer norm)
             
        Encoder layer output = FO
             



## Decoder

The decoder is fed as input the target sentence tokens as input along with the ouptut of the top Encoder layer. It produces as output in a single run the id of the token generated in vocabulary and it keeps producing the tokens until a special token, generally EOS token is reached.

#### Embedding layer in first decoder layer for target sentence

Embedding layer -> WordEmbedding
Postitional Encoding (formula mentioned in paper) -> PositionalEncoding (for each word)

Input to Decoder = WordEmbedding + PostitionalEncoding (each is of 512 size, summed element-wise) -> output(512 in length)

In the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence.** This is done by masking future positions (setting them to -inf)** before the softmax step in the self-attention calculation.


### Decoder Layer

Note - Each decoder layer has the same architecture

Target sentence is a matrix of **sent_len X 512** after passing through the embedding layer(where sent_len is length of sentence and 512 is the vector representation size of each word).

Represent target sent matrix by **trg_mat**

Each decoder layer receives this dim vector as input.

#### Self-Attention
* Self-Attention operates similarly to how it operates in encoder layer. Only difference is that in the decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions (setting them to -inf) before the softmax step in the self-attention calculation.

* Output of self attention size - **sent_len X 512**

#### Encoder-Decoeder-Attention

* Operates in the same way as self-attention except the Keys and Values vectors are constructed from the Encoder output(sent_len X 512) and the query vector is constructed from the previous layer's output matrix (sent_len X 512).

### Final Linear Layer
* Output of the last decoder layer is passed through a linear layer which takes a vector of size 512 (float values) and maps it to vector of vocab size (1 X target_vocab_size). 
* Softmax is applied to covert these values to probabilities adding upto 1. 
* The index with the highest probability is considered as the index of the predicted word from the vocabulary. 
* This word is fed along with the previous generated words to the decoder for the next run.

(ignoring embedding here in the first case for generic representation)
Decoder single layer process ->

    Input - trg_mat (sent_len X 512, future tokens will be masked during attention process)
          - enc_out (sent_len X 512, to be used in encoder-decoder attention layer)
    ->
        Self-Attention ->
            W = 8 sets of Wq, Wk, Wv
            Z = []
            for (Wk, Wq, Wv) in W:
                Qv = trg_mat X Wq (sent_len X 64)
                Kv = trg_mat X Wk (sent_len X 64)
                Vv = trg_mat X Wv (sent_len X 64)

                Sv = Qv X KvT (sent_len X sent_len)
                Sv = Sv/8  (for stable gradients)
                Sv = mask(Sv) # future tokens are masked by setting probability to -inf
                Sv = Softmax(Sv) (for normalizing and making sure all values sum upto 1)
                Zi = Sv X Vv (sent_len X 64)
                Z.append(Z)
                
             Zconcat = Z1.concat(Z2).concat(Z3)....concat(Z8)   (sent_len X 512)
             
             EDI = Zconcat X Wo (sent_len X 512)
             
             EDI = EDI + trg_mat (residual adding)
             EDI = norm(EDI)
             
        Encoder-Decoder-Attention ->
            W = 8 sets of Wq, Wk, Wv
            Z = []
            for (Wk, Wq, Wv) in W:
                Qv = EDI X Wq (sent_len X 64)
                Kv = enc_out X Wk (sent_len X 64)
                Vv = enc_out X Wv (sent_len X 64)

                Sv = Qv X KvT (sent_len X sent_len)
                Sv = Sv/8  (for stable gradients)
                Sv = mask(Sv) # future tokens are masked by setting probability to -inf
                Sv = Softmax(Sv) (for normalizing and making sure all values sum upto 1)
                Zi = Sv X Vv (sent_len X 64)
                Z.append(Z)
                
             Zconcat = Z1.concat(Z2).concat(Z3)....concat(Z8)   (sent_len X 512)
             
             FI = Zconcat X Wo (sent_len X 512)
             
             FI = FI + EDI (residual adding)
             FI = norm(FI)
        
        Feed-Forward ->
             FO = []
             for i in parallel_process(sent_len):
                 FOi = FI[i] -> feed-forward layer (1 X 512)
                 FO.concat(FOi)
                 
             FO (sent_len X 512)
             FO = FO + FI (residual summing)
             FO = norm(FO) (layer norm)
             
        Decoder layer output = FO

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session