### Questionnaire

1. What is vanishing gradient problem?


2. Why recurrent NN are more susceptible to this problem than simple feed forward network?


3. Different ways to solve vanishing/exploding gradient problem?


4. How LSTM helps to solve vanishing gradient problem?


5. What LSTM based architecture you will design to get similarity between 2 sentences?


6. What architecture choice would you make to extract name of persons from a sentences?


7. What is Bi-directional LSTM, how it works and where it can help?


8. What is encoder-decoder architecture? What are its applications?


9. What is teacher forcing? How it is useful?


10. What is Information Bottleneck?

## Why LSTM?

**Problem with Vanila RNN**
1. Struggle with longer sequences. **For longer sequences in later timesteps hidden state tend to lose information about early timesteps**.
2. **Vanishing and Exploding Gradients**: During backpropogation weights receive updates proportional to the gradient at each timestep. During backprogation through time, gradient for early layers become so small that it becomes insignificant. Similarly gradient can explode also. Both factors make learning unstable.
 

> **Intuition** : Think of a very large feed forward NN. We will now just look for bias terms, similarly we can figure out for weights also.

> <img src="http://neuralnetworksanddeeplearning.com/images/tikz38.png">

>  $ \frac{\partial C}{\partial b_1} =
\begin{cases}
\infty; W \gt 1 \\
 0; W \lt 1
\end{cases}, \frac{\partial C}{\partial w_1} =
\begin{cases}
\infty; W \gt 1 \\
 0; W \lt 1
\end{cases} $
  
> Why this is more prevalent in RNN compared to FeedForward NN? : In RNNs, recurrent weights is same at all time steps. If that weight is very small or very large that will have effect gradient at earlier timesteps multiple times. 


> For more details : [http://neuralnetworksanddeeplearning.com/chap5.html](http://neuralnetworksanddeeplearning.com/chap5.html)

**Possible Solutions**
1. Use gradient clipping (for exploding gradient)
2. Use skip connections (direct connections from early layers generally used in computer vision like resnets)
3. Use Gated Recurrent Networks (like GRU, LSTM)

**LSTM (Long Short Term Memory)**
<img src="https://cdn-images-1.medium.com/freeze/max/1000/1*msv51vP7ooRGGsKhiM7p9g.png?q=20" height="40%" width="40%">
1. The Key idea behind LSTM is **cell state** horizontal line running through top. LSTM add or remove information from cell using gates.
2. 3 Gates (Forget, Input, Ouput).
3. Refer https://colah.github.io/posts/2015-08-Understanding-LSTMs/ for detail mathematical equations.
4. **How LSTM solve Vanishing gradient problem?** : Notice you will not find any weigths in cell state (only gates). So gradient passing through cell state never explodes or vanishes. https://stats.stackexchange.com/questions/185639/how-does-lstm-prevent-the-vanishing-gradient-problem.

## Applications of Recurrent Networks:

![](https://i.stack.imgur.com/b4sus.jpg)

### One to Many (Poem/Code Generation)

Generate a whole poem after giving few intital words.

<img src="https://miro.medium.com/proxy/1*XuIizpk3Hb_wxzXG_mr1og.png">

### Many to One (Classification, Finding Similarity)

1. At each timestep some hidden state (some context) is generated.
2. We are more concerned about last hidden state (final context).
3. We will apply some non-linear transformation to final context and find sentiment (classify).

<img src="https://www.researchgate.net/profile/Huy_Tien_Nguyen/publication/321259272/figure/fig2/AS:572716866433034@1513557749934/Illustration-of-our-LSTM-model-for-sentiment-classification-Each-word-is-transfered-to-a.png" width="70%" height="70%">

### Many to Many - Seq2Seq (NER, POS tagging)

1. Suppose we want to find Named Entities in the sentence. 
2. For each word we will get corresponding hidden state and from which we will determine the entity type.

<img src="https://github.com/adityajn105/NLP-projects/raw/master/Named%20Entity%20Recognition/arch.jpeg">

- **Bi-LSTM**
   1. Sometimes we need to the context of future words also to solve our problem.
        
    Example : "Bank of America is second largest financial institute of United States."
    
    -here bank could be river bank which will be not a named entity.
    
    -but after reading whole sentence we will find it will be a named entity (organization).
  
  2. Bi-LSTM is not much different from LSTM. It is just 2 independent passes from 2 different ends of sequence. After that, all hidden states from 2 passes is concatenated.


- **Time-Distributed Layer**

    1. It is just used to apply same transformation to tensor output of every timestamp.
    
    2. It can be achieved using a loop also. TimeDistributed is just a wrapper for that loop.
    
    3. Like in this example we have used a Dense Layer of size equal to all possible NER. 

### Many to Many - Seq2Seq using Encoder-Decoder (Machine Translation)

1. Now we want to do langauge translation where input and output is not necessarily of the same length.

<img src="https://github.com/adityajn105/NLP-projects/raw/master/Eng-to-Spanish(Seq2Seq)/seq2seq.jpg">
    
- **Introduced by Google in 2014**
- **Encoder**
     * Maps variable length sequences to a fixed length memory/vector (called context or thoughvector). Context is just hidden state from last timestamp. It is just a bunch of numbers representing input sequence. 
     * Use padding and trucation to make input sequence of fixed length
     * Encoder sends context to the decoder

- **Decoder**
     * Decoder begins producing the output sequence item by item.
     * Decoder also maintains a hidden states that it passes from one timestep to the next. 

- **Teacher Forcing** 
     * Suppose during training second step the decoder gives wrong prediction, this wrong prediction will go to next timestep, making predictions further off from desired. (example of a math problem!!)
     * During Training/Learning of model instead of giving decoder cell last word generated in previous timestep, we will give actual target word as input. (https://miro.medium.com/max/842/1*U3d8D_GnfW13Y3nDgvwJSw.png - Two people running on ground)
     * This is found to result in faster convergence.
     * However, during prediction you will give work generated in last timestep only.

- **Problem : Information Bottleneck**
     * Whether it is 2 word sentence or 30 word sentence the information is encapuslated in same size vector. Does it make sense?
     * Suppose it is a very long sequence and some important information in the begining of sentence, that information will most probably be lost in the end. Struggles to encode longer sequence.

### Many to Many - Seq2Seq using Encoder-Decoder with Attention (Machine Translation, Text Summarization)

- **Solution to above Problem**
    * Prevent sequence overload by giving decoder a way to focus on the likeliest words at each translation step.
    
    * Instead of giving decoder summarized information for input sequence, we will try to give information of all hidden states from the input. Hidden states will weighted based on likelihood of input word which will be our context.


- **How to find importance (attention) of a word and find context?**
    * We will perform a series of calucation to get a score for each word.
    
    * **Step 1:** First get all the hidden states for available words from decoder.
     $$ Encoder( w_0, w_1, ..., w_n ) = [h_0, h_1, ..., h_n] $$
    
    * **Step 2:** Now, we will take current hidden state of decoder (initally default).
     $$ \textit{Let current hidden state be $s_{t-1}$} $$
     
    * **Step 3:** We will concatenate each hidden state with decoder hidden state ($s_{t-1}$) to form multiple vectors.
     $$ Concatenation = [(h_0,s_{t-1}), (h_1,s_{t-1}), ..., (h_n,s_{t-1})] $$ 
    
    * **Step 4:** We will pass each vector to get a number (energy) and perform softmax to get a probability score ($\alpha$) for each hidden state.
     $$  \textit{Energy, } e_{i<t>} = dense( h_i,s_{t-1} ) \\
         \textit{Alpha, } \alpha_{i<t>} = \frac{e^{e_{i<t-1>}}}{\sum_k e^{e_{k<t-1>}}}, \textit{This is softmax over input time axis} 
     $$
     
    * **Step 5:** Now, we will perform soft-attention to get weighted sum of each hidden state (context).
     $$ \textit{Context, C} = \sum_{t=1}^T h_t.\alpha_t $$
    
    
- **Input for Decoder?**
    * We will see how to generate the output from decoder
    
    * **Step 1:** Now, we will pass Context (C) to decoder to generate next hidden state
     $$ Decoder(C) = out $$
    
    * **Step 2:** We will pass output of decoder to a dense layer (VOCAB SIZE) with softmax activation to get the next word.
    
<img src="https://i.stack.imgur.com/13ADZ.png">
    
- **Reference:** https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/

- **Teacher Forcing?**

<img src="https://github.com/adityajn105/NLP-projects/raw/master/Eng-to-spanish%20(Seq2Seq%20Attention)/seq2seq-attention.jpg">

- **Read More :**
    * BLEU Score (Bilingual Evaluation Understudy)- a metric for Neural Machine Translation - https://www.youtube.com/watch?v=DejHQYAGb7Q
    * BEAM Search Decoding - Greedy+Probablistic approach to generate best output sequence - https://www.youtube.com/watch?v=RLWuzLLSIgw

### Generalised Attention

**Calculating Attention in terms of Key, Queries and Values**

       This key/value/query concepts actually came from information retrival systems.

       When you type a query to search some video on youtube, the search engine will map query against a set of keys (video title, description), then present you best mathced videos (values)

      
* For above system, query will be decoder hidden state, the key and value are from the encoder hidden states (key and value are same here). 


* The score (energy) is the compatibility between the query and key (which can be dot product between query and key). This scores will go though softmax function to yield a set of alphas (weights) whose.


* Each alpha (weight) is multiplied with corresponding values (again here it is same as key) to get a context vector.


* If we set alpha (weight) = 1 for last hidden state, and for all previous hidden state to 0, our attention mechanism will reduce to original seq2seq. Like there is no attention at all.


* This terminology will be more useful when we will study transformers.

## Transformers (Self-Attention)

* **Problems with above system**
    1. No parallel computing, we have to decode every word in output one by one. Even encoder is not parallel. Everything is sequential.
    2. Vanishing/exploding gradients, LSTM and GRU helps but they even fail with very long sequences.
    3. In Decoder timesteps there is a loss of information in later stages.
    
    
* **How transformer solve above issues?**
    1. No sequential computation is required per layer.
    2. Gradient does not have to flow through time steps. So not vanishing/exploding gradients.

<img src="https://i.stack.imgur.com/J45g2.png">

* In above system using LSTM, key/value/query were generated by LSTM cells. In self-attention there are transformations of the corresponding input.

   $$\textit{Query} = I * W_q\\\textit{Key} = I * W_k\\\textit{Value} = I * W_v $$
   
   
* **Why matrix multiplication (or say transformations)?** 
    1. If we do not transform input vectors, the dot product for computing weights for each input's value will yield maximum score for itself.
      Example for pronoun we need it to give maximum weight to its referrent.

    2. Also transformation may yield better representation for Query, Key and Value. Here key and value are not same. These transformations can be learned in a neural network as explained above.

    3. Transformation helps to convert the input vector into a space of a desired dimension, say from 300 to 64.
    
    
* **Steps to calculate weighted value vectors for each query**
    1. **Step 1 :** Take first query, and dot product with every input keys. By this we will get n number of scores.
            q1.k1 = 112     q1.k2 = 96     q1.k3 = 56 ...
    2. **Step 2 :** Divide each score with square root key vector dimensions. This leads to stable gradients. Suppose Key vector dimension is 64.
            14                 12               7     ....
    3. **Step 3 :** Convert these scores into weights using softmax.
            0.88               0.11           0.0008  ....
    4. **Step 4 :** Multiply each value vector with thses scores. (Non relevant values will be surpressed with respect to current context)
            0.88 * v1          0.11 * v2      0.0008 * v3  ....
    5. **Step 5 :** Calculate weighted sum
             z1 = 0.88 * v1 + 0.11 * v2 + 0.0008 * v3 + ....
    6. **Step 6 :** Repear steps 1 to 5 for each query (q1, q2, ...)
             z1, z2, z3, ...
             
    $$ \textit{Attention Head, } Z = Softmax( \frac{Q.K^T}{\sqrt d_k} )V $$


* **The Beast with many Heads (Multi-head attention)**

    1. Self-attention layer is refined using mechanism called "multi-headed attention". 
    
            Why?
            a. z1 should contain a little bit of evey other encoding, but it could be dominated by actual word itself.
        
            b. Using multiple "attention heads", we have multiple sets Q/K/V weight matrices. Each set is randomly initialized. Then after training each set is used to project input embedding into a different representation space. Chances are reduced that actual word is dominated in weighted value vector.
            
    2. In paper, 8 attention heads are used. so it has 8 different sets of K/V/Q pairs.
    
    3. Now, Feed forward Netowrk is not expecting 8 matrices - it is expecing single matrix vector (a vector for each word). We need to condense these 8 matrix down to a single one.
    
        $$ Z =  [Z_0, Z_1, Z_2, Z_3, Z_4, Z_5, Z_6, Z_7] * W^o $$
        
    4. All Attention heads can be computed parallely.


* **Encoder Summary**
   <img src="http://jalammar.github.io/images/t/transformer_multi-headed_self-attention-recap.png">
   
   
* **Positional Encoding**
    1. Encodes each inputs position in the sequence coz order of words is also important.
    2. Positional encoding is added to input word embeddings.
    3. Why? Everything is done in parallel so to incorporate some sense of order it is done.
    
    
* **Transformer Decoder**
    1. Component of decoder works similar to encoder as well.
    
    2. Output of top encoder is then transformed into a set of attention vectors K and V. These are to be used by each decoder in its "encoder-decoder attention" layer which helps decoder focus on appropriate places in input seq.
    
    3. The output of each step is fed to the bottom decoder in next time step, and the decoder bubbble up their decoding results just like encoders did. (Can be done right shifting output and masking) - Causal Attention
    
    4. "Encoder-Decoder Attention" layer works just like multithreaded self-attention, excepts it creates its query matrix from layer below it, and takes keys and valyes matrix from output of encoder stack.
    
* **Summary**
    <img src="https://i0.wp.com/blog.exxactcorp.com/wp-content/uploads/2019/05/1_blSbN23mOGMZ_DWvTAcO1w.png" height="50%">
    
    
* **Some cutting-edge Transformers**
    1. GPT-2 / GPT-3 (Open AI): Generative Pre-Training for Transformer
    2. BERT (Google): Bidirectional Encoder Representation from Transformers
    3. T5 (Google): Text-to-text transfer transformer (for mulit-tasking)
    
    
* **Read More**
    1. [Deep contextualized word representations](https://arxiv.org/pdf/1802.05365.pdf)
    2. [Attention Is All You Need](https://arxiv.org/abs/1706.03762)
    3. [The Illustrated Transformer](http://jalammar.github.io/illustrated-transformer/)
    4. [The Annotate Transformer](https://nlp.seas.harvard.edu/2018/04/03/attention.html) 
    5. [Reformer: The Efficient Transformer](https://arxiv.org/abs/2001.04451)
    6. [The Illustrated GPT-2 (Visualizing Transformer Language Models](http://jalammar.github.io/illustrated-gpt2/)
    7. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
    8. [How GPT3 Works - Visualizations and Animations](http://jalammar.github.io/how-gpt3-works-visualizations-animations/)
    9. [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/abs/1910.10683)

### Outline

1. CBOW :

        Use fixed context window to predict the center word.
        Problem: Limited Context. What if we want all context?
    
    
2. ELMo :

        Use bidirectional LSTM to predict the center word.
        Take whole context into account (full sentence).
    
    
3. GPT  :

        Uses only Transformer decoder stack. 
        It is uni-directional (decoder), means it is not taking account of future words. (Causal Attention - No peeking)
    
    
4. BERT :

        Uses only Transformet encoder stack.
        It is Bi-directional (encoder).
        It was trained using multi-masking of language and next sentence prediction.


5. T5 :

        Uses both encoder and decoder stack.
        Also has bi-directional context
        It is Text to Text to model.

### BERT - Bi-directional Encoder Representation from transformers

* During pre-training it works on unlabelled data.

* Bert is multi-layer bi-directional transformer, also uses positional embeddings.

* Bert base - 12 layers (transformer blocks), 12 attention heads, 110 million params.

* **Pre-training**
    * Before feeding word sequences to the model, we change 15% of words.
    * Out of this 15%, 1. 80% will be replace with <MASK> token, 2. 10% will be replaced by random token, 3. 10% it remains unchanged.
    * In addition to masked language it also uses "next sentence prediction" task as objective.
    * 
    
* **Fine-Tuning Bert**
    
    * 

### GPT

### Reformer