### Business Context:

In part 1 of the Transformer series lectures, we have done an in-depth analysis of NLP.

We started from simpler NLP models like **Bag of Words(BOW), TF-IDF** and moved towards word embedding models **like Word2Vec, Glove** and then to **simple/bi-directional RNNs**.

Finally we came across complex **Transformer architecture**, and witnessed the power of **BERT**, one of the most high performing State-of-the-Art Transformer model.
Here we saw how the concept of self attention with encoder & decoder layers had helped surpass performance metrics set by previous architectures in downstream NLP taks.

In part of this series, we will discuss on **2 more novel architectures** which worked on improving over BERT performances using different training & optimization techniques.

These are :
* **RoBERTa : A Robustly Optimized BERT Pretraining Approach**
* **XLNet : Generalized Autoregressive Pretraining for Language Understanding**

We will analyze the architectures of these 2 models, study their training and optimization techniques and finally use them to **classify Human Emotions** into separate categories.

### References:

* Google Images
* Transformers research paper -> https://arxiv.org/abs/1706.03762
* BERT research paper -> https://arxiv.org/abs/1810.04805
* RoBERTa research paper -> https://arxiv.org/abs/1907.11692
* XLNet research paper -> https://arxiv.org/abs/1906.08237
* Wikipedia, Google

### Transformer Architecture Recap:

* Refer **Transformers in NLP: Part1** for detailed information

### Brief Overview:

* Composed of encoder and decoder blocks -> multiple identical encoders and decoders stacked on top of each other, having same no of units.

* Encoder layer converts a particular language into a numerical form using the attention mechanism

* Decoder uses the encoded information from the encoder layers to give the translation in a different language.


![transformer.png](attachment:transformer.png)

![encoder-decoder_2.png](attachment:encoder-decoder_2.png) 

### Self Attention:

* Weighted combination of all word embeddings (including those that appear later in the sentence) as shown in the figure below:

* Allows the model to look at the other words in the input sequence to get a better understanding of a certain word in the sequence.

* Word embeddings are transformed into 3 separate matrices -- queries, keys, and values -> their combined weights are used for final results.



![self-attention.png](attachment:self-attention.png)

### BERT Recap:

* Refer **Transformers in NLP: Part1** for detailed information

### Brief Overview:

* Bidirectional Encoder Representations from Transformers
* Learns information from both the left and the right side of a token’s context
* Pre-trained on the BooksCorpus (800M words) and English Wikipedia (2,500M words).

**Illustration:**

 1. He and his friends went to sit near the river **bank** and enjoy the sunset.
 2. Tom needs to go to the **bank** by saturday to claim his lottery amount awarded as a cheque.
 
 **Word Embeddings in BERT:**
 
 ![bert-text-io.png](attachment:bert-text-io.png)

#### BERT Special Tokens:

* [UNK]

* [SEP]

* [PAD]

* [CLS]

* [MASK]

#### BERT Pre-Training:

* Details covered in RoBERTa section below

### Introduction to RoBERTa:

* RoBERTa : A Robustly Optimized BERT Pretraining Approach
* Significant undertraining of BERT : an observation post release of other models
* An Improved version of training BERT models to match or exceed performance of models released post BERT

### BERT pre-training:

* Tokenization
* Training objectives: MLM & NSP
* Optimization
* Training data : Book Corpus + English Wikipedia (16 GB of uncompressed text)

### Approach for RoBERTa:


#### 1. Dynamic Masking in RoBERTa vs Static Masking in BERT:

    * Static masking used in BERT during data pre-processing
    * Data masked on the fly for RoBERTa
    
#### 2. Full Sentences - NSP loss in RoBERTa vs Segment Pair + NSP loss in BERT:

    * Segment pairs composed of multiple natural sentences used in BERT with NSP loss
    * Full sentences sampled from one or more documents is used without NSP loss
    * Max token length same for both = 512 tokens
    
#### 3. Training with larger batch sizes:

    * Increasing the batch size led to higher performance
    
#### 4. Byte Pair Encoding:
    
    * Using bytes instead of unicode as subwords
    
#### 5. Pre-Training Dataset:
    
    * BERT: Book Corpus + English Wikipedia => 16GB data
    * RoBERTa: (Book Corpus + English Wikipedia => 16GB) + CC News(76GB) + OpenWebText(38GB)  + Stories(31GB)
    * 160GB/ 10 times more data used in pre-training RoBERTa
    

### RoBERTa model architecture:

Same as BERT models as shown below:

* The number of layers in the Transformer Block is denoted as L, the hidden size as H, and the number of self-attention heads as A
* L = 12, H = 768, A = 12, 110M params
* L = 24, H = 1024, A = 16, 355M parameters

### RoBERTa evaluation:

This improved version of BERT has been evaluated on the following 3 benchmark datasets:

* GLUE: The General Language Understanding Benchmark
* SQuAD: The Stanford Question Answering Dataset
* RACE: The Reading Comprehension from Examinations


* Results illustrate the importance of previously overlooked design decisions
* It suggests that BERT’s pretraining objective remains competitive with recently proposed alternatives.

### Hyper-parameters for RoBERTa:

* Details illustrated with findings in research paper

### AR Models vs AE Models:

#### Auto-Regressive(AR) Models:

* AR language modeling seeks to estimate the probability distribution of a text corpus with an autoregressive model 

* Only trained to encode a uni-directional context (either forward or backward). 

* Not effective at modeling deep bidirectional context information. 

* Gap between AR language modeling and effective pretraining.

#### Auto-Encoder(AE) Models:

* Aims to reconstruct the original data from corrupted input. 

* Density estimation not being part of the objective, bidirectional contexts are utilized for reconstruction.

* However, there is pretrain-finetune discrepancy due to missing symbols in realtime data

* Not able to model the joint probability using the product rule as in AR language modeling. 

* Assumes the predicted tokens are independent of each other given the unmasked tokens, which is oversimplified as high-order, long-range dependency is prevalent in natural language.

### Introduction to XLNet:

* Large bidirectional transformer that uses improved training methodology, larger data and more computational power to achieve better than BERT prediction metrics on 20 language tasks.

* Uses a permutation language modeling objective to combine the advantages of AR and AE methods.

* Trained with over 130 GB of textual data and 512 TPU chips running for 2.5 days.

* Overcomes the limitations of BERT thanks to its autoregressive formulation.

### Where does XLNet shine?

* Best of both worlds

* Compared to conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. 

* Captures bidirectional context information thanks to the permutation operation.

* Does not rely on data corruption, does not suffer from the pretrain-finetune discrepancy that BERT is subject to.

* The autoregressive objective provides a natural way to use the product rule for factorizing the joint probability of the predicted tokens, eliminating the independence assumption made in BERT.

* Under comparable experiment setting, XLNet consistently outperforms BERT on problems like GLUE language understanding, reading comprehension tasks like SQuAD and RACE, text classification etc.

### Differences between XLNet & BERT:

1. Independence Assumption

2. Input noise

3. Context dependency

4. Permutation Language Modeling

### Transformer XL:

* Research paper-> https://arxiv.org/abs/1901.02860

### XLNet Pre-training:

* BooksCorpus and English Wikipedia (13GB)

* Giga5 (16GB text)  

* ClueWeb 2012-B (19GB) 

* Common Crawl (110GB) 

* Full sequence length of 512 used in pre-training

### Comparison with BERT & RoBERTa:

* Details illustrated with findings in research paper

### Hyper-parameters for XLNet:

* Details illustrated with findings in paper