# Advanced AI: Transformers for NLP Using Large Language Models

**Instructor:** Jonathan Fernandes

Transformers have quickly become the go-to architecture for natural language processing (NLP). As a result, knowing how to use them is now a business-critical skill in your AI toolbox. In this course, instructor Jonathan Fernandes walks you through many of the key large language models developed since GPT-3. He presents a high-level overview of GLaM, Megatron-Turing NLG, Gopher, Chinchilla, PaLM, OPT, and BLOOM, relaying some of the most important insights from each model.

Get a high-level overview of large language models, where and how they are used in production, and why they are so important to NLP. Additionally, discover the basics of transfer learning and transformer training to optimize your AI models as you go. By the end of this course, you’ll be up to speed with what’s happened since OpenAI first released GPT-3 as well as the key contributions of each of these large language models.


## 1. Transformers in NLP
### What are large language models?
- BERT and GPT-3 are examples of large language models
- **Large language model architecture is based on transformer architecture.**
- Transformers and large language models were proposed by a team of Google researchers in 2017 in a paper entitled: "**Attention Is All You Need**," which has become a turning point in NLP.
- Large language models have millions and often billions of parameters and are trained on enormous datasets
- GPT-3 was released in May of 2020

<img src='img/1.png' width="800" height="400" align="center"/>

- Models released by Google research include: **GLaM**, **PaLM**
- Models released by DeepMind include: **Gopher**, **Chinchilla**
- Released by Microsoft and Nvidia: Megatron-Turing NLG or **MT-NLG**
- Released by Meta AI: **OPT** $\Rightarrow$ makes large language models available to researchers outside of big tech
- Released by Hugging Face: **BLOOM** $\Rightarrow$ makes large language models available to researchers outside of big tech

### Transformers in Production

#### BERT
- In 2019, Google started using BERT as part of search 
    - $\Rightarrow$ Now, when entering something into Google search, you can enter something more "English sounding"
    - For example: instead of "curling objective" $\Rightarrow$ "what's the main objective of curling?"
    - Another example:
        - In the past, if you did a Google search using the phrase "Can you get medicine for someone pharmacy," it would **not** have picked up on the fact that "for someone" was a really important part of a query $\Rightarrow$ But now, it will pick up on the fact that you're looking for another person to pick up the medicine
- **BERT**: **B**idirectional **E**ncoder **R**epresentations from **T**ransformers 
- BERT was fed the Wikipedia and the BookCorpus data as input
- One of the first large language models developed by the Google research team
- The quality of Google search has improved significantly using BERT.

### Transformers: History
- The models based on the original transformer paper from 2017 have evolved over the years.
- One of the challenges with training large language models in 2017 was that you needed labeled data.
- The ULMFiT model proposed by Jeremy Howard and Sebastian Ruda provided a framework where you didn't need labeled data, and that meant **large corpus of texts, such as Wikipedia, could now be used to train models.**
- In June of 2018, GPT or **G**enerative **P**re-**T**rained Model, developed by Open AI, was the first pre-trained transformer model.
- When Open AI released a bigger and better version of GPT (GPT-2) in Feb 2019, it made headlines because the team didn't want to release the details of the model due to **ethical concerns.**
- Meta's BART and Google's T5 are both large pre-trained models using the same architecture as the original transformer
- Hugging Face released DistilBERT, which is a smaller, faster, and lighter version of BERT: DistilBERT had 95% the performace of BERT and reduced the size of the BERT model by 40%
- In May 2020 Open AI released GPT-3, which is excellent at generating high-quality English sentences.
    - Although Open AI provided a lot of details in their GPT-3 paper, they didn't reveal the dataset they used or thier model weights
    
<img src='img/2.png' width="800" height="400" align="center"/>

<img src='img/3.png' width="800" height="400" align="center"/>

**Note** that in the graph above, the y-axis is on a log scale, and so the growth is not linear but exponential

## 2. Training Transformers and Their Architecture

### Transfer Learning
- Transfer learning is made up of 2 components:
    - **Pre-training** $\Rightarrow$ Extremely resource-heavy
    - **Fine-tuning** $\Rightarrow$ Involves training our model with labeled data
    
<img src='img/4.png' width="800" height="400" align="center"/>

#### Pre-training Tasks: BERT (Google)
- **Masked language modeling:** Fed Wikipedia and BookCorpus data as input and words were randomly masked out
- BERT then had to predict what the most likely candidates were for these masked words
- With **Next sentence prediction**, it had to predict whether one sentence followed the other.
    - 50% of the time one sentence did follow the other and these were labeled as `isNext`
    - 50% of the time a random other sentence from the corpus was used, and these were labeled as `notNext`
- According to BERT's documentation, **1,500 words is approximately equivalent to 2,400 tokens.**
    - So this means **one word is approximately 1.4 tokens.**
    - **A novel of 100,000 words is approximately 140,000 tokens.**
    
#### RoBERTa (Facebook)
- Trained in one day
- 2 trillion tokens
- Also used were Wikipedia, BookCorpus, as well as the Common Crawl news dataset, OpenWebText, and the Common Crawl stories:
    - **Common Crawl** is a raw webpage dataset from years of web crawling 
    - **OpenWebText** is a dataset created by scraping URLs from REddit with a score of three (this is a proxy for the quality of the data response)
    
#### GPT-3 (Open AI)
- 34 days training days
- Used 10,000 V100 GPUs
- 300B training tokens
- Primarily an Azure infrastructure
- Used Wikipedia, CommonCrawl, WebText2, Books1, Books2

#### Benefits of Transfer Learning
- Faster development 
    - For BERT, the author suggest two to four epochs of training
    - Much better than the thousands of hours of pre-training time
- Less data to fine-tune
- Excellent results 

### Transformer Architecture

#### Encoder-Decoder Models
- From the **"Attention Is All You Need"** paper:

<img src='img/5.png' width="500" height="250" align="center"/>

- The left-hand side is known as an **encoder** and the right-hand side is known as a **decoder**. 
- We feed in the English sentence, such as "I like NLP," and the decoder can act as a transformer of the sentence from English to German:

<img src='img/6.png' width="500" height="250" align="center"/>

- However, the transformer is not made up of a single encoder, but rather six encoders. 
- Each of these parts can be used independently depending on the task.
- Encoder-Decoder models are good at generative tasks such as translation or summarization
- Examples of encoder-decoder models are:
    - BART (Facebook)
    - T5 (Google)

<img src='img/7.png' width="500" height="250" align="center"/>

<img src='img/8.png' width="500" height="250" align="center"/>

#### Encoder-only Models

- Encoder-only models are good for tasks that require understanding of the input, such as:
    - Sentence classification
    - Named entity recognition (NER)
- Examples include the family of BERT models:
    - BERT
    - RoBERTa
    - DistilBERT
    
<img src='img/9.png' width="600" height="300" align="center"/>

#### Decoder-only Models
- Good for generative tasks such as text generation
- Examples include: 
    - GPT
    - GPT-2
    - GPT-3

<img src='img/10.png' width="600" height="300" align="center"/>
    
    
**In summary, transformers are made up of encoders and decoders, and the tasks we can perform will depend on whether we use either or both components

<img src='img/x.png' width="800" height="400" align="center"/>