# Advanced AI: Transformers for NLP Using Large Language Models

**Instructor:** Jonathan Fernandes

Transformers have quickly become the go-to architecture for natural language processing (NLP). As a result, knowing how to use them is now a business-critical skill in your AI toolbox. In this course, instructor Jonathan Fernandes walks you through many of the key large language models developed since GPT-3. He presents a high-level overview of GLaM, Megatron-Turing NLG, Gopher, Chinchilla, PaLM, OPT, and BLOOM, relaying some of the most important insights from each model.

Get a high-level overview of large language models, where and how they are used in production, and why they are so important to NLP. Additionally, discover the basics of transfer learning and transformer training to optimize your AI models as you go. By the end of this course, you’ll be up to speed with what’s happened since OpenAI first released GPT-3 as well as the key contributions of each of these large language models.


## 1. Transformers in NLP
### What are large language models?
- BERT and GPT-3 are examples of large language models
- **Large language model architecture is based on transformer architecture.**
- Transformers and large language models were proposed by a team of Google researchers in 2017 in a paper entitled: "**Attention Is All You Need**," which has become a turning point in NLP.
- Large language models have millions and often billions of parameters and are trained on enormous datasets
- GPT-3 was released in May of 2020

<img src='img/1.png' width="800" height="400" align="center"/>

- Models released by Google research include: **GLaM**, **PaLM**
- Models released by DeepMind include: **Gopher**, **Chinchilla**
- Released by Microsoft and Nvidia: Megatron-Turing NLG or **MT-NLG**
- Released by Meta AI: **OPT** $\Rightarrow$ makes large language models available to researchers outside of big tech
- Released by Hugging Face: **BLOOM** $\Rightarrow$ makes large language models available to researchers outside of big tech

### Transformers in Production

#### BERT
- In 2019, Google started using BERT as part of search 
    - $\Rightarrow$ Now, when entering something into Google search, you can enter something more "English sounding"
    - For example: instead of "curling objective" $\Rightarrow$ "what's the main objective of curling?"
    - Another example:
        - In the past, if you did a Google search using the phrase "Can you get medicine for someone pharmacy," it would **not** have picked up on the fact that "for someone" was a really important part of a query $\Rightarrow$ But now, it will pick up on the fact that you're looking for another person to pick up the medicine
- **BERT**: **B**idirectional **E**ncoder **R**epresentations from **T**ransformers 
- BERT was fed the Wikipedia and the BookCorpus data as input
- One of the first large language models developed by the Google research team
- The quality of Google search has improved significantly using BERT.

### Transformers: History
- The models based on the original transformer paper from 2017 have evolved over the years.
- One of the challenges with training large language models in 2017 was that you needed labeled data.
- The ULMFiT model proposed by Jeremy Howard and Sebastian Ruda provided a framework where you didn't need labeled data, and that meant **large corpus of texts, such as Wikipedia, could now be used to train models.**
- In June of 2018, GPT or **G**enerative **P**re-**T**rained Model, developed by Open AI, was the first pre-trained transformer model.
- When Open AI released a bigger and better version of GPT (GPT-2) in Feb 2019, it made headlines because the team didn't want to release the details of the model due to **ethical concerns.**
- Meta's BART and Google's T5 are both large pre-trained models using the same architecture as the original transformer
- Hugging Face released DistilBERT, which is a smaller, faster, and lighter version of BERT: DistilBERT had 95% the performace of BERT and reduced the size of the BERT model by 40%
- In May 2020 Open AI released GPT-3, which is excellent at generating high-quality English sentences.
    - Although Open AI provided a lot of details in their GPT-3 paper, they didn't reveal the dataset they used or thier model weights
    
<img src='img/2.png' width="800" height="400" align="center"/>

<img src='img/3.png' width="800" height="400" align="center"/>

**Note** that in the graph above, the y-axis is on a log scale, and so the growth is not linear but exponential

## 2. Training Transformers and Their Architecture

### Transfer Learning
- Transfer learning is made up of 2 components:
    - **Pre-training** $\Rightarrow$ Extremely resource-heavy
    - **Fine-tuning** $\Rightarrow$ Involves training our model with labeled data
    
<img src='img/4.png' width="800" height="400" align="center"/>

#### Pre-training Tasks: BERT (Google)
- **Masked language modeling:** Fed Wikipedia and BookCorpus data as input and words were randomly masked out
- BERT then had to predict what the most likely candidates were for these masked words
- With **Next sentence prediction**, it had to predict whether one sentence followed the other.
    - 50% of the time one sentence did follow the other and these were labeled as `isNext`
    - 50% of the time a random other sentence from the corpus was used, and these were labeled as `notNext`
- According to BERT's documentation, **1,500 words is approximately equivalent to 2,400 tokens.**
    - So this means **one word is approximately 1.4 tokens.**
    - **A novel of 100,000 words is approximately 140,000 tokens.**
    
#### RoBERTa (Facebook)
- Trained in one day
- 2 trillion tokens
- Also used were Wikipedia, BookCorpus, as well as the Common Crawl news dataset, OpenWebText, and the Common Crawl stories:
    - **Common Crawl** is a raw webpage dataset from years of web crawling 
    - **OpenWebText** is a dataset created by scraping URLs from REddit with a score of three (this is a proxy for the quality of the data response)
    
#### GPT-3 (Open AI)
- 34 days training days
- Used 10,000 V100 GPUs
- 300B training tokens
- Primarily an Azure infrastructure
- Used Wikipedia, CommonCrawl, WebText2, Books1, Books2

#### Benefits of Transfer Learning
- Faster development 
    - For BERT, the author suggest two to four epochs of training
    - Much better than the thousands of hours of pre-training time
- Less data to fine-tune
- Excellent results 

### Transformer Architecture

#### Encoder-Decoder Models
- From the **"Attention Is All You Need"** paper:

<img src='img/5.png' width="500" height="250" align="center"/>

- The left-hand side is known as an **encoder** and the right-hand side is known as a **decoder**. 
- We feed in the English sentence, such as "I like NLP," and the decoder can act as a transformer of the sentence from English to German:

<img src='img/6.png' width="500" height="250" align="center"/>

- However, the transformer is not made up of a single encoder, but rather six encoders. 
- Each of these parts can be used independently depending on the task.
- Encoder-Decoder models are good at generative tasks such as translation or summarization
- Examples of encoder-decoder models are:
    - BART (Facebook)
    - T5 (Google)

<img src='img/7.png' width="500" height="250" align="center"/>

<img src='img/8.png' width="500" height="250" align="center"/>

#### Encoder-only Models

- Encoder-only models are good for tasks that require understanding of the input, such as:
    - Sentence classification
    - Named entity recognition (NER)
- Examples include the family of BERT models:
    - BERT
    - RoBERTa
    - DistilBERT
    
<img src='img/9.png' width="600" height="300" align="center"/>

#### Decoder-only Models
- Good for generative tasks such as text generation
- Examples include: 
    - GPT
    - GPT-2
    - GPT-3

<img src='img/10.png' width="600" height="300" align="center"/>
    
    
#### $\star$ $\star$ $\star$ In summary, transformers are made up of encoders and decoders, and the tasks we can perform will depend on whether we use either or both components $\star$ $\star$ $\star$

### Self-Attention
- One of the key ingredients to transformers is **self-attention.**
- In the following example ("The monkey ate that banana because it was too hungry"), how is the model able to determine that the "it" corresponds to the monkey, and not the banana?
    - It does this by using a mechanism called **self-attention**, that incorporates the embeddings for all the other words in the sentence.
    - So, when processing the word "it," self-attention will take a weighted average of the embeddings of the other context words
    - In the example below, the darker the shade, the more weight that word is given (and every word is given some weight):

<img src='img/11.png' width="500" height="250" align="center"/>

- So, what's going on under the hood?
    - As part of the self-attention mechanism, the authors of the original transformer take the word embeddings and project it into three vector spaces, which they call **Q (query)**, **K (key)**, and **V (value)**.
    - Projecting word embeddings into new vector spaces is a tool that mathematicians use to get different representations of the word embeddings
    - In order to calculate the attention weights, we'll take in as input the query, key, and value vectors.
    - We then calculate the **score** of each word to determine how much focus to place on other words in the sentence.
    - We want to try to figure out how the query and the key vectors relate to each other. 
        - This is done by taking the dot product of the query vector and the key vector.
        - Queries and keys that are similar will have a large dot product
        - Queries and keys that don't share much in common will have little to no overlap.
        - In the equation below:
            - T means that we're performing a **transpose** operation on the vector K
            - **n** is the dimension of these vectors; we divide by the square root of n to scale the dot product attention, and so reduce its size 
            - we now have the logits and can convert this to probabilities by using the softmax function 
            - We then multiply each value vector by the softmax score
            - We can then sum up the weighted value vectors, and this produces the self-attention calculation for a word.
            - $\star$ **This process takes place for every single word in the sentence.** $\star$
            - Self-attention allows us to apply a different weight to words in a sentence.

<img src='img/12.png' width="400" height="200" align="center"/>

### Multi-head Attention and Feed Forward Network
- Above we looked at how self-attention can help us provide context for a word for the sentence "the monkey ate that banana because it was too hungry," but what if we could get multiple instances of the self-attention mechanism, so that each can perform a different task?
    - One could make a link between nouns and adjectives
    - Another could connect pronouns to their subjects 
    - etc.
- This is the idea behind **multi-headed attention.**
- What's particularly impressive is we don't create these relations in the model; they're fully learned from the data.
- BERT has 12 such heads, and each multi-head attention block gets three inputs:
    - the query
    - the key
    - the value
    
<img src='img/13.png' width="800" height="400" align="center"/>

- These (12 heads of BERT) are put through linear or dense layers before the multi-head attention function, as shown below.
- The query, key, and value are passed through separate, fully-connected, linear layers for each attention head.
- This model can jointly attend to information from different representations and at different positions

<img src='img/14.png' width="600" height="300" align="center"/>

- By having 12 self-attention heads, the BERT model is able to focus on several tasks at once, thus allowing it to make richer connections between words
- Some of the larger language models have significantly more heads
    - For example: GPT-3 has 96 such heads
- The key takeaway from this section is that multi-head attention allows us to make richer connections between words, and none of these connections are created, but rather they're all learned by the model

## 3. Large Language Models
### GPT-3
- GPT-3 is probably the most well-known large language model
- **GPT**: 
    - **G**enerative: Predicts a future token, given past tokens
    - **P**re-trained: Trained on a large corpus of data
    - **T**ransformer: *Decoder* portion of the transformer architecture
- **GPT-3's objective:** 
    - **Goal: Given the preceding token, predict the next token.**
       - Causal language model
       - Autoregressive
- **Data trained on:**
    - **English Wikipedia**
    - **Common Crawl** $\Rightarrow$ raw webpage data
    - **WebText2** $\Rightarrow$ dataset created by scraping URLs from Reddit (with a score of 3, used as proxy for quality)
    - **Books1** $\Rightarrow$ collection of novels by unpublished authors
    - **Books2** $\Rightarrow$ collection of novels by unpublished authors
- **Task trained on:**
    - Causal language modeling: predict the next word in a text
    - We can train the model in a self-supervised way and we don't have to annotate our datasets
    - We can then take all these humungous datasets and use them to train our model
    - Additionally, we want to use some decoding algorithms, such as [**beam search**](https://en.wikipedia.org/wiki/Beam_search) to give us a balance of coherent language and diversity so we don't get sentences repeated.

<img src='img/15.png' width="500" height="250" align="center"/>

- For a couple of years, researches have focused on getting a large corpus of data and training a language model
- To use a language model for a specific task (for example: sentiment analysis), you'd have to:
    - Give it hundreds of examples of sentences
    - Sentences labeled as positive or negative
    - Train the model on these sentences and labels
    - Model will produce good results
    
#### What if we could create a language model that if we give it a new task and a couple of examples with the expected output, that it would be able to perform well on these tasks: GPT-3 does just this.

- **Prompt:**
    - Way to interact with models
    - Zero-shot learning $\Rightarrow$ given task with no examples
    - One-shot learning $\Rightarrow$ given task with one example and expected output
    - Few-shot learning $\Rightarrow$ given task with a couple of examples and expected output
    
<img src='img/16.png' width="600" height="300" align="center"/>

- **In summary:** GPT-3 provides an easy way to interact with models.

### GPT-3 Use Cases
- Open AI provides access to GPT-3 via an API: [https://beta.openai.com/playground](https://beta.openai.com/playground)

#### Classification
- In the following example, we enter the key Fedex with an empty value and the transformer correctly fills in the answer (highlighted in green):

<img src='img/17.png' width="500" height="250" align="center"/>

- If you look up the notes for Open AI's model training data, this training cuts off in 2021, but Meta and technology seems to be a reasonable answer (below).

<img src='img/18.png' width="500" height="250" align="center"/>

#### Summarize text for a 2nd grader
- We use the first few paragraphs of the Wikipedia entry for GPT-3:

<img src='img/19.png' width="800" height="400" align="center"/>

<img src='img/20.png' width="800" height="400" align="center"/>

#### Ad from product description

<img src='img/21.png' width="600" height="300" align="center"/>

<img src='img/22.png' width="600" height="300" align="center"/>

<img src='img/23.png' width="600" height="300" align="center"/>

#### Study notes

<img src='img/24.png' width="400" height="200" align="center"/>

## Challenges and Shortcomings of GPT-3
#### 1. Bias
- GPT-3 was trained on data that is biased
- Human language and text naturally reflect bias
- GPT-3 trained on data deemed interesting on Reddit via upvotes from other users

#### 2. Environmental impact
- Carbon emissions study of large language models was conducted by Google and Berkeley in 2021 and found that **training GPT-3 would've resulted in energy consumption of almost 1,300 MMWh (megawatt hours) and the release of 550 tons of $CO_2$**


#### It's worth noting that some of the large language models that followed GPT-3 tried to optimize/address some of these challenges (bias and environmental impact)

## GLaM
- The Google research team noted that training large dense models requires significant amount of compute resources, and they proposed a family of language models called GLaM
- **GLaM**: **G**eneralist **La**nguage **M**odel
- Sparse model
- Significantly less training costs compared to an equivalent dense model
- **1/3 of energy used to train GPT-3 and still have better overal zero shot and one-shot performance across the board.**
- Largest GLaM model has 1.2 trillion parameters, which is approximately 7x larger than GPT-3

### GLaM Architecture
- Source: "GLaM: Efficient Scaling of Language Models With Mixture-of-Experts" (Du et al.)
- Upper block is a transformer layer (noticed multi-headed attention and feedforward network)
- Bottom block has "mixture-of-experts" layer (moe uses softmax)

<img src='img/25.png' width="500" height="250" align="center"/>

- Even though each mixture of expert layer has many more parameters, the experts are sparsely activated. This means that for a given input token, only a limited subset of experts is used. During training, each mixture of experts layers gating network is trained to use its input to activate the best two experts for each token of an input sequence. During inference, the learned gating network dynammically pick the two best experts for each token. As a result, even though the GLaM model has 1.2 trillion parameters, only 96.6 billion of those are activated during training (much less than the 175 billion of GPT-3)

<img src='img/26.png' width="600" height="300" align="center"/>

- **In summary: the objective of Google's GLaM model is to reduce the training and inference cost using a sparse mixture of experts model.**


## Megatron-Turing NLG Model
- A lot of the research after GPT-3 was released seemed to indicate that scaling up models improved performance.
- So Microsoft and NVidia partnered together to create the Megatron-Turing NLG model, with a massive three times more parameters than GPT-3

<img src='img/27.png' width="800" height="400" align="center"/>

### Model Parameters
- Modelwise, the artchitecture uses the transformer's decoder just like GPT-3,  ut you can see that it has more layers and more attention heads than GPT-3.

<img src='img/28.png' width="600" height="300" align="center"/>

#### Hardware challenges
The researchers identified a couple of challenges with working with large language models:
- Can't fit parameters of largest language models in memory of largest GPUs
- Need parallelism techniques on both memory and compute to use thousands of GPUs
- Although these researchers achieved superior zero-, one-, and few-shot learning accuracies on several NLP benchmarks and established some new state-of-the-art results, **a lot of their success is probably more around the super-computing hardware infrastructure that was developed** with an enormous 600 NVidia DGX A100 nodes.

<img src='img/29.png' width="600" height="300" align="center"/>

#### In summary: the objective around the Megatron-Turing language model seems to be mostly around hardware infrastructure, and this model was one of the largest dense decoder models, coming in at 530 billion parameters

## Gopher
- The DeepMind research team released Gopher in January 2022
- They released six flavors of the model ranging from 44M to 280B parameters
- Put together a diverse dataset called MassiveText dataset
- Tested on 152 tasks
- Architecture is similar to GPT-3 in that it just uses the decoder portion of the transformer

<img src='img/31.png' width="800" height="400" align="center"/>

- The MassiveText corpus has 2.3 trillion tokens, but the model only trains on a subset of these tokens, so the model doesn't get to see the whole dataset
- Over 99% of MassiveText is in English. The remaining text is split between Hindi, followed by a couple of mostly European languages.
- If we look at the top six domains of MassiveWeb, we can see that at least four are either academic or scientific in nature. 

<img src='img/32.png' width="600" height="300" align="center"/>

- So it shouldn't be much of a surprise that many of the tasks that Gopher is tested on are scientific in nature, such as High School Chemistry, etc.

<img src='img/33.png' width="600" height="300" align="center"/>

- There are 152 different tasks that the model is evaluated on, and they range from reading comprehension and fact checking to mathematics, common sense, and logical reasoning
- **Gopher** outperforms state-of-the-art large language models in 100 of the 124 tasks it was tested on.
- Below, the x-axis refers to different tasks within the category 

<img src='img/34.png' width="600" height="300" align="center"/>

- In general, Gopher doesn't do as well on tasks such as language modeling, common sense, and logical reasoning.

<img src='img/35.png' width="600" height="300" align="center"/>

<img src='img/36.png' width="600" height="300" align="center"/>

<img src='img/37.png' width="600" height="300" align="center"/>

<img src='img/38.png' width="600" height="300" align="center"/>


## Scaling laws
- Why do we have such large parameter models?
- Around the time of the release of GPT-3, the OpenAI team released some results around what they called the scaling laws for large models.
- **The performance of large models is a function of:**
    - **Model parameters**
    - **Size of the dataset**
    - **Total amount of compute available for training**
    
<img src='img/39.png' width="800" height="400" align="center"/>

- The OpenAI team then go on to propose that as more compute becomes available, you can decide where you want to allocate this; either training a larger model using larger batches or training for more steps
- **The conclusion they came to was that most of the increase should towards increasing the model size**
- **There will be some benefit to using more data and using large batch sizes but minimal contribution if you train for more steps***
- **One reason that model sizes have just gotten bigger since GPT-3 is that these scaling laws suggest that increasing the model size will give you the biggest benefit**

<img src='img/41.png' width="800" height="400" align="center"/>

## Chinchilla

- Up until this point, we've seen that the trend has been to increase model size.
- Interestingly, the number of training tokens used for most of these models has been around 300 billion
- The DeepMind team's hypothesis was too large, and that if you take the same compute budget, a smaller model trained on more data will perform better
- They ten tested this hypothesis by training over 400 language models ranging from 70 million to over 16 billion parameters with datasets from five to 500 billion tokens
- They then trained **Chinchilla**, a 70 billion parameter model with 1.4 trillion training tokens
- And **Chinchilla outperforms all previous models, including Gopher, on a large range of dowanstream evaluation tasks.**
- As this is a smaller model, this means less compute required for fine-tuning and inference.
- The conclusion that the DeepMind team came to very different from the conclusion that OpenAI came to:

### Recommendation from Chinchilla Paper
- Tenfold increase in computational budget
    - model size $\Rightarrow$ scaled in equal proportions
    - number of training tokens $\Rightarrow$ scaled in equal proportions

<img src='img/42.png' width="800" height="400" align="center"/>

- FLOPs = floating point operations; a measure of computation
- The DeepMind team set out to answer this question:
    - **Given a fixed FLOPs budget, how should one trade-off model size and the number of training tokens?**
    
<img src='img/43.png' width="800" height="400" align="center"/>

- **For a given flop budget, what is the optimal parameter count?**

<img src='img/44.png' width="800" height="400" align="center"/>

<img src='img/45.png' width="800" height="400" align="center"/>

- They concluded that you can end up with a more performant model using a small model with more training data.

<img src='img/46.png' width="800" height="400" align="center"/>

## BIG-bench

#### Challenges with current benchmarks
- Too narrow in scope
    - language understanding
    - summarization
- **BIG-bench**: **B**eyond the **I**mitation **G**ame **Bench**mark
    - [See website here](https://wwwgithub.com/google/BIG-bench)
    - 200 tasks that humans perform well on but current state-of-the-art models don't
    - Team of human expert raters 
        - Performed all tasks to provide strong baseline
        - Used all available resources (including searching the internet)
    - Task examples:
        - Checkmate in one move
        - Guess movies from their emoji descriptions $/Rightarrow$ The focus of this task is on describing the movie plot with emojis, rather than the movie title, so the computer needs to understand the movie plot.
        - Kannada riddles
    - **Results:**
        - **No model, regardless of size, outperformed the best-performing human on any task.**
        - **However, on some tasks, the bes

<img src='img/x.png' width="800" height="400" align="center"/>