# LLaMA (2023)

The purpose of this notebook is to provide an overview of the LLaMA family of models. It is an evolving family. Therefore, we are going to cover both the LLaMA model family and the LLaMA 2 model family. The objective is to understand the most important aspects that make these models perform better than its predecesors, even with a lower parameter count.

List of main relevant papers:
* [Touvron et al. (2023)](https://arxiv.org/pdf/2302.13971.pdf). LLaMA: Open and Efficient Foundation Language Models
* [Touvron et al. (2023)](https://arxiv.org/pdf/2307.09288.pdf). LLaMA 2: Open Foundation and Fine-Tuned Chat Models
* [Llama Team (2024)](https://arxiv.org/pdf/2407.21783). The LLaMA 3 Herd of Models

List of main relevant blogs:
* [Meta (2023)](https://ai.meta.com/blog/large-language-model-llama-meta-ai/). Introducing LLaMA: A foundational, 65-billion-parameter large language model
* [Meta (2023)](https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/). LLaMA 2: Open Foundation and Fine-Tuned Chat Models

List of main relevant Youtube videos:
* [Kilcher (2023)](https://www.youtube.com/watch?v=E5OnoYF2oAk). LLaMA explained
* [Jamil (2023)](https://www.youtube.com/watch?v=Mn_9W1nCFLo). LLaMA explained
* [Khan (2024)](https://lightning.ai/fareedhassankhan12/studios/building-llama-3-from-scratch). Building Llama 3 from scratch with Pytorch
* [Sebastian Raschka (2024)](https://github.com/rasbt/LLMs-from-scratch/blob/main/ch05/07_gpt_to_llama/converting-gpt-to-llama2.ipynb). Converting GPT-2 to Llama 2

# 1 - Introduction

Large Language Models (LLMs) trained on massive corpora of texts have shown their ability to perform new tasks from textual instructions or from a few examples ([Brow et al., 2020](https://arxiv.org/abs/2005.14165)). These few-shot properties first appeared when scaling models to a sufficient size, resulting in a line of work that focuses on further scaling these models. 

However, recent work from [Hoffmann et al. (2022)](https://arxiv.org/abs/2203.15556) shows that , for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data. In the article, [Hoffmann et al. (2022)](https://arxiv.org/abs/2203.15556) discuss that the trend up to that point in large language model training has been to increase the model size, often without increasing the number of training tokens. The authors hypothesize that the race to train larger and larger models is resulting in models that are substantially underperforming compared to what could be achieved with the same compute budget. In this context, the authors propose a new [*scaling law*](https://en.wikipedia.org/wiki/Neural_scaling_law) that determines how to best scale the dataset and model sizes for a particular training compute budget (in TFLOP terms).

## 1.1 - LLaMA

LLaMA authors build on top of these ideas, but argue that Chinchilla's scaling law by Hoffmann et al. focus on training budget only is not enough because it disregards the inference budget, which becomes critical when serving a language model at scale. In this contex, give a target level of performance, **the preferred model is not the fastest to train but the fastest at inference**, and although it may be cheaper to train a large model to reach a certain performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although [Hoffmann et al. (2022)](https://arxiv.org/abs/2203.15556) recommends training a 10B parameter model on 200B tokens, **LLaMA authors find that the performance of a 7B model continues to improve even after 1T tokens**.

The focus of the paper is to train a series of language models that achieve the best possible performance at various inference budgets, **by training on more tokens than what is typically used**. the resulting models range from 7B to 65B parameters with competitive performance compared to the best existing LLMs. Another important aspect of the paper is that unlike Chinchilla ([Hoffmann et al., 2022](https://arxiv.org/abs/2203.15556)), PaLM ([Chowdhery et al., 2022](https://arxiv.org/abs/2204.02311)), or GPT-3 ([Brow et al., 2020](https://arxiv.org/abs/2005.14165)), LLaMA was trained only with publicly available data.

<table>
    <tr>
        <td><img src="./images_1/llama_1_table.png" width="600"/></td>
    </tr>
</table>

## 1.2 - LLaMA 2

LLaMA 2 is an updated version of LLaMA 1, trained on a new mix of publicly avaibale data. To create a new familiy of LLaMA 2 models, they began with the pretraining approach of LLaMA, using an optimized auto-regressive transformer (i.e., decoder only Transformer), but made several changes to improve performance. Specifically, they made the following changes:

* Performed more robust data cleaning
* Updated the data mixes
* Trained on 40% more total tokens (100% for smaller models since they had not been previously trained on all of the data)
* Doubled the context length
* Used Gropued-quary attention (GQA) to improve inference scalability in combination with KV cache

<table>
    <tr>
        <td><img src="./images_1/llama_2_table.png" width="800"/></td>
    </tr>
</table>

In addition to a foundation model, the authors released **a fine-tuned version of LLaMA 2 that was optimized for dialogue use cases** using Reinforcement Learning with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy Optimization (PPO). Throughout the RLHF stagge, the accumulation of iterative reward modeling data in parallel with model enhancements is crucial to ensure the reward models remain within distribution.

<table>
    <tr>
        <td><img src="./images_1/llama_chat.png" width="1000"/></td>
    </tr>
</table>

## 1.3 - Llama 3

LLaMA 3 represents significant advancements over LLaMA 2 in several key areas, including scalability, efficiency, and performance. Here's a focused breakdown of the improvements:

- **Model size and vocabulary:** LLaMA 3 scales up to **128,256-token vocabulary**, compared to LLaMA 2’s 32,000 tokens.
- **Training data:** Trained on **15 trillion tokens**, over 7 times more than LLaMA 2’s 2 trillion, improving its knowledge base and multilingual capabilities.
- **Context length:** LLaMA 3 doubles the context length to **8,192 tokens**
- **Architectural optimizations:** Llama 2 used **Grouped-Query Attention (GQA)** only on big models. Llama 3 uses it on smaller models too, like the 8B model.
- **Multilingual support:** Significantly better at handling **30+ languages** (Llama 2 supported 20) due to expanded training data and vocabulary. More non-english tokens too
- **Dialogue optimization:** Uses **Proximal Policy Optimization (PPO)** and **Direct Preference Optimization (DPO)** for improved conversational and interactive tasks.
- **Safety and AI responsibility:** Stronger emphasis on safety with features like **Llama Guard 2** and red-teaming practices, enhancing responsible deployment.
- **Training efficiency:** LLaMA 3 benefits from Meta’s **advanced 24,000-GPU clusters**, tripling training efficiency compared to LLaMA 2.

<table>
    <tr>
        <td><img src="./images_1/llama_3_table.png" width="500"/></td>
    </tr>
</table>


# 2 - Architecture


The first difference between the traditional Transformer architecture is that **LLaMA is a decoder-only model**. In addition, it leverages various improvements that were subsequently proposed and used in different models. Here are the main differences with the original architecture, and where the authors found the inspiration from this change:

* **Pre-normalization with RMSNorm [GPT-3]**. To improve the training stability, the input of each transformer sub-layer is normalized instead of normalizing the output. They use the **RMSNorm** normalizing function, introduced by [Zhang and Sennrich (2019)](https://arxiv.org/abs/1910.07467)

* **Rotary embeddings [GPTNeo]**. They remove the absolute positional embeddings, and instead, add rotary positional embeddings (RoPE), introduced by [Su et al. (2021)](https://arxiv.org/abs/2104.09864), at each layer of the network

* **Grouped query attention** ([Ainslie et al., 2023](https://arxiv.org/abs/2305.13245)). A new type of attention mechanism that improves the performance and efficiency of transformer models by dividing the query heads into groups and sharing the key and value heads across each group

* **SwiGLU activation function [PALM]**. They replace the ReLU non-linearity aby the SwiGLU activation function, introduced by [Shazeer (2020)](https://arxiv.org/abs/2002.05202). They use a dimenstion of $\frac{2}{3} 4d$ instead of $4d$ as in PaLM

<table>
    <tr>
        <td><img src="./images_1/llama_3_architecture.jpeg" width="1300"/></td>
    </tr>
</table>

## 2.1 - Tokenizer

Data is tokenized with the Byte-pair encoding (BPE) algorithm ([Sennrich et al., 2016](https://arxiv.org/abs/1508.07909)) using the implementation from SentencePiece ([Kudo and richardson, 2018](https://arxiv.org/abs/1808.06226)). Notably all numbers are split into individual digits, and fallback to bytes to decompose unknown UTF-8 characters. The total vocabulary size is 32K tokens.

## 2.2 - Layer normalization & RMSNorm

**Layer normalization (LN)** is a technique for normalizing the activations of a neural network layer. This helps to stabilize the training process and improve the performance of the model.  Layer normalization works by subtracting the mean of the activations from each activation and then dividing by the standard deviation of the activations. This ensures that the activations have a mean of zero and a standard deviation of one.

Basically, with LN, we are transforming the input/output (depending of where we apply the normalization) of layers into values from a standard Gaussian distribution $\mathcal{N}(0,1)$. This can be easily showed. Assuming our data $X$ follows a Gaussian distribution $\mathcal{N}(\mu, \sigma) = \mathcal{N}(5, 6)$, when we substract the mean and divide by the standard deviation we would get $\frac{X-5}{6} = z \approx \mathcal{N}(0,1)$

<table>
    <tr>
        <td><img src="./images_1/layer_normalization.png" width="450"/></td>
    </tr>
</table>

**Root Mean Square Normalization (RMSNorm)** is a simplified version of LN that is less computationally expensive. RMSNorm works by subtracting the square root of the mean squared activations from each activation. RMSNorm only focuses on re-scaling invariance. The authors hypothesize that re-scaling invariance is the reason for success of LN, rather than recentering invariance. Intuitively, RMSNorm simplifies LN by totally removing the mean statistic:

<table>
    <tr>
        <td><img src="./images_1/rmsnorm.png" width="850"/></td>
    </tr>
</table>


Both LN and RMSNorm are effective at normalizing the activations of neural networks. However, LN is more computationally expensive than RMSNorm. This is because LN needs to compute the mean and standard deviation of the activations for each layer, while RMSNorm only needs to compute the square root of the mean squared activations. 



## 2.3 - Positional encodings

### 2.3.1 - What is the difference between absolute and relative positional encodings?

Absolute positional encodings use stickers that say "I'm in the first position," "I'm in the second position," and so on. Relative positional encodings use stickers that say "I'm close to this one," "I'm far from that one," to help us understand where the blocks are in relation to each other.

**Absolute Positional Encodings ([Vaswani et al., 2017](https://arxiv.org/abs/1706.03762)):**
* Absolute positional encodings are typically fixed embeddings that are added to the token embeddings based on the absolute positions of tokens in the input sequence.
* They are typically precomputed or learned independently of the input data.
* For example, in the original transformer model, absolute positional encodings are typically represented as sine and cosine functions of the position and are added to the token embeddings to provide a fixed positional signal for the model.

**Relative Positional Encodings ([Shaw et al., 2018](https://arxiv.org/abs/1803.02155)):**
* Relative positional encodings are computed based on the relative positions of tokens within a sequence.
* They capture how each token relates to the others in terms of their positions, making them more adaptable to different sequence lengths.
* Relative positional encodings are particularly useful in cases where sequences have variable lengths or when dealing with very long sequences.
* These encodings can be learned from the data, taking into account the relative distances between tokens, which allows the model to adapt to different sequences.

**Sinusoidal functions** allow models to estimate relative positions between tokens through their periodic and smooth nature. By encoding absolute positions with sine and cosine at different frequencies, the embeddings create predictable, continuous differences between positions. This enables models to infer relative distances, helping capture both short- and long-range dependencies in a sequence.

**Sinusoidal functions** in positional embeddings have limitations with **fixed maximum sequence lengths**, typically defined up to a certain point (e.g., 512 or 1024). Beyond this limit, the model struggles to encode additional positions, affecting its utility for longer sequences and generalization.

### 2.3.2 - Rotary positional encodings

> [Very good blog post about RoPE](https://karthick.ai/blog/2024/Rotatory-Position-Embedding-(RoPE)/)

The main idea is that the model can adapt and **learn the best rotations for the positional information during training, rather than relying on fixed sinusoidal functions**. Learning these rotations allows the model to adapt better to the specifics of the data it's processing.

Advantages of RoPE
* Long-Range Context: RoPE adeptly captures relationships between tokens across lengthy sequences, a challenge for conventional positional embeddings.
* Rotation Invariance: By design, RoPE maintains effectiveness irrespective of sequence length, addressing a limitation of sine-cosine embeddings.
* Interpretability: The rotational approach offers an intuitive geometric interpretation of how positional information influences the attention mechanism.

Other practical differences with the original positional encodings:
* Rotary position embeddings are only applied to the query and the keys, but not the values. 
* Rotary position embeddings are applied after **q** and **k** vectors have been multiplied by the W matrix in the atention mechanism, while in the vanilla Transformer they are applied before.

## 2.4 - Self-attention

Reminder of multi-head self-attention:

<table>
    <tr>
        <td><img src="./images_1/self_attention.png" width="1000"/></td>
    </tr>
</table>

### 2.4.1 - KV cache

Since the decoder is causal (i.e., the attention of a token only depends on its preceding tokens), at each generation step we are recalculating the same previous token attention, **when we actually just want to calculate the attention for the new token**.

This is where KV comes into play. By caching the previous Keys and Values, we can focus on only calculating the attention for the new token.

Why is this optimization important? As seen in the picture above, the matrices obtained with KV caching are way smaller, which leads to faster matrix multiplications. The only downside is that it needs more GPU VRAM (or CPU RAM if GPU is not being used) to cache the Key and Value states.

<table>
    <tr>
        <td><img src="./images_1/kv_caching_2.gif" width="600"/></td>
    </tr>
</table>

In [4]:
import numpy as np
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2").to(device)

for use_cache in (True, False):
  times = []
  for _ in range(1):  # measuring X generations
    start = time.time()
    model.generate(**tokenizer("What is KV caching?", return_tensors="pt").to(device), use_cache=use_cache, max_new_tokens=1000, pad_token_id=tokenizer.eos_token_id)
    times.append(time.time() - start)
  print(f"{'with' if use_cache else 'without'} KV caching: {round(np.mean(times), 3)} +- {round(np.std(times), 3)} seconds")

with KV caching: 8.031 +- 0.0 seconds
without KV caching: 47.205 +- 0.0 seconds


### 2.4.2 - Grouped Query Attention (GQA)

GPUs have a "problem", they are too fast. For example, a A100 GPU does mathematical operations x40 faster that it can transfer the same amount of information to memory. This discrepancy highlights an important bottleneck: it's not always the number of operations being performed that limits performance, but rather the amount of data transfer required for those operations. 

For example, computing the same operation on the same tensor $N$ times may be faster than computing the same operation on $N$ different tensors, even if they have the same size, this is because the GPU may need to move the tensors around.

**Our goal should not only be to optimize the number of operations we do, but also minimize the memory access/transfers that we perform**.

To do that, we can modify the way we estimate self-attention. We can distinguish three possible approaches:
* Multi-head self-attention ([Vaswani et al., 2017](https://arxiv.org/abs/1706.03762))
* Multi-query self-attention ([Shazeer, 2019](https://arxiv.org/abs/1911.02150))
* Grouped-query self-attention ([Ainslie et al., 2023](https://arxiv.org/abs/2305.13245))

<table>
    <tr>
        <td><img src="./images_1/different_types_of_attention.png" width="650"/></td>
    </tr>
</table>

#### Multi-head self-attention

<table>
    <tr>
        <td><img src="./images_1/multi_head_attention.png" width="900"/></td>
    </tr>
</table>

#### Multi-head self-attention with KV cache

<table>
    <tr>
        <td><img src="./images_1/multi_head_attention_kv.png" width="900"/></td>
    </tr>
</table>

#### Multi-query attention (MQA) with KV cache

[Shazeer (2019)](https://arxiv.org/abs/1911.02150) proposed a refinement to the Multi-Head Attention (MHA) algorithm called Multi-Query Attention (MQA), which reduces memory badnwith overhead of loading keys and values by using multiple query heads with a single key/value head.

In the following picture we can see that we do not "repeat" $K$ and $V$ tensors by not using the $h$ dimension (which I assume it is the number of heads)

<table>
    <tr>
        <td><img src="./images_1/multi_query_attention.png" width="900"/></td>
    </tr>
</table>

#### Grouped-query attention (GQA)

Based on the research of [Ainslie et al. (2023)](https://arxiv.org/abs/2305.13245), MQA highlights certain drawbacks. Specifically, utilizing MQA can lead to a decline in quality and introduce training instability. Consequently, attempting to train distinct models optimised separately for quality and inference may not be a practical solution, as stated in the paper.

This is because **the primary goal of employing the MQA technique is to accelerate the inference process, making the modification of the entire model architecture and training approach for this purpose impractical**.

<table>
    <tr>
        <td><img src="./images_1/grouped_query_attention.jpg" width="900"/></td>
    </tr>
</table>

## 2.5 - SwiGLU activation function

[Shazeer (2020)](https://arxiv.org/abs/2002.05202) studies how GLU variants improve Transformers performance

GLU (Gated Linear Units) is a neural network layer, not an activation funciton in the strict sense. It is a linear transformation followed by a gating mechanism. The gating mechanism is a sigmoid function that controls the flow of information from the linear transformation. The GLU has non-linear capabilities, but has a linear path for the gradient so diminishes the vanishing gradient problem.

We can deﬁne GLU variants using other activation functions than the sigmoid function.

#### Bilinear activation

The bilinear layer is a GLU variant that omits the sigmoid function. It is a bilinear transformation followed by an element-wise product.

$$\text{Bilinear}(x, W, V, b, c) = (xW + b)$$

#### ReGLU activation

ReGLU is a GLU variant that uses ReLU as the activation function:

$$\text{ReGLU}(x, W, V, b, c) = \text{ReLU}(xW + b) \otimes (xV + c)$$

#### GEGLU activation

GEGLU is a GLU variant that uses GELU as the activation function.

$$\text{GEGLU}(x, W, V, b, c) = \text{GELU}(xW + b) \otimes (xV + c)$$

#### SwiGLU activation

SwiGLU is a GLU variant that uses Swish as the activation function.

$$\text{SwiGLU}(x,W, V, b, c) = \text{Swish}_1(xW + b) \otimes (xV + c)$$