# GPT to Llama

## 1. GPT-2 to Llama 2

### 1.1 Differences between GPT-2 and Llama 2

#### 1.1.1 Architecture

GPT-2 and Llama 2 has very similar architecture. The differences are:
- Normalization layer: LayerNorm -> **RMSNorm**
  - Instead of LayerNorm in GPT-2 (both the ones in the transformer blocks and the last LayerNorm before output layer), Llama 2 uses `RMSNorm` (**Root Mean Square Layer Normalization**) layer. While LayerNorm normalizes inputs using mean and variance, RMSNorm uses only the root mean square, which improves computational efficiency.
 
- FeedForward layer's activation function: GELU -> **SwiGLU**
  - In the FeedForward part, to introduce non-linearity, instead of using GELU activation like GPT-2, Llama 2 uses `SwiGLU` (a variant of the **Gated Linear Unit(GLU)** that incorporates the `SiLU`(also known as `Swish`) function.) SwiGLU has been shown to improve performance in Transformer architectures, outperforming traditional activation functions like ReLU and GELU in various tasks. 
 
- Positional embedding: absolute positional embedding -> **RoPE**
  - Instead of the traditional absolute positional embeddings in GPT-2, Llama 2 uses **rotary position embeddings**(`RoPE`), which enables it to capture both absolute and relative positional information simultaneously.

- Llama 2 doesn't use the Dropout layers

#### 1.1.2 Tokenizer

Llama 2 uses Google's [SentencePiece](https://github.com/google/sentencepiece) tokenizer, not OpenAI's [Tiktoken](https://github.com/openai/tiktoken) (but Llama 3 uses Tiktoken). Both are **Byte Pair Encoding (BPE)** tokenizers and Tiktoken offers improved efficiency and flexibility.

### 1.2 Convert GPT-2 to Llama 2

#### 1.2.1 Replace LayerNorm with RMSNorm layer

In contrast to LayerNorm which uses inputs' mean and variance to normalize, RMSNorm only uses root mean square and improves computational efficiency. The paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)

The RMSNorm operation is as follows, where $x$ is the input and $\gamma$ is a trainable parameter vector. $\epsilon$ is a small constant to avoid zero-division errors:

$$ y_i = \frac{x_i}{\text{RMS}(x)} \gamma_i, \quad \text{where} \quad \text{RMS}(x) = \sqrt{\epsilon + \frac{1}{n} \sum x_i^2} $$

In [1]:
import torch
import torch.nn as nn

# Use this to replace the class LayerNorm

class RMSNorm(nn.Module):
    def __init__(self, emb_dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.emb_dim = emb_dim
        self.weight = nn.Parameter(torch.ones(emb_dim)).float()
    
    def forward(self, x):
        means = x.pow(2).mean(dim=-1, keepdim=True)
        x_normed = x * torch.rsqrt(means + self.eps)
        return (x_normed * self.weight).to(dtype=x.dtype)

#### 

#### 1.2.2 Replace GELU with SwiGLU in FeedForward module

##### 1.2.2.1 Replace GELU with SwiGLU

##### 1.2.2.2 Update the FeedForward module

### 1.2.3 Implement RoPE for positional embedding

##### 1.2.3.1 Implement RoPE

##### 1.2.3.2 Add RoPE to MultiHeadAttention module

#### 

#### 1.2.4 Update the TransformerBlock module

#### 1.2.5 Update the model class

### 1.3 Load model and pretrained weights

#### 1.3.1 Initialize the model

#### 1.3.2 Load tokenizer

#### 1.3.3 Load pretrained weights

### 1.4 Use the pretrained Llama 2

## Llama 2 to Llama 3