# Generative Pre-Trainined Transformer

[2018 GPT](https://en.wikipedia.org/wiki/GPT-1) The transformer architecture was used to create the first autoregressive model, GPT. It then evolved into [GPT-2 2019](https://huggingface.co/transformers/v2.8.0/model_doc/gpt2.html), a larger and more optimized version of GPT pre-trained on WebText, and [GPT-3 2020](https://arxiv.org/abs/2005.14165), a larger and more optimized version of GPT-2 pre-trained on Common Crawl.

## The Origin GPT-1

Generative Pre-trained Transformer 1 (GPT-1) was the first of OpenAI's large language models following Google's invention of the transformer architecture in 2017. In June 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced that initial model along with the general concept of a generative pre-trained transformer.

Up to that point, the best-performing neural NLP models primarily employed supervised learning from large amounts of manually labeled data. This reliance on supervised learning limited their use of datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models

In contrast, a GPT's "semi-supervised" approach involved two stages: 
    - an unsupervised generative "pre-training" stage in which a language modeling objective was used to set initial parameters, 
    - and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task.

The use of a transformer architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT models with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".

Read [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)

### Reason for choosing BookCorpus

BookCorpus was chosen as a training dataset partly because the long passages of continuous text helped the model learn to handle long-range information. It contained over 7,000 unpublished fiction books from various genres. The rest of the datasets available at the time, while being larger, lacked this long-range structure (being "shuffled" at a sentence level).

The BookCorpus text was cleaned by the ftfy library to standardized punctuation and whitespace and then tokenized by spaCy.

Architecture
The GPT-1 architecture was a twelve-layer decoder-only transformer, using twelve masked self-attention heads, with 64-dimensional states each (for a total of 768). Rather than simple [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent), the [Adam optimization algorithm](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) was used. GPT-1 has 110 million parameters.




<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/51/Full_GPT_architecture.svg/2560px-Full_GPT_architecture.svg.png" alt="GPT Architecture" width="600" height="600">

This breakdown of the GPT architecture focuses on each component in sequence, offering detailed explanations of how they contribute to the overall function of the model. Here's a cleaner version of the explanation:

---

### 1. **Input Embedding in GPT:**
   - **Purpose:** Converts words or tokens into numerical vectors (dense embeddings) so the model can process them. These embeddings capture semantic relationships between tokens.
   - **Explanation:** Words, subwords, or characters are represented as vectors in a continuous space (embeddings). Similar words have similar embeddings, while different words are farther apart in the vector space. Embeddings are learned during training and are passed to the transformer layers.

#### **Steps in Input Embedding Process:**
1. **Tokenization:** Text is divided into tokens. GPT uses techniques like **Byte-Pair Encoding (BPE)**, which splits text into manageable tokens (e.g., "running" -> "run" and "ing").
2. **Vocabulary Mapping:** Each token is mapped to an index in the model's vocabulary. For instance, "run" might map to index 563.
3. **Embedding Lookup Table:** Each token index is converted into a high-dimensional vector (embedding) using an embedding matrix.
4. **Dense Representation (Embedding):** The output is a sequence of vectors representing tokens, passed into the transformer layers for further processing.

---

### Example:
For the sentence `"The cat sat on the mat."`, tokens are mapped to indices and converted to embeddings like:

```plaintext
Tokenized: [The] [cat] [sat] [on] [the] [mat]
Token Indices: [3] [102] [205] [87] [3] [490]
Embeddings: [[0.5, -1.3, 0.2, ...], [-0.8, 0.3, 1.1, ...], ...]
```

These dense embeddings are then fed into the transformer layers for processing.

---

### 2. **Positional Encoding:**
   - **Explanation:** Since transformers don't inherently understand the order of tokens, positional encoding is added to the embeddings to provide information about token positions in the sequence.

#### **How Positional Encoding Works:**
1. **Combination of Position and Embedding:** Position is encoded using sinusoidal functions, ensuring each token's position is uniquely represented.
2. **Mathematical Formula:**
   - For even dimensions:
     $$ PE(p, 2i) = \sin(p / 10000^{2i/d_{\text{model}}}) $$
   - For odd dimensions:
     $$ PE(p, 2i+1) = \cos(p / 10000^{2i/d_{\text{model}}}) $$

Where $p$ is the position, and $d_{\text{model}}$ is the dimension of the embedding. These positional encodings are added element-wise to the embeddings.

---

### Example:
For the sentence `"The cat sat on the mat."`, positional encodings are added to the embeddings:

```plaintext
Embedding + Positional Encoding for "The": [0.51, -1.28, 0.23, ...]
Embedding + Positional Encoding for "cat": [-0.78, 0.34, 1.16, ...]
```

---

### 3. **Dropout Layer:**
   - **Explanation:** Dropout helps prevent overfitting by randomly "dropping" (setting to zero) a percentage of neurons during training. This forces the network to learn more robust patterns.

#### **Why Dropout Helps:**
- **Prevents Co-Adaptation:** Dropout prevents the model from becoming too reliant on any specific neurons, forcing different neurons to collaborate more effectively.
- **Improves Generalization:** Helps the network perform better on unseen data.
- **Simulates an Ensemble:** By randomly dropping neurons, dropout behaves like training multiple models that share parameters, improving robustness.

During training, dropout randomly sets neurons to zero with a specific probability, e.g., 20% dropout. During inference, dropout is turned off, and the weights are scaled.

---

### 4. **Transformer Blocks:**
   - **Explanation:** GPT's architecture consists of stacked transformer blocks, each containing self-attention mechanisms, feedforward networks, and normalization layers.

#### **Components of a Transformer Block:**
1. **Layer Normalization (LayerNorm):** Stabilizes training by normalizing the input to each layer, ensuring consistent activation scaling.
2. **Multi-Head Self-Attention:** Allows the model to focus on different parts of the input sequence simultaneously. Each head attends to a different subspace of the input, capturing multiple relationships between tokens.
   
   - **Query, Key, and Value Matrices:** Self-attention works by creating query, key, and value matrices from the input.
   - **Attention Scores:** Calculated by multiplying the query matrix with the transposed key matrix and normalizing with softmax.
   - **Weighted Sum of Values:** The attention scores are used to compute a weighted sum of the value matrix, determining the final output of the attention mechanism.
   
3. **Feedforward Neural Network:** A two-layer network that further transforms the output of the self-attention mechanism. It applies a **GELU activation function** to introduce non-linearity.
4. **Residual Connection (+):** Adds the input of a layer back to its output, preventing vanishing gradient problems and stabilizing training.

---

### 5. **LayerNorm and Residual Connections:**
   - **Explanation:** LayerNorm and residual connections occur twice in each transformer block, after both the self-attention and feedforward layers.

---

### 6. **Feedforward Network (GELU Activation):**
   - **Explanation:** The feedforward network transforms the output of the attention mechanism. The **GELU activation** introduces non-linearity, helping the model learn complex patterns.

---

### 7. **Final Output (Linear Layer and Softmax):**
   - **Explanation:** After passing through all transformer blocks, the final output is processed by a linear layer and a softmax function to generate probabilities over the vocabulary, enabling next-token prediction during text generation.

---

This summary covers the key components of the GPT architecture and how each one contributes to the model's ability to process and generate text.

| **Type** | **Model Name** | **#Parameters**  | **Release** | **Base Models** | **Open Source** | **#Tokens** | **Training Dataset** | **Context Window Size** |
|-----------------|----------------|--------------------|-------------|-----------------|-----------------|---------------------------|-----------------|------------------|
| **GPT Family**  | GPT-1           | 110M                | 2018   | ✓               | ✓               | 1.3B        | BooksCorpus, English Wikipedia |  |
|                 | GPT-2           | 1.5B                | 2019   | ✓               | ✓               | 10B         | Reddit outbound                |  |
|                 | GPT-3           | 6.7B, 13B, 175B     | 2020   | ×               | ×  | 300B | Common Crawl (filtered), WebText2, Books1, Books2, Wikipedia | |
|                 | GPT-3.5         | 1.3B, 6B, 20B       | 2022   | ×               | ×               | 2.5T        | WebText, Common Crawl, Books, Wikipedia | |
|                 | CODEX           | 12B                 | 2021   | GPT             | ✓               | -           | Public GitHub software repositories     | |
|                 | WebGPT          | 760M, 13B, 175B     | 2021   | GPT-3           | ×               | -           | ELI5                                    | |
|| [GPT-4](https://cdn.openai.com/papers/gpt-4.pdf)| 1.76T| 2023   | -               | ×               | 13T         | Diverse internet text | 128,000 tokens|
|| [GPT-4o](https://platform.openai.com/docs/models/gpt-4o)| 220B (Experts)| 2024 | GPT-4 | ×  | 13T?                | Web, technical documents, and multimodal data |128,000 tokens |
|| [GPT-4o Mini](https://platform.openai.com/docs/models/gpt-4o-mini)| 8B| 2024| GPT-4o| ×             | 13T?        | Cost-effective model for general API usage | 128,000 tokens|

**Key updates**:
1. **GPT-3.5** was introduced as an intermediate model between GPT-3 and GPT-4, with variants ranging from 1.3B to 20B parameters. It brought improvements in language generation and is used in ChatGPT's free version.
2. **GPT-4o**, launched in 2024, uses a "Mixture of Experts" architecture with experts specialized in different tasks. It consists of multiple expert models (220B parameters per expert) that together sum up to a total of 1.76 trillion parameters.
3. **GPT-4o Mini** is a lighter and more affordable version of GPT-4o, specifically aimed at smaller applications and businesses needing cost-effective AI solutions.

