# GPT-3 (Reduced) from Scratch

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adiel2012/deep-learning-abc/blob/main/gpt3.ipynb)

This notebook implements **GPT-3's architecture** from scratch.

GPT-3 is very similar to GPT-2 (decoder-only Transformer), with a few key modifications for scale:
1. **Alternating Dense and Sparse Attention**: Not implemented here for simplicity (standard attention is used).
2. **Larger Scale**: 175B parameters (we implement a reduced version).
3. **Few-Shot Learning**: The model is designed to learn from prompts without gradient updates.

In [None]:
!pip install torch matplotlib

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'Using device: {device}')

## 1. Simplified Sparse Attention Pattern (Mock)

GPT-3 uses **sparse attention** in alternating layers to reduce computation. For this implementation, we will stick to standard attention but note where the difference lies.

In [None]:
# Standard Causal Attention (Same as GPT-2)
class GPT3Attention(nn.Module):
    def __init__(self, d_model, n_heads, max_len):
        super().__init__()
        self.d_model = d_model
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        self.c_attn = nn.Linear(d_model, 3 * d_model)
        self.c_proj = nn.Linear(d_model, d_model)
        
        # Causal mask
        self.register_buffer("bias", torch.tril(torch.ones(max_len, max_len))
                                     .view(1, 1, max_len, max_len))
        
    def forward(self, x):
        # Standard attention implemenation
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.d_model, dim=2)
        
        q = q.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_k).transpose(1, 2)
        
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.d_k))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        
        return self.c_proj(y)

## 2. GPT-3 Architecture

The main difference is scale. GPT-3 small (125M) is basically GPT-2 base. 
GPT-3 175B uses:
- 96 layers
- d_model = 12288
- 96 heads
- Context window = 2048 tokens

In [None]:
class GPT3(nn.Module):
    def __init__(self, vocab_size, d_model, n_heads, n_layers, max_len):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_len, d_model)
        self.drop = nn.Dropout(0.1)
        
        self.blocks = nn.ModuleList([
            GPTBlock(d_model, n_heads, 4 * d_model, max_len) for _ in range(n_layers)
        ])
        
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)
        self.max_len = max_len

    def forward(self, idx):
        b, t = idx.size()
        t_pos = torch.arange(0, t, dtype=torch.long, device=idx.device)
        
        x = self.token_emb(idx) + self.pos_emb(t_pos)
        x = self.drop(x)
        
        for block in self.blocks:
            x = block(x)
            
        x = self.ln_f(x)
        logits = self.head(x)
        return logits
    
# Reuse GPTBlock from GPT-2 notebook (simulating sharing)
class GPTBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, max_len):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = GPT3Attention(d_model, n_heads, max_len)
        self.ln2 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Linear(d_ff, d_model),
            nn.Dropout(0.1)
        )
        
    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

# Instantiating "GPT-3 Small" (similar to GPT-2 base)
model = GPT3(
    vocab_size=50257, # GPT-2/3 tokenizer size
    d_model=768,
    n_heads=12,
    n_layers=12,
    max_len=1024       # Reduced context for demo
).to(device)

print(f"GPT-3 Small initialized: {sum(p.numel() for p in model.parameters())/1e6:.1f}M params")

## 3. Few-Shot Prompting Concept

GPT-3's power comes from **in-context learning**. Instead of fine-tuning weights (like BERT), we feed examples in the prompt.

In [None]:
prompt = """
Translate English to French:
sea otter => loutre de mer
peppermint => menthe poivrÃ©e
plush giraffe => girafe peluche
cheese =>
"""

print("Few-Shot Prompt Structure:")
print(prompt)
print("(The model completes the pattern: 'fromage')")