# 🤖 Large Language Models (LLMs) - Complete Guide

## Understanding the AI Revolution: From GPT to Claude

**What you'll learn:**
- What LLMs are and how they work
- The architecture behind models like GPT, Claude, and LLaMA
- Transformers and attention mechanisms
- Training, fine-tuning, and prompt engineering
- Hands-on: Build a mini language model
- Using LLM APIs (OpenAI, Anthropic, etc.)

**Prerequisites:**
- Basic Python
- Understanding of neural networks (helpful but not required)

**Time:** 90-120 minutes

## 📚 Table of Contents

1. [Introduction to LLMs](#intro)
2. [How LLMs Work](#how)
3. [The Transformer Architecture](#transformer)
4. [Attention Mechanism](#attention)
5. [Training Process](#training)
6. [Tokenization](#tokenization)
7. [Prompting & Context](#prompting)
8. [Fine-tuning vs RAG](#finetuning)
9. [Building a Mini LLM](#miniLLM)
10. [Using LLM APIs](#apis)
11. [Advanced Topics](#advanced)
12. [Exercises](#exercises)

<a id='intro'></a>
## 1. 🎯 Introduction to LLMs

### What is a Large Language Model?

An **LLM** is a neural network trained on massive amounts of text to:
- Understand language patterns
- Generate human-like text
- Complete tasks through natural language

### Popular LLMs

| Model | Company | Size | Best For |
|-------|---------|------|----------|
| **GPT-4** | OpenAI | ~1.7T params | General purpose, reasoning |
| **Claude 3** | Anthropic | Unknown | Safety, long context |
| **Gemini** | Google | ~1.5T params | Multimodal tasks |
| **LLaMA 3** | Meta | 8B-70B params | Open source, fine-tuning |
| **Mistral** | Mistral AI | 7B-8x7B params | Efficient, open |

### Evolution Timeline

```
2017: Transformer Architecture ("Attention is All You Need")
   ↓
2018: GPT-1 (117M params) - First large-scale pretraining
   ↓
2019: GPT-2 (1.5B params) - "Too dangerous to release"
   ↓
2020: GPT-3 (175B params) - Emergent capabilities
   ↓
2022: ChatGPT - Conversational AI goes mainstream
   ↓
2023: GPT-4, Claude 3, Gemini - Multimodal, longer context
   ↓
2024: Open source explosion (LLaMA 3, Mistral, Phi-3)
```

### Key Capabilities

✅ **Text Generation**: Write essays, code, poetry

✅ **Question Answering**: General knowledge queries

✅ **Summarization**: Condense long documents

✅ **Translation**: Between languages

✅ **Code Generation**: Write and debug code

✅ **Reasoning**: Solve problems step-by-step

✅ **Conversation**: Natural dialogue

In [None]:
# Setup
!pip install -q torch transformers tiktoken matplotlib numpy

In [None]:
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries loaded successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

<a id='how'></a>
## 2. 🧠 How LLMs Work

### The Big Picture

```
Input Text → Tokenization → Embeddings → Transformer Layers → Output Probabilities

"Hello world"  →  [15496, 995]  →  [vectors]  →  [processing]  →  Next token: "!"
```

### Core Components

1. **Tokenization**: Break text into tokens
   ```
   "Hello world" → ["Hello", " world"]
   ```

2. **Embeddings**: Convert tokens to vectors
   ```
   "Hello" → [0.23, -0.45, 0.67, ...] (768 numbers)
   ```

3. **Transformer Layers**: Process sequences with attention
   - Self-attention: Which words relate to each other?
   - Feed-forward: Transform representations

4. **Prediction**: Probability distribution over vocabulary
   ```
   Next token probabilities:
   "!" → 0.45
   "." → 0.30
   "," → 0.15
   ```

### Autoregressive Generation

LLMs generate text **one token at a time**, using previous tokens as context:

```
Prompt: "The capital of France is"

Step 1: "The capital of France is" → predict "Paris" (90% confidence)
Step 2: "The capital of France is Paris" → predict "," (70% confidence)  
Step 3: "The capital of France is Paris," → predict "which" (60% confidence)
...
```

In [None]:
# Demo: Load a small GPT-2 model
print("Loading GPT-2 (small) model...")
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
print("✅ Model loaded!\n")

# Model info
num_params = sum(p.numel() for p in model.parameters())
print(f"Model: GPT-2 (small)")
print(f"Parameters: {num_params:,}")
print(f"Layers: {model.config.n_layer}")
print(f"Hidden size: {model.config.n_embd}")
print(f"Attention heads: {model.config.n_head}")
print(f"Vocabulary size: {model.config.vocab_size}")

In [None]:
# Demo: Simple text generation
def generate_text(prompt, max_length=50, temperature=0.7):
    """Generate text using GPT-2"""
    inputs = tokenizer.encode(prompt, return_tensors='pt')
    
    outputs = model.generate(
        inputs,
        max_length=max_length,
        temperature=temperature,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test it
prompts = [
    "Artificial intelligence is",
    "The future of technology",
    "Machine learning can help"
]

print("🤖 GPT-2 Text Generation Demo\n")
print("="*70 + "\n")

for prompt in prompts:
    generated = generate_text(prompt, max_length=40)
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated}\n")
    print("-"*70 + "\n")

<a id='transformer'></a>
## 3. 🏗️ The Transformer Architecture

### Architecture Overview

```
┌────────────────────────────────────┐
│         Input Text                 │
└────────────┬───────────────────────┘
             │
             ▼
┌────────────────────────────────────┐
│      Tokenization                  │
│  "Hello" → [15496]                 │
└────────────┬───────────────────────┘
             │
             ▼
┌────────────────────────────────────┐
│   Token Embeddings                 │
│   [15496] → [0.2, -0.5, ...]       │
└────────────┬───────────────────────┘
             │
             ▼
┌────────────────────────────────────┐
│   Positional Encoding              │
│   Add position information         │
└────────────┬───────────────────────┘
             │
             ▼
┌────────────────────────────────────┐
│   Transformer Block 1              │
│   ┌──────────────────────┐        │
│   │  Multi-Head Attention │        │
│   └──────────┬───────────┘        │
│              │                     │
│   ┌──────────▼───────────┐        │
│   │  Feed Forward Network │        │
│   └──────────────────────┘        │
└────────────┬───────────────────────┘
             │
             ▼
     [Repeat 12-96 times]
             │
             ▼
┌────────────────────────────────────┐
│   Output Layer                     │
│   Probability over vocabulary      │
└────────────┬───────────────────────┘
             │
             ▼
┌────────────────────────────────────┐
│   Next Token Prediction            │
│   "world" (85% confidence)         │
└────────────────────────────────────┘
```

### Key Innovations

1. **Self-Attention**: Every token can "look at" every other token
2. **Parallel Processing**: Unlike RNNs, can process all tokens simultaneously
3. **Positional Encoding**: Knows word order without recurrence
4. **Layer Normalization**: Stable training
5. **Residual Connections**: Gradient flow

### Model Sizes

| Model | Layers | Hidden Size | Attention Heads | Parameters |
|-------|--------|-------------|-----------------|------------|
| GPT-2 Small | 12 | 768 | 12 | 117M |
| GPT-2 Medium | 24 | 1024 | 16 | 345M |
| GPT-2 Large | 36 | 1280 | 20 | 774M |
| GPT-3 | 96 | 12288 | 96 | 175B |
| GPT-4 | ~120 | ~18000 | ~140 | ~1.7T |

In [None]:
# Simple Transformer Block Implementation
class SimpleTransformerBlock(nn.Module):
    """A simplified transformer block for educational purposes"""
    
    def __init__(self, embed_dim, num_heads, ff_dim):
        super().__init__()
        
        # Multi-head self-attention
        self.attention = nn.MultiheadAttention(
            embed_dim=embed_dim,
            num_heads=num_heads,
            batch_first=True
        )
        
        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(embed_dim, ff_dim),
            nn.ReLU(),
            nn.Linear(ff_dim, embed_dim)
        )
        
        # Layer normalization
        self.ln1 = nn.LayerNorm(embed_dim)
        self.ln2 = nn.LayerNorm(embed_dim)
    
    def forward(self, x):
        # Self-attention with residual connection
        attn_out, _ = self.attention(x, x, x)
        x = self.ln1(x + attn_out)
        
        # Feed-forward with residual connection
        ffn_out = self.ffn(x)
        x = self.ln2(x + ffn_out)
        
        return x

# Create a simple transformer block
block = SimpleTransformerBlock(embed_dim=512, num_heads=8, ff_dim=2048)

# Test with random input
batch_size, seq_len, embed_dim = 2, 10, 512
x = torch.randn(batch_size, seq_len, embed_dim)
output = block(x)

print("✅ Simple Transformer Block")
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Parameters: {sum(p.numel() for p in block.parameters()):,}")

<a id='attention'></a>
## 4. 👁️ Attention Mechanism

### What is Attention?

**Attention** lets the model focus on relevant parts of the input:

```
Sentence: "The cat sat on the mat"

Processing "sat":
  - High attention to "cat" (subject)
  - High attention to "mat" (location)
  - Low attention to "the" (less important)
```

### Self-Attention Formula

```
Attention(Q, K, V) = softmax(QK^T / √d_k) V

Where:
  Q = Queries (what we're looking for)
  K = Keys (what we can match against)
  V = Values (what we retrieve)
  d_k = dimension of keys (for scaling)
```

### Multi-Head Attention

Instead of one attention mechanism, use multiple "heads":

```
Head 1: Focuses on syntactic relationships
Head 2: Focuses on semantic meaning
Head 3: Focuses on long-range dependencies
...
Head 8: Focuses on positional patterns

→ Concatenate all heads → Learn richer representations
```

In [None]:
# Visualize attention patterns
def visualize_attention(text, layer=0, head=0):
    """Visualize attention weights from GPT-2"""
    # Tokenize
    inputs = tokenizer(text, return_tensors='pt')
    
    # Get attention weights
    with torch.no_grad():
        outputs = model(**inputs, output_attentions=True)
        attention = outputs.attentions[layer][0, head].numpy()
    
    # Get tokens
    tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
    
    # Plot
    plt.figure(figsize=(10, 8))
    plt.imshow(attention, cmap='viridis')
    plt.colorbar(label='Attention Weight')
    plt.xticks(range(len(tokens)), tokens, rotation=90)
    plt.yticks(range(len(tokens)), tokens)
    plt.xlabel('Key Tokens')
    plt.ylabel('Query Tokens')
    plt.title(f'Attention Pattern (Layer {layer}, Head {head})')
    plt.tight_layout()
    plt.show()

# Example
text = "The cat sat on the mat"
print(f"Text: {text}\n")
visualize_attention(text, layer=0, head=0)

<a id='tokenization'></a>
## 5. 🔤 Tokenization

### What is Tokenization?

Breaking text into **tokens** (sub-word units):

```
Word-level: "Hello world" → ["Hello", "world"]
Character-level: "Hello" → ["H", "e", "l", "l", "o"]
Subword (BPE): "HelloWorld" → ["Hello", "World"]
```

### Why Subword Tokenization?

**Benefits:**
- ✅ Handles unknown words ("tokenization" → "token" + "ization")
- ✅ Smaller vocabulary size
- ✅ Better for multilingual models

### Popular Algorithms

1. **BPE** (Byte Pair Encoding) - GPT models
2. **WordPiece** - BERT
3. **SentencePiece** - T5, LLaMA
4. **Unigram** - Alternative to BPE

In [None]:
# Tokenization examples
examples = [
    "Hello, world!",
    "Tokenization is important",
    "GPT-4 can understand complex patterns",
    "🤖 AI is amazing!",
    "programming"
]

print("🔤 Tokenization Examples\n")
print("="*70 + "\n")

for text in examples:
    tokens = tokenizer.tokenize(text)
    token_ids = tokenizer.encode(text)
    
    print(f"Text: {text}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {token_ids}")
    print(f"Count: {len(tokens)} tokens\n")

<a id='training'></a>
## 6. 🏋️ Training Process

### Three Stages of LLM Training

```
1. PRE-TRAINING (Months, $$$$$)
   └── Train on massive text corpus (internet, books, code)
   └── Learn: grammar, facts, reasoning patterns
   └── Objective: Predict next token
   └── Cost: ~$100M for GPT-4 scale

2. SUPERVISED FINE-TUNING (Days-Weeks, $$$)
   └── Train on high-quality Q&A pairs
   └── Learn: Follow instructions, helpful responses
   └── Cost: ~$100K

3. RLHF (Weeks, $$)
   └── Reinforcement Learning from Human Feedback
   └── Learn: Align with human preferences
   └── Makes model helpful, harmless, honest
```

### Training Data Scale

| Model | Training Tokens | Training Data |
|-------|----------------|---------------|
| GPT-2 | 10B | 40GB text |
| GPT-3 | 300B | ~500GB text |
| GPT-4 | ~13T | ~10TB text |
| LLaMA 2 | 2T | ~2TB text |

<a id='miniLLM'></a>
## 9. 🔨 Building a Mini Language Model

Let's build a tiny LLM from scratch to understand the concepts!

In [None]:
# Complete Mini-LLM Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F

class MiniLLM(nn.Module):
    """A minimal language model for character-level text generation"""
    
    def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2, max_len=128):
        super().__init__()
        self.vocab_size = vocab_size
        self.max_len = max_len
        
        # Token embeddings
        self.token_embedding = nn.Embedding(vocab_size, embed_dim)
        
        # Positional embeddings
        self.position_embedding = nn.Embedding(max_len, embed_dim)
        
        # Transformer blocks
        self.blocks = nn.ModuleList([
            SimpleTransformerBlock(embed_dim, num_heads, embed_dim * 4)
            for _ in range(num_layers)
        ])
        
        # Output layer
        self.ln_f = nn.LayerNorm(embed_dim)
        self.head = nn.Linear(embed_dim, vocab_size)
    
    def forward(self, idx):
        B, T = idx.shape
        
        # Token + position embeddings
        tok_emb = self.token_embedding(idx)  # (B, T, embed_dim)
        pos_emb = self.position_embedding(torch.arange(T, device=idx.device))  # (T, embed_dim)
        x = tok_emb + pos_emb  # (B, T, embed_dim)
        
        # Transformer blocks
        for block in self.blocks:
            x = block(x)
        
        # Output
        x = self.ln_f(x)
        logits = self.head(x)  # (B, T, vocab_size)
        
        return logits
    
    def generate(self, idx, max_new_tokens, temperature=1.0):
        """Generate new tokens autoregressively"""
        for _ in range(max_new_tokens):
            # Crop context if too long
            idx_cond = idx if idx.size(1) <= self.max_len else idx[:, -self.max_len:]
            
            # Get predictions
            logits = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Last token
            
            # Sample
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            
            # Append
            idx = torch.cat([idx, idx_next], dim=1)
        
        return idx

print("✅ MiniLLM class defined!")

In [None]:
# Train on a tiny dataset (Shakespeare-style)
text = """To be or not to be, that is the question.
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,
Or to take arms against a sea of troubles."""

# Create character-level vocabulary
chars = sorted(set(text))
vocab_size = len(chars)
char_to_idx = {ch: i for i, ch in enumerate(chars)}
idx_to_char = {i: ch for i, ch in enumerate(chars)}

# Encode text
encode = lambda s: [char_to_idx[c] for c in s]
decode = lambda l: ''.join([idx_to_char[i] for i in l])

data = torch.tensor(encode(text), dtype=torch.long)

print(f"Text length: {len(text)} characters")
print(f"Vocabulary size: {vocab_size}")
print(f"First 50 characters: {text[:50]}")
print(f"Encoded: {data[:50].tolist()}")

In [None]:
# Create and test model
model_mini = MiniLLM(vocab_size=vocab_size, embed_dim=64, num_heads=4, num_layers=2)

print("🤖 MiniLLM Architecture:")
print(f"Parameters: {sum(p.numel() for p in model_mini.parameters()):,}")
print(f"Vocabulary: {vocab_size} characters")

# Generate before training (random)
context = torch.tensor([encode("To be")], dtype=torch.long)
with torch.no_grad():
    generated = model_mini.generate(context, max_new_tokens=50, temperature=1.0)
    
print("\n🎲 Before training (random):")
print(decode(generated[0].tolist()))

<a id='apis'></a>
## 10. 🌐 Using LLM APIs

### Popular LLM APIs

| Provider | Model | Pricing | Best For |
|----------|-------|---------|----------|
| OpenAI | GPT-4 | $0.03/1K tokens | General purpose |
| Anthropic | Claude 3 | $0.015/1K tokens | Long context |
| Google | Gemini | Free tier | Multimodal |
| Cohere | Command | $0.001/1K tokens | Cost-effective |
| Together AI | LLaMA 3 | $0.0006/1K tokens | Open source |

In [None]:
# Example: Using OpenAI API (requires API key)
# Uncomment to use

# import openai
# import os

# openai.api_key = os.getenv("OPENAI_API_KEY")

# def chat_with_gpt(prompt, model="gpt-4"):
#     response = openai.ChatCompletion.create(
#         model=model,
#         messages=[{"role": "user", "content": prompt}],
#         temperature=0.7,
#         max_tokens=150
#     )
#     return response.choices[0].message.content

# # Test
# answer = chat_with_gpt("Explain what a large language model is in simple terms")
# print(answer)

print("💡 To use LLM APIs:")
print("1. Get API key from provider")
print("2. Set environment variable: export OPENAI_API_KEY='your-key'")
print("3. Uncomment and run the code above")

## 🎉 Conclusion

You've learned:

✅ What LLMs are and how they work

✅ Transformer architecture and attention

✅ Tokenization and embeddings

✅ Training process (pre-training, fine-tuning, RLHF)

✅ Built a mini LLM from scratch

✅ How to use LLM APIs

### Next Steps

1. Fine-tune a model on custom data
2. Build applications with LLM APIs
3. Learn prompt engineering
4. Explore RAG (Retrieval Augmented Generation)
5. Study model efficiency (quantization, distillation)

### Resources

- [Attention is All You Need](https://arxiv.org/abs/1706.03762) - Original Transformer paper
- [GPT-3 Paper](https://arxiv.org/abs/2005.14165)
- [Hugging Face Course](https://huggingface.co/course)
- [Karpathy's Neural Networks](https://karpathy.ai/zero-to-hero.html)

**Happy learning! 🚀**