# Chapters 6, 7 & 8: Think Tokens, Training Loop & Evaluation

**Goal:** Understand how all the pieces (rollouts, rewards, GRPO) come together in a complete training loop, how think tokens teach the model to reason, and how to evaluate the results.

This notebook covers:
- **Chapter 6:** Think tokens and emergent reasoning behavior
- **Chapter 7:** The complete training loop in `train.py`
- **Chapter 8:** Evaluation benchmarks and next steps

---

## The Big Picture

Everything we've built so far:

```
Ch 2: load_model_qlora()     --> Load 8B model in ~5.7 GB
       sample_batch()         --> Generate diverse responses
       
Ch 3: math_reward()           --> Score responses (+1.0, -0.5, -1.0)
       reward_with_thinking() --> Also reward <think> blocks
       
Ch 5: grpo_step()             --> One RL update (generate G responses, compute advantages, backprop)

Ch 7: train()                 --> Tie it all together in a loop!
       |
       +-- for step in range(num_steps):
       |     1. Get prompt from GSM8K dataset
       |     2. optimizer.zero_grad()
       |     3. grpo_step() --> generates, scores, computes loss, backward()
       |     4. optimizer.step() --> update LoRA weights
       |     5. torch.cuda.empty_cache()
```

---

## Part 1: Think Tokens (Chapter 6)

### What are think tokens?

Special `<think>...</think>` blocks where the model externalizes its reasoning before giving a final answer. During inference, the think block can be stripped - the user only sees the final answer.

### Example: Math Problem

**Without thinking:**
```
Q: What is 127 * 843?
A: 107,061
```

**With thinking:**
```
Q: What is 127 * 843?
<think>
I need to compute 127 * 843.
Let me break this down:
127 * 800 = 101,600
127 * 40  = 5,080
127 * 3   = 381

Total: 101,600 + 5,080 + 381 = 107,061

Let me verify: 107,061 / 127 = 843. Correct.
</think>
The answer is 107,061.
```

### Why is this valuable?

| Benefit | Explanation |
|---|---|
| Problem decomposition | Forces multi-step breakdown |
| Credit assignment | Reward function can score intermediate steps |
| Interpretability | See exactly where errors occur |
| Self-correction | Model can review its own work |

### How the model learns WHEN to think

The model is **never explicitly told** which problems are hard. It learns this emergently through GRPO's relative scoring:

**Simple problem (2 + 3):**
- Response without thinking: "5" --> reward +1.0 (or +0.5 with thinking reward)
- Response with thinking: "<think>2+3=5</think> 5" --> reward +1.0
- Within the GRPO group, non-thinking and thinking responses get similar rewards
- The model learns: thinking is unnecessary overhead for trivial problems

**Complex problem (127 * 843):**
- Response without thinking: often wrong --> reward -0.5
- Response with thinking: step-by-step, often correct --> reward +1.0
- Within the GRPO group, thinking responses dominate
- The model learns: thinking is essential for hard problems

In [None]:
# Simulating how GRPO learns when to think

import random
random.seed(42)

def simulate_grpo_group(problem_difficulty: str, G: int = 4):
    """Simulate a GRPO group for a simple vs complex problem."""
    responses = []
    
    for _ in range(G):
        uses_thinking = random.random() < 0.5  # 50% chance of thinking
        
        if problem_difficulty == "simple":
            # Simple problems: usually correct regardless of thinking
            correct = random.random() < 0.95  # 95% accuracy
        else:
            # Complex problems: thinking helps a lot
            if uses_thinking:
                correct = random.random() < 0.75  # 75% with thinking
            else:
                correct = random.random() < 0.20  # 20% without thinking
        
        # reward_with_thinking scoring
        if correct and uses_thinking:
            reward = 1.0
        elif correct:
            reward = 0.5
        elif uses_thinking:
            reward = -0.2
        else:
            reward = -0.5
        
        responses.append((uses_thinking, correct, reward))
    
    return responses


# Run many simulations
n_sims = 5000

for difficulty in ["simple", "complex"]:
    thinking_encouraged = 0
    not_thinking_encouraged = 0
    
    for _ in range(n_sims):
        group = simulate_grpo_group(difficulty)
        rewards = [r for _, _, r in group]
        baseline = sum(rewards) / len(rewards)
        
        for thinks, correct, reward in group:
            advantage = reward - baseline
            if advantage > 0.01:
                if thinks:
                    thinking_encouraged += 1
                else:
                    not_thinking_encouraged += 1
    
    total = thinking_encouraged + not_thinking_encouraged
    if total > 0:
        think_pct = 100 * thinking_encouraged / total
        no_think_pct = 100 * not_thinking_encouraged / total
    else:
        think_pct = no_think_pct = 0
    
    print(f"\n{difficulty.upper()} problems:")
    print(f"  Thinking responses encouraged:     {think_pct:.1f}%")
    print(f"  Non-thinking responses encouraged:  {no_think_pct:.1f}%")

print("\nConclusion: GRPO naturally learns to think more on complex problems!")
print("This is EMERGENT behavior - no explicit 'difficulty detector' is needed.")

---

## Part 2: The GSM8K Dataset

The training loop uses **GSM8K** (Grade School Math 8K) - a dataset of 8,500 multi-step grade school math word problems.

### Dataset format

Each example has:
- `question`: A math word problem in natural language
- `answer`: A step-by-step solution ending with `#### <number>`

In [None]:
# Understanding the GSM8K answer format

def extract_gsm8k_answer(answer_text: str) -> str:
    """Extract the final numerical answer from GSM8K's answer format.
    
    GSM8K answers look like:
    'Step 1: Calculate X = 5 * 3 = 15
     Step 2: Calculate Y = 15 + 7 = 22
     #### 22'
    
    We need to extract '22' from after the '####'.
    """
    lines = answer_text.strip().split("\n")
    # Search from the END of the text for the #### marker
    for line in reversed(lines):
        if "####" in line:
            return line.split("####")[-1].strip()
    # Fallback: return the last line
    return lines[-1].strip() if lines else ""


# Test with realistic GSM8K examples
examples = [
    {
        "question": "Janet's ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells every duck egg at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
        "answer": "Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = <<9*2=18>>$18 every day at the farmer's market.\n#### 18"
    },
    {
        "question": "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?",
        "answer": "It takes 2/2=<<2/2=1>>1 bolt of white fiber.\nSo the total bolts needed is 2+1=<<2+1=3>>3.\n#### 3"
    },
]

print("GSM8K Dataset Examples")
print("=" * 70)

for i, ex in enumerate(examples):
    answer = extract_gsm8k_answer(ex["answer"])
    print(f"\nExample {i+1}:")
    print(f"  Question: {ex['question'][:80]}...")
    print(f"  Full answer: {ex['answer'][:80]}...")
    print(f"  Extracted ground truth: {answer}")

---

## Part 3: The Training Loop - `train()` Line by Line

Now let's walk through the complete training function from `train.py`:

In [None]:
# Annotated version of the train() function
# (This is a walkthrough - don't run this cell unless you have a GPU)

import argparse
import sys
import time
from pathlib import Path

import torch

# These imports show how the chapters connect:
# from ch02_rollouts.sample import load_model_qlora      # Chapter 2: Model loading
# from ch03_rewards.verifier import math_reward, reward_with_thinking  # Chapter 3: Rewards
# from ch05_grpo.grpo import grpo_step                   # Chapter 5: GRPO algorithm


def train(
    num_steps: int = 50,              # Total training steps
    G: int = 4,                        # GRPO group size
    lr: float = 1e-5,                  # Learning rate
    use_thinking_reward: bool = False,  # Use reward_with_thinking?
    log_interval: int = 5,             # Print progress every N steps
):
    print("=== GRPO Training on RTX 3090 ===")
    print(f"Steps: {num_steps}, G: {G}, LR: {lr}")
    print(f"Thinking reward: {use_thinking_reward}")
    
    # ========================================
    # SETUP PHASE
    # ========================================
    
    # Step 1: Load model with QLoRA (Chapter 2)
    # This gives us ~5.7 GB base model + ~0.2 GB trainable LoRA adapters
    # model, tokenizer = load_model_qlora()
    
    # Step 2: Set up optimizer
    # AdamW = Adam with weight decay. Only updates LoRA parameters.
    # lr=1e-5 is conservative - prevents catastrophic forgetting
    # optimizer = AdamW(model.parameters(), lr=lr)
    
    # Step 3: Load dataset
    # GSM8K: 8,500 grade school math problems
    # Shuffle with fixed seed for reproducibility
    # dataset = load_dataset("openai/gsm8k", "main", split="train")
    # dataset = dataset.shuffle(seed=42)
    
    # Step 4: Select reward function
    # math_reward: only checks final answer
    # reward_with_thinking: also rewards <think> blocks
    # reward_fn = reward_with_thinking if use_thinking_reward else math_reward
    
    print("\nSetup complete. Starting training loop...")
    print("(This is a walkthrough - not actually running)")

In [None]:
# The core training loop (annotated walkthrough)

print("THE CORE TRAINING LOOP")
print("=" * 70)
print("""
for step in range(num_steps):  # e.g., 50 steps
    
    # --- Get training example ---
    example = dataset[step % len(dataset)]   # Cycle through dataset
    prompt = example["question"]              # "Janet's ducks lay 16 eggs..."
    ground_truth = extract_gsm8k_answer(      # "18"
        example["answer"]
    )
    
    # --- GRPO Step ---
    optimizer.zero_grad()                     # Clear gradients from last step
    
    loss, reward, responses = grpo_step(       # THE MAIN EVENT:
        model, tokenizer,                      #   1. Generate G=4 responses
        prompt, ground_truth,                  #   2. Score each with reward_fn
        reward_fn, G=G                         #   3. Compute advantages
    )                                          #   4. Compute loss & backward()
    
    optimizer.step()                           # Update LoRA weights!
    
    # --- Memory management ---
    torch.cuda.empty_cache()                   # Free unused GPU memory
                                               # Prevents fragmentation
""")

print("Each step:")
print("  1. Pick a math problem from GSM8K")
print("  2. Generate 4 different responses")
print("  3. Score them (correct=+1.0, wrong=-0.5, no answer=-1.0)")
print("  4. Compute advantages (each reward minus group mean)")
print("  5. Compute loss: -sum(advantage * log_prob) / G")
print("  6. Backpropagate through the LoRA parameters")
print("  7. Optimizer updates the LoRA weights")
print("  8. Clear GPU cache")

---

## Part 4: Memory Budget on RTX 3090

One of the most practical aspects of this system is fitting everything into 24 GB:

In [None]:
# Memory budget breakdown

components = [
    ("Base model (4-bit NF4)",     5.0,  "Frozen, not trainable"),
    ("LoRA adapters (float32)",    0.2,  "~50M trainable parameters"),
    ("Optimizer states (AdamW)",   0.4,  "2x LoRA params (m and v)"),
    ("Gradients",                  0.2,  "Same size as LoRA params"),
    ("Activations (peak)",         6.0,  "During forward/backward pass"),
    ("KV cache (generation)",      1.5,  "For G=4 sequential generations"),
    ("CUDA overhead",              0.5,  "Context, allocator, etc."),
]

total_gpu = 24.0
total_used = sum(gb for _, gb, _ in components)

print("Memory Budget: GRPO Training on RTX 3090 (24 GB)")
print("=" * 65)
print(f"{'Component':<30} {'GB':<8} {'Notes'}")
print("-" * 65)

for name, gb, notes in components:
    bar = "#" * int(gb * 2)
    print(f"{name:<30} {gb:<8.1f} {bar} {notes}")

print("-" * 65)
print(f"{'TOTAL USED':<30} {total_used:<8.1f}")
print(f"{'AVAILABLE':<30} {total_gpu:<8.1f}")
print(f"{'HEADROOM':<30} {total_gpu - total_used:<8.1f}")
print()
print(f"Peak usage from book: ~13.57 GB (well within 24 GB budget)")
print(f"Headroom allows for longer sequences or larger G if needed")

### Key memory management strategies

1. **QLoRA** reduces model from 16 GB to ~5 GB (the big win)
2. **Small G=4** limits concurrent forward passes
3. **Sequential generation** (one response at a time, not batched)
4. **`torch.cuda.empty_cache()`** prevents memory fragmentation between steps
5. **Only LoRA parameters** have gradients, optimizer states

---

## Part 5: Understanding the Optimizer

The training loop uses **AdamW** with `lr=1e-5`. Let's understand why these choices:

In [None]:
# Why lr=1e-5 and not larger?

print("Learning Rate Selection: lr = 1e-5")
print("=" * 50)
print()
print("Why so small?")
print("-" * 50)
print("1. We're fine-tuning a PRETRAINED model")
print("   - Too large LR would destroy pretrained knowledge")
print("   - This is called 'catastrophic forgetting'")
print()
print("2. RL gradients are inherently noisy")
print("   - Based on sampled responses (not ground truth)")
print("   - Small G=4 means noisy advantage estimates")
print("   - Small LR smooths out the noise")
print()
print("3. LoRA alpha/r scaling already amplifies updates")
print("   - alpha=16, r=64 --> scaling = 0.25")
print("   - Effective LR for LoRA = 1e-5 * 0.25 = 2.5e-6")
print()
print("Why AdamW and not plain SGD?")
print("-" * 50)
print("1. Adam's adaptive learning rate per-parameter")
print("   helps with the high-variance RL gradients")
print("2. Weight decay (the 'W' in AdamW) provides")
print("   mild regularization, preventing LoRA params")
print("   from growing too large")

---

## Part 6: The Command Line Interface

In [None]:
# Understanding the CLI arguments from train.py

print("Running train.py from the command line:")
print("=" * 55)
print()
print("Basic run (default settings):")
print("  python ch07_training/train.py")
print("  --> 50 steps, G=4, lr=1e-5, math_reward")
print()
print("With thinking reward:")
print("  python ch07_training/train.py --thinking")
print("  --> Same but uses reward_with_thinking")
print()
print("Custom configuration:")
print("  python ch07_training/train.py --steps 100 --G 8 --lr 2e-5 --thinking")
print("  --> 100 steps, G=8, lr=2e-5, reward_with_thinking")
print()
print("Arguments:")
print(f"  {'--steps':<12} Number of training steps (default: 50)")
print(f"  {'--G':<12} GRPO group size (default: 4)")
print(f"  {'--lr':<12} Learning rate (default: 1e-5)")
print(f"  {'--thinking':<12} Use thinking reward (default: off)")

---

## Part 7: What Training Looks Like

The book provides simulated training output. Let's understand what the metrics mean:

In [None]:
# Simulating training progress
import random
random.seed(42)

print("=== GRPO Training on RTX 3090 ===")
print("Steps: 50, G: 4, LR: 1e-05")
print("Thinking reward: True")
print()

# Simulate improving metrics over training
losses = []
rewards = []

for step in range(50):
    # Simulate gradually improving reward
    base_reward = -0.2 + (0.8 * step / 50)  # -0.2 to +0.6 over training
    reward = base_reward + random.uniform(-0.3, 0.3)
    reward = max(-1.0, min(1.0, reward))  # Clamp
    
    loss = -0.05 - (0.3 * step / 50) + random.uniform(-0.1, 0.1)
    
    losses.append(loss)
    rewards.append(reward)
    
    if (step + 1) % 10 == 0:
        avg_loss = sum(losses[-10:]) / 10
        avg_reward = sum(rewards[-10:]) / 10
        elapsed = (step + 1) * 3.0  # ~3 sec per step
        print(f"Step {step+1:>3}: loss={avg_loss:>8.4f}, reward={avg_reward:>7.4f}, time={elapsed:.1f}s")

total_time = 50 * 3.0
print(f"\n=== Training Complete ===")
print(f"Total time: {total_time:.1f}s ({total_time/50:.2f}s/step)")
print(f"Final avg loss: {sum(losses[-10:])/10:.4f}")
print(f"Final avg reward: {sum(rewards[-10:])/10:.4f}")
print(f"Peak GPU memory: ~13.57 GB")

### Reading the metrics

| Metric | What it means | Good trend |
|---|---|---|
| **loss** | GRPO policy gradient loss | Becomes more negative (model is learning) |
| **reward** | Average reward across G responses | Increases toward +1.0 |
| **time** | Wall clock time | Stable per-step time |
| **Peak GPU memory** | Maximum VRAM usage | Should stay under 24 GB |

### What to watch for
- **Reward plateaus at 0.0:** All responses get same score (all correct or all wrong). Increase G or adjust temperature.
- **Loss oscillates wildly:** LR too high. Reduce it.
- **OOM errors:** Reduce G or max_new_tokens.

---

## Part 8: Results (Chapter 7)

The book reports preliminary results on GSM8K (100 held-out test problems):

In [None]:
# Results from the book

results = [
    ("Qwen3-8B-Instruct (Base)",    58, 5),
    ("GRPO + math_reward",           67, 12),
    ("GRPO + reward_with_thinking",  71, 85),
]

print("Preliminary Results on GSM8K (100 test problems)")
print("=" * 65)
print(f"{'Model':<35} {'Accuracy':<12} {'Think Block %'}")
print("-" * 65)

for model_name, accuracy, think_pct in results:
    acc_bar = "*" * (accuracy // 2)
    print(f"{model_name:<35} {accuracy}%{'':<6} {think_pct}%")

print()
print("Key findings:")
print("  1. GRPO alone improved accuracy by +9% (58% -> 67%)")
print("  2. Adding thinking reward gave +13% total (58% -> 71%)")
print("  3. Think block presence jumped from 5% to 85%!")
print("  4. The model learned to REASON, not just memorize answers")
print()
print("The +4% gap between math_reward and reward_with_thinking")
print("shows that explicit reasoning traces help the model solve")
print("problems it would otherwise get wrong.")

---

## Part 9: Evaluation & Benchmarks (Chapter 8)

### Three key benchmarks

| Benchmark | Task | Size | Metric | Difficulty |
|---|---|---|---|---|
| **GSM8K** | Grade school math | 8,500 | Numerical accuracy | Medium |
| **MATH** | Competition math | 12,500 | Accuracy (5 subjects) | Hard |
| **HumanEval** | Code generation | 164 | pass@k | Variable |

### The evaluation loop

In [None]:
# How evaluation works (conceptual)

print("Evaluation Loop (pseudo-code):")
print("=" * 50)
print("""
def evaluate(model, tokenizer, dataset, reward_fn):
    correct = 0
    total = 0
    
    for example in dataset:
        # Generate with LOW temperature (near-deterministic)
        response = sample_response(
            model, tokenizer,
            example["question"],
            temperature=0.1  # <-- Key difference from training!
        )
        
        # Check correctness
        ground_truth = extract_gsm8k_answer(example["answer"])
        reward = reward_fn(response, ground_truth)
        
        if reward > 0:
            correct += 1
        total += 1
    
    accuracy = correct / total
    return accuracy
""")

print("Critical difference: Training uses T=0.7, Evaluation uses T=0.1")
print("  Training: Need diversity for exploration")
print("  Evaluation: Want the model's best answer")

### Qualitative analysis

Beyond accuracy numbers, the book recommends inspecting:

1. **Error analysis:** Look at think blocks for wrong answers. Is it:
   - An arithmetic mistake?
   - A misunderstanding of the problem?
   - A logic error?

2. **Trace consistency:** Does the final answer logically follow from the think block? A common failure mode is correct reasoning but a wrong final number.

3. **Success analysis:** Are successful reasoning traces elegant and efficient, or convoluted?

---

## Part 10: Future Directions (Chapter 8)

The book outlines four directions that go beyond the single-GPU setup:

In [None]:
future_directions = [
    {
        "name": "Full Fine-Tuning",
        "description": "Update ALL weights, not just LoRA adapters",
        "requirement": "Multiple A100/H100 GPUs + DeepSpeed/FSDP",
        "benefit": "Deeper integration of reasoning capabilities",
    },
    {
        "name": "Process Reward Models (PRMs)",
        "description": "Score every individual reasoning step, not just the final answer",
        "requirement": "Powerful model (e.g., GPT-4) to evaluate each step",
        "benefit": "Much more granular training signal",
    },
    {
        "name": "Monte Carlo Tree Search (MCTS)",
        "description": "Build a search tree of reasoning paths at inference time",
        "requirement": "Significant inference-time compute",
        "benefit": "Explore multiple reasoning strategies, pick the best",
    },
    {
        "name": "Distillation",
        "description": "Use a large reasoning model to generate traces, train a smaller model to mimic",
        "requirement": "A powerful 'teacher' model",
        "benefit": "Deploy reasoning in smaller, faster models",
    },
]

print("Future Directions (Beyond Single GPU)")
print("=" * 60)

for i, d in enumerate(future_directions, 1):
    print(f"\n{i}. {d['name']}")
    print(f"   What: {d['description']}")
    print(f"   Needs: {d['requirement']}")
    print(f"   Gains: {d['benefit']}")

---

## Part 11: Complete Data Flow Diagram

Let's trace a single training step end-to-end to solidify understanding:

In [None]:
print("""
COMPLETE DATA FLOW: One Training Step
======================================

1. DATASET --> Prompt
   GSM8K[step=7] --> "Janet's ducks lay 16 eggs per day..."
   extract_gsm8k_answer() --> ground_truth = "18"

2. FORMATTING (ch02)
   apply_chat_template() --> "<|im_start|>user\nJanet's ducks...\n<|im_start|>assistant\n"

3. GENERATION (ch02, torch.no_grad)
   model.generate() x4 -->
     R1: "She sells 16-3-4=9 eggs. 9*2=$18. The answer is 18"
     R2: "16 eggs minus 3 minus 4 is 9. 9 times 2 is 20. Answer: 20"
     R3: "<think>16-3-4=9 eggs sold. 9*$2=$18</think> The answer is 18"
     R4: "I think she makes about $15"

4. SCORING (ch03)
   reward_with_thinking():
     R1: correct, no think  --> +0.5
     R2: wrong, no think    --> -0.5
     R3: correct + think    --> +1.0
     R4: wrong, no think    --> -0.5

5. ADVANTAGES (ch05)
   baseline = mean(+0.5, -0.5, +1.0, -0.5) = +0.125
   A1 = +0.5 - 0.125  = +0.375  (encourage)
   A2 = -0.5 - 0.125  = -0.625  (discourage)
   A3 = +1.0 - 0.125  = +0.875  (STRONGLY encourage)
   A4 = -0.5 - 0.125  = -0.625  (discourage)

6. LOSS (ch05)
   For each response with non-zero advantage:
     loss -= advantage * log_prob(response | prompt) / G
   loss.backward()  --> gradients flow through LoRA adapters

7. UPDATE (ch07)
   optimizer.step()  --> AdamW updates LoRA weights
   
   Net effect after this step:
     - Model is slightly more likely to produce R3-style responses
       (correct + thinking = highest advantage)
     - Model is slightly less likely to produce R2 and R4-style responses
       (wrong answers = negative advantages)

8. CLEANUP
   torch.cuda.empty_cache()  --> free GPU memory
   --> Ready for next step
""")

---

## Exercises

### Exercise 1: Logging Enhancements
The current training loop logs loss and reward. Add logging for:
- Percentage of responses with `<think>` blocks
- Average response length (in tokens)
- Per-step timing breakdown (generation vs scoring vs backprop)

### Exercise 2: Gradient Accumulation
The book mentions gradient accumulation as an extension. Modify the training loop to accumulate gradients over `K` steps before calling `optimizer.step()`. This simulates a larger effective batch size.

### Exercise 3: Learning Rate Schedule
Implement a simple warmup + cosine decay learning rate schedule. Why might this be important for RL training stability?

### Exercise 4: Evaluation Script
Write a function that evaluates the model on a held-out set of GSM8K problems. Compare accuracy before and after training.

### Exercise 5: Curriculum Learning
Instead of random sampling from GSM8K, implement a curriculum that starts with simpler problems and gradually introduces harder ones. How would you measure problem difficulty?

---

## Key Takeaways

1. **Think tokens** teach the model to externalize reasoning. The model learns WHEN to think through GRPO's relative scoring - no explicit difficulty labels needed.

2. **The training loop** is deceptively simple: `zero_grad --> grpo_step --> optimizer.step --> empty_cache`. All the complexity is inside `grpo_step`.

3. **Memory fits in 24 GB** thanks to QLoRA (4-bit base + small LoRA adapters) and sequential generation (one response at a time).

4. **GSM8K results** show +13% accuracy improvement with thinking rewards, and the model spontaneously adopts reasoning traces in 85% of responses.

5. **Evaluation uses low temperature** (T=0.1) to get the model's best answer, unlike training which uses T=0.7 for exploration.

6. **This is just the beginning** - full fine-tuning, process reward models, MCTS, and distillation can push results further with more compute.

---

**Previous:** [Chapter 5 - GRPO](../ch05_grpo/learn_grpo.ipynb)  

**Congratulations!** You've now walked through the entire RL Post-Training Handbook codebase. You understand:
- How to load and efficiently fine-tune large models (QLoRA)
- How to generate diverse training data (rollouts with temperature)
- How to design reward functions (verifiable rewards + think bonuses)
- How GRPO eliminates the need for a critic (group-relative advantages)
- How to assemble everything into a working training loop
- How to evaluate and iterate on the results