# Lesson 4: Scaling TTRL on GSM8K

In individual examples, we saw TTRL work. Now, let's test it at scale.
We will use the **GSM8K (Grade School Math 8K)** dataset, which is the gold standard for testing reasoning.

**Our Experiment:**
1.  Take 5 random problems from GSM8K.
2.  Run the Base Model (Zero-Shot).
3.  Run the TTRL Model (Best-of-N).
4.  Compare the "Accuracy Lift".

In [5]:
import re
from datasets import load_dataset
from rich.console import Console
from rich.table import Table
try:
    import ollama
except ImportError:
    print("pip install ollama")

console = Console()
MODEL_NAME = "mistral:7b"

## 1. Load Dataset
We load the `gsm8k` dataset from Hugging Face.

In [6]:
console.print("[yellow]Loading GSM8K...[/yellow]")
dataset = load_dataset("gsm8k", "main", split="test")

# Select 5 random hard examples or just the first 5
examples = dataset.select(range(5))

def extract_answer(text: str):
    """Extracts the number after #### in GSM8K solutions"""
    match = re.search(r'####\s*(\d+)', text)
    return match.group(1) if match else None

## 2. The Evaluator
We define a function to evaluating the model.

In [7]:
def solve_problem(prompt: str, method="greedy") -> str:
    """Solves using either Greedy (Baseline) or TTRL (Best-of-N)"""
    
    if method == "greedy":
        # Simple Zero-Shot
        response = ollama.chat(model=MODEL_NAME, messages=[{"role": "user", "content": prompt}])
        return response['message']['content']
    
    elif method == "ttrl":
        # Best-of-3 with Verification
        best_score = -1
        best_ans = ""
        
        for _ in range(3):
            # Generate
            cand = ollama.chat(model=MODEL_NAME, messages=[{"role": "user", "content": prompt}], options={"temperature": 0.8})
            content = cand['message']['content']
            
            # Verify (Self-Check)
            check_prompt = f"Question: {prompt}\nAnswer: {content}\nIs this correct? Reply 1 for Yes, 0 for No."
            check = ollama.chat(model=MODEL_NAME, messages=[{"role": "user", "content": check_prompt}], options={"temperature":0})
            
            score = 1.0 if "1" in check['message']['content'] else 0.0
            if score > best_score:
                best_score = score
                best_ans = content
        
        return best_ans

## 3. The Experiment
We run the loop comparing Baseline vs TTRL.

In [8]:
table = Table(title="GSM8K Results")
table.add_column("Problem", style="dim", width=30)
table.add_column("Baseline", justify="center")
table.add_column("TTRL", justify="center")
table.add_column("Truth", justify="center")

for ex in examples:
    q = ex['question']
    truth = extract_answer(ex['answer'])
    
    # Run Baseline
    base_raw = solve_problem(q, method="greedy")
    base_correct = truth in base_raw # Simple string match for tutorial
    
    # Run TTRL
    ttrl_raw = solve_problem(q, method="ttrl")
    ttrl_correct = truth in ttrl_raw
    
    table.add_row(
        q[:30]+"...", 
        "✅" if base_correct else "❌", 
        "✅" if ttrl_correct else "❌", 
        truth
    )

console.print(table)