# Lesson 3: The Verifier (Best-of-N)

In Lesson 2 (which we skipped to get here), we saw that **Majority Voting failed** for the math problem "Is 91 prime?".
The model `mistral:7b` confidently said "Yes" 5 times out of 5.

This happens when a model has a strong **System 1 bias** (intuition) that is incorrect.
To fix this, we need **System 2 thinking** (Reasoning) and a **Verifier**.

**New Strategy:**
1.  **Generate** reasoning traces ("Think step by step...").
2.  **Verify** the logic (Is the math correct?).
3.  **Select** the best answer based on verification scores (Best-of-N).

In [1]:
import time
from rich.console import Console
try:
    import ollama
except ImportError:
    print("pip install ollama")

console = Console()
MODEL_NAME = "mistral:7b" 

## 1. The Generator (Thinker)
We ask the model to "Think step-by-step". This encourages it to output a Chain of Thought (CoT).
Often, the mere act of writing out the steps allows the model to catch its own error.

In [2]:
def generate_thought(prompt: str, temp: float = 0.7) -> str:
    """Generates a step-by-step reasoning trace."""
    response = ollama.chat(
        model=MODEL_NAME,
        messages=[
            {"role": "system", "content": "You are a math expert. Think step-by-step to check for factors. show your work."},
            {"role": "user", "content": prompt}
        ],
        options={"temperature": temp}
    )
    return response['message']['content'].strip()

## 2. The Verifier (LLM-as-a-Judge)
In a production TTRL system, this would be a trained Reward Model.
Here, we act as a "Self-Correction" system by asking the model to look at its own work and judge it.
We ask: *"Check the math. Determine if the reasoning is sound."*

In [3]:
def verify_solution(problem: str, solution: str) -> float:
    """
    Asks the model to critique the reasoning. Returns a score 0.0 to 1.0.
    """
    verifier_prompt = f"""
    Problem: {problem}
    Proposed Solution: {solution}
    
    Task: Check the math calculations in the solution. 
    If you find ANY calculation error (e.g. 7*13 != 91), score it 0.
    If the reasoning is sound and concludes correctly, score it 1.
    Reply with ONLY the score (0 or 1).
    """
    
    response = ollama.chat(
        model=MODEL_NAME, # Self-Correction
        messages=[{"role": "user", "content": verifier_prompt}],
        options={"temperature": 0.0} # Deterministic for judging
    )
    
    content = response['message']['content'].strip()
    return 1.0 if "1" in content else 0.0

## 3. Best-of-N Search Loop
We run the loop:
1.  Generate a solution.
2.  Verify it.
3.  Keep the one with the highest Verification Score.

This is simpler than PPO training but often yields massive gains at inference time (like **OpenAI o1**).

In [4]:
PROBLEM = "Is 91 a prime number?"
N_SAMPLES = 5

console.print(f"\n[bold yellow]Running Best-of-N Search (N={N_SAMPLES})...[/bold yellow]")
console.print(f"Problem: {PROBLEM}")

best_score = -1.0
best_solution = ""

for i in range(N_SAMPLES):
    # 1. Generate
    solution = generate_thought(PROBLEM, temp=0.8)
    
    # 2. Verify
    score = verify_solution(PROBLEM, solution)
    
    color = "green" if score > 0.5 else "red"
    console.print(f"\n[bold]Sample {i+1} (Score: {score}):[/bold]")
    console.print(f"[{color}]{solution[:150].replace('\n', ' ')}...[/{color}]")
    
    if score > best_score:
        best_score = score
        best_solution = solution

## 4. Final Result
If TTRL works, we should find at least one solution that realized 91 is divisible by 7.

In [5]:
console.print(f"\n[bold cyan]üèÜ Best Verification Score:[/bold cyan] {best_score}")

if best_score == 1.0:
    console.print("[bold green]TTRL SUCCESS:[/bold green] Found a valid reasoning path!")
    console.print(f"Best Thought:\n{best_solution}")
else:
    console.print("[bold red]FAILURE:[/bold red] Could not find a verified solution.")