# Reasoning Models

In this notebook, you'll explore the empirical differences between base models using chain-of-thought prompting and models that reason more effectively, and investigate the core concepts from the lesson: self-consistency, process vs outcome evaluation, and test-time compute allocation.

**What you'll do:**
- Compare a base model with CoT prompting against a reasoning-focused model on 10 math/reasoning problems, predicting which problems each will get right before running
- Run self-consistency experiments: generate N chains (N=1,3,5,10,20), compute majority-vote accuracy, and plot accuracy vs N to find the point of diminishing returns
- Evaluate reasoning chains two ways (process vs outcome) on 5 problems with known solutions, finding cases where the final answer is correct but a step is wrong
- Design a test-time compute allocation experiment: fixed compute budget, equal vs adaptive allocation across problems of varying difficulty

**For each exercise, PREDICT the output before running the cell.** Wrong predictions are more valuable than correct ones—they reveal gaps in your mental model.

In [None]:
# Setup — self-contained for Google Colab
!pip install -q openai

import os
import json
import textwrap
import random
import re
from collections import Counter
from openai import OpenAI
import matplotlib.pyplot as plt
import numpy as np

# --- API Key Setup ---
# Option 1: Set your API key as an environment variable (recommended)
#   In Colab: go to the key icon in the left sidebar, add OPENAI_API_KEY
# Option 2: Paste it directly (less secure, don't commit this)
#   os.environ["OPENAI_API_KEY"] = "sk-..."

# You can also use any OpenAI-compatible API (e.g., local Ollama, Together AI)
# by changing the base_url:
#   client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

client = OpenAI()

# --- Model Configuration ---
# BASE_MODEL: a standard model that follows CoT prompts but wasn't specifically
# trained with RL for reasoning. We use temperature>0 for self-consistency.
BASE_MODEL = "gpt-4o-mini"

# REASONING_MODEL: a model trained with RL to reason effectively.
# o3-mini is OpenAI's reasoning model — it generates internal reasoning
# tokens before producing an answer. If you don't have access to o3-mini,
# you can substitute another reasoning-focused model.
REASONING_MODEL = "o3-mini"

# Nice plots
plt.style.use('dark_background')
plt.rcParams['figure.figsize'] = [10, 4]

# Reproducible results where possible
random.seed(42)
np.random.seed(42)


def call_llm(prompt: str, model: str = BASE_MODEL,
             temperature: float = 0.0, max_tokens: int = 1024) -> str:
    """Call the LLM with a single prompt. Returns the response text."""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    return response.choices[0].message.content.strip()


def call_llm_with_usage(prompt: str, model: str = BASE_MODEL,
                        temperature: float = 0.0,
                        max_tokens: int = 1024) -> tuple[str, int]:
    """Call the LLM and return (response_text, completion_tokens)."""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        temperature=temperature,
        max_tokens=max_tokens,
    )
    text = response.choices[0].message.content.strip()
    tokens = response.usage.completion_tokens
    return text, tokens


def call_reasoning_model(prompt: str,
                         model: str = REASONING_MODEL) -> tuple[str, int]:
    """Call the reasoning model. Returns (response_text, completion_tokens).

    Reasoning models (like o3-mini) handle their own chain-of-thought
    internally — you don't need to prompt them with 'think step by step.'
    The model generates internal reasoning tokens (which you may or may not
    see) and then produces an answer.
    """
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    text = response.choices[0].message.content.strip()
    tokens = response.usage.completion_tokens
    return text, tokens


def print_wrapped(text: str, width: int = 80, prefix: str = ""):
    """Print text with word wrapping for readability."""
    for line in text.split("\n"):
        wrapped = textwrap.fill(line, width=width, initial_indent=prefix,
                                subsequent_indent=prefix)
        print(wrapped)


def extract_number(text: str) -> int | None:
    """Extract the last number from a response string."""
    numbers = re.findall(r'-?\d[\d,]*', text.replace(',', ''))
    if not numbers:
        return None
    try:
        return int(numbers[-1])
    except ValueError:
        return None


def majority_vote(answers: list[int | None]) -> int | None:
    """Return the most common non-None answer, or None if all are None."""
    valid = [a for a in answers if a is not None]
    if not valid:
        return None
    counts = Counter(valid)
    return counts.most_common(1)[0][0]


# Quick test to verify the API is working
test = call_llm("Say 'API connection successful' and nothing else.")
print(test)
print(f"\nBase model: {BASE_MODEL}")
print(f"Reasoning model: {REASONING_MODEL}")
print("Setup complete.")

## Shared Data

All exercises use math and reasoning problems of varying complexity. We define them here so exercises can share them. The problems range from single-step arithmetic to multi-step word problems that require chaining several operations.

In [None]:
# --- Problems for Exercises 1-4 ---
# Each entry: (description, problem_text, correct_answer, num_steps, difficulty)
# difficulty: "easy" (single-step or simple multi-step) or "hard" (chained reasoning)

PROBLEMS = [
    # Easy problems (1-2 steps)
    ("simple multiply",
     "What is 8 x 7?",
     56, 1, "easy"),

    ("two-digit addition",
     "What is 47 + 86?",
     133, 1, "easy"),

    ("simple word problem",
     "A store has 3 shelves with 8 books each. They remove 5 books. How many books are left?",
     19, 2, "easy"),

    # Medium problems (2-3 steps)
    ("two-digit multiply",
     "What is 17 x 24?",
     408, 3, "medium"),

    ("three operations",
     "What is (15 x 8) + (12 x 3)?",
     156, 3, "medium"),

    ("word problem",
     "Alex earns $12/hour. He worked 5 hours Monday, 3 hours Tuesday, and 4 hours Wednesday. How much did he earn total?",
     144, 2, "medium"),

    # Hard problems (3+ steps, chained reasoning)
    ("chained operations",
     "What is 13 x 17 + 8 x 9 - 45?",
     248, 4, "hard"),

    ("multi-step word problem",
     "A farmer has 4 fields, each with 6 rows of corn. Each row has 15 plants. He loses 10% of the plants to pests. How many plants survive?",
     324, 4, "hard"),

    ("percentage reasoning",
     "A shirt costs $80. It's on sale for 25% off. Then you have a coupon for 10% off the sale price. What is the final price?",
     54, 3, "hard"),

    ("ratio word problem",
     "Three friends split a bill. The bill is $156. Alice pays twice as much as Bob, and Carol pays three times as much as Bob. How much does Alice pay?",
     52, 4, "hard"),
]

print(f"Loaded {len(PROBLEMS)} problems:")
for diff in ["easy", "medium", "hard"]:
    count = sum(1 for _, _, _, _, d in PROBLEMS if d == diff)
    print(f"  {diff}: {count}")
print("\nData loaded.")

---

## Exercise 1: Base Model CoT vs Reasoning Model (Guided)

The lesson demonstrated that RL-trained reasoning models produce *consistently* better reasoning chains than base models with CoT prompting. Same architecture, same parameter count—the difference is training. In this exercise, you'll test that claim on 10 problems.

For each problem, you'll run two models:
- **Base model + CoT prompt:** A standard model prompted to "think step by step"
- **Reasoning model:** A model trained with RL to reason effectively (generates internal reasoning tokens)

The first 3 problems are fully worked with analysis. The remaining 7 run with the same pattern.

**Before running, predict:**
- Which problems will the base model get wrong that the reasoning model gets right? (Hint: the computational complexity criterion from the CoT lesson still applies, but reasoning models handle complex problems more reliably)
- Will there be problems where the base model gets it right and the reasoning model gets it wrong?
- How will the reasoning *quality* differ, beyond just the final answer?

In [None]:
# --- Step 1: Run the first 3 problems with detailed analysis ---
# We'll show the full reasoning chains to compare quality, not just accuracy.

print("=" * 70)
print("DETAILED ANALYSIS: First 3 Problems")
print("=" * 70)

detailed_results = []

for i, (desc, problem, answer, steps, diff) in enumerate(PROBLEMS[:3]):
    print(f"\n{'=' * 70}")
    print(f"Problem {i+1}: {desc} (difficulty: {diff}, steps: {steps})")
    print(f"Question: {problem}")
    print(f"Correct answer: {answer}")
    print(f"{'=' * 70}")

    # Base model with CoT
    cot_prompt = f"{problem}\n\nLet's work through this step by step, then give the final answer."
    base_response = call_llm(cot_prompt, model=BASE_MODEL, max_tokens=500)
    base_answer = extract_number(base_response)
    base_correct = base_answer == answer

    print(f"\n--- Base Model + CoT ---")
    print_wrapped(base_response)
    print(f"\nExtracted: {base_answer} | Correct: {'Yes' if base_correct else 'NO'}")

    # Reasoning model
    reasoning_response, reasoning_tokens = call_reasoning_model(problem)
    reasoning_answer = extract_number(reasoning_response)
    reasoning_correct = reasoning_answer == answer

    print(f"\n--- Reasoning Model ---")
    print_wrapped(reasoning_response)
    print(f"\nExtracted: {reasoning_answer} | Correct: {'Yes' if reasoning_correct else 'NO'}")
    print(f"Reasoning tokens used: {reasoning_tokens}")

    detailed_results.append({
        "desc": desc, "answer": answer, "steps": steps, "diff": diff,
        "base_correct": base_correct, "reasoning_correct": reasoning_correct,
        "base_response": base_response, "reasoning_response": reasoning_response,
        "reasoning_tokens": reasoning_tokens,
    })

    # Analysis for this problem
    print(f"\n--- Analysis ---")
    if base_correct and reasoning_correct:
        print("Both correct. Compare the reasoning quality:")
        print("  - Is the base model's chain structured and checkable?")
        print("  - Is the reasoning model's chain more systematic?")
    if not base_correct and reasoning_correct:
        print("Base model WRONG, reasoning model CORRECT.")
        print("  The RL training produced a more reliable chain.")
    if base_correct and not reasoning_correct:
        print("Base model CORRECT, reasoning model WRONG.")
        print("  Surprising! Reasoning models aren't infallible.")

print("\nDetailed analysis complete for first 3 problems.")

In [None]:
# --- Step 2: Run remaining 7 problems ---

all_results = list(detailed_results)  # Copy the first 3

print("Remaining problems:")
print(f"{'Problem':<30} | {'Answer':>6} | {'Base':>6} | {'Reasoning':>10}")
print("-" * 70)

for desc, problem, answer, steps, diff in PROBLEMS[3:]:
    # Base model with CoT
    cot_prompt = f"{problem}\n\nLet's work through this step by step, then give the final answer."
    base_response = call_llm(cot_prompt, model=BASE_MODEL, max_tokens=500)
    base_answer = extract_number(base_response)
    base_correct = base_answer == answer

    # Reasoning model
    reasoning_response, reasoning_tokens = call_reasoning_model(problem)
    reasoning_answer = extract_number(reasoning_response)
    reasoning_correct = reasoning_answer == answer

    sym_b = "\u2713" if base_correct else "\u2717"
    sym_r = "\u2713" if reasoning_correct else "\u2717"
    print(f"{desc:<30} | {answer:>6} | {str(base_answer):>5} {sym_b} | {str(reasoning_answer):>8} {sym_r}")

    all_results.append({
        "desc": desc, "answer": answer, "steps": steps, "diff": diff,
        "base_correct": base_correct, "reasoning_correct": reasoning_correct,
        "base_response": base_response, "reasoning_response": reasoning_response,
        "reasoning_tokens": reasoning_tokens,
    })

print("\nAll 10 problems tested.")

In [None]:
# --- Step 3: Summary and visualization ---

base_total = sum(r["base_correct"] for r in all_results)
reasoning_total = sum(r["reasoning_correct"] for r in all_results)

print("OVERALL ACCURACY")
print("=" * 50)
print(f"Base model + CoT:  {base_total}/{len(all_results)} ({base_total/len(all_results):.0%})")
print(f"Reasoning model:   {reasoning_total}/{len(all_results)} ({reasoning_total/len(all_results):.0%})")

# By difficulty
for diff in ["easy", "medium", "hard"]:
    subset = [r for r in all_results if r["diff"] == diff]
    if not subset:
        continue
    b_acc = sum(r["base_correct"] for r in subset) / len(subset)
    r_acc = sum(r["reasoning_correct"] for r in subset) / len(subset)
    print(f"\n{diff.upper()} ({len(subset)} problems):")
    print(f"  Base + CoT:    {b_acc:.0%}")
    print(f"  Reasoning:     {r_acc:.0%}")
    print(f"  Improvement:   {r_acc - b_acc:+.0%}")

# Bar chart by difficulty
fig, ax = plt.subplots(figsize=(8, 5))
diffs = ["easy", "medium", "hard"]
base_accs = []
reason_accs = []
for d in diffs:
    subset = [r for r in all_results if r["diff"] == d]
    base_accs.append(sum(r["base_correct"] for r in subset) / len(subset) * 100 if subset else 0)
    reason_accs.append(sum(r["reasoning_correct"] for r in subset) / len(subset) * 100 if subset else 0)

x = np.arange(len(diffs))
width = 0.35
bars1 = ax.bar(x - width/2, base_accs, width, label='Base + CoT', color='#f59e0b',
               edgecolor='white', linewidth=0.5)
bars2 = ax.bar(x + width/2, reason_accs, width, label='Reasoning Model', color='#a78bfa',
               edgecolor='white', linewidth=0.5)

ax.set_ylabel('Accuracy (%)', fontsize=11)
ax.set_title('Base Model + CoT vs Reasoning Model by Difficulty', fontsize=13, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([d.capitalize() for d in diffs])
ax.set_ylim(0, 115)
ax.legend()
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

for bar in list(bars1) + list(bars2):
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2,
            f'{bar.get_height():.0f}%', ha='center', va='bottom', fontsize=10, color='white')

plt.tight_layout()
plt.show()

# Cases where reasoning model succeeded and base failed
reasoning_wins = [r for r in all_results if r["reasoning_correct"] and not r["base_correct"]]
if reasoning_wins:
    print(f"\nProblems where reasoning model won and base model lost ({len(reasoning_wins)}):")
    for r in reasoning_wins:
        print(f"  - {r['desc']} (difficulty: {r['diff']}, {r['steps']} steps)")

print("\nKey insight: RL training produces CONSISTENTLY better reasoning chains,")
print("not just longer ones. Same architecture, different training. The reasoning")
print("model has learned to use the scratchpad (context window) effectively.")

**What just happened:** You compared a base model with CoT prompting against a reasoning model on 10 problems of varying difficulty. The reasoning model should show higher and more consistent accuracy, especially on the hard multi-step problems.

The key difference is not the architecture or parameter count—it's the training. The reasoning model was trained with RL where the reward signal is answer correctness. It learned to use the scratchpad (context window) more effectively: structured, checkable chains rather than the hit-or-miss quality of base model CoT.

This is the lesson's core "of course" moment: you already knew tokens are computation (CoT lesson). You already knew RL can shape behavior (RLHF). Of course you can use RL to shape reasoning behavior.

---

## Exercise 2: Self-Consistency Experiment (Supported)

The lesson DEVELOPED self-consistency from MENTIONED: generate N reasoning chains, take the majority vote. In this exercise, you'll measure how accuracy changes as N increases from 1 to 20.

The mechanism: if each chain has probability p > 0.5 of being correct, then N independent chains with majority voting have a higher probability of producing the correct answer. Different chains make different errors—voting averages out the noise.

**The first problem is fully set up with code for N=1 and N=3.** You'll extend the pattern to N=5, 10, 20 and then to additional problems.

**Before running, predict:**
- Will accuracy increase monotonically with N, or will it plateau?
- At what N will you see diminishing returns?
- Will the benefit of more chains be larger for harder problems?

In [None]:
# --- Step 1: Self-consistency for the first problem (fully worked) ---
# We use temperature > 0 to get diverse chains.
# Each chain is a different sample from the model's distribution.

SELF_CONSISTENCY_PROBLEMS = [
    ("two-digit multiply", "What is 34 x 56?", 1904),
    ("chained operations", "What is 13 x 17 + 8 x 9 - 45?", 248),
    ("percentage word problem",
     "A shirt costs $80. It's on sale for 25% off. Then you have a coupon for 10% off the sale price. What is the final price?",
     54),
    ("ratio word problem",
     "Three friends split a bill. The bill is $156. Alice pays twice as much as Bob, and Carol pays three times as much as Bob. How much does Alice pay?",
     52),
    ("multi-step word problem",
     "A farmer has 4 fields, each with 6 rows of corn. Each row has 15 plants. He loses 10% of the plants to pests. How many plants survive?",
     324),
]

# Demonstrate with the first problem, N=1 and N=3
demo_desc, demo_problem, demo_answer = SELF_CONSISTENCY_PROBLEMS[0]
print(f"Problem: {demo_problem}")
print(f"Correct answer: {demo_answer}")
print()

cot_prompt = f"{demo_problem}\n\nLet's work through this step by step, then give the final answer."

# N=1: single chain
chain_1 = call_llm(cot_prompt, temperature=0.7, max_tokens=500)
answer_1 = extract_number(chain_1)
print(f"N=1: Single chain answer = {answer_1} ({'correct' if answer_1 == demo_answer else 'WRONG'})")
print(f"  Chain: {chain_1[:120]}...")
print()

# N=3: three chains, majority vote
chains_3 = []
answers_3 = []
for i in range(3):
    chain = call_llm(cot_prompt, temperature=0.7, max_tokens=500)
    ans = extract_number(chain)
    chains_3.append(chain)
    answers_3.append(ans)
    print(f"  Chain {i+1}: answer = {ans}")

vote_3 = majority_vote(answers_3)
print(f"\nN=3: Majority vote = {vote_3} ({'correct' if vote_3 == demo_answer else 'WRONG'})")
print(f"  Individual answers: {answers_3}")
print(f"  Voting averages out errors from individual chains.")

In [None]:
# --- Step 2: Extend to all N values for all problems ---
# TODO: For each problem, generate chains at N = 1, 3, 5, 10, 20.
# At each N, use majority voting to get the answer.
# Track whether the majority-vote answer is correct.
#
# Strategy: Generate all 20 chains once (the max N), then take the
# majority vote of the first 1, 3, 5, 10, 20 answers. This is more
# efficient than generating separately for each N.
#
# The result should be stored in sc_results:
#   sc_results[problem_index][N] = {"correct": bool, "vote": int, "answers": list}
#
# Starter code below. Fill in the TODO sections.

N_VALUES = [1, 3, 5, 10, 20]
sc_results = {}

for p_idx, (desc, problem, answer) in enumerate(SELF_CONSISTENCY_PROBLEMS):
    print(f"\nProblem {p_idx+1}: {desc} (answer: {answer})")
    cot_prompt = f"{problem}\n\nLet's work through this step by step, then give the final answer."

    # Generate 20 chains
    all_answers = []
    for i in range(20):
        chain = call_llm(cot_prompt, temperature=0.7, max_tokens=500)
        ans = extract_number(chain)
        all_answers.append(ans)
    print(f"  Generated 20 chains. Answers: {all_answers}")

    # TODO: For each N in N_VALUES, take the first N answers from all_answers
    # and use majority_vote() to get the consensus answer.
    # Store results in sc_results[p_idx] as a dictionary keyed by N.
    # Each entry should track: the vote, whether it's correct, and the answer subset.
    # Print the vote and correctness for each N.
    #
    # Hint: You already saw the pattern in Step 1 with N=3.
    # This is the same idea, looped over multiple N values.

    # YOUR CODE HERE (5-8 lines)


print("\nSelf-consistency experiment complete.")

In [None]:
# --- Step 3: Plot accuracy vs N ---
# TODO: Create a plot showing accuracy (y-axis) vs N (x-axis).
#
# Compute the accuracy at each N across all problems:
#   For each N, accuracy = (number of problems where majority vote is correct) / total problems
#
# Also overlay individual problem correctness as scatter points to show
# which problems benefit most from more chains.
#
# Useful data:
#   sc_results[p_idx][n]["correct"]  — whether majority vote was correct
#   N_VALUES = [1, 3, 5, 10, 20]
#   len(SELF_CONSISTENCY_PROBLEMS) = 5
#
# Hints:
#   accuracies = []
#   for n in N_VALUES:
#       acc = sum(sc_results[p][n]["correct"] for p in range(5)) / 5
#       accuracies.append(acc * 100)
#
#   fig, ax = plt.subplots(figsize=(8, 5))
#   ax.plot(N_VALUES, accuracies, 'o-', ...)

# YOUR CODE HERE (15-25 lines)


<details>
<summary>Solution for Steps 2 and 3</summary>

**Step 2 — majority vote at each N:**

```python
    sc_results[p_idx] = {}
    for n in N_VALUES:
        subset = all_answers[:n]
        vote = majority_vote(subset)
        sc_results[p_idx][n] = {
            "correct": vote == answer,
            "vote": vote,
            "answers": subset,
        }
        print(f"  N={n:>2}: vote={vote}, correct={vote == answer}")
```

**Step 3 — plot:**

```python
# Compute accuracy at each N
accuracies = []
for n in N_VALUES:
    acc = sum(sc_results[p][n]["correct"] for p in range(len(SELF_CONSISTENCY_PROBLEMS))) / len(SELF_CONSISTENCY_PROBLEMS)
    accuracies.append(acc * 100)

fig, ax = plt.subplots(figsize=(8, 5))

# Main accuracy line
ax.plot(N_VALUES, accuracies, 'o-', color='#a78bfa', linewidth=2.5,
        markersize=10, label='Majority-Vote Accuracy', zorder=3)

# Individual problem results as background dots
for p_idx in range(len(SELF_CONSISTENCY_PROBLEMS)):
    for n in N_VALUES:
        correct = sc_results[p_idx][n]["correct"]
        color = '#34d399' if correct else '#f87171'
        ax.scatter(n, p_idx * 5 - 5, color=color, s=30, alpha=0.5, zorder=2)

ax.set_xlabel('Number of Chains (N)', fontsize=12)
ax.set_ylabel('Accuracy (%)', fontsize=12)
ax.set_title('Self-Consistency: Accuracy vs Number of Chains', fontsize=13, fontweight='bold')
ax.set_ylim(-5, 110)
ax.set_xticks(N_VALUES)
ax.legend(fontsize=11)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

# Annotate diminishing returns
for i, (n, acc) in enumerate(zip(N_VALUES, accuracies)):
    ax.annotate(f'{acc:.0f}%', (n, acc), textcoords='offset points',
                xytext=(0, 12), ha='center', fontsize=10, color='white')

plt.tight_layout()
plt.show()

# Summary
print("\nAccuracy by N:")
for n, acc in zip(N_VALUES, accuracies):
    print(f"  N={n:>2}: {acc:.0f}%")

# Identify diminishing returns
print("\nDiminishing returns:")
for i in range(1, len(accuracies)):
    improvement = accuracies[i] - accuracies[i-1]
    print(f"  N={N_VALUES[i-1]}->{N_VALUES[i]}: {improvement:+.0f}% improvement")

print("\nKey insight: more chains help up to a point. The improvement from")
print("1 to 5 chains is typically large. The improvement from 10 to 20 is")
print("marginal. And the compute cost is linear — 20 chains cost 20x more.")
print("Self-consistency trades compute for reliability, with diminishing returns.")
```

**Why this approach:** We generate all 20 chains once and take subsets, rather than re-generating for each N. This is both more efficient and scientifically cleaner — the same chains are used across conditions, isolating the effect of N rather than sampling noise.

The accuracy curve should rise steeply from N=1 to N=5, then flatten. The intuition: if each chain has (say) 70% chance of being correct, going from 1 chain to 3 chains is a big jump. Going from 10 to 20 barely matters because the majority is already very likely correct.

</details>

**What just happened:** You measured the empirical relationship between the number of reasoning chains and accuracy. Self-consistency works because different chains make different errors. Majority voting averages out these independent errors.

The diminishing returns are important: the lesson emphasized that "more reasoning tokens are not always better." Each additional chain costs the same compute, but the marginal accuracy improvement shrinks. This is a concrete example of the test-time compute scaling tradeoff—you want to allocate compute where it matters, not uniformly.

---

## Exercise 3: Process vs Outcome Evaluation (Supported)

The lesson DEVELOPED process supervision from INTRODUCED: outcome evaluation asks "did you get the right answer?" while process evaluation asks "is each step correct?" In this exercise, you'll evaluate reasoning chains both ways and find cases where the outcome is correct but a step is wrong—the cancelling errors problem from the lesson.

For 5 problems with known step-by-step solutions, you'll:
1. Generate reasoning chains
2. Check the final answer (outcome evaluation)
3. Evaluate each step (process evaluation)
4. Find discrepancies—correct answer with wrong reasoning

**The first problem has a complete evaluation template.** You extend the pattern.

**Before running, predict:**
- Will you find cases where the final answer is correct but a step is wrong?
- Which type of problem is more likely to have cancelling errors?
- If an ORM and PRM evaluated the same chain, when would they disagree?

In [None]:
# --- Problems with known step-by-step solutions ---
# Each has the problem, correct answer, and the correct reasoning steps.
# We'll use these to evaluate the model's chain step by step.

EVAL_PROBLEMS = [
    {
        "problem": "What is 17 x 24?",
        "answer": 408,
        "steps": [
            "Break 24 into 20 + 4",
            "17 x 20 = 340",
            "17 x 4 = 68",
            "340 + 68 = 408",
        ],
    },
    {
        "problem": "A shirt costs $80. It's on sale for 25% off. Then you have a coupon for 10% off the sale price. What is the final price?",
        "answer": 54,
        "steps": [
            "25% of $80 = $20",
            "Sale price: $80 - $20 = $60",
            "10% of $60 = $6",
            "Final price: $60 - $6 = $54",
        ],
    },
    {
        "problem": "A farmer has 4 fields, each with 6 rows of corn. Each row has 15 plants. He loses 10% of the plants to pests. How many plants survive?",
        "answer": 324,
        "steps": [
            "Total rows: 4 x 6 = 24",
            "Total plants: 24 x 15 = 360",
            "Plants lost: 10% of 360 = 36",
            "Plants surviving: 360 - 36 = 324",
        ],
    },
    {
        "problem": "Three friends split a bill. The bill is $156. Alice pays twice as much as Bob, and Carol pays three times as much as Bob. How much does Alice pay?",
        "answer": 52,
        "steps": [
            "Let Bob's share = x",
            "Alice pays 2x, Carol pays 3x",
            "x + 2x + 3x = 156",
            "6x = 156, so x = 26",
            "Alice pays 2 x 26 = $52",
        ],
    },
    {
        "problem": "What is 13 x 17 + 8 x 9 - 45?",
        "answer": 248,
        "steps": [
            "13 x 17 = 221",
            "8 x 9 = 72",
            "221 + 72 = 293",
            "293 - 45 = 248",
        ],
    },
]

print(f"Loaded {len(EVAL_PROBLEMS)} evaluation problems, each with step-by-step solutions.")

In [None]:
# --- Step 1: First problem with full evaluation template ---
# We generate multiple chains (using temperature > 0) and evaluate each one
# both ways: outcome (final answer) and process (each step).
#
# NOTE: For process evaluation, we're using the LLM itself as a simplified
# step evaluator. A real PRM would be a dedicated model trained specifically
# on step-level correctness labels, as described in the lesson. Using an LLM
# with a prompt is a practical proxy that lets us explore the concept, but
# it is less reliable than a trained PRM and may miss subtle errors or
# flag correct steps incorrectly.

p = EVAL_PROBLEMS[0]
print(f"Problem: {p['problem']}")
print(f"Correct answer: {p['answer']}")
print(f"Correct steps: {p['steps']}")
print()

# Generate 5 chains
cot_prompt = f"{p['problem']}\n\nSolve this step by step. Show each calculation on its own line."

chains = []
for i in range(5):
    chain = call_llm(cot_prompt, temperature=0.9, max_tokens=500)
    chains.append(chain)

# Evaluate each chain
for i, chain in enumerate(chains):
    final_answer = extract_number(chain)
    outcome_correct = final_answer == p["answer"]

    # Process evaluation: use the LLM to check each step
    process_prompt = f"""I'm checking a math solution for errors. Here's the problem and solution:

Problem: {p['problem']}
Correct answer: {p['answer']}
Correct steps: {'; '.join(p['steps'])}

Student's solution:
{chain}

For each step in the student's solution, respond with:
- CORRECT if the step's math is right
- WRONG if the step's math has an error (explain what's wrong)

Then on the last line write: ALL_CORRECT or HAS_ERRORS"""

    process_eval = call_llm(process_prompt, temperature=0.0, max_tokens=500)
    process_correct = "ALL_CORRECT" in process_eval

    # Classification
    if outcome_correct and process_correct:
        label = "GOOD: correct answer, correct reasoning"
    elif outcome_correct and not process_correct:
        label = "DANGEROUS: correct answer, WRONG reasoning (cancelling errors!)"
    elif not outcome_correct and process_correct:
        label = "UNLUCKY: wrong answer, but reasoning was valid"
    else:
        label = "BAD: wrong answer, wrong reasoning"

    print(f"\nChain {i+1}: {label}")
    print(f"  Final answer: {final_answer} ({'correct' if outcome_correct else 'WRONG'})")
    print(f"  Process evaluation: {'all steps correct' if process_correct else 'HAS ERRORS'}")
    if outcome_correct and not process_correct:
        print(f"  >>> ORM would give +1 reward. PRM would penalize wrong steps.")
        print(f"  >>> This chain would pass outcome evaluation but fail process evaluation.")
    print(f"  Chain preview: {chain[:100]}...")

In [None]:
# --- Step 2: Evaluate all 5 problems ---
# TODO: Run the same outcome + process evaluation pattern on all 5 problems.
# Generate 5 chains per problem. Track the counts of each category:
#   - GOOD: correct answer + correct reasoning
#   - DANGEROUS: correct answer + wrong reasoning
#   - UNLUCKY: wrong answer + correct reasoning
#   - BAD: wrong answer + wrong reasoning
#
# Store results in eval_summary:
#   eval_summary[problem_index] = {"good": int, "dangerous": int, "unlucky": int, "bad": int}
#
# Follow the same pattern as Step 1:
#   1. Generate 5 chains with temperature=0.9
#   2. For each chain, check outcome (extract_number == answer)
#   3. For each chain, check process (LLM-based step evaluation)
#   4. Classify and count
#
# Hints:
#   - Reuse the cot_prompt and process_prompt patterns from Step 1
#   - The interesting finding is how many "DANGEROUS" cases you find
#     (correct answer, wrong reasoning)
#   - Print a running summary as you go

eval_summary = {}

# YOUR CODE HERE (25-40 lines)


In [None]:
# --- Step 3: Visualize the outcome vs process evaluation gap ---
# TODO: Create a stacked bar chart showing the distribution of categories
# for each problem.
#
# X-axis: problem index (or short description)
# Y-axis: count (out of 5 chains)
# Colors: good=#34d399, dangerous=#f59e0b, unlucky=#60a5fa, bad=#f87171
#
# Also compute and print:
#   - Total outcome accuracy (chains where final answer is correct)
#   - Total process accuracy (chains where all steps are correct)
#   - The gap between them (outcome - process) = "hidden reasoning failures"
#
# Useful data: eval_summary[problem_index]["good"], ["dangerous"], etc.

# YOUR CODE HERE (20-30 lines)


<details>
<summary>Solution for Steps 2 and 3</summary>

**Step 2 — evaluate all problems:**

```python
eval_summary = {}

for p_idx, p in enumerate(EVAL_PROBLEMS):
    print(f"\nProblem {p_idx+1}: {p['problem'][:50]}...")
    print(f"  Correct answer: {p['answer']}")

    counts = {"good": 0, "dangerous": 0, "unlucky": 0, "bad": 0}

    cot_prompt = f"{p['problem']}\n\nSolve this step by step. Show each calculation on its own line."

    for i in range(5):
        chain = call_llm(cot_prompt, temperature=0.9, max_tokens=500)
        final_answer = extract_number(chain)
        outcome_correct = final_answer == p["answer"]

        process_prompt = f"""I'm checking a math solution for errors. Here's the problem and solution:

Problem: {p['problem']}
Correct answer: {p['answer']}
Correct steps: {'; '.join(p['steps'])}

Student's solution:
{chain}

For each step in the student's solution, respond with:
- CORRECT if the step's math is right
- WRONG if the step's math has an error (explain what's wrong)

Then on the last line write: ALL_CORRECT or HAS_ERRORS"""

        process_eval = call_llm(process_prompt, temperature=0.0, max_tokens=500)
        process_correct = "ALL_CORRECT" in process_eval

        if outcome_correct and process_correct:
            counts["good"] += 1
        elif outcome_correct and not process_correct:
            counts["dangerous"] += 1
        elif not outcome_correct and process_correct:
            counts["unlucky"] += 1
        else:
            counts["bad"] += 1

    eval_summary[p_idx] = counts
    print(f"  Results: {counts}")

print("\nAll problems evaluated.")
```

**Step 3 — visualization:**

```python
fig, ax = plt.subplots(figsize=(10, 5))

problems_short = [p["problem"][:25] + "..." for p in EVAL_PROBLEMS]
x = np.arange(len(EVAL_PROBLEMS))

categories = ["good", "dangerous", "unlucky", "bad"]
colors = {"good": "#34d399", "dangerous": "#f59e0b", "unlucky": "#60a5fa", "bad": "#f87171"}
labels = {"good": "Correct answer + correct reasoning",
          "dangerous": "Correct answer + WRONG reasoning",
          "unlucky": "Wrong answer + correct reasoning",
          "bad": "Wrong answer + wrong reasoning"}

bottom = np.zeros(len(EVAL_PROBLEMS))
for cat in categories:
    values = [eval_summary[i][cat] for i in range(len(EVAL_PROBLEMS))]
    ax.bar(x, values, bottom=bottom, label=labels[cat], color=colors[cat],
           edgecolor='white', linewidth=0.5)
    bottom += np.array(values)

ax.set_xlabel('Problem', fontsize=11)
ax.set_ylabel('Chains (out of 5)', fontsize=11)
ax.set_title('Process vs Outcome Evaluation: Finding Hidden Failures', fontsize=13, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f"P{i+1}" for i in range(len(EVAL_PROBLEMS))], fontsize=10)
ax.legend(loc='upper right', fontsize=8)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.tight_layout()
plt.show()

# Compute the gap
total_chains = len(EVAL_PROBLEMS) * 5
total_outcome_correct = sum(eval_summary[i]["good"] + eval_summary[i]["dangerous"]
                            for i in range(len(EVAL_PROBLEMS)))
total_process_correct = sum(eval_summary[i]["good"] + eval_summary[i]["unlucky"]
                            for i in range(len(EVAL_PROBLEMS)))
total_dangerous = sum(eval_summary[i]["dangerous"] for i in range(len(EVAL_PROBLEMS)))

print(f"\nOverall statistics ({total_chains} chains total):")
print(f"  Outcome accuracy (final answer correct): {total_outcome_correct}/{total_chains} ({total_outcome_correct/total_chains:.0%})")
print(f"  Process accuracy (all steps correct):     {total_process_correct}/{total_chains} ({total_process_correct/total_chains:.0%})")
print(f"  Hidden failures (correct answer, wrong reasoning): {total_dangerous}/{total_chains} ({total_dangerous/total_chains:.0%})")
print(f"\nThe gap between outcome and process accuracy represents reasoning")
print(f"failures that an ORM would miss but a PRM would catch.")
print(f"These are the 'cancelling errors' from the lesson: chains that get")
print(f"lucky with the right answer despite wrong intermediate steps.")
```

**Why this matters:** The "dangerous" category—correct answer with wrong reasoning—is exactly what an Outcome Reward Model (ORM) would miss. It would give these chains a positive reward, reinforcing wrong reasoning patterns. A Process Reward Model (PRM) would catch the errors at the step level, providing a richer training signal. This is the same proxy gap from reward hacking: outcome supervision is the proxy, process supervision is closer to the true objective.

</details>

**What just happened:** You evaluated reasoning chains two ways and found the gap between outcome accuracy and process accuracy. Chains that get the right answer with wrong reasoning are the "cancelling errors" problem from the lesson.

The key insight: an ORM gives these chains the same reward as chains with correct reasoning. A PRM penalizes the wrong steps even when the final answer is right. This is why process supervision trains better reasoning—it's closer to the true objective (correct reasoning at every step) rather than the proxy (correct final answer).

Connect this to the reward hacking pattern from RLHF & Alignment: outcome supervision is a proxy that can be "gamed" (accidentally, through cancelling errors). Process supervision is a tighter specification of what "good reasoning" means.

---

## Exercise 4: Test-Time Compute Allocation (Independent)

The lesson's paradigm shift: instead of building bigger models, let the model think longer. But how much longer? The lesson emphasized that "more reasoning tokens are not always better"—there are diminishing returns, and on simple problems, extra reasoning can actually hurt.

**Your task:** Design an experiment to test adaptive vs uniform compute allocation.

**Setup:**
- You have a fixed compute budget: a total of N reasoning chains across all problems
- You have a mix of easy and hard problems
- Compare two strategies:
  - **Equal allocation:** Same number of chains per problem (budget / num_problems)
  - **Adaptive allocation:** More chains for harder problems, fewer for easy ones
- Measure overall accuracy under each strategy

**Think about:**
- How will you define "harder"? (Number of steps? Past accuracy? Token count?)
- What total budget makes the comparison interesting? (Too high and both strategies max out; too low and neither works)
- How will you allocate chains adaptively? (proportional to difficulty? all-or-nothing?)

**No skeleton is provided.** Design the experiment yourself. The solution is in the `<details>` block below.

In [None]:
# Your experiment here.
#
# 1. Define a set of problems with varying difficulty (or reuse PROBLEMS)
# 2. Choose a fixed total compute budget (total chains across all problems)
# 3. Implement equal allocation: budget / num_problems chains per problem
# 4. Implement adaptive allocation: more chains for harder problems
# 5. Measure accuracy under both strategies
# 6. Plot and compare
#
# Hint: You can reuse the majority_vote() function and the call_llm() helper.
# Use temperature > 0 for diverse chains.



In [None]:
# Reflection:
#
# 1. Did adaptive allocation outperform equal allocation?
# 2. By how much? Was the difference large or marginal?
# 3. What was the key factor: spending MORE on hard problems, or spending LESS on easy ones?
# 4. How does this connect to the lesson's "bigger brain vs more thinking time" analogy?
#
# Print your observations:
print("Reflection:")
print("  1. Adaptive vs equal: ...")
print("  2. Magnitude: ...")
print("  3. Key factor: ...")
print("  4. Connection to lesson: ...")

<details>
<summary>Solution</summary>

**Design rationale:** Use the PROBLEMS list (which has easy, medium, and hard problems). Set a total budget that's tight enough to force allocation tradeoffs. Use the number of reasoning steps as a proxy for difficulty.

```python
# --- Test-Time Compute Allocation Experiment ---

# Use a subset of problems with clear difficulty variation
ALLOCATION_PROBLEMS = [
    # Easy (1-2 steps)
    ("simple multiply", "What is 8 x 7?", 56, 1),
    ("two-digit addition", "What is 47 + 86?", 133, 1),
    ("simple word problem",
     "A store has 3 shelves with 8 books each. They remove 5 books. How many?",
     19, 2),
    # Hard (3-4 steps)
    ("chained operations", "What is 13 x 17 + 8 x 9 - 45?", 248, 4),
    ("percentage reasoning",
     "A shirt costs $80. It's 25% off, then 10% off the sale price. Final price?",
     54, 3),
    ("ratio word problem",
     "Bill is $156. Alice pays 2x Bob, Carol pays 3x Bob. How much does Alice pay?",
     52, 4),
]

NUM_PROBLEMS = len(ALLOCATION_PROBLEMS)
TOTAL_BUDGET = 30  # Total chains across all problems

# --- Strategy 1: Equal allocation ---
chains_per_problem_equal = TOTAL_BUDGET // NUM_PROBLEMS  # 5 each

print(f"Total budget: {TOTAL_BUDGET} chains across {NUM_PROBLEMS} problems")
print(f"\n--- EQUAL ALLOCATION: {chains_per_problem_equal} chains per problem ---")

equal_results = []
for desc, problem, answer, steps in ALLOCATION_PROBLEMS:
    cot_prompt = f"{problem}\n\nLet's work through this step by step."
    answers = []
    for _ in range(chains_per_problem_equal):
        chain = call_llm(cot_prompt, temperature=0.7, max_tokens=500)
        answers.append(extract_number(chain))

    vote = majority_vote(answers)
    correct = vote == answer
    equal_results.append(correct)
    sym = "\u2713" if correct else "\u2717"
    print(f"  {desc:<25} | chains={chains_per_problem_equal} | vote={vote} | {sym}")

equal_accuracy = sum(equal_results) / len(equal_results)
print(f"  Equal accuracy: {equal_accuracy:.0%}")

# --- Strategy 2: Adaptive allocation ---
# Allocate proportional to number of steps (difficulty proxy)
total_steps = sum(s for _, _, _, s in ALLOCATION_PROBLEMS)
adaptive_chains = []
for _, _, _, steps in ALLOCATION_PROBLEMS:
    # Allocate proportional to steps, minimum 1
    alloc = max(1, round(TOTAL_BUDGET * steps / total_steps))
    adaptive_chains.append(alloc)

# Adjust to hit exact budget
while sum(adaptive_chains) > TOTAL_BUDGET:
    # Remove from the problem with the most allocation
    max_idx = adaptive_chains.index(max(adaptive_chains))
    adaptive_chains[max_idx] -= 1
while sum(adaptive_chains) < TOTAL_BUDGET:
    # Add to the problem with the most steps
    max_steps_idx = max(range(NUM_PROBLEMS), key=lambda i: ALLOCATION_PROBLEMS[i][3])
    adaptive_chains[max_steps_idx] += 1

print(f"\n--- ADAPTIVE ALLOCATION ---")
print(f"  Allocation: {adaptive_chains} (total: {sum(adaptive_chains)})")

adaptive_results = []
for i, (desc, problem, answer, steps) in enumerate(ALLOCATION_PROBLEMS):
    n_chains = adaptive_chains[i]
    cot_prompt = f"{problem}\n\nLet's work through this step by step."
    answers = []
    for _ in range(n_chains):
        chain = call_llm(cot_prompt, temperature=0.7, max_tokens=500)
        answers.append(extract_number(chain))

    vote = majority_vote(answers)
    correct = vote == answer
    adaptive_results.append(correct)
    sym = "\u2713" if correct else "\u2717"
    print(f"  {desc:<25} | chains={n_chains:>2} | vote={vote} | {sym}")

adaptive_accuracy = sum(adaptive_results) / len(adaptive_results)
print(f"  Adaptive accuracy: {adaptive_accuracy:.0%}")

# --- Comparison ---
print(f"\n{'=' * 50}")
print(f"RESULTS (budget = {TOTAL_BUDGET} chains)")
print(f"  Equal allocation:    {equal_accuracy:.0%}")
print(f"  Adaptive allocation: {adaptive_accuracy:.0%}")
print(f"  Difference:          {adaptive_accuracy - equal_accuracy:+.0%}")

# Visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Left: chain allocation comparison
x = np.arange(NUM_PROBLEMS)
width = 0.35
short_descs = [d[:12] for d, _, _, _ in ALLOCATION_PROBLEMS]

ax1.bar(x - width/2, [chains_per_problem_equal] * NUM_PROBLEMS, width,
        label='Equal', color='#f59e0b', edgecolor='white', linewidth=0.5)
ax1.bar(x + width/2, adaptive_chains, width,
        label='Adaptive', color='#a78bfa', edgecolor='white', linewidth=0.5)
ax1.set_xlabel('Problem', fontsize=11)
ax1.set_ylabel('Chains Allocated', fontsize=11)
ax1.set_title('Compute Allocation per Problem', fontsize=12, fontweight='bold')
ax1.set_xticks(x)
ax1.set_xticklabels(short_descs, fontsize=8, rotation=30, ha='right')
ax1.legend()
ax1.spines['top'].set_visible(False)
ax1.spines['right'].set_visible(False)

# Right: accuracy comparison
strategies = ['Equal', 'Adaptive']
accs = [equal_accuracy * 100, adaptive_accuracy * 100]
bars = ax2.bar(strategies, accs, color=['#f59e0b', '#a78bfa'],
               edgecolor='white', linewidth=0.5)
ax2.set_ylabel('Accuracy (%)', fontsize=11)
ax2.set_title('Overall Accuracy: Same Budget, Different Allocation', fontsize=12, fontweight='bold')
ax2.set_ylim(0, 115)
ax2.spines['top'].set_visible(False)
ax2.spines['right'].set_visible(False)
for bar, acc in zip(bars, accs):
    ax2.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2,
             f'{acc:.0f}%', ha='center', fontsize=12, color='white')

plt.suptitle(f'Test-Time Compute Allocation (Budget: {TOTAL_BUDGET} chains)',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nKey insight: adaptive allocation outperforms equal allocation because")
print("easy problems don't need many chains (they're correct even with 1-2),")
print("while hard problems benefit significantly from more chains. This is")
print("the core of test-time compute scaling: allocate compute based on problem")
print("difficulty, not uniformly. Same total compute, better overall accuracy.")
```

**Expected findings:** Adaptive allocation should outperform equal allocation because easy problems are already correct with 1-2 chains (the extra chains are wasted), while hard problems benefit significantly from more chains (self-consistency from Exercise 2). The improvement comes from *redistributing* compute, not adding more.

**Connection to the lesson:** This is test-time compute scaling in practice. The paradigm shift is not just "think longer" but "think longer *where it matters*." A reasoning model that allocates variable compute per problem (longer chains for harder problems, shorter for easy ones) outperforms one that uses the same compute for every problem. The "bigger brain vs more thinking time" analogy: you don't need to think for 30 minutes about what 8 x 7 is, but you do for a multi-step word problem.

</details>

---

## Key Takeaways

1. **RL training produces consistently better reasoning, not just longer chains.** The reasoning model outperformed the base model + CoT on the same problems—same architecture, same parameter count, different training. RL shaped how the model uses the scratchpad (context window), not what it knows.

2. **Self-consistency trades compute for reliability, with diminishing returns.** Going from 1 to 5 chains is a large accuracy improvement. Going from 10 to 20 is marginal. The compute cost is linear, but the benefit curve flattens. This is why "more reasoning tokens are not always better."

3. **Outcome evaluation misses reasoning flaws that process evaluation catches.** Chains can arrive at the correct answer through wrong reasoning (cancelling errors). An ORM would reward these chains; a PRM would penalize the wrong steps. The gap between outcome accuracy and process accuracy represents hidden failures.

4. **Adaptive compute allocation outperforms uniform allocation.** Easy problems don't need many chains; hard problems benefit from more. Same total compute, better overall accuracy. This is test-time compute scaling in practice: allocate compute based on problem difficulty, not uniformly.

5. **The paradigm shift: from "how big is the model?" to "how much does the model think?"** Model size and inference compute are two independent axes of scaling. These exercises demonstrated the inference compute axis empirically—self-consistency, process supervision, and adaptive allocation are all mechanisms for trading inference compute for better performance.