# Chapter 3: Reward Signals

**Goal:** Understand how to programmatically judge whether a model's output is good, and how reward design shapes what the model learns.

This notebook breaks down every function in `verifier.py` and connects it to the book's explanations.

---

## Two Flavors of Reward

The book describes two fundamentally different approaches:

| | Verifiable Rewards | Learned Rewards (Reward Models) |
|---|---|---|
| **What** | Deterministic functions | Separate ML model |
| **Examples** | Math correctness, code execution, unit tests | Human preference prediction |
| **Pros** | Cheap, fast, no false positives | Handles subjective tasks |
| **Cons** | Only works for objective tasks | Expensive to train, can hallucinate |
| **Used here** | **Yes** | No (too expensive for single GPU) |

This chapter focuses entirely on **verifiable rewards** - functions that return a scalar score based on objective correctness.

---

## Part 1: Answer Extraction - `extract_answer()`

Before we can check if an answer is correct, we need to **find** the answer in the model's free-form text response.

### The Challenge

The model might express the same answer in many ways:
- "The answer is 42"
- "So we get 6 * 7 = 42"
- "\\boxed{42}"
- "...calculating step by step... 42"

We need regex patterns that handle all of these.

In [None]:
import re

def extract_answer(response: str) -> str | None:
    """Extract a numerical answer from a free-form text response.
    
    Uses a cascade of regex patterns, ordered from MOST SPECIFIC to LEAST SPECIFIC.
    Returns the first match found, or None if no answer is detected.
    """
    patterns = [
        # Pattern 1: Keyword-based (most reliable)
        # Matches: "answer is 42", "result: 42", "equals 42"
        r"(?:answer|result|equals?|is)[:\s]+(-?\d+(?:\.\d+)?)",
        
        # Pattern 2: Equals sign at end of line
        # Matches: "6 * 7 = 42" (common in step-by-step math)
        r"=\s*(-?\d+(?:\.\d+)?)\s*$",
        
        # Pattern 3: LaTeX boxed format
        # Matches: "\boxed{42}" (standard math competition format)
        r"\\boxed\{(-?\d+(?:\.\d+)?)}",
        
        # Pattern 4: Last number in the response (fallback)
        # Matches any number at the end of the text
        # This is the least reliable - used only when nothing else matches
        r"(\d+(?:\.\d+)?)\s*$",
    ]
    
    for pattern in patterns:
        match = re.search(pattern, response, re.IGNORECASE | re.MULTILINE)
        if match:
            return match.group(1)
    return None


# Let's test all the patterns:
test_cases = [
    ("The answer is 42",                    "42",   "Pattern 1: keyword 'answer is'"),
    ("result: 3.14",                         "3.14", "Pattern 1: keyword 'result:'"),
    ("Let me calculate... 6 * 7 = 42",      "42",   "Pattern 2: equals at end"),
    ("Therefore, \\boxed{42}",               "42",   "Pattern 3: LaTeX boxed"),
    ("I computed each step and got 42",      "42",   "Pattern 4: last number (fallback)"),
    ("I have no idea about this problem",   None,    "No match: returns None"),
    ("The answer is -7.5",                   "-7.5", "Pattern 1: negative decimal"),
]

print("Testing extract_answer():")
print(f"{'Response':<45} {'Expected':<10} {'Got':<10} {'Pattern':<30}")
print("-" * 95)
for response, expected, pattern_name in test_cases:
    result = extract_answer(response)
    status = "pass" if result == expected else "FAIL"
    print(f"{response:<45} {str(expected):<10} {str(result):<10} {pattern_name}")

### Why the ordering matters

Consider: "The answer is 5, but let me verify: 2 + 3 = 5"

- Pattern 1 matches "answer is 5" --> returns "5" (correct!)
- If we tried Pattern 4 first, it would also find "5" but for the wrong reason
- If the response was "The answer is 5, wait no, = 6" and we used Pattern 2 first, we'd get "6" (wrong!)

The cascade from most-specific to least-specific minimizes extraction errors.

---

## Part 2: Math Reward - `math_reward()`

### The Three-Tier Scoring System

The book uses an **asymmetric reward scheme** that treats different failure modes differently:

```
+1.0  Correct answer    (best outcome)
-0.5  Wrong answer      (the model tried but failed)
-1.0  No answer found   (worst - model didn't even produce a parseable number)
```

### Why the asymmetry?

- **-1.0 for no answer** is the strongest penalty because it means the model produced unparseable output. We want to **strongly discourage** incoherent responses.
- **-0.5 for wrong answer** is milder because at least the model produced a number. This is a step in the right direction and shouldn't be punished as harshly.

In [None]:
def math_reward(response: str, ground_truth: str) -> float:
    """Score a response against a known correct answer.
    
    Returns:
        +1.0  if extracted answer matches ground truth
        -0.5  if an answer was found but it's wrong
        -1.0  if no answer could be extracted at all
    """
    answer = extract_answer(response)
    
    if answer is None:
        return -1.0  # No parseable answer --> strongest penalty
    
    try:
        # Numeric comparison with tolerance (handles floating point)
        if abs(float(answer) - float(ground_truth)) < 1e-6:
            return 1.0
    except ValueError:
        pass  # Not a valid number, fall through to string comparison
    
    # Exact string comparison (handles non-numeric answers)
    if answer.strip() == ground_truth.strip():
        return 1.0
    
    return -0.5  # Answer found but wrong


# Test cases from the book
test_cases = [
    ("The answer is 42",                "42",  1.0,   "Correct - keyword extraction"),
    ("Let me think... 6 * 7 = 42",     "42",  1.0,   "Correct - equals extraction"),
    ("I think it's 43",                 "42", -0.5,   "Wrong - close but no cigar"),
    ("I don't know",                    "42", -1.0,   "No answer - unparseable"),
    ("result: 3.14159",                 "3.14159", 1.0, "Correct - float comparison"),
    ("The answer is 3.141590001",       "3.14159", 1.0, "Correct - within tolerance"),
]

print("Testing math_reward():")
print(f"{'Response':<35} {'Truth':<10} {'Expected':<10} {'Got':<10} {'Note'}")
print("-" * 85)
for response, truth, expected, note in test_cases:
    result = math_reward(response, truth)
    status = "pass" if result == expected else "FAIL"
    print(f"{response:<35} {truth:<10} {expected:<10} {result:<10} {note}")

### The Float Tolerance

```python
abs(float(answer) - float(ground_truth)) < 1e-6
```

This handles floating-point precision issues. Without it:
- "3.14159" vs "3.141590" would fail string comparison
- Numerical computation rounding errors would cause false negatives

The tolerance of `1e-6` (0.000001) is tight enough to catch wrong answers but loose enough to forgive minor floating-point differences.

---

## Part 3: Code Reward - `code_reward()`

For code generation tasks, we can use an even more objective measure: **does the code run and produce the right output?**

In [None]:
import subprocess

def code_reward(code: str, expected_output: str, timeout: float = 5.0) -> float:
    """Execute generated code and check its output.
    
    Returns:
        +1.0  if code runs and output matches expected
        -0.5  if code runs but output is wrong
        -1.0  if code crashes, times out, or fails to execute
    """
    try:
        # Run the code in a subprocess with a timeout
        result = subprocess.run(
            ["python", "-c", code],  # -c means "run this string as code"
            capture_output=True,       # Capture stdout and stderr
            text=True,                 # Return strings, not bytes
            timeout=timeout,           # Kill if it takes too long
        )
        
        if result.returncode != 0:
            return -1.0  # Code crashed (syntax error, runtime error, etc.)
        
        if result.stdout.strip() == expected_output.strip():
            return 1.0   # Correct output!
        
        return -0.5  # Code ran but gave wrong output
        
    except subprocess.TimeoutExpired:
        return -1.0  # Infinite loop or too slow
    except Exception:
        return -1.0  # Any other failure


# Test cases
test_cases = [
    ("print(6 * 7)",           "42",   1.0,  "Correct code"),
    ("print(6 * 8)",           "42",  -0.5,  "Wrong output (48 != 42)"),
    ("syntax error here",      "42",  -1.0,  "Syntax error"),
    ("while True: pass",       "42",  -1.0,  "Infinite loop (timeout)"),
    ("print('hello world')",   "hello world", 1.0, "String output match"),
]

print("Testing code_reward():")
print(f"{'Code':<30} {'Expected':<15} {'Score':<10} {'Note'}")
print("-" * 75)
for code, expected, expected_score, note in test_cases:
    # Skip the infinite loop test in notebook (would actually hang for 5 seconds)
    if "while True" in code:
        score = -1.0
        print(f"{code:<30} {expected:<15} {score:<10} {note} (skipped in notebook)")
    else:
        score = code_reward(code, expected)
        print(f"{code:<30} {expected:<15} {score:<10} {note}")

### Security Note

Running arbitrary generated code in a subprocess is a **sandbox** approach. The book keeps it simple with `subprocess.run`, but in production you'd want:
- Docker containers
- Resource limits (memory, disk)
- Network isolation
- Restricted syscalls (seccomp)

The `timeout=5.0` parameter is the minimal safety measure - it kills the process if the generated code runs an infinite loop.

---

## Part 4: Think Token Rewards - `reward_with_thinking()`

This is where it gets interesting. Chapter 6 introduces **think tokens** - special `<think>...</think>` blocks where the model shows its reasoning before giving a final answer.

### The Four Reward Tiers

```
+1.0   Correct + has <think> block    (best: right answer WITH reasoning)
+0.5   Correct + no <think> block     (good but: right answer without showing work)
-0.2   Wrong + has <think> block      (mild penalty: wrong but at least tried reasoning)
-0.5   Wrong + no <think> block       (bad: wrong AND didn't even try to reason)
```

### Why reward thinking even when wrong?

The penalty for "wrong + thinking" (-0.2) is much milder than "wrong + no thinking" (-0.5). This creates a **gradient toward reasoning** even before the model can solve problems correctly. The model learns:

1. First: "I should use `<think>` blocks" (because -0.2 > -0.5)
2. Then: "I should reason correctly in my `<think>` blocks" (to reach +1.0)

In [None]:
from collections.abc import Callable

def reward_with_thinking(
    response: str,
    ground_truth: str,
    base_reward_fn: Callable[[str, str], float] = math_reward,
) -> float:
    """Composite reward that incentivizes explicit reasoning traces.
    
    Checks two things:
    1. Is the answer correct? (via base_reward_fn)
    2. Did the model show its reasoning? (via <think> tags)
    
    Combines both signals into a single scalar reward.
    """
    # Check for thinking block
    has_think = "<think>" in response and "</think>" in response
    
    # Get base correctness score
    base_reward = base_reward_fn(response, ground_truth)
    
    # Combine into four tiers
    if base_reward > 0 and has_think:
        return 1.0    # Correct + thinking = best
    elif base_reward > 0:
        return 0.5    # Correct without thinking = good but not great
    elif has_think:
        return -0.2   # Wrong but showed reasoning = mild penalty
    return -0.5       # Wrong without reasoning = standard penalty


# Test all four tiers
test_cases = [
    # Correct + thinking
    ("<think>15 * 23 = 15*20 + 15*3 = 300 + 45 = 345</think>\nThe answer is 345",
     "345", 1.0, "Correct + think"),
    
    # Correct, no thinking
    ("The answer is 345",
     "345", 0.5, "Correct, no think"),
    
    # Wrong + thinking
    ("<think>15 * 23 = 15*20 + 15*3 = 300 + 35 = 335</think>\nThe answer is 335",
     "345", -0.2, "Wrong + think"),
    
    # Wrong, no thinking  
    ("The answer is 335",
     "345", -0.5, "Wrong, no think"),
]

print("Testing reward_with_thinking():")
print(f"{'Scenario':<25} {'Expected':<10} {'Got':<10}")
print("-" * 45)
for response, truth, expected, label in test_cases:
    result = reward_with_thinking(response, truth)
    status = "pass" if result == expected else "FAIL"
    print(f"{label:<25} {expected:<10} {result:<10}")

---

## Part 5: How Rewards Shape GRPO Learning

Let's trace through a concrete example of how rewards interact with the GRPO algorithm (Chapter 5).

### Scenario: Group of 4 responses to "What is 15 * 23?"

Using `math_reward`:

In [None]:
# Simulating what GRPO does with rewards

responses = [
    "Let me calculate: 15 * 23 = 345. The answer is 345",
    "15 times 23... I think it's 355",
    "The answer is 345.",
    "Hmm, I'm not sure about this one",
]
ground_truth = "345"

# Step 1: Compute rewards
rewards = [math_reward(r, ground_truth) for r in responses]

# Step 2: Compute baseline (group mean)
baseline = sum(rewards) / len(rewards)

# Step 3: Compute advantages
advantages = [r - baseline for r in rewards]

print("GRPO Reward Computation Example")
print("=" * 70)
print(f"\nPrompt: 'What is 15 * 23?'  |  Ground truth: {ground_truth}")
print()

for i, (resp, reward, adv) in enumerate(zip(responses, rewards, advantages)):
    direction = "ENCOURAGE" if adv > 0 else ("DISCOURAGE" if adv < 0 else "NEUTRAL")
    print(f"Response {i+1}: '{resp[:50]}...'")
    print(f"  Reward: {reward:+.1f}  |  Advantage: {adv:+.4f}  |  Action: {direction}")
    print()

print(f"Baseline (mean reward): {baseline:.4f}")
print()
print("Interpretation:")
print("  - Responses 1 & 3 got +1.0 (correct), which is above the baseline")
print("    --> GRPO will INCREASE their probability")
print("  - Response 2 got -0.5 (wrong answer), below baseline")
print("    --> GRPO will DECREASE its probability")
print("  - Response 4 got -1.0 (no answer), well below baseline")
print("    --> GRPO will STRONGLY DECREASE its probability")

---

## Part 6: Comparing Reward Functions

Let's see how the same responses get scored differently with `math_reward` vs `reward_with_thinking`:

In [None]:
# Same prompt, different reward functions

responses = [
    "<think>15*23 = 15*20 + 15*3 = 300 + 45 = 345</think>\nThe answer is 345",
    "The answer is 345",
    "<think>15*23 = 15*20 + 15*3 = 300 + 35 = 335</think>\nThe answer is 335",
    "335",
]
ground_truth = "345"

print("Comparing math_reward vs reward_with_thinking")
print("=" * 70)
print()

labels = [
    "Correct + thinking",
    "Correct, no thinking",
    "Wrong + thinking",
    "Wrong, no thinking",
]

print(f"{'Scenario':<25} {'math_reward':<15} {'reward_with_thinking':<20}")
print("-" * 60)

for label, resp in zip(labels, responses):
    mr = math_reward(resp, ground_truth)
    rt = reward_with_thinking(resp, ground_truth)
    print(f"{label:<25} {mr:+.1f}{'':>8} {rt:+.1f}")

print()
print("Key differences:")
print("  math_reward: Only cares about the final answer")
print("  reward_with_thinking: Also rewards the PROCESS of reasoning")
print()
print("  With math_reward: 'Correct + thinking' and 'Correct, no thinking'")
print("    both get +1.0 -- no incentive to show reasoning.")
print("  With reward_with_thinking: 'Correct + thinking' gets +1.0 but")
print("    'Correct, no thinking' only gets +0.5 -- model learns to reason.")

---

## Part 7: Reward Design Principles

The book emphasizes several principles for designing reward functions:

### 1. Reward shaping is everything
The model will optimize for **exactly what you reward**. If your reward function has a loophole, the model will find it.

### 2. Three tiers is a minimum
Binary (correct/wrong) rewards provide less gradient signal than multi-tier scoring. The three tiers (+1.0, -0.5, -1.0) give GRPO more information to work with.

### 3. Penalize non-answers harshly
The -1.0 for "no answer found" is intentionally the strongest penalty. Without it, the model might learn to produce vague, non-committal responses that avoid being scored as "wrong".

### 4. Process rewards are powerful
`reward_with_thinking` is a simple form of **process-based reward** - it rewards not just the outcome but the process. Research shows this leads to more robust and generalizable reasoning.

---

## Part 8: The `__main__` Block

In [None]:
# Running the built-in test suite from verifier.py

test_cases = [
    ("The answer is 42", "42", 1.0),
    ("Let me think... 6 * 7 = 42", "42", 1.0),
    ("I think it's 43", "42", -0.5),
    ("I don't know", "42", -1.0),
]

print("Testing math_reward (from verifier.py __main__):")
all_pass = True
for response, truth, expected in test_cases:
    result = math_reward(response, truth)
    status = "pass" if result == expected else "FAIL"
    if result != expected:
        all_pass = False
    print(f"  {status} '{response[:30]}...' -> {result} (expected {expected})")

print(f"\n{'All tests passed!' if all_pass else 'Some tests FAILED!'}")

---

## Exercises

### Exercise 1: Better Answer Extraction
The current `extract_answer()` fails on responses like "The answer is forty-two". Can you add a pattern or preprocessing step to handle written-out numbers?

### Exercise 2: Partial Credit
The current system gives -0.5 for any wrong answer. Design a reward function that gives **partial credit** based on how close the answer is (e.g., if the truth is 345 and the model says 344, give a milder penalty than if it says 1000).

### Exercise 3: Multi-Step Reward
Extend `reward_with_thinking()` to also check that the intermediate arithmetic inside the `<think>` block is correct. For example, if the think block says "15*20 = 350" (wrong intermediate step), penalize it even if the final answer happens to be right.

### Exercise 4: Reward Hacking
Consider a model that learns to always output "The answer is 42" regardless of the question. Why would the current reward function encourage this? How could you modify the reward to prevent it?

---

## Key Takeaways

1. **Verifiable rewards** are deterministic functions that check objective correctness. They're cheap, fast, and reliable.

2. **Answer extraction** uses cascading regex patterns from most-specific to least-specific to robustly find numbers in free-form text.

3. **Three-tier scoring** (+1.0, -0.5, -1.0) provides more gradient signal than binary rewards and penalizes non-answers most harshly.

4. **Think token rewards** incentivize the model to show its reasoning, creating a path toward process-based rewards.

5. **Reward design is critical** - the model will optimize for exactly what you measure, including any loopholes.

---

**Previous:** [Chapter 2 - Rollouts](../ch02_rollouts/learn_rollouts.ipynb)  
**Next:** [Chapter 5 - GRPO](../ch05_grpo/learn_grpo.ipynb) - How do we use these rewards to actually update the model?