# 02 â€” Reward Design & Validation

Validates that the five reward functions give **higher scores** for correct
samples and **lower scores** for incorrect samples.

Reward functions:
| Function | What it checks | Max |
|---|---|---|
| `format_reward_func` | `<reasoning>/<answer>` tags + digit answer | 1.0 |
| `numbering_reward_func` | Sequential line numbers (1, 2, 3, ...) | ~1.0 |
| `spelling_reward_func` | Listed letters match the word | 2.0 |
| `counting_reward_func` | Running count of target letter is accurate | ~1.0 |
| `correct_answer_reward_func` | Final answer matches true count | 2.0 |

In [None]:
import os, sys
sys.path.insert(0, os.path.abspath(".."))

from src.utils import read_jsonl
from src.rewards import reward_single

## Correct Samples (expect HIGH rewards)

In [None]:
correct = read_jsonl("../data/samples_correct.jsonl")

for row in correct:
    r = reward_single(row["word"], row["letter"], row["count"], row["completion"])
    print(f'\n{"="*55}')
    print(f'  "{row["letter"]}" in "{row["word"]}"  (true count: {row["count"]})')
    print(f'{"="*55}')
    print(f"  Response:\n{row['completion']}")
    print(f"\n  Rewards: {r}")
    print(f"  Total:   {r.total:.2f}")

## Incorrect Samples (expect LOW rewards)

These have errors: wrong running counts, skipped letters, missing format tags, wrong answers.

In [None]:
incorrect = read_jsonl("../data/samples_incorrect.jsonl")

for row in incorrect:
    r = reward_single(row["word"], row["letter"], row["count"], row["completion"])
    print(f'\n{"="*55}')
    print(f'  "{row["letter"]}" in "{row["word"]}"  (true count: {row["count"]})')
    print(f'{"="*55}')
    print(f"  Response:\n{row['completion']}")
    print(f"\n  Rewards: {r}")
    print(f"  Total:   {r.total:.2f}")

## Summary

Correct samples consistently score **much higher** than incorrect ones,
confirming that the reward functions can guide GRPO training effectively.

Key failure modes penalised:
- Wrong final answer (loses 2.0 from `correct_answer_reward_func`)
- Wrong running counts (loses from `counting_reward_func`)
- Skipped/extra letters (loses from `spelling_reward_func` and `numbering_reward_func`)
- Missing `<reasoning>/<answer>` tags (loses from `format_reward_func`)