# 01 — Baseline Prompting

Shows how **Qwen2.5-3B-Instruct** performs on the letter-counting task
**without** any fine-tuning, using a Chain-of-Thought system prompt.

Task: *"How many of the letter 'g' are there in the word 'engage'?"*

The model should list each letter with a running count inside `<reasoning>` tags,
then give the final answer inside `<answer>` tags.

In [None]:
# Setup — install deps if needed (e.g. on Colab)
import subprocess, sys
try:
    import trl
except ImportError:
    subprocess.check_call([sys.executable, "-m", "pip", "install", "-q",
        "torch", "transformers>=4.40", "datasets", "accelerate",
        "peft>=0.10", "trl>=0.15", "pandas", "matplotlib"])

import os, sys
sys.path.insert(0, os.path.abspath(".."))

In [None]:
from src.prompting_baseline import load_base_model, generate
from src.rewards import reward_single

model, tokenizer = load_base_model()
print(f"Model loaded on: {model.device}")

## Baseline Results

Run the baseline model on three test cases and score with our reward functions.
The model typically struggles to count specific letter occurrences accurately.

In [None]:
test_cases = [("engage", "g"), ("banana", "a"), ("strawberry", "r")]

for word, letter in test_cases:
    true_count = word.count(letter)
    response = generate(model, tokenizer, word, letter)
    rewards = reward_single(word, letter, true_count, response)

    print(f'\n{"="*55}')
    print(f'How many "{letter}" in "{word}"?  (true count: {true_count})')
    print(f'{"="*55}')
    print(response)
    print(f"\nReward breakdown: {rewards}")
    print(f"Total reward: {rewards.total:.2f}")

## Observations

The pretrained model often:
- Miscounts the running total of the target letter
- Gives the wrong final answer
- May not follow the exact `<reasoning>/<answer>` format

This motivates GRPO fine-tuning to teach the model accurate letter counting.