
GRPO with TRL *for Reasoning*
===============================

This notebook shows how to train an autoregressive LLM to **reason through grade‑school math
word‑problems** with the **GRPOTrainer** in 🤗 [TRL Library]((https://huggingface.co/docs/trl/index).  

It roughly follows the structure of our sentiment-aligning movie-hater, but we'll use a more real-world reasoning-based dataset instead!

Key differences from sentiment fine-tuning
--------------------------------------
* **Reward signal** – binary correctness (exact numerical answer) rather than a
  continuous sentiment score. The reward is +1 if the model’s answer matches the
  ground‑truth answer, −1 otherwise.  This makes the optimisation landscape much
  sharper.
* **Prompt diversity** – we use the full *GSM8K* dataset (~7 k unique problems),
  so no synthetic template expansion is required.
* **Parsing** – we must extract the model’s final numeric answer from its
  chain‑of‑thought output to evaluate reward.
* **β (KL) weight** – reasoning requires larger policy moves; we therefore set
  `beta = 0.02`, lower than TRL’s default (0.04).

---

#### References and Further Reading
* GRPO reasoning tutorial – HuggingFace cookbook ([Tutorial](https://colab.research.google.com/github/huggingface/cookbook/blob/main/notebooks/en/fine_tuning_llm_grpo_trl.ipynb))
* GSM8K GRPO demo repo ([GitHub](https://github.com/Yeok-c/grpo-gsm8k-demo))
* Bite: How DeepSeekR1 was Trained ([Blog](https://www.philschmid.de/deepseek-r1))
* Abbie's RL Tutorial (PPO-focused) ([Tutorial](https://apetulante.github.io/posts/RL-for-LLMs/RL_for_LLMs.html))

## 1. Setup

A lot of our setup will be the same as our movie hater notebook. We proceed by:
1. Installing dependencies
2. Importing packages
3. Loading our model and tokenizer


In [None]:
!pip -q install --upgrade "trl==0.15.2" "transformers>=4.40.1" accelerate datasets math_verify --progress-bar off

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2025.3.2 requires fsspec==2025.3.2, but you have fsspec 2025.3.0 which is incompatible.[0m[31m
[0m

In [None]:
from datasets import load_dataset, Dataset
from trl import GRPOTrainer, GRPOConfig
import re, random
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
set_seed(42)
print("Device:", device)

Device: cuda


In [None]:
model_name = "vicgalle/gpt2-open-instruct-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# TRL requires a pad_token
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name).to(device) #can rerun to reset the model

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/510M [00:00<?, ?B/s]

In [None]:
def generate(model_obj, prompt, max_new=40):
    inp = tokenizer(prompt, return_tensors="pt").to(device)
    out = model_obj.generate(**inp, max_new_tokens=max_new)
    return tokenizer.decode(out[0], skip_special_tokens=True)

### 2. Load GSM8K Dataset

GSM8K is a benchmark dataset of grade-school math problems written in natural language. It was designed to evaluate arithmetic and reasoning capabilities in language models, with problems requiring step-by-step logic to reach a final answer.

### Why this dataset requires reasoning

Unlike movie reviews (where the sentiment model just learned to be critical),
each GSM8K prompt presents a math word problem that requires **logical steps**
to arrive at the final answer.

Examples:
- "Tom has 3 apples, buys 2 more..." → requires addition
- "A train leaves at 3:30 and arrives at 5:10..." → time subtraction

The model must not only read and understand the problem but execute the
appropriate arithmetic steps before writing `Answer: <number>` at the end.

This makes the task more sensitive to coherence, correctness, and following
multi-step structure — a stronger test of reasoning.

We keep 5000 problems for training to keep GPU time semi- reasonable. It will still take a long time - lower to 1000 (or even fewer!) to get through the notebook more quickly if you just want to see it run!

In [None]:
raw_ds = load_dataset("gsm8k", "main", split="train[:1000]")

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

We can inspect one of these dataset examples:

In [None]:
print("Question: " + raw_ds[0]["question"])
print("Answer: " + raw_ds[0]["answer"])

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72


And see how our model responds before training:

In [None]:
print(generate(model, raw_ds[0]["question"]))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

April: 2,000
May: 2,000

April: 2,000

April: 2,000

April: 2,000

April: 2,


### 3. Build Prompts
Each prompt asks the model to solve the problem *step‑by‑step* so it can show
its reasoning.

Then, we insert from our dataset explicitly the answer that we extract from the dataset. We'll format our data to have the reasoning process between think tags and the final answer separate, i.e.

Prompt format:
```
<problem>
Jack has 3 apples. …
</problem>
<think>
3 + 4 = 7  
7 – 2 = 5  
</think>
<answer>5</answer>
```

In [None]:
prompts = []
answers = []

for ex in raw_ds:
    q   = ex["question"]
    txt = ex["answer"]

    # 2. Extract final answer (after "#### ")
    m_ans = re.search(r"####\s*([-+]?\d+)", txt)
    if not m_ans:
        continue
    ans = m_ans.group(1)

    # 3. Extract chain-of-thought (everything before the #### marker)
    cot = txt.split("####")[0].strip()

    # 4. Build formatted prompt
    prompt = (
        "<problem>\n" + q + "\n</problem>\n\n"
        "<think>\n" + cot + "\n</think>\n\n"
        f"<answer>{ans}</answer>"
    )

    prompts.append(prompt)
    answers.append(int(ans))


dataset = Dataset.from_dict({"prompt": prompts, "answer": answers})

In [None]:
dataset[0]

{'prompt': '<problem>\nNatalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?\n</problem>\n\n<think>\nNatalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n</think>\n\n<answer>72</answer>',
 'answer': 72}

### 4. Reward Function: Answer Match

Now, let's build a reward function that tries to make sure we get not just a correct answer, but also some good reasoning!

We'll still keep it super simple. We give rewards:

- **+0.4** if the `<answer>` matches the ground truth,

- **+0.2** for each correctly verified line inside the `<think>` block

This way, we ensure we get the correct final answer, but also check that the model's thinking steps are arthimetically correct.

In [None]:
from math_verify import parse, verify
from math_verify.parser import ExprExtractionConfig

# Function to check if a given line in the thinking process is correct
def verify_step(line: str) -> bool:
    """
    Returns True if `line` is of the form "A op B = C" and the arithmetic checks out,
    using Math-Verify's parser and verifier.
    """
    if "=" not in line:
        return False
    lhs, rhs = map(str.strip, line.split("=", 1))

    # parse both sides as plain expressions (no LaTeX)
    left_expr  = parse(lhs, extraction_config=[ExprExtractionConfig()])
    right_expr = parse(rhs, extraction_config=[ExprExtractionConfig()])

    # verify expects (gold, prediction), so ensure order:
    # gold = RHS, prediction = LHS
    return verify(right_expr, left_expr)

# The reward function that will look at the thinking process AND final answer
def reward_fn(prompts, completions, answer):
    rewards = []
    for out, gt in zip(completions, answer):
        # extract think-block and answer-block
        think = re.search(r"<think>(.*?)</think>", out, re.S)
        ans   = re.search(r"<answer>(\d+)</answer>", out)
        score = 0.0
        if think:
            # verify each line in think: simple arithmetic check
            for line in think.group(1).splitlines():
                if verify_step(line):  # your step-checking logic
                    score += 0.2
        if ans and int(ans.group(1)) == gt:
            score += 0.4
        rewards.append(score)
    return rewards

Let's test this reward quickly to ensure that we get what we expect

In [None]:
# pick example i
i = 0
prompt_i     = prompts[i]
gold_output  = prompts[i]   # grab from what we formatted from the data earlier
gold_answer  = answers[i]

# compute reward
reward = reward_fn(
    prompts     = [prompt_i],
    completions = [gold_output],
    answer     = [gold_answer]
)[0]

print("Training example reward:", reward)

# now, try to put something blatantly wrong as answer
reward = reward_fn(
    prompts     = [prompt_i],
    completions = [gold_output], # the thinking steps will still be correct
    answer     = [gold_answer + 10] # BUT we put an answer here that we know is wrong!
)[0]

print("Wrong example reward:", reward)


Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
Training example reward: 0.8

Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
Wrong example reward: 0.4


While our reward function effectively incentivizes correctness (even along the way) it does not necessarily mean we'll get *optimal* intermediate reasoning steps. But, we are checking that the reasoning wasn't totally off the rails, at least somewhat!

To train models that value coherent, concise, AND accurate reasoning processes, alternative reward modeling techniques can be employed that incentivize good reasoning paths. Indeed, it's common for a reward function to give rewards for multiple elements of "goodness" in an answer simultaneously!

*Also note! Our checker for line-by-line accuracy is VERY simple here, and may fail on some examples in this dataset which, for instance, use varaibles in the math expressions. Bulding a robust reward function is a crucial step of RL training!*

### 5. GRPO Configuration

We set up GRPO largely the same as in our sentiment notebook. Note one change is to make sure that the max completion length is longer, as we want to allow longer response generations when reasoning is required.

In [None]:
cfg = GRPOConfig(
    beta=0.02,  # Controls the strength of the KL divergence penalty; higher values keep the model closer to the reference policy.
    learning_rate=5e-6,  # Determines the step size at each iteration while moving toward a minimum of the loss function.
    num_generations=4,  # Number of completions generated per prompt; facilitates diverse outputs for better policy optimization.
    per_device_train_batch_size=64,  # Number of samples processed per device in one forward/backward pass; must be divisible by num_generations.
    gradient_accumulation_steps=4,  # Number of steps to accumulate gradients before updating model weights; helps simulate larger batch sizes.
    logging_steps=10,  # Frequency (in steps) at which training logs are recorded.
    max_prompt_length=64,  # Maximum number of tokens in the input prompt; inputs longer than this will be truncated.
    max_completion_length=128,  # Maximum number of tokens the model can generate in response to a prompt.
)

trainer = GRPOTrainer(
    model=model,  # Our model (loaded above)
    args=cfg,  # Training configuration (defined above)
    train_dataset=dataset,
    reward_funcs=[reward_fn],  # List of reward functions to evaluate generated outputs. Note can be more than one!
    processing_class=tokenizer,  # Tokenizer corresponding to the model
)


### 6. Train
This will take a long time! Even on an A100, it may take several hours for this step to run. We've chosen a fairly large set of examples here, and generation for reasoning takes a bit longer as we've allowed the maximum response length to be longer.

If you re-run this code, this section will ask for an API key for weights and biases.

In [None]:
trainer.train()



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mabigail-petulante[0m ([33mabigail-petulante-vanderbilt-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


`generation_config` default values have been modified to match model-specific defaults: {'use_cache': False, 'bos_token_id': 50256, 'eos_token_id': 50256}. If this is not desired, please set these values explicitly.



The first snake is 24 inches because there are 12 inches in a foot.
The snakes are 24+16+10= <<24+16+10=50>>50 inches long.


Step,Training Loss
10,108143526.4
20,0.368
30,0.0499
40,0.0485




<think>
<think>
<think>
<think>
<think>
<think>
<think>

<h1>As per the rule of Euclidean mechanics, “[1, 2, 3, 6, 6]** is equal to 180.”</h1>
120
0.75
27

At first the children had 10*7=<<10*7=70>>70 books. With their teacher, they have 70+8=<<70+8=78>>78 books.
6
I know, but I'll get my mind blown. I'll just take a few moments to think. Please, try this out with a few more minutes.
200 x 3 = <<50*3=150>>150 kg of fish.
Therefore, he sold 150 x 3 = <<50*3=150>>150 kg of fish.
10

80


75
2240

Female:36(.50)=17 cows
14

You said that you don't want to waste money on something when it can be used to get something you want. For example, you will never need $200 to get a new pair of shoes, or to upgrade your car if you don't have the money.


<i>The novels left behind by this writer</i>
<ii>10</ii>
<iii>9</iii>
<iv>10</iv>

Laura expects 150 - 3 = <<150-3=65>>65 Guests.
Laura expects 200 * 0.3 = <<200-3=65>>65 Guests.

35</span>

He got 20*3=<<21*3=60>>60 seeds
That means he plants 20*

TrainOutput(global_step=45, training_loss=24031894.87516898, metrics={'train_runtime': 3231.0881, 'train_samples_per_second': 0.928, 'train_steps_per_second': 0.014, 'total_flos': 0.0, 'train_loss': 24031894.87516898})

### 7. Test Performance

In [None]:
test_prompt = raw_ds[0]["question"]
before = generate(model, test_prompt)
after  = generate(trainer.model, test_prompt)

print("🔵 BEFORE\n", before, "\n")
print("🟢 AFTER\n", after, "\n")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


🔵 BEFORE
 Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

April:

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 

🟢 AFTER
 Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?

She sold half as many clips in April and May.

She sold half as many clips in April and May.

She sold half as many clips in April and May.
 



(*NOTE:* I cut the training short! Answers would likely be at least slightly better if training completed. But it takes a long time!)

So, this behavior is definitely more reasoning-like. But our model didn't exactly arrive at the right answer through clear and logical thinking.

For one, we still didn't use much data (5000 examples is pretty small).

Also, while our dataset did provide walkthrough answers which we scored the steps of *along with* final answer, true, good reasoning is more than just mathematical correctness of intermediate steps.

In practice, these models are trained on *many, many more examples* and with *much more complex reward functions*.

For example, DeepSeek-R1 goes well beyond a simple “correct/incorrect” final answer reward. Instead, it's training pipeline combines:

- **Answer Reward** – a binary signal for a fully correct final answer, evaluated by a programmatic verifier

- **Format Reward** – a small bonus when the model follows prescribed structure (e.g. `<think>…</think>` and `<answer>…</answer>` tags)

- **Constraint Rewards** – checks that each input number is used exactly once and computations adhere to the problem’s rules

- **Language/Fluency Rewards** – soft penalties for disfluent or non-canonical phrasing, to encourage readable reasoning

Optimizing not just for strong ability to get the right answer, but also to get there in efficient and coherent ways!

### 8. (OPTIONAL) Save the Model

In [None]:
# --- Save model and tokenizer ---
output_dir = "checkpoint-math-reasoning"
trainer.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

##### --- Later, to reload and continue training ---
```
model = AutoModelForCausalLM.from_pretrained(output_dir)
tokenizer = AutoTokenizer.from_pretrained(output_dir)
```
In place of loading from model name as above.