# Post training an LLM for reasoning with GRPO in TRL

## 1. Setup

### Install dependencies

Everything's in `requirements.txt`. After creating and activating the virtual environment, run `pip install -r requirements.txt` in the virtual environment.

### Log into HuggingFace

Use the following only if you use Jupyter Notebook, since you can't paste in anything in the widget shown by this in VSCode. (Source: https://github.com/huggingface/huggingface_hub/issues/752)

In [3]:

# from huggingface_hub import notebook_login

# notebook_login()

Put your HuggingFace in `.env` with the environment variable name `HF_TOKEN`.

In [6]:
from dotenv import load_dotenv
load_dotenv()

import os
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN"))

Note: Environment variable`HF_TOKEN` is set and is the current active token independently from the token you've just configured.


Verify that you've properly logged in.

In [None]:
!huggingface-cli whoami

## 2. Load dataset

Load the first 5% of the train and test dataset.

In [4]:
from datasets import load_dataset

dataset_id = "AI-MO/NuminaMath-TIR"
train_dataset, test_dataset = load_dataset(dataset_id, split=["train[:5%]", "test[:5%]"])

Check the loaded train_dataset:

In [9]:
print(train_dataset)

Dataset({
    features: ['problem', 'solution', 'messages'],
    num_rows: 3622
})


Check one sample:

In [10]:
print(train_dataset[0])

{'problem': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.', 'solution': "To determine the coefficient of \\(x^2y^6\\) in the expansion of \\(\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8\\), we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case, \\(a = \\frac{3}{5}x\\), \\(b = -\\frac{y}{2}\\), and \\(n = 8\\).\n\nWe are interested in the term that contains \\(x^2y^6\\). In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo get \\(x^2\\), we need \\(8 - k = 2\\), thus \\(k = 6\\).\n\nSubstituting \\(k = 6\\) into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we wi

Modify the dataset to follow DeepSeek-R1's training conversation style as follows:
```
A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:
```

In [7]:
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

def make_conversation(example):
    return {
        "prompt" : [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

In [8]:
train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

In [9]:
print(train_dataset[0]["prompt"])

[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>', 'role': 'system'}, {'content': 'What is the coefficient of $x^2y^6$ in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}]


Remove the "messages" and "problem" columns from the train dataset as we need the data to have only "prompt" and "solution" features.

In [14]:
train_dataset = train_dataset.remove_columns(["messages", "problem"])
print(train_dataset)

Dataset({
    features: ['solution', 'prompt'],
    num_rows: 3622
})


## 3. GRPO train the base model

### 3.1 Load the baseline model

We'll start with `Qwen/Qwen2-0.5B-Omstrict` as our baseline model (Policy Model).

In [18]:
import torch
from transformers import AutoModelForCausalLM

model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

## 3.2 Configure LoRA

In [17]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

model = get_peft_model(model, lora_config)

print(model.print_trainable_parameters())

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093
None


### 3.3 Load Reward Functions

DeepSeek-R1 authors used an accuracy-based reward model evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between `<think> </think>` tags. We simply define reward functions as generic Python functions.

#### 3.3.1 Format

Ensure that the generation use `<think></think>`, `<answer></answer>` tags for reasoning.

In [18]:
import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return rewards_list

#### 3.3.2 Solution accuracy

In [19]:
from math_verify import LatexExtractionConfig, parse, verify

def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs["solution"]
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

#### 3.4 Configure GRPO parameters

Experiment parameters: `max_completion_length`, `num_generations`, `max_prompt_length`. For simplicity, train only one epoch and reduce the three parameters from their default values.

In [20]:
from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="Qwen2-0.5B-GRPO-test",
    learning_rate=1e-5,
    remove_unused_columns=False, # need to access the solution column in accuracy_reward
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    bf16=True,
    # parameters for data preprocessing
    max_completion_length=64, # default 256
    num_generations=4, # default 8
    max_prompt_length=128, # default 512
    # parameters for reporting and saving
    report_to=["tensorboard"],
    logging_steps=10,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

#### 3.5 Train the model

In [22]:
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, accuracy_reward],
    args = training_args,
    train_dataset=train_dataset,
)

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


In [23]:
trainer.train()

Step,Training Loss
10,0.0042
20,0.0105
30,0.0093
40,0.0188
50,0.0218
60,0.0128
70,0.0221
80,0.0355
90,0.0277
100,0.0409


Error during comparison
Traceback (most recent call last):
  File "/home/azureuser/cloudfiles/code/Users/fine-tuning/.venv/lib/python3.12/site-packages/math_verify/grader.py", line 809, in compare_single_extraction_wrapper
    return compare_single_extraction(g, t)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/cloudfiles/code/Users/fine-tuning/.venv/lib/python3.12/site-packages/math_verify/utils.py", line 51, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/azureuser/cloudfiles/code/Users/fine-tuning/.venv/lib/python3.12/site-packages/math_verify/grader.py", line 789, in compare_single_extraction
    return sympy_expr_eq(
           ^^^^^^^^^^^^^^
  File "/home/azureuser/cloudfiles/code/Users/fine-tuning/.venv/lib/python3.12/site-packages/math_verify/grader.py", line 667, in sympy_expr_eq
    return sympy_compare_relational(gold, pred, float_rounding, numeric_precision)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

TrainOutput(global_step=113, training_loss=0.021259699313514, metrics={'train_runtime': 2974.4299, 'train_samples_per_second': 1.218, 'train_steps_per_second': 0.038, 'total_flos': 0.0, 'train_loss': 0.021259699313514})

Save the results and push to HF hub.

In [24]:
trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

CommitInfo(commit_url='https://huggingface.co/alabebop/Qwen2-0.5B-GRPO-test/commit/9da3eec26246a492e9a580a0a358b0fba95f4257', commit_message='End of training', commit_description='', oid='9da3eec26246a492e9a580a0a358b0fba95f4257', pr_url=None, repo_url=RepoUrl('https://huggingface.co/alabebop/Qwen2-0.5B-GRPO-test', endpoint='https://huggingface.co', repo_type='model', repo_id='alabebop/Qwen2-0.5B-GRPO-test'), pr_revision=None, pr_num=None)

## 4. Check model performance

Load the saved model and run an evaluation on a test sample.

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "alabebop/Qwen2-0.5B-GRPO-test"
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

adapter_config.json:   0%|          | 0.00/778 [00:00<?, ?B/s]

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.


adapter_model.safetensors:   0%|          | 0.00/2.18M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/80.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

Check a sample in the test dataset.

In [None]:
print(test_dataset[0]["prompt"]) # row based access
print(test_dataset["prompt"][0]) # column based access
# they both access the same data thanks to the dataset library

[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>', 'role': 'system'}, {'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?", 'role': 'user'}]
[{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>', 'role': 'system'

Create a function to interact with the model, which generates the answer, measures the inference duration, and counts the number of generated tokens. This will show us how much the model has reasoned during generation.

In [20]:
import time, torch

def generate_with_reasoning(prompt):
    # build the prompt from the dataset
    prompt = " ".join(entry["content"] for entry in prompt)
    # tokenize the prompt and move to the same device as the model (GPU if available)
    inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)

    # generate without gradients
    start_time = time.time()
    with torch.no_grad():
        output_ids = trained_model.generate(**inputs, max_length=500)
    end_time = time.time()

    # decode and extract model response
    generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # get inference duration
    inference_duration = end_time - start_time

    # get number of generated tokens
    num_input_tokens = inputs["input_ids"].shape[1]
    num_generated_tokens = output_ids.shape[1] - num_input_tokens

    return generated_text, inference_duration, num_generated_tokens
    

Generate the answer for the test sample.

In [21]:
prompt = test_dataset["prompt"][0]
print(f"Prompt: {prompt}")
generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
print(f"Generated text: {generated_text}")
print(f"Inference duration: {inference_duration:.2f} seconds")
print(f"Number of generated tokens: {num_generated_tokens}")

Prompt: [{'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer>', 'role': 'system'}, {'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?", 'role': 'user'}]
Generated text: A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., <think> reasoning process here </think><answer> answer here </answer> In 1988,

Logging the generated text we can see that it properly used the `<think></think>` and `<answer></answer>` tag. However, the thinking process is like a top of head answer attempt and the answer is apparently wrong. 

The model is too small, also training for only 1 epoch with 5% of the dataset is probably not enough. On the other hand, if we look at the solution in the test dataset, it's incomplete. So probably there are many incomplete samples in the trainig dataset as well.  

In [22]:
test_dataset[0]

{'problem': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?",
 'solution': 'To solve this problem, let\'s break it down step-by-step:\n\n1. Let the person\'s birth year be \\( Y \\).\n2. In 1988, the person\'s age would be \\( 1988 - Y \\).\n3. The sum of the digits of \\( Y \\) should be equal to their age in 1988.\n\nTherefore, we need to find a year \\( Y \\) such that:\n\n\\[ 1988 - Y = \\text{sum of the digits of } Y \\]\n\nWe can solve this by iterating through possible values for \\( Y \\) and checking if the condition holds.\n\nLet\'s write a Python script to find the correct birth year \\( Y \\).\n```python\ndef digit_sum(year):\n    """Calculate the sum of the digits of a year."""\n    return sum(int(digit) for digit in str(year))\n\ndef find_birth_year():\nprint((    for year in range(1900, 1989):))  # Reasonable range given the\n```\n```output\nCell In[210], line 6\n    for year in range(1900, 1989):  # Reasonable ra

I tried the same puzzle with GPT-4o twice, it couldn't solve it after a lot of brute force efforts. o3-mini-high, however, was able to solve it after reasoning for 14 seconds and here is its answer:

Reasoned for 14 seconds:
```
We want to find a person’s age in 1988, where their age equals the sum of the digits of their birth year. Let’s break this apart. Let the birth year be Y, and the sum of its digits be S(Y). The equation becomes: 1988 - Y = S(Y), or Y + S(Y) = 1988.

We’ll assume Y is a four-digit number, so Y = 19ab, where ‘a’ and ‘b’ are digits. Therefore, S(Y) = 10 + a + b. This gives us:
1900 + 10a + b + (10 + a + b) = 1988. Let’s solve for a and b.

To solve the puzzle: Let the birth year be Y = 1900 + 10a + b. The sum of digits of Y is S(Y) = 10 + a + b. The equation becomes 1910 + 11a + 2b = 1988, and so 11a + 2b = 78. Solving this, I find that a = 6 and b = 6. Therefore, the birth year is 1966. In 1988, their age is 22, which matches the sum of the digits. So, the person was 22 years old in 1988.
```

Let’s solve the problem step by step.

⸻

The Problem:

In 1988, a person’s age was equal to the sum of the digits of their birth year.

Let:
	•	Y = birth year
	•	S(Y) = sum of the digits of Y

The person’s age in 1988 is:

\text{Age} = 1988 - Y

According to the problem:

1988 - Y = S(Y)

This is equivalent to:

Y + S(Y) = 1988

⸻

Assume the Birth Year is in the 1900s

Let Y = 1900 + 10a + b, where a and b are digits (0–9).

The sum of the digits is:

S(Y) = 1 + 9 + a + b = 10 + a + b

Substitute into the equation:

(1900 + 10a + b) + (10 + a + b) = 1988

Simplify:

1900 + 10a + b + 10 + a + b = 1910 + 11a + 2b = 1988

Subtract 1910 from both sides:

11a + 2b = 1988 - 1910 = 78

⸻

Solve for a and b:

We need to find digits a and b (between 0 and 9) such that:

11a + 2b = 78
	•	Try a = 6:
11 \times 6 = 66 \quad \Rightarrow \quad 78 - 66 = 12 \quad \Rightarrow \quad 2b = 12 \quad \Rightarrow \quad b = 6

This gives a = 6 and b = 6.

⸻

Determine the Birth Year and Age:

Now, the birth year is:

Y = 1900 + 10a + b = 1900 + 60 + 6 = 1966

The person’s age in 1988 is:

1988 - 1966 = 22

Check the sum of the digits of the birth year:

S(1966) = 1 + 9 + 6 + 6 = 22

Since both the age and the digit sum are 22, the condition is satisfied.

⸻

Final Answer:

The person was 22 years old in 1988.

## 5. Efficiency notes

The training was run on a VM with 2 H100 GPUs (96GB vRAM). The choice of using LoRA should make the training only require relatively small resources. The training of 1 epoch took 50 min, while normally LoRA training of a small model like Qwen-0.5B should be just 10 min.

A snapshot of `nvidia-smi` showed the following during training

```
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.01             Driver Version: 535.216.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 NVL                On  | 00000001:00:00.0 Off |                    0 |
| N/A   38C    P0              91W / 400W |   3063MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 NVL                On  | 00000002:00:00.0 Off |                    0 |
| N/A   37C    P0              94W / 400W |   2679MiB / 95830MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    138841      C   .../Users/fine-tuning/.venv/bin/python     3054MiB |
|    1   N/A  N/A    138841      C   .../Users/fine-tuning/.venv/bin/python     2670MiB |
+---------------------------------------------------------------------------------------+
```

The training is distributed in the two GPUs (thanks to `device_map="auto"`). But both GPUs' utilization is very low.

| GPU | Memory Usage         | GPU Utilization | Power Draw       | Notes                          |
|-----|----------------------|------------------|------------------|--------------------------------|
| 0   | **3.06 GB / 95.8 GB** | **12%**          | 117W / 400W      | Very low usage                 |
| 1   | **2.67 GB / 95.8 GB** | **21%**          | 123W / 400W      | Also low                       |

Instead of training, a lot of time is probably spent on generation and calculating reward. For optimization, we might try caching the reward parse, multithread the generation, etc.