# GRPO Fine-Tuning Notebook

## Overview
This notebook demonstrates a complete and working implementation of **GRPO Fine-Tuning**. The primary goal of this document is to preserve the original, functional code exactly as-is, while enhancing the notebook with **clear structure, professional documentation, and explanatory context**.

The added explanations are designed to:
- Improve readability and maintainability
- Help new readers quickly understand the workflow
- Provide professional-grade documentation suitable for sharing or presentation

**Note:** No code logic has been modified in this enhancement. Only explanatory markdown has been added.

## Notebook Structure

This notebook follows a logical sequence commonly used in fine-tuning workflows:

1. **Environment & Dependency Setup** – Importing libraries and preparing the runtime
2. **Configuration & Hyperparameters** – Defining model and training parameters
3. **Dataset Preparation** – Loading and preprocessing training data
4. **Model Initialization** – Setting up the base model and GRPO components
5. **Training Loop** – Executing the fine-tuning process
6. **Evaluation & Outputs** – Reviewing results and model behavior

Each section below includes concise explanations to guide the reader through the intent and mechanics of the code.

In [2]:
!pip install  -U -q trl peft math_verify
!pip install unsloth-zoo==2025.9.9
# Tested with transformers==4.47.1, trl==0.14.0, datasets==3.2.0, peft==0.14.0, accelerate==1.2.1, math_verify==0.3.3

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
unsloth-zoo 2025.9.9 requires transformers!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,<=4.55.4,>=4.51.3, but you have transformers 4.57.3 which is incompatible.[0m[31m
Collecting transformers!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,<=4.55.4,>=4.51.3 (from unsloth-zoo==2025.9.9)
  Using cached transformers-4.55.4-py3-none-any.whl.metadata (41 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers!=4.52.0,!=4.52.1,!=4.52.2,!=4.52.3,!=4.53.0,!=4.54.0,!=4.55.0,!=4.55.1,<=4.55.4,>=4.51.3->unsloth-zoo==2025.9.9)
  Using cached tokenizers-0.21.4-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
INFO: pip is looking at multiple versions of trl to determine which version is compatible with other requirements. This could take a whil

## Dataset Preparation

Here, the dataset is loaded and prepared for training. This may include formatting, filtering, or structuring the data to align with the GRPO fine-tuning requirements.

In [2]:
pip install ipywidgets

Collecting ipywidgets
  Downloading ipywidgets-8.1.8-py3-none-any.whl.metadata (2.4 kB)
Collecting comm>=0.1.3 (from ipywidgets)
  Downloading comm-0.2.3-py3-none-any.whl.metadata (3.7 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Downloading widgetsnbextension-4.0.15-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl.metadata (20 kB)
Downloading ipywidgets-8.1.8-py3-none-any.whl (139 kB)
Downloading jupyterlab_widgets-3.0.16-py3-none-any.whl (914 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m914.9/914.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading widgetsnbextension-4.0.15-py3-none-any.whl (2.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m32.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading comm-0.2.3-py3-none-any.whl (7.3 kB)
Installing collected packages: widgetsnbextension, jupyterlab_widget

In [2]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Environment Setup

This section imports all required libraries and prepares the execution environment. These dependencies support model loading, training orchestration, and GRPO-specific logic.

In [18]:
from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset, test_dataset = load_dataset(dataset_id, split=['train[:5%]', 'test[:5%]'])

#### Prompt Generation based on Data 

In [20]:
SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

Map:   0%|          | 0/3622 [00:00<?, ? examples/s]

In [22]:
train_dataset = train_dataset.remove_columns(['messages', 'problem'])
print(train_dataset)

Dataset({
    features: ['solution', 'prompt'],
    num_rows: 3622
})


In [23]:
import torch
from transformers import AutoModelForCausalLM

model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

In [24]:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

model = get_peft_model(model, lora_config)

model.print_trainable_parameters()

trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093


#### Reward Modeling

In [25]:
import re
def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return [1.0 if match else 0.0 for match in matches]

In [26]:
from math_verify import LatexExtractionConfig, parse, verify
def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs['solution']
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

#### GRPO Configurations

In [None]:
from trl import GRPOConfig

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir="Qwen2-GRPO-Trained-Model",
    learning_rate=1e-4,
    remove_unused_columns=False, # to access the solution column in accuracy_reward
    gradient_accumulation_steps=16,
    num_train_epochs=1,
    bf16=True,

    # Parameters that control de data preprocessing
    max_completion_length=64, # default: 256
    num_generations=2, # default: 8
    max_prompt_length=512, # default: 512

    # Parameters related to reporting and saving
    report_to=["tensorboard"],
    logging_steps=1,
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,
)

#### GRPO Trainer and Training


In [30]:
from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, accuracy_reward],
    args=training_args,
    train_dataset=train_dataset
)
# trainer.train()

In [31]:
trainer.train()

Step,Training Loss
1,0.0083
2,0.0106
3,0.0213
4,0.0387
5,0.0947
6,0.0485
7,0.0714
8,0.0725
9,0.0519
10,0.069


TrainOutput(global_step=56, training_loss=0.014425811175897252, metrics={'train_runtime': 384.4485, 'train_samples_per_second': 9.421, 'train_steps_per_second': 0.146, 'total_flos': 0.0, 'train_loss': 0.014425811175897252})

#### Training Logs


In [32]:
logs = trainer.state.log_history

In [33]:
print(logs)

[{'loss': 0.0083, 'grad_norm': 0.12191557139158249, 'learning_rate': 0.0001, 'num_tokens': 28545.0, 'completions/mean_length': 54.4609375, 'completions/min_length': 2.0, 'completions/max_length': 64.0, 'completions/clipped_ratio': 0.75, 'completions/mean_terminated_length': 25.84375, 'completions/min_terminated_length': 2.0, 'completions/max_terminated_length': 63.0, 'rewards/format_reward/mean': 0.0546875, 'rewards/format_reward/std': 0.22826264798641205, 'rewards/accuracy_reward/mean': 0.0078125, 'rewards/accuracy_reward/std': 0.0883883461356163, 'reward': 0.0625, 'reward_std': 0.06629125773906708, 'frac_reward_zero_std': 0.90625, 'entropy': 1.2638819850981236, 'clip_ratio/low_mean': 0.0, 'clip_ratio/low_min': 0.0, 'clip_ratio/high_mean': 0.0, 'clip_ratio/high_max': 0.0, 'clip_ratio/region_mean': 0.0, 'epoch': 0.017857142857142856, 'step': 1}, {'loss': 0.0106, 'grad_norm': 0.12030661851167679, 'learning_rate': 9.821428571428572e-05, 'num_tokens': 57155.0, 'completions/mean_length': 4

#### Saving training on Hugging Face


In [34]:
trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...RPO-Trained-Model/training_args.bin: 100%|##########| 6.99kB / 6.99kB            

  ...ut.tfevents.1768146470.aaids.8359.1:  94%|#########4| 90.3kB / 95.7kB            

  ...2-GRPO-Trained-Model/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

  ...ned-Model/adapter_model.safetensors:  97%|#########6| 2.11MB / 2.18MB            

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...RPO-Trained-Model/training_args.bin: 100%|##########| 6.99kB / 6.99kB            

  ...ut.tfevents.1768146470.aaids.8359.1: 100%|##########| 95.7kB / 95.7kB            

  ...ned-Model/adapter_model.safetensors: 100%|##########| 2.18MB / 2.18MB            

  ...2-GRPO-Trained-Model/tokenizer.json: 100%|##########| 11.4MB / 11.4MB            

CommitInfo(commit_url='https://huggingface.co/SulemanSahib/Qwen2-GRPO-Trained-Model/commit/f7c8390f240ae7988cb22078896d3dd74fb9db7c', commit_message='End of training', commit_description='', oid='f7c8390f240ae7988cb22078896d3dd74fb9db7c', pr_url=None, repo_url=RepoUrl('https://huggingface.co/SulemanSahib/Qwen2-GRPO-Trained-Model', endpoint='https://huggingface.co', repo_type='model', repo_id='SulemanSahib/Qwen2-GRPO-Trained-Model'), pr_revision=None, pr_num=None)

## Testing with untrained model. 
The answer is red but the untrained model foces more on reasoning that comlicates the process and trained model focues more on solution both tests are in section below. 

In [None]:
from transformers import AutoTokenizer
sample_id = 2
model_id = "Qwen/Qwen2-0.5B-Instruct" 
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)
# print(test_dataset['prompt'][sample_id])
print(test_dataset['solution'][sample_id])

import time

def generate_with_reasoning(prompt):
  # Build the prompt from the dataset
  prompt = " ".join(entry['content'] for entry in prompt)

  # Tokenize and move to the same device as the model
  inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)

  # Generate text without gradients
  start_time = time.time()
  with torch.no_grad():
      output_ids = trained_model.generate(**inputs, max_length=500)
  end_time = time.time()

  # Decode and extract model response
  generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)

  # Get inference time
  inference_duration = end_time - start_time

  # Get number of generated tokens
  num_input_tokens = inputs['input_ids'].shape[1]
  num_generated_tokens = output_ids.shape[1] - num_input_tokens

  return generated_text, inference_duration, num_generated_tokens

prompt = test_dataset['prompt'][sample_id]
generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
print(generated_text)
prompt_text = " ".join(entry['content'] for entry in prompt)
response_text = generated_text[len(prompt_text):].strip()
print(response_text)
print(f"Inference time: {inference_duration:.2f} seconds")
print(f"Generated tokens: {num_generated_tokens}")

To solve this problem, we need to carefully track the sequence of black and red cards that Petya places based on the given constraints. Let's denote the sequence of cards with \(R\) for red and \(B\) for black.

### Key Constraints
1. The 10th and 11th cards are red.
2. The 25th card is black.
3. No two cards of the same color are placed consecutively.

### Objective
- Determine the color of the 26th card.

### Reasoning
Given that no two cards of the same color can be consecutive, the cards must alternate in colors. However, we need to respect the specific placements of the 10th, 11th, and 25th cards.

Let's outline the logic step-by-step:
1. From cards 1 to 9, the sequence must alternate starting with either red or black.
2. The 10th and 11th cards are red, so the 9th card must be black.
3. The 25th card is black.

Given that the 25th card is black and no two consecutive cards can be the same color, the 26th card must be red.

Let's confirm this with a Python code to simulate the seq

### Testing with Trained Model

In [65]:
sample_id = 2
model_id = "SulemanSahib/Qwen2-GRPO-Trained-Model" #
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

# print(test_dataset['prompt'][sample_id])
print(test_dataset['solution'][sample_id])

import time

def generate_with_reasoning(prompt):
  # Build the prompt from the dataset
  prompt = " ".join(entry['content'] for entry in prompt)

  # Tokenize and move to the same device as the model
  inputs = trained_tokenizer(prompt, return_tensors="pt").to(trained_model.device)

  # Generate text without gradients
  start_time = time.time()
  with torch.no_grad():
      output_ids = trained_model.generate(**inputs, max_length=500)
  end_time = time.time()

  # Decode and extract model response
  generated_text = trained_tokenizer.decode(output_ids[0], skip_special_tokens=True)

  # Get inference time
  inference_duration = end_time - start_time

  # Get number of generated tokens
  num_input_tokens = inputs['input_ids'].shape[1]
  num_generated_tokens = output_ids.shape[1] - num_input_tokens

  return generated_text, inference_duration, num_generated_tokens

prompt = test_dataset['prompt'][sample_id]
generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(prompt)
print(generated_text)
prompt_text = " ".join(entry['content'] for entry in prompt)
response_text = generated_text[len(prompt_text):].strip()
print(response_text)
print(f"Inference time: {inference_duration:.2f} seconds")
print(f"Generated tokens: {num_generated_tokens}")

To solve this problem, we need to carefully track the sequence of black and red cards that Petya places based on the given constraints. Let's denote the sequence of cards with \(R\) for red and \(B\) for black.

### Key Constraints
1. The 10th and 11th cards are red.
2. The 25th card is black.
3. No two cards of the same color are placed consecutively.

### Objective
- Determine the color of the 26th card.

### Reasoning
Given that no two cards of the same color can be consecutive, the cards must alternate in colors. However, we need to respect the specific placements of the 10th, 11th, and 25th cards.

Let's outline the logic step-by-step:
1. From cards 1 to 9, the sequence must alternate starting with either red or black.
2. The 10th and 11th cards are red, so the 9th card must be black.
3. The 25th card is black.

Given that the 25th card is black and no two consecutive cards can be the same color, the 26th card must be red.

Let's confirm this with a Python code to simulate the seq