<a href="https://colab.research.google.com/github/amanzoni1/fine_tuning/blob/main/LLama3_RL_GRPO_Reasoning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Qwen3 Reinforcement Learning & GRPO with Reasoning

In [1]:
#@title Colab Install { display-mode: "form" }
%%capture
# Install Unsloth + vLLM (pinned versions)
!pip install --no-deps unsloth vllm==0.8.5.post1

# Core dependencies for LoRA, TRL, and bitsandbytes on Colab
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo

# Common NLP libraries
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub transformers==4.51.3

# vLLM extra requirements (skip numpy/transformers/xformers to avoid conflicts)
import requests, re
reqs = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
filtered = re.sub(rb"(transformers|numpy|xformers)[^\n]*\n", b"", reqs)
with open("vllm_requirements.txt","wb") as f:
    f.write(filtered)
!pip install -r vllm_requirements.txt

In [2]:
import torch
import numpy as np
import pandas as pd
from unsloth import FastLanguageModel
from datasets import load_dataset, Dataset
from transformers import TextStreamer
from trl import SFTTrainer, SFTConfig

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 06-05 14:24:06 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 06-05 14:24:06 [__init__.py:239] Automatically detected platform cuda.


In [3]:
model       = "unsloth/Meta-Llama-3.1-8B-bnb-4bit"
sft_dataset = "openai/gsm8k"
rl_dataset  = "EdinburghNLP/xsum"

In [4]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name            = model,
    max_seq_length        = 512,
    load_in_4bit          = True,
    fast_inference        = True,
    max_lora_rank         = 16,
)

==((====))==  Unsloth 2025.5.9: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-bnb-4bit with actual GPU utilization = 49.53%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 128.
Unsloth: vLLM's KV Cache can use up to 1.2 GB. Also swap space = 0 GB.
INFO 06-05 14:24:51 [config.py:717] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwargs

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

INFO 06-05 14:24:58 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-05 14:24:58 [cuda.py:289] Using XFormers backend.
INFO 06-05 14:24:58 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 06-05 14:24:58 [model_runner.py:1108] Starting to load model unsloth/meta-llama-3.1-8b-bnb-4bit...
INFO 06-05 14:24:59 [loader.py:1187] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 06-05 14:25:00 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

INFO 06-05 14:26:00 [weight_utils.py:281] Time spent downloading weights for unsloth/meta-llama-3.1-8b-bnb-4bit: 59.585679 seconds
INFO 06-05 14:26:01 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 06-05 14:26:34 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 06-05 14:26:34 [model_runner.py:1140] Model loading took 5.4442 GiB and 95.439224 seconds
INFO 06-05 14:26:47 [worker.py:287] Memory profiling takes 11.87 seconds
INFO 06-05 14:26:47 [worker.py:287] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.30GiB
INFO 06-05 14:26:47 [worker.py:287] model weights take 5.44GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.59GiB; the rest of the memory reserved for KV Cache is 1.24GiB.
INFO 06-05 14:26:47 [executor_base.py:112] # cuda blocks: 632, # CPU blocks: 0
INFO 06-05 14:26:47 [executor_base.py:117] Maximum concurrency for 512 tokens per request: 19.75x
INFO 06-05 14:26:47 [model_runner.py:1450] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-

Capturing CUDA graph shapes:   0%|          | 0/19 [00:00<?, ?it/s]

INFO 06-05 14:27:34 [model_runner.py:1592] Graph capturing finished in 47 secs, took 0.47 GiB
INFO 06-05 14:27:35 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 60.04 seconds
Unsloth: Just some info: will skip parsing ['q_norm', 'post_feedforward_layernorm', 'pre_feedforward_layernorm', 'k_norm']
Unsloth: Just some info: will skip parsing ['q_norm', 'post_feedforward_layernorm', 'pre_feedforward_layernorm', 'k_norm']


tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    random_state               = 111,
    r                          = 16,
    lora_alpha                 = 32,
    bias                       = "none",
    use_gradient_checkpointing = "unsloth",
    target_modules             = ["q_proj", "k_proj", "v_proj", "o_proj",
                                  "gate_proj", "up_proj", "down_proj"],
)

Unsloth 2025.5.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [6]:
# Define special tokens and system prompt
reasoning_start = "<REASONING>"
reasoning_end   = "</REASONING>"
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = (
    "You are given a problem.\n"
    "Think over it and describe your step‐by‐step reasoning.\n"
    f"Enclose reasoning between {reasoning_start} and {reasoning_end}.\n"
    f"Finally, give your answer between {solution_start} and {solution_end}"
)

In [7]:
# Build and assign chat_template to the tokenizer

chat_template = (
    # If the very first message is a SYSTEM role, print it + <eos>:
    "{% if messages[0]['role'] == 'system' %}"
      "{{ messages[0]['content'] + eos_token }}"
      "{% set rest = messages[1:] %}"
    "{% else %}"
      # Otherwise, inject our system_prompt + <eos>:
      "{{ '{system_prompt}' + eos_token }}"
      "{% set rest = messages %}"
    "{% endif %}"

    # Now loop over the remaining messages (either user or assistant):
    "{% for m in rest %}"
      "{% if m['role'] == 'user' %}"
        "{{ m['content'] }}"
      "{% else %}"  # assistant
        "{{ m['content'] + eos_token }}"
      "{% endif %}"
    "{% endfor %}"

    # If we asked for “add_generation_prompt,” append <REASONING> to the end:
    "{% if add_generation_prompt %}"
      "{{ '{reasoning_start}' }}"
    "{% endif %}"
)

chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")

tokenizer.chat_template = chat_template

In [8]:
# Quick sanity check of the template
example_messages = [
    {"role": "user",
     "content": "Which country has the highest population density?"},
    {"role": "assistant",
     "content": (
         f"{reasoning_start}"
         "I know that country X is small in area but has a huge population, "
         "so its people per square kilometer is extremely high."
         f"{reasoning_end}"
         f"{solution_start}Monaco{solution_end}"
     )},
    {"role": "user",
     "content": "Which planet is farthest from the Sun?"},
]


print("Rendered example:\n")
print(tokenizer.apply_chat_template(example_messages, tokenize=False, add_generation_prompt = True))

Rendered example:

You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Which country has the highest population density?<REASONING>I know that country X is small in area but has a huge population, so its people per square kilometer is extremely high.</REASONING><SOLUTION>Monaco</SOLUTION><|end_of_text|>Which planet is farthest from the Sun?<REASONING>


In [9]:
# Load the dataset
dataset = load_dataset(sft_dataset, "main", split="train")
dataset

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer'],
    num_rows: 7473
})

In [10]:
print("=== Raw GSM8K columns ===")
print(dataset.column_names)
print("\n=== First raw example ===")
print(dataset[0])

=== Raw GSM8K columns ===
['question', 'answer']

=== First raw example ===
{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?', 'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'}


In [11]:
# Define formatting + token-count function
def format_and_count_gsm8k(example):
    question = example["question"].strip()
    reasoning = example["answer"].split("####")[0].replace("\n", " ").strip()
    final_ans = example["answer"].split("####")[1].strip()

    messages = [
        {"role": "system",    "content": system_prompt},
        {"role": "user",      "content": question},
        {"role": "assistant", "content": (
            f"{reasoning_start}{reasoning}{reasoning_end}"
            f"{solution_start}{final_ans}{solution_end}"
        )}
    ]

    enc = tokenizer.apply_chat_template(messages, tokenize=True)
    if isinstance(enc, dict):
        token_len = len(enc["input_ids"])
    else:
        token_len = len(enc)

    text_str = tokenizer.apply_chat_template(messages, tokenize=False)

    return {
        "token_len": token_len,
        "text": text_str,
        "Messages": messages
    }

In [12]:
# Apply formatting to the dataset
dataset = dataset.map(
    format_and_count_gsm8k,
    remove_columns=dataset.column_names,
)

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [13]:
# Sanity check: print a few "text" examples before filtering
print("\n=== Few formatted text examples (first 3) ===")
for i in range(3):
    print(f"\n--- Example {i} ---")
    print(dataset[i]["text"])


=== Few formatted text examples (first 3) ===

--- Example 0 ---
You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?<REASONING>Natalia sold 48/2 = <<48/2=24>>24 clips in May. Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.</REASONING><SOLUTION>72</SOLUTION><|end_of_text|>

--- Example 1 ---
You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn?<REASONING>Weng earns 12/60 = $<<12/60=0.2>

In [14]:
# Token-length statistics and dataset filtered for training
lengths = np.array(dataset["token_len"])
print("\nToken-length percentiles (50/90/99):", np.percentile(lengths, [50, 90, 99]))

threshold = 200
sft_ds_filtered     = dataset.filter(lambda ex: ex["token_len"] <= threshold)
sft_ds_filtered     = sft_ds_filtered.select(range(100))
sft_ds_filtered_out = dataset.filter(lambda ex: ex["token_len"] >  threshold)

print(f"\nRemaining for training (≤{threshold} tokens): {len(sft_ds_filtered)} / {len(dataset)}")


Token-length percentiles (50/90/99): [210. 299. 390.]


Filter:   0%|          | 0/7473 [00:00<?, ? examples/s]

Filter:   0%|          | 0/7473 [00:00<?, ? examples/s]


Remaining for training (≤200 tokens): 100 / 7473


In [15]:
# Drop extra columns so dataset contains only "text"
sft_dataset = sft_ds_filtered.remove_columns(["token_len", "Messages"])
print("\n=== Final dataset ===")
print(sft_dataset)


=== Final dataset ===
Dataset({
    features: ['text'],
    num_rows: 100
})


In [16]:
# Define training arguments
sft_config = SFTConfig(
    seed                        = 111,
    do_train                    = True,
    num_train_epochs            = 2,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 2,
    learning_rate               = 2e-4,
    lr_scheduler_type           = "linear",
    warmup_ratio                = 0.03,
    weight_decay                = 0.01,
    logging_strategy            = "steps",
    logging_steps               = 5,
    report_to                   = "none",
)

In [17]:
# Instantiate SFTTrainer
trainer = SFTTrainer(
    model         = model,
    args          = sft_config,
    train_dataset = sft_dataset,
    tokenizer     = tokenizer,
)

Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/100 [00:00<?, ? examples/s]

In [18]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 100 | Num Epochs = 2 | Total steps = 50
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
 "-____-"     Trainable parameters = 41,943,040/8,000,000,000 (0.52% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,1.6672
10,0.8701
15,0.7716
20,0.6991
25,0.7737
30,0.6378
35,0.5837
40,0.5606
45,0.5789
50,0.519


TrainOutput(global_step=50, training_loss=0.7661662244796753, metrics={'train_runtime': 162.7359, 'train_samples_per_second': 1.229, 'train_steps_per_second': 0.307, 'total_flos': 893591679762432.0, 'train_loss': 0.7661662244796753})

In [19]:
# Pick one example’s first two “system + user” messages
prompt_messages = sft_ds_filtered_out[0]["Messages"][:2]

# Render into a single string and append <REASONING> for generation:
text = tokenizer.apply_chat_template(
    prompt_messages,
    tokenize=False,
    add_generation_prompt=True,  # append the final <REASONING>
)

In [20]:
# Stream the model’s generations (CoT + solution)
streamer = TextStreamer(tokenizer, skip_prompt=False)

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    temperature    = 0.0,
    max_new_tokens = 512,
    streamer       = streamer,
)

<|begin_of_text|>You are given a problem.
Think over it and describe your step‐by‐step reasoning.
Enclose reasoning between <REASONING> and </REASONING>.
Finally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?<REASONING>Yesterday, Julie read 12*2=<<12*2=24>>24 pages. So far, she has read 12+24=36 pages. She has 120-36=<<120-36=84>>84 pages left. Half of 84 is 84/2=<<84/2=42>>42 pages. So, Julie should read 42 pages tomorrow.</REASONING><SOLUTION>42</SOLUTION><|end_of_text|>


In [21]:
del dataset
del sft_ds_filtered
del sft_ds_filtered_out
del sft_dataset

import gc
gc.collect()
torch.cuda.empty_cache()

In [9]:
# 1) Load the full XSum train split into memory
dataset = load_dataset(rl_dataset, split="train")
dataset

README.md:   0%|          | 0.00/6.24k [00:00<?, ?B/s]

xsum.py:   0%|          | 0.00/5.76k [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/304M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/16.7M [00:00<?, ?B/s]

0000.parquet:   0%|          | 0.00/17.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/204045 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11332 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/11334 [00:00<?, ? examples/s]

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})

In [10]:
# Keep only the first 1 000 documents whose raw token count ≤ 300
DOC_CUTOFF = 300
TARGET_EXAMPLES = 1000

selected = []
for ex in dataset:
    # ex["document"] is the source text, ex["summary"] is the gold summary.
    doc_tokens = tokenizer(
        ex["document"],
        truncation=False,  # we just want to measure length, not truncate
    )["input_ids"]
    if len(doc_tokens) <= DOC_CUTOFF:
        selected.append({
            "document": ex["document"],
            "summary":  ex["summary"]
        })
        if len(selected) >= TARGET_EXAMPLES:
            break

print(f"✔ Collected {len(selected)} examples with doc‐tokens ≤ {DOC_CUTOFF}.")

✔ Collected 1000 examples with doc‐tokens ≤ 300.


In [11]:
# Build a Hugging Face Dataset from that Python list
dataset = Dataset.from_list(selected)
dataset

Dataset({
    features: ['document', 'summary'],
    num_rows: 1000
})

In [12]:
# Turn each document into a formatted “text” string
def to_grpo_input(ex):
    # Build the `<SYSTEM> + <USER>` prompt, then append "<REASONING>" so model knows to start thinking.
    messages = [
        {"role": "system",  "content": system_prompt},
        {"role": "user",    "content": ex["document"]}
    ]
    text_str = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True  # this injects "<REASONING>" at the end
    )
    full_ids = tokenizer(text_str, truncation=False)["input_ids"]

    return {
        "text":         text_str,
        "gold_summary": ex["summary"],
        "full_len": len(full_ids)
    }

dataset = dataset.map(to_grpo_input)


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [13]:
# Check token‐length percentiles on the newly‐built “text” field
full_lens = np.array(dataset["full_len"])

print("✔ Final “text” lengths 50/90/99 pct:", np.percentile(full_lens, [50,90,99]))

✔ Final “text” lengths 50/90/99 pct: [244.  335.1 353. ]


In [14]:
# Remove every column except the needed columns
rl_dataset = dataset.remove_columns([c for c in dataset.column_names
                                    if c not in ("text","gold_summary")])

rl_dataset

Dataset({
    features: ['text', 'gold_summary'],
    num_rows: 1000
})

In [15]:
# Sanity check
print("✅ Ready for GRPO")

print("columns now:", rl_dataset.column_names)

for i in range(3):
    print(f"\n─ Example {i} ─")
    print("text :", repr(rl_dataset[i]["text"]))
    print("gold :", repr(rl_dataset[i]["gold_summary"]))

✅ Ready for GRPO
columns now: ['text', 'gold_summary']

─ Example 0 ─
text : 'You are given a problem.\nThink over it and describe your step‐by‐step reasoning.\nEnclose reasoning between <REASONING> and </REASONING>.\nFinally, give your answer between <SOLUTION> and </SOLUTION><|end_of_text|>A fire alarm went off at the Holiday Inn in Hope Street at about 04:20 BST on Saturday and guests were asked to leave the hotel.\nAs they gathered outside they saw the two buses, parked side-by-side in the car park, engulfed by flames.\nOne of the tour groups is from Germany, the other from China and Taiwan. It was their first night in Northern Ireland.\nThe driver of one of the buses said many of the passengers had left personal belongings on board and these had been destroyed.\nBoth groups have organised replacement coaches and will begin their tour of the north coast later than they had planned.\nPolice have appealed for information about the attack.\nInsp David Gibson said: "It appears as thoug

In [16]:
import re

In [17]:
# Build an “</SOLUTION> + optional EOS/whitespace” pattern
solution_end_regex = (
    r"</SOLUTION>"
  + r"[\s]*"
  + "(?:" + re.escape(tokenizer.eos_token) + ")?"
)

# Build a single regex that matches “</REASONING> … <SOLUTION> … </SOLUTION>” at the end
match_full_format = re.compile(
    rf"{reasoning_end}.*?"
    rf"{solution_start}(.+?){solution_end_regex}"
    rf"[\s]*$",
    flags = re.MULTILINE | re.DOTALL
)

In [18]:
# Reward func for the format
def reward_format(completions, **kwargs):
    """
    Returns one float per generated completion:
      • +3.0  if “</REASONING>…<SOLUTION>…</SOLUTION>” appears exactly once AND
               that entire sequence ends the string.
      • Otherwise, award:
          +0.5 if exactly one </REASONING>, else −0.5
          +0.5 if exactly one <SOLUTION>,   else −0.5
          +0.5 if exactly one </SOLUTION>,  else −0.5
        (so total is in [−1.5 … +1.5] for partial tag counts).
      • If absolutely no tags appear (all counts==0), override to −3.0.
    """
    scores = []
    for c in completions:
        resp = c[0]["content"]

        # Perfect final pattern → +3.0
        if match_full_format.search(resp):
            scores.append(3.0)
            continue

        # Count appearances of each tag
        cnt_rend  = resp.count(reasoning_end)
        cnt_sst   = resp.count(solution_start)
        cnt_send  = resp.count(solution_end)

        # Build a partial score (−0.5 or +0.5) per tag
        score = 0.0
        score +=  0.5 if cnt_rend  == 1 else -0.5
        score +=  0.5 if cnt_sst   == 1 else -0.5
        score +=  0.5 if cnt_send  == 1 else -0.5

        # If none of the tags appear → heavy penalty
        if cnt_rend == 0 and cnt_sst == 0 and cnt_send == 0:
            scores.append(-3.0)
            continue

        scores.append(score)
    return scores

In [19]:
# (Assuming reasoning_end = "</REASONING>", solution_start = "<SOLUTION>", solution_end = "</SOLUTION>"
#  and match_full_format has been defined exactly as in your code.)

# 1) Perfect formatting at the very end → should yield +3.0
perfect = (
    "Some text before…"
    "</REASONING>Here comes the answer<SOLUTION>42</SOLUTION>"
)  # no characters after </SOLUTION>
print(reward_format([[{"content": perfect}]])[0])  # → 3.0

# 2) All three tags appear once, but not at the very end:
almost = "abc </REASONING> … <SOLUTION>foo</SOLUTION> xyz"
print(reward_format([[{"content": almost}]])[0])
#   cnt_rend=1, cnt_sst=1, cnt_send=1 → +0.5 +0.5 +0.5 = +1.5

# 3) Exactly two tags appear once, third missing or duplicated:
two_tags = "… </REASONING> Hello <SOLUTION>foo</SOLUTION>"
#  cnt_rend=1 (+0.5), cnt_sst=1 (+0.5), cnt_send=1 (+0.5)
#  but since the pattern isn’t at the very end (there’s no </SOLUTION> at string end), this is counted as “Case B”.
print(reward_format([[{"content": two_tags}]])[0])  # → +1.5 also

# 4) Only one correct tag, the others absent or duplicated
one_tag = "… only </REASONING> is here"
#  cnt_rend=1 (+0.5), cnt_sst=0 (–0.5), cnt_send=0 (–0.5) → total = –0.5
print(reward_format([[{"content": one_tag}]])[0])  # → –0.5

# 5) No tags at all → immediate –3.0
none = "No tags anywhere in this response"
print(reward_format([[{"content": none}]])[0])  # → –3.0

# 6) Duplicated tags → each count != 1, so each tag contributes –0.5; e.g. cnt_rend=2, cnt_sst=2, cnt_send=2
dup_tags = "</REASONING></REASONING> <SOLUTION></SOLUTION> <SOLUTION></SOLUTION> </SOLUTION>"
print(reward_format([[{"content": dup_tags}]])[0])  # → (–0.5) + (–0.5) + (–0.5) = –1.5

3.0
1.5
3.0
-0.5
-3.0
3.0


In [23]:
!pip install evaluate rouge_score

Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=18d3ee18e35158cfc760913acd229eb12fa91bceed46d2dfbb2ee497c2eef05a
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score
Successfully installed rouge_score-0.1.2


In [24]:
from evaluate import load

rouge = load("rouge")

In [33]:
# Capture everything between <SOLUTION> and </SOLUTION>
solution_regex = re.compile(
    rf"{re.escape(solution_start)}(.*?){re.escape(solution_end)}",
    flags = re.DOTALL | re.MULTILINE
)

In [34]:
# Extract text from completion
def extract_solution_text(raw: str) -> str:
    m = solution_regex.search(raw)
    if m:
        return m.group(1).strip()
    return raw.strip()

In [35]:
# Define a soft length‐reward function
def length_reward(generated: str, reference: str) -> float:
    len_gen = len(generated.split())
    len_ref = len(reference.split())
    dev = abs(len_gen - len_ref) / max(1, len_ref)
    return max(0.0, 1.0 - dev)

In [39]:
# Build the reward_content function
ROUGE_WEIGHT  = 8.0   # maximum points from ROUGE-2
LENGTH_WEIGHT = 1.0   # maximum points from length

def reward_content(prompts, completions, gold_summary, **kwargs):
    """
    Returns one float per generated completion. Steps:
      a) Extract “gen_summary” = text inside <SOLUTION>…</SOLUTION>, else raw.
      b) Compute rouge2 F1(gen_summary, gold_summary[i]) → in [0.0, 1.0].
      c) Compute length_penalty(gen_summary, gold_summary[i]) → in [0.0, 1.0].
      d) Final content score = (rouge2_f1 * ROUGE_WEIGHT) + (length_penalty * LENGTH_WEIGHT).
         → lies in [0.0 … ROUGE_WEIGHT + LENGTH_WEIGHT]. By default [0.0 … 9.0].
    """
    scores = []
    for completion, reference in zip(completions, gold_summary):
        raw = completion[0]["content"]
        gen_summary = extract_solution_text(raw)

        # 1) ROUGE-2 F1
        r = rouge.compute(
            predictions = [gen_summary],
            references  = [reference],
            rouge_types = ["rouge2"],
            use_stemmer = True
        )
        rouge2_f1 = r["rouge2"]

        # 2) Length penalty
        lp = length_reward(gen_summary, reference)

        # 3) Scale and sum
        content_score = (rouge2_f1 * ROUGE_WEIGHT) + (lp * LENGTH_WEIGHT)
        scores.append(content_score)

    return scores

In [55]:
# ── Step 3: construct a few toy “(completion, gold_summary)” pairs ─────────────────

# 3.1 “Perfect match” → expect ~ ROUGE-2=1.0 and length_penalty=1.0 → score ≈ 9.0
gold_1 = "The cat sat on the mat."
completion_1 = (
    "<REASONING>irrelevant reasoning…</REASONING>"
    f"{solution_start}The cat sat on the mat.{solution_end}"
)
# wrap it into the “completions” data structure that GRPOTrainer will use:
# each completion is a list of one dict { "content": ... }
completions_1 = [[{"content": completion_1}]]

# 3.2 “Partial match” (gold is a bit different) → expect lower ROUGE‐2, lower length_penalty
gold_2 = "The cat sat on the mat today."
completion_2 = (
    "<REASONING>some reasoning…</REASONING>"
    f"{solution_start}The cat sat on mat.{solution_end}"
)
completions_2 = [[{"content": completion_2}]]

# 3.3 “No <SOLUTION> tags at all” → function falls back to raw text
gold_3 = rl_dataset[0]["gold_summary"]
completion_3 = "lazy dog."
completions_3 = rl_dataset[0]["text"]

# Pull all pairs into lists so we can call reward_content in one shot:
all_completions = [completions_1[0], completions_2[0], completions_3[0]]
all_gold       = [gold_1,         gold_2,         gold_3]

In [None]:
 print("text :", repr(rl_dataset[i]["text"]))
    print("gold :", repr(rl_dataset[i]["gold_summary"]))

In [56]:
# ── Step 4: call reward_content(...) and print the numbers ────────────────────────────

scores = reward_content(
    prompts     = [None, None, None],  # GRPOTrainer also passes “prompts” but we don't use them inside reward_content
    completions = all_completions,
    gold_summary= all_gold
)

print("Scores for each test case:")
for i, sc in enumerate(scores, 1):
    print(f"  Case {i}:  {sc:.4f}")

TypeError: string indices must be integers, not 'str'

In [58]:

# Let’s build three toy “model completions” (all “perfect” for now):
toy_completions = []
toy_gold       = []

for i in range(3):
    prompt_text  = rl_dataset[i]["text"]
    gold_summary = rl_dataset[i]["gold_summary"]
    toy_gold.append(gold_summary)

    # We assume the model “thinks nothing” (i.e. immediate </REASONING>),
    # then writes out exactly the gold summary inside <SOLUTION>…</SOLUTION>
    perfect_response = f"{reasoning_end}{solution_start}{gold_summary}{solution_end}"
    # Wrap in the GRPOTrainer format: each “completion” is a list of one dict
    toy_completions.append([{"content": perfect_response}])


# 1) Test reward_format
fmt_scores = reward_format(toy_completions)
print("\nReward_format on ‘perfect’ completions -->", fmt_scores)
#    Expect [3.0, 3.0, 3.0] because each ends in “</REASONING>…<SOLUTION>…</SOLUTION>”

# 2) Test reward_content
content_scores = reward_content(
    prompts     = [None]*3,       # GRPOTrainer will supply these, but we ignore inside reward_content
    completions = toy_completions,
    gold_summary = toy_gold
)
print("Reward_content on ‘perfect’ completions →", [f"{s:.3f}" for s in content_scores])
#    Expect each ≈ (1.0*8.0) + (1.0*1.0) = 9.0


# ——————————————————————————————————————————————————————————————————————
# Now try “broken” versions to see the drop‐off. For example:
broken_completions = []

# Example 0: Missing </SOLUTION> tag altogether
broken_completions.append([{"content": f"{reasoning_end}{solution_start}{toy_gold[0]}"}])

# Example 1: Has two </SOLUTION> tags (over‐tagged)
broken_completions.append([{"content": f"{reasoning_end}{solution_start}{toy_gold[1]}{solution_end}{solution_end}"}])

# Example 2: Tags appear in wrong order: <SOLUTION> appears before </REASONING>
broken_completions.append([{"content": f"{solution_start}{toy_gold[2]}{solution_end}{reasoning_end}"}])

print("\nBroken‐format returns:")
print("  fmt_scores:", reward_format(broken_completions))

# For content, we still extract whatever is between the first <SOLUTION>…</SOLUTION> (if any),
# so reward_content will compute ROUGE on that partial text.
print("  content_scores:", [f"{s:.3f}" for s in reward_content(
    prompts       = [None]*3,
    completions   = broken_completions,
    gold_summary  = toy_gold
)])


Reward_format on ‘perfect’ completions --> [3.0, 3.0, 3.0]
Reward_content on ‘perfect’ completions → ['9.000', '9.000', '9.000']

Broken‐format returns:
  fmt_scores: [0.5, 3.0, 1.5]
  content_scores: ['8.529', '9.000', '9.000']
