<a href="https://colab.research.google.com/github/ashivashankars/CMPE255_Assignments/blob/main/4_Unsloth's_grpo_reasoning_toy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab 4 — GRPO Reasoning (DeepSeek‑style) with Unsloth

**Goal:** Train short **GRPO** reasoning steps on a tiny math/logic set.

##Overview
Teach the model to **show its work** on reasoning questions. GRPO gives reward to better chains‑of‑thought. We keep a tiny built‑in dataset so it always runs.

In [None]:
# %%capture
!pip -q install --upgrade pip
# Core libs
!pip -q install "unsloth>=2025.10.0" "transformers>=4.45.0" "datasets>=2.19.0" "accelerate>=1.0.0" "trl>=0.9.6" "peft>=0.13.0" "bitsandbytes>=0.44.0" "evaluate>=0.4.3" "scikit-learn>=1.5.0"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.8 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m74.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.
pylibcudf-cu12 25.6.0 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 22.0.0 which is incompatible.[0m[31m
[0m

In [None]:
import os, random, numpy as np, torch, platform
from datetime import datetime
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED);
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)
print("Timestamp:", datetime.now())
print("Python:", platform.python_version())
print("Torch:", torch.__version__)
print("CUDA available:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("Device:", torch.cuda.get_device_name(0))
    print("Capability:", torch.cuda.get_device_capability(0))
else:
    print("⚠️ GPU not found. Colab > Runtime > Change runtime type > GPU is recommended.")

Timestamp: 2025-11-09 05:40:39.516658
Python: 3.12.12
Torch: 2.8.0+cu126
CUDA available: True
Device: NVIDIA A100-SXM4-40GB
Capability: (8, 0)


## Tiny inline reasoning dataset

In [None]:
import json, os
reasoning = [
  {"question":"If a pen costs $2 and a notebook costs twice as much, what is the total cost?", "answer":"$6", "rationale":"Notebook costs $4 (twice of $2). Total $2+$4=$6."},
  {"question":"What is 15% of 80?", "answer":"12", "rationale":"0.15*80=12."},
  {"question":"A train travels 60 km in 1.5 hours. What is its average speed?", "answer":"40 km/h", "rationale":"Speed=Distance/Time=60/1.5=40."},
]
os.makedirs("data", exist_ok=True)
with open("data/grpo_tiny.jsonl","w") as f:
    for r in reasoning: f.write(json.dumps(r)+"\n")
len(reasoning)

3

## Load model (QLoRA)

In [None]:
from unsloth import FastLanguageModel
import torch
dtype = torch.bfloat16 if torch.cuda.is_available() else None
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b-unsloth-bnb-4bit",
    max_seq_length=2048,
    dtype=dtype,
    load_in_4bit=True,
)
# Explicitly set the chat template for Llama-3.1
tokenizer.chat_template = "<|begin_of_text|>{% for message in messages %}{% if message['role'] == 'user' %}<|start_header_id|>user<|end_header_id|>\n{{ message['content'] | trim }}<|eot_id|>{% elif message['role'] == 'assistant' %}<|start_header_id|>assistant<|end_header_id|>\n{{ message['content'] | trim }}<|eot_id|>{% elif message['role'] == 'system' %}<|start_header_id|>system<|end_header_id|>\n{{ message['content'] | trim }}<|eot_id|>{% endif %}{% endfor %}{% if add_generation_prompt %}<|start_header_id|>assistant<|end_header_id|>\n{% endif %}"
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

==((====))==  Unsloth 2025.11.2: Fast Llama patching. Transformers: 4.57.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


## Build a simple GRPO-style reward (exact match on final answer)

In [None]:
import json, torch, random
from datasets import Dataset
ds = Dataset.from_list(reasoning)

# Simple reward: 1.0 if final boxed answer matches, else 0.0
def format_messages(ex):
    prompt = f"Solve step by step, then give final answer on a new line as 'Final: {ex['answer']}'.\nQuestion: {ex['question']}"
    return [{"role":"user","content":prompt}]
ds = ds.map(lambda ex: {"messages": format_messages(ex)})
ds

Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'answer', 'rationale', 'messages'],
    num_rows: 3
})

## GRPO training loop (minimal toy)

In [None]:
import torch, math

FastLanguageModel.for_inference(model)
optim = torch.optim.AdamW(model.parameters(), lr=5e-6)

def rollout_and_reward(messages):
    x = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
    y = model.generate(x, max_new_tokens=96, do_sample=True, temperature=0.8, top_p=0.95, output_scores=True, return_dict_in_generate=True)
    text = tokenizer.decode(y.sequences[0], skip_special_tokens=True)
    reward = 1.0 if "Final: " in text and text.strip().split("Final:")[-1].strip().startswith(messages[0]["content"].split("Final:")[-1].strip()) else 0.0
    return y, reward, text

for step, ex in enumerate(ds.shuffle(seed=42).select(range(len(ds)))):
    y, reward, text = rollout_and_reward(ex["messages"])
    # Fake GRPO: encourage higher logprobs when reward=1, otherwise small penalty
    loss = (1.0 - reward) * 0.1
    loss = torch.tensor(loss, requires_grad=True, device=model.device)
    loss.backward()
    if (step+1)%2==0:
        optim.step(); optim.zero_grad()
    print(f"step {step} reward={reward:.2f} sample:\n{text[:180]}...\n")
print("Toy GRPO loop done (for full GRPO use the official tutorial).")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


step 0 reward=0.00 sample:
user
Solve step by step, then give final answer on a new line as 'Final: 40 km/h'.
Question: A train travels 60 km in 1.5 hours. What is its average speed?assistant
Average speed =...

step 1 reward=1.00 sample:
user
Solve step by step, then give final answer on a new line as 'Final: 12'.
Question: What is 15% of 80?assistant
What is 25% of 80?िरफ
What is 10% of 80?िरफ
What is 50% of 80?िर...

step 2 reward=1.00 sample:
user
Solve step by step, then give final answer on a new line as 'Final: $6'.
Question: If a pen costs $2 and a notebook costs twice as much, what is the total cost?assistant
Pleas...

Toy GRPO loop done (for full GRPO use the official tutorial).


## Quick check

In [None]:
def ask(q):
    msgs=[{"role":"user","content":f"Solve step by step and end with 'Final: X'.\nQuestion: {q}"}]
    x = tokenizer.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
    y = model.generate(x, max_new_tokens=120, do_sample=False)
    print(tokenizer.decode(y[0], skip_special_tokens=True))
ask("What is 20% of 90?")

user
Solve step by step and end with 'Final: X'.
Question: What is 20% of 90?assistant
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of 90 is 18.
Step-by-step explanation:
20% of


## Save adapter

In [None]:
model.save_pretrained("llama31_grpo_toy_adapter")
tokenizer.save_pretrained("llama31_grpo_toy_adapter")
print("Saved to llama31_grpo_toy_adapter")

Saved to llama31_grpo_toy_adapter
