<a href="https://colab.research.google.com/github/abdulsamadkhan/Reasoning/blob/main/GRPO%20with%20Llama.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRPO with Llama

#1. Installing Libraries

## **1.1 Installing `unsloth`**
- `unsloth` is a library optimized for fine-tuning large language models (LLMs).
- It focuses on efficiency, allowing fine-tuning on consumer GPUs and cloud environments.
- Useful for developers working on custom AI models.

## **1.2. Installing `vllm`**
- `vllm` is a high-performance inference engine for LLMs.
- It optimizes memory usage and speeds up model execution using parallelization techniques.
- Beneficial for serving LLMs in production environments.





In [None]:
!pip install unsloth vllm
!pip install --upgrade pillow

Collecting unsloth
  Downloading unsloth-2025.3.14-py3-none-any.whl.metadata (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.3/59.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vllm
  Downloading vllm-0.7.3-cp38-abi3-manylinux1_x86_64.whl.metadata (25 kB)
Collecting unsloth_zoo>=2025.3.11 (from unsloth)
  Downloading unsloth_zoo-2025.3.12-py3-none-any.whl.metadata (17 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.17-py3-none-any.whl.metadata (9.5 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.4.0-py3-none-any.whl.metadata (19 kB)
C



##Setting up Unsloth

In [None]:
from unsloth import FastLanguageModel

Unsloth: Patching Xformers to fix some performance issues.
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


Now, let’s load the Llama 3.1 8B Instruct model and configure it for fine-tuning:

---

This code loads the model in 4-bit quantization to save memory and applies LoRA (Low-Rank Adaptation) for efficient fine-tuning. The target_modules parameter specifies which layers of the model to fine-tune, and use_gradient_checkpointing enables training with longer contexts.

---

In [None]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.6,  # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Remove QKVO if out of memory
    lora_alpha=lora_rank,
    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
    random_state=3407,
)

INFO 03-16 04:56:57 __init__.py:207] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.14: Fast Llama patching. Transformers: 4.48.3. vLLM: 0.7.3.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.31%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.56 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 288.
Unsloth: vLLM's KV Cache can use up to 17.29 GB. Also swap space = 6 GB.
INFO 03-16 04:57:15 config.py:549] This model supports multiple tasks: {'embed', 'classify', 're

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

INFO 03-16 04:57:21 cuda.py:229] Using Flash Attention backend.
INFO 03-16 04:57:21 model_runner.py:1110] Starting to load model unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit...
INFO 03-16 04:57:21 loader.py:1089] Loading weights with BitsAndBytes quantization.  May take a while ...
INFO 03-16 04:57:23 weight_utils.py:254] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

INFO 03-16 04:57:40 weight_utils.py:270] Time spent downloading weights for unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit: 17.471849 seconds


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-16 04:57:45 model_runner.py:1115] Loading model weights took 5.5976 GB
INFO 03-16 04:57:45 punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-16 04:57:54 worker.py:267] Memory profiling takes 8.46 seconds
INFO 03-16 04:57:54 worker.py:267] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.59) = 23.46GiB
INFO 03-16 04:57:54 worker.py:267] model weights take 5.60GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.34GiB; the rest of the memory reserved for KV Cache is 16.44GiB.
INFO 03-16 04:57:54 executor_base.py:111] # cuda blocks: 8415, # CPU blocks: 3072
INFO 03-16 04:57:54 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 131.48x
INFO 03-16 04:57:59 model_runner.py:1434] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error 

Capturing CUDA graph shapes: 100%|██████████| 39/39 [00:49<00:00,  1.28s/it]

INFO 03-16 04:58:49 model_runner.py:1562] Graph capturing finished in 50 secs, took 0.87 GiB
INFO 03-16 04:58:49 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 63.52 seconds





tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth 2025.3.14 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## Data Preparation
First, we will define the format of the prompts and answers:



In [None]:
# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

Now, let’s prepare the dataset:

In [None]:
import re
from datasets import load_dataset, Dataset


# Helper functions to extract answers from different formats
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data


dataset = get_gsm8k_questions()

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

In [None]:
dataset['prompt'][2]

[{'content': '\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n',
  'role': 'system'},
 {'content': 'Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet?',
  'role': 'user'}]

In [None]:
from transformers import AutoTokenizer

# Define the prompt
prompt = dataset["prompt"][2]


# Extract the content from the prompt dictionaries
text = "".join([d["content"] for d in prompt])

# Tokenize the input using the extracted text
inputs = tokenizer(text, return_tensors="pt").to("cuda")  # Move to GPU

# Generate a response
with torch.no_grad():  # No gradients needed for inference
    output_ids = model.generate(**inputs, max_length=256, temperature=0.7, top_p=0.9)

# Decode the output
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(response)


Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet? 

<reasoning>
Betty has half of the money she needs, which is $50. Her parents give her $15, and her grandparents give her twice as much as her parents, which is $30. So, Betty has $50 + $15 + $30 = $95. Betty needs $100 - $95 = $5 more.
</reasonition>
<answer>
$5
</answer> 

This problem is a good example of a real-world application of basic arithmetic operations. It requires the student to apply the concepts of addition and subtraction to solve a practical problem. The problem also requires the student to understand the concept of half and twice as much, which is an important aspect of fractions and ratios. 

This problem is appropriate for 4t

## Defining Reward Functions

Different reward functions
*  **correctness_reward_func**	Rewards the model when its
answer matches the correct answer
* **int_reward_func	Rewards** the model for providing a numeric answer
* **strict_format_reward_func and soft_format_reward_func**	Reward the model for following the specified format
*  **xmlcount_reward_func**	Rewards proper XML tag usage and penalizes extra content after the closing tags

In [None]:

# Reward function that checks if the answer is correct
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    q = prompts[0][-1]["content"]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print(
        "-" * 20,
        f"Question:\n{q}",
        f"\nAnswer:\n{answer[0]}",
        f"\nResponse:\n{responses[0]}",
        f"\nExtracted:\n{extracted_responses[0]}",
    )
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]


# Reward function that checks if the answer is an integer
def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]


# Reward function that checks if the completion follows the strict format
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


# Reward function that checks if the completion follows a more relaxed format
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


# Reward function that counts XML tags and penalizes extra content
def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count


def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

## Training with GRPO


The GRPOConfig sets various hyperparameters for training:

* `use_vllm:` Enables fast inference with vLLM
* `learning_rate:` Controls how quickly the model learns
* `num_generations:` Number of completions to generate for each prompt
* `max_steps:` Total number of training steps to perform

In [None]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 256
training_args = GRPOConfig(
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=6,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    # num_train_epochs = 1,  # Set to 1 for a full training run
    max_steps=300,
    save_steps=300,
    max_grad_norm=0.1,
    report_to="none",  # Can use Weights & Biases
    output_dir="outputs",
)

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args=training_args,
    train_dataset=dataset,
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


In [None]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 300
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 83,886,080/8,000,000,000 (1.05% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
Let's break it down step by step:

1. The first 10 tickets cost full price, which is 10 * $40 = $400.
2. The remaining 2 tickets exceed the 10 tickets and qualify for a 5% discount. 
    The discount for each ticket is 5% of $40, which is 0.05 * $40 = $2.
    So the price for each exceeding ticket is $40 - $2 = $38.
3. The total cost of the 2 tickets with the 5% discount is 2 * $38 = $76.
4. The total amount Mr. Benson paid is the cost of the 10 full-price tickets plus the cost of the 2 discounted tickets.
    Total = $400 + $76 = $476.

<answer> $476 </answer> 
Extracted:
$476


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0,-0.089667,0.219638,177.333344,0.0,-0.089667,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,353.5,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,-0.201833,0.32421,191.0,0.000302,-0.201833,0.0,0.0,0.0,0.0
4,0.0,2.413667,0.155688,178.666672,0.000338,-0.086333,0.0,0.0,0.5,2.0
5,0.0,-0.059167,0.094679,125.833336,0.000968,-0.059167,0.0,0.0,0.0,0.0
6,0.0,1.167833,1.282462,258.666687,0.000225,-0.082167,0.0,0.0,0.25,1.0
7,0.0,0.0,0.0,179.666672,0.000448,0.0,0.0,0.0,0.0,0.0
8,0.0,-0.073167,0.179221,128.5,0.000785,-0.073167,0.0,0.0,0.0,0.0
9,0.0,0.556,1.358772,187.166672,0.0004,-0.277333,0.0,0.0,0.166667,0.666667
10,0.0,0.0,0.0,232.833344,0.00028,0.0,0.0,0.0,0.0,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m

Now, let's add up the costs of all the items: 
$10 + $6 = $16
$16 + $8 = $24
$24 + $9 = $33

The total amount Sophie spent is $33. 
Extracted:
To find the total amount Sophie spent, we need to add up the individual costs of the items she bought.

First, let's calculate the cost of each item:

- Cupcakes: 5 cupcakes * $2 = $10
- Doughnuts: 6 doughnuts * $1 = $6
- Apple pie slices: 4 slices * $2 = $8
- Cookies: 15 cookies * $0.60 = $9

Now, let's add up the costs of all the items: 
$10 + $6 = $16
$16 + $8 = $24
$24 + $9 = $33

The total amount Sophie spent is $33.
-------------------- Question:
Harry's birthday was three weeks after the closing of the school. His three friends decided to contribute an equal amount of money to throw him a party. Harry added $30 to the contribution, making the total contribution three times as much as Harry contributed. Calculate the total amount of money that each of Harry's friends contrib

TrainOutput(global_step=300, training_loss=0.0012868078833509762, metrics={'train_runtime': 3361.1012, 'train_samples_per_second': 0.536, 'train_steps_per_second': 0.089, 'total_flos': 0.0, 'train_loss': 0.0012868078833509762})

##Testing the Model
After training, let’s test our model to see how it performs. First, we’ll save the LoRA weights

In [None]:
model.save_lora("grpo_saved_lora")

test the model on a new question  

In [None]:
from vllm import SamplingParams

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "calculate pi."},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_saved_lora"),
    )[0]
    .outputs[0]
    .text
)

print(output)

Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.69s/it, est. speed input: 10.72 toks/s, output: 66.07 toks/s]

The value of pi (π) is an irrational number, which means it cannot be expressed as a finite decimal or fraction. However, we can calculate an approximation of pi using various methods.

One popular method is the Bailey-Borwein-Plouffe formula (BBP formula), which is a spigot algorithm for computing the nth binary digit of the mathematical constant pi. 

Another approach is to use the Monte Carlo method, which is based on random sampling. However, this is not as accurate as the BBP formula.

Here's a simplified example of how to calculate an approximation of pi using the BBP formula:

π = Σ (1/(16^k)) * ((4/(8k+1)) + (2/(8k+4)) - (1/(8k+5)) - (1/(8k+6)))

This is an infinite series, and the more terms you add, the closer you get to the actual value of pi.

Here's a Python code to calculate pi using the BBP formula:

```python
import math

def calculate_pi(n_terms):
    pi = 0.0
    for k in range(n_terms):
        pi += (1/(16**k)) * ((4/(8*k+1)) + (2/(8*k+4)) - (1/(8*k+5)) - (1/(8*k+6)




# Saving the Model
Unsloth provides several options for saving your fine-tuned model, but we’ll focus on the most common.

In [None]:
# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 44.37 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 31%|███▏      | 10/32 [00:00<00:00, 47.83it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:20<00:00,  1.60it/s]


Unsloth: Saving tokenizer... Done.
Done.


## Pushing to Hugging Face Hub
We’ll push the model to the Hugging Face Hub using the push_to_hub_merged method. This method allows us to push the model in multiple quantization formats.

In [None]:

from huggingface_hub import login

# Log in to Hugging Face Hub
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
    "abdulsamad/laamaInstruct_tuned", tokenizer, save_method="merged_16bit"
)

Unsloth: You are pushing to hub, but you passed your HF username = abdulsamad.
We shall truncate abdulsamad/laamaInstruct_tuned to laamaInstruct_tuned


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 44.41 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:20<00:00,  1.59it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.


README.md:   0%|          | 0.00/632 [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/abdulsamad/laamaInstruct_tuned


Check the model response on some data point.

In [None]:

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": dataset["prompt"][2]},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_saved_lora"),
    )[0]
    .outputs[0]
    .text
)

print(output)

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.93s/it, est. speed input: 83.78 toks/s, output: 59.48 toks/s]

<reasoning>
Julie read 12 pages yesterday. She read twice as many pages today, so she read 12 * 2 = 24 pages today. In total, she has read 12 + 24 = 36 pages so far. The book has 120 pages, so there are 120 - 36 = 84 pages left to read. She wants to read half of the remaining pages tomorrow, so she needs to read 84 / 2 = 42 pages tomorrow.
</reasoning>
<answer>
42
</answer>





##Conclusion
In this exercise, you’ve learned how to:

1. Set up Unsloth for accelerated fine-tuning
2. Prepare data for GRPO training
3. Define custom reward functions to guide the model’s learning
4. Train a model using GRPO
5. Test the fine-tuned model
6. Save the model in various formats