<a href="https://colab.research.google.com/github/abdulsamadkhan/Reasoning/blob/main/GRPO%20with%20Llama%203.2%201B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GRPO with Llama

# **1. Installing Libraries**

## **1.1 Installing `unsloth`**
- `unsloth` is a library optimized for fine-tuning large language models (LLMs).
- It focuses on efficiency, allowing fine-tuning on consumer GPUs and cloud environments.
- Useful for developers working on custom AI models.

## **1.2. Installing `vllm`**
- `vllm` is a high-performance inference engine for LLMs.
- It optimizes memory usage and speeds up model execution using parallelization techniques.
- Beneficial for serving LLMs in production environments.





In [1]:
!pip install unsloth vllm
!pip install --upgrade pillow



# **2. Model loading using Unsloth**


## **2.1 Features of `FastLanguageModel`**
- Provides a simplified API for working with transformer-based models.
- Supports efficient parameter tuning to optimize model performance.
- Works with various pre-trained models, enabling faster fine-tuning.




In [2]:
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-23 04:08:33 [__init__.py:256] Automatically detected platform cuda.


Now, let’s load the Llama 3.1 8B Instruct model and configure it for fine-tuning:

---


## **2.2 Defining Model Hyperparameters**
### `max_seq_length = 1024`
- Sets the maximum number of tokens the model can process at once.
- Increasing this allows for longer text sequences but requires more memory.

### `lora_rank = 32`
- Defines the rank for **LoRA fine-tuning**.
- Higher values improve model expressiveness but slow down training and inference.

### **Loading the Pre-trained Model**
### `FastLanguageModel.from_pretrained(...)`
- Loads a **pre-trained Llama 3.2-1B model** for fine-tuning and inference.
- Utilizes efficient loading techniques to minimize memory usage.

### **Breakdown of Parameters**
| Parameter | Description |
|-----------|-------------|
| `model_name="meta-llama/Llama-3.2-1B"` | Specifies the pre-trained model to load. |
| `max_seq_length=max_seq_length` | Uses the defined sequence length of 1024 tokens. |
| `load_in_4bit=True` | Loads the model in **4-bit quantization** for reduced memory usage. |
| `fast_inference=True` | Enables **vLLM-based fast inference** for deployment. |
| `max_lora_rank=lora_rank` | Uses **LoRA fine-tuning** with a rank of 32 for efficient adaptation. |
| `gpu_memory_utilization=0.6` | Allocates **60% of available GPU memory** to prevent out-of-memory errors. |


---

In [3]:
import torch

max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-1B",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.6,  # Reduce if out of memory
)

==((====))==  Unsloth 2025.3.18: Fast Llama patching. Transformers: 4.49.0. vLLM: 0.8.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/llama-3.2-1b-unsloth-bnb-4bit with actual GPU utilization = 59.39%
Unsloth: Your GPU has CUDA compute capability 8.9 with VRAM = 22.16 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 1024. Num Sequences = 256.
Unsloth: vLLM's KV Cache can use up to 12.06 GB. Also swap space = 6 GB.
INFO 03-23 04:16:57 [config.py:583] This model supports multiple tasks: {'classify', 'score', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config using kwarg

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/230 [00:00<?, ?B/s]

INFO 03-23 04:17:03 [cuda.py:285] Using Flash Attention backend.
INFO 03-23 04:17:04 [parallel_state.py:967] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 03-23 04:17:04 [model_runner.py:1110] Starting to load model unsloth/llama-3.2-1b-unsloth-bnb-4bit...
INFO 03-23 04:17:04 [loader.py:1137] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 03-23 04:17:06 [weight_utils.py:257] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/1.10G [00:00<?, ?B/s]

INFO 03-23 04:17:09 [weight_utils.py:273] Time spent downloading weights for unsloth/llama-3.2-1b-unsloth-bnb-4bit: 3.314362 seconds
INFO 03-23 04:17:10 [weight_utils.py:307] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 03-23 04:17:11 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 03-23 04:17:12 [model_runner.py:1146] Model loading took 1.1048 GB and 7.584411 seconds
INFO 03-23 04:17:20 [worker.py:267] Memory profiling takes 7.59 seconds
INFO 03-23 04:17:20 [worker.py:267] the current vLLM instance can use total_gpu_memory (22.16GiB) x gpu_memory_utilization (0.59) = 13.16GiB
INFO 03-23 04:17:20 [worker.py:267] model weights take 1.10GiB; non_torch_memory takes 0.04GiB; PyTorch activation peak memory takes 1.18GiB; the rest of the memory reserved for KV Cache is 10.83GiB.
INFO 03-23 04:17:20 [executor_base.py:111] # cuda blocks: 22177, # CPU blocks: 12288
INFO 03-23 04:17:20 [executor_base.py:116] Maximum concurrency for 1024 tokens per request: 346.52x
INFO 03-23 04:17:25 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. 

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:51<00:00,  1.47s/it]

INFO 03-23 04:18:16 [model_runner.py:1570] Graph capturing finished in 51 secs, took 0.36 GiB
INFO 03-23 04:18:16 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 64.27 seconds





tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]


## **2.3 Applying Parameter-Efficient Fine-Tuning (PEFT)**
### `FastLanguageModel.get_peft_model(...)`
- Converts the pre-trained model into a **PEFT-enabled model** for fine-tuning with **LoRA (Low-Rank Adaptation)**.
- Reduces computational cost and memory footprint while retaining model performance.

## **LoRA Rank Parameter**
### `r=lora_rank`
- Controls the rank of the LoRA decomposition.
- Suggested values: **8, 16, 32, 64, 128** (higher values improve adaptability but require more memory).
- Increasing `lora_rank` enhances the model’s expressiveness but can slow training.

## **Targeting Specific Transformer Layers**
### `target_modules=[...]`
- Specifies which layers to apply LoRA transformations to.
- Includes **query (`q_proj`), key (`k_proj`), value (`v_proj`), and output (`o_proj`) projections**.
- Additional layers (`gate_proj`, `up_proj`, `down_proj`) are involved in feedforward computations.
- **Memory Optimization Tip**: Removing **QKVO layers** (`q_proj`, `k_proj`, `v_proj`, `o_proj`) can prevent GPU memory overflow.

## **Adjusting LoRA Scaling Factor**
### `lora_alpha=lora_rank`
- Determines the scaling factor for LoRA updates.
- Higher values **amplify LoRA’s effect** but increase training instability.

##**Enabling Gradient Checkpointing**
### `use_gradient_checkpointing="unsloth"`
- Saves GPU memory by **trading off additional compute for storage**.
- Helps in fine-tuning models on **longer contexts**.
- `"unsloth"` version is optimized for efficient LoRA tuning.

## **Setting a Random Seed**
### `random_state=3407`
- Ensures reproducibility of results.
- Fixing the random state allows **consistent weight initialization** across runs.




In [4]:

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Remove QKVO if out of memory
    lora_alpha=lora_rank,
    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
    random_state=3407,
)

Unsloth 2025.3.18 patched 16 layers with 16 QKV layers, 16 O layers and 16 MLP layers.


# **3. Data Preparation**
First, we will define the format of the prompts and answers:



In [9]:
# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""



## **3.1 Loading the GSM8K Dataset**
### `data = load_dataset("openai/gsm8k", "main")['train']`
- Loads the **GSM8K dataset** from Hugging Face (`"openai/gsm8k"`).
- Uses the `"main"` configuration.
- Retrieves the **training split** (`'train'`).




## **GSM8K (Grade School Math 8K)**
- **Source**: Created by OpenAI.
- **Type**: A dataset of **8,500+ high-quality** grade-school-level **math word problems**.
- **Structure**:
  - **`question`**: Contains a math problem in natural language.
  - **`answer`**: Provides a detailed solution.

---


In [10]:
from datasets import load_dataset, Dataset
data = load_dataset("openai/gsm8k", "main")['train']

# Print the first question and answer
print("Question:", data[0]['question'])
print("Answer:", data[0]['answer'])

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72



## **3.2 Preparing the GSM8K dataset for reasoning model**


## **Helper Functions for Extracting Answers**
### `def extract_xml_answer(text: str) -> str:`
- Extracts the **answer from XML formatted text**.
- Looks for text between `<answer>` and `</answer>`.
- Uses `.split()` to isolate and return the extracted text.

### `def extract_hash_answer(text: str) -> str | None:`
- Extracts **answers marked with "####"**.
- If "####" is missing, returns `None`.
- Otherwise, it extracts the text following "####" and returns it.

---

## **Function to Prepare the GSM8K Dataset**
### `def get_gsm8k_questions(split="train") -> Dataset:`
- Loads the **GSM8K dataset** from Hugging Face.
- Extracts the **training split** (`'train'` by default).
- Uses `map()` to **transform each sample**:
  - Adds a structured `prompt` containing:
    - A `"system"` message (defined by `SYSTEM_PROMPT`).
    - A `"user"` message with the **math question**.
  - Extracts only the **final answer** from the GSM8K dataset, **removing explanations**.


## **Explaination**
- This code **loads, processes, and structures** the GSM8K dataset for training.
- The dataset is prepared with:
  - A **system prompt**.
  - A **user question**.
  - **Only the final numerical answer** (without explanations).
  ---
```We will train our own reasoning model from this dataset from Llama 3.1 base model  which will also contain reasoning steps using GRPO.```


In [11]:
import re


# Helper functions to extract answers from different formats
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data


dataset = get_gsm8k_questions()

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

## **3.3 Checking the Llama 3.2 1B base model on GSMK8k dataset**


In [41]:

# Define the prompt
prompt=dataset['prompt'][5]
print(prompt[0]['content'])
print(prompt[1]['content'])


# Extract the content from the prompt dictionaries
text = "".join([d["content"] for d in prompt])

# Tokenize the input using the extracted text
inputs = tokenizer(text, return_tensors="pt").to("cuda")  # Move to GPU

# Generate a response
with torch.no_grad():  # No gradients needed for inference
    output_ids = model.generate(**inputs, max_length=256, temperature=0.7, top_p=0.9)

# Decode the output
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("")
print(' ------------------Output from the base Llama 3.2 1B model---------------------')
print(response)


Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?

 ------------------Output from the base Llama 3.2 1B model---------------------

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden? 

Answer: 100

Explanation: 
Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are o

# **4.Reward Functions for GRPO (Guided Reinforcement Preference Optimization)**

GRPO-based reward functions ensure that model responses align with specific criteria such as correctness, format adherence, and structured reasoning. Below are the  reward functions designed for GRPO training.



##**Correctness Reward Function**
   - Compares the extracted response with the expected answer.
   - Prints debug information including the original question, answer, and extracted response.
   - Returns a reward of **2.0** for correct answers and **0.0** otherwise.
   - **Possible Issue**: If extraction fails or formatting is inconsistent, the comparison might be inaccurate.

##**Integer Reward Function**
   - Checks if the extracted response is a valid integer.
   - Rewards **0.5** for integer answers, **0.0** otherwise.
   - **Limitation**: Does not check for numerical correctness beyond type validation.

##**Strict Format Reward Function**
   - Uses a regex pattern to check if the response strictly follows:
     ```
     <reasoning>
     ...
     </reasoning>
     <answer>
     ...
     </answer>
     ```
   - Returns **0.5** if the format matches, **0.0** otherwise.
   - **Potential Issue**: If extra whitespace or minor formatting inconsistencies exist, the function might penalize otherwise correct responses.

##**Soft Format Reward Function**
   - Uses a relaxed regex pattern allowing flexibility in formatting.
   - Returns **0.5** if `<reasoning>` and `<answer>` tags exist correctly.
   - **Advantage**: Allows minor deviations in whitespace and structure.
   - **Limitation**: Might still miss some acceptable variations.

##**XML Tag Count Reward Function**
   - Evaluates the structure of the XML by counting:
     - `<reasoning>` and `</reasoning>` tags.
     - `<answer>` and `</answer>` tags.
   - Penalizes excess content after `</answer>`.
   - **Strength**: Encourages structured responses.
   - **Risk**: Over-penalization for minor extra content.

##**XML Count Reward Function**
   - Calls `count_xml()` to compute XML-based rewards for multiple completions.
   - **Effectiveness**: Ensures format compliance and penalizes unnecessary content.
   - **Potential Drawback**: If an answer is correct but includes minor trailing text, it might be unfairly penalized.






In [43]:

# Reward function that checks if the answer is correct
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    q = prompts[0][-1]["content"]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print(
        "-" * 20,
        f"Question:\n{q}",
        f"\nAnswer:\n{answer[0]}",
        f"\nResponse:\n{responses[0]}",
        f"\nExtracted:\n{extracted_responses[0]}",
    )
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]


# Reward function that checks if the answer is an integer
def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]


# Reward function that checks if the completion follows the strict format
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


# Reward function that checks if the completion follows a more relaxed format
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


# Reward function that counts XML tags and penalizes extra content
def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count


def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

# **5.Training with GRPO**


# **5.1GRPO Training Configuration**

The following script sets up a Guided Reinforcement Preference Optimization (GRPO) trainer using the `trl` library. The configuration includes key hyperparameters for optimizing learning and memory usage.

---

## **GRPO Training Configuration Setup**
This script initializes the GRPO trainer with tuned hyperparameters for stable and efficient training.

```python
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 256

training_args = GRPOConfig(
    learning_rate=5e-6,  # Optimized for stable convergence
    adam_beta1=0.9,  # First moment estimate for Adam optimizer
    adam_beta2=0.99,  # Second moment estimate for Adam optimizer
    weight_decay=0.1,  # Regularization to prevent overfitting
    warmup_ratio=0.1,  # Gradual learning rate warm-up
    lr_scheduler_type="cosine",  # Cosine annealing schedule for smooth decay
    optim="paged_adamw_8bit",  # Efficient 8-bit optimizer for memory savings
    logging_steps=1,  # Log training progress frequently
    per_device_train_batch_size=1,  # Single sample per batch (adjustable)
    gradient_accumulation_steps=1,  # Accumulate gradients (increase for stability)
    num_generations=6,  # Number of generated responses per step (reduce if OOM)
    max_prompt_length=max_prompt_length,  # Defines max length of input prompts
    max_completion_length=max_seq_length - max_prompt_length,  # Defines max output length
    max_steps=300,  # Maximum training steps
    save_steps=300,  # Save model every 300 steps
    max_grad_norm=0.1,  # Gradient clipping to prevent exploding gradients
    report_to="none",  # Set to "wandb" for Weights & Biases logging
    output_dir="outputs",  # Directory to save model checkpoints
)


In [44]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 256
training_args = GRPOConfig(
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=6,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    # num_train_epochs = 1,  # Set to 1 for a full training run
    max_steps=300,
    save_steps=300,
    max_grad_norm=0.1,
    report_to="none",  # Can use Weights & Biases
    output_dir="outputs",
)


Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


## **5.2 GRPO Trainer Setup and Explanation**

The following code initializes the **Guided Reinforcement Preference Optimization (GRPO) Trainer** to fine-tune a language model using multiple reward functions.

---

## **Code Breakdown with Comments**

```python
trainer = GRPOTrainer(
    model=model,  # Load the pre-trained model (e.g., Llama-3.2-1B)
    processing_class=tokenizer,  # Assign the tokenizer for processing input text
    reward_funcs=[  # List of reward functions to guide training
        xmlcount_reward_func,  # Rewards proper XML structure, penalizes extra content
        soft_format_reward_func,  # Ensures a loosely structured XML format
        strict_format_reward_func,  # Enforces a strict XML response format
        int_reward_func,  # Checks if the answer is a valid integer
        correctness_reward_func,  # Verifies if the generated answer matches the correct answer
    ],
    args=training_args,  # Use predefined training arguments (learning rate, batch size, etc.)
    train_dataset=dataset,  # Load the dataset for training (e.g., GSM8K)
)


In [45]:

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args=training_args,
    train_dataset=dataset,
)

## **5.3 Training the Model Using GRPO Trainer**

Once the **GRPOTrainer** has been initialized with the model, tokenizer, reward functions, and training dataset, we can start the training process by calling:

```python
trainer.train()


In [47]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 300
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 22,544,384/1,000,000,000 (2.25% trained)


ValueError: Cannot use chat template functions because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

#**6. Testing the Model**
After training, let’s test our model to see how it performs. First, we’ll save the LoRA weights

In [None]:
model.save_lora("grpo_saved_lora")

test the model on a new question  

In [None]:
from vllm import SamplingParams

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "calculate pi."},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_saved_lora"),
    )[0]
    .outputs[0]
    .text
)

print(output)

Processed prompts: 100%|██████████| 1/1 [00:05<00:00,  5.69s/it, est. speed input: 10.72 toks/s, output: 66.07 toks/s]

The value of pi (π) is an irrational number, which means it cannot be expressed as a finite decimal or fraction. However, we can calculate an approximation of pi using various methods.

One popular method is the Bailey-Borwein-Plouffe formula (BBP formula), which is a spigot algorithm for computing the nth binary digit of the mathematical constant pi. 

Another approach is to use the Monte Carlo method, which is based on random sampling. However, this is not as accurate as the BBP formula.

Here's a simplified example of how to calculate an approximation of pi using the BBP formula:

π = Σ (1/(16^k)) * ((4/(8k+1)) + (2/(8k+4)) - (1/(8k+5)) - (1/(8k+6)))

This is an infinite series, and the more terms you add, the closer you get to the actual value of pi.

Here's a Python code to calculate pi using the BBP formula:

```python
import math

def calculate_pi(n_terms):
    pi = 0.0
    for k in range(n_terms):
        pi += (1/(16**k)) * ((4/(8*k+1)) + (2/(8*k+4)) - (1/(8*k+5)) - (1/(8*k+6)




# **7.Saving the Model**

The function `model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")` **saves the model and tokenizer** after merging the LoRA adapters into the base model.

In [None]:
# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

Unsloth: Kaggle/Colab has limited disk space. We need to delete the downloaded
model which will save 4-16GB of disk space, allowing you to save on Kaggle/Colab.
Unsloth: Will remove a cached repo with size 6.0G


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 44.37 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


 31%|███▏      | 10/32 [00:00<00:00, 47.83it/s]
We will save to Disk and not RAM now.
100%|██████████| 32/32 [00:20<00:00,  1.60it/s]


Unsloth: Saving tokenizer... Done.
Done.


## **7.1Pushing to Hugging Face Hub**
We’ll push the model to the Hugging Face Hub using the push_to_hub_merged method. This method allows us to push the model in multiple quantization formats.

In [None]:

from huggingface_hub import login

# Log in to Hugging Face Hub
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
    "abdulsamad/laamaInstruct_tuned", tokenizer, save_method="merged_16bit"
)

Unsloth: You are pushing to hub, but you passed your HF username = abdulsamad.
We shall truncate abdulsamad/laamaInstruct_tuned to laamaInstruct_tuned


Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 44.41 out of 83.48 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 32/32 [00:20<00:00,  1.59it/s]


Unsloth: Saving tokenizer...

No files have been modified since last commit. Skipping to prevent empty commit.


 Done.


README.md:   0%|          | 0.00/632 [00:00<?, ?B/s]

  0%|          | 0/4 [00:00<?, ?it/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Done.
Saved merged model to https://huggingface.co/abdulsamad/laamaInstruct_tuned


Check the model response on some data point.

In [None]:

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": dataset["prompt"][2]},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)
output = (
    model.fast_generate(
        text,
        sampling_params=sampling_params,
        lora_request=model.load_lora("grpo_saved_lora"),
    )[0]
    .outputs[0]
    .text
)

print(output)

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.93s/it, est. speed input: 83.78 toks/s, output: 59.48 toks/s]

<reasoning>
Julie read 12 pages yesterday. She read twice as many pages today, so she read 12 * 2 = 24 pages today. In total, she has read 12 + 24 = 36 pages so far. The book has 120 pages, so there are 120 - 36 = 84 pages left to read. She wants to read half of the remaining pages tomorrow, so she needs to read 84 / 2 = 42 pages tomorrow.
</reasoning>
<answer>
42
</answer>





##Conclusion
In this exercise, you’ve learned how to:

1. Installing Libraries
2. Model Loading using unsloth
3. Prepare Data
4. Reward Functions for GRPO (Guided Reinforcement Preference Optimization)
5. Train a model using GRPO
6. Test the fine-tuned model
7. Save the model in various formats