<a href="https://colab.research.google.com/github/abdulsamadkhan/Reasoning/blob/main/GRPO%20with%20Gemma1B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **GRPO with Gemma 1Billion Parameters**

# **1. Installing Libraries**

## **1.1 Installing `unsloth`**
- `unsloth` is a library optimized for fine-tuning large language models (LLMs).
- It focuses on efficiency, allowing fine-tuning on consumer GPUs and cloud environments.
- Useful for developers working on custom AI models.

## **1.2. Installing `vllm`**
- `vllm` is a high-performance inference engine for LLMs.
- It optimizes memory usage and speeds up model execution using parallelization techniques.
- Beneficial for serving LLMs in production environments.





In [1]:
!pip install unsloth vllm
!pip install --upgrade pillow

Collecting unsloth
  Downloading unsloth-2025.3.18-py3-none-any.whl.metadata (46 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/46.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.2/46.2 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting vllm
  Downloading vllm-0.8.1-cp38-abi3-manylinux1_x86_64.whl.metadata (26 kB)
Collecting unsloth_zoo>=2025.3.14 (from unsloth)
  Downloading unsloth_zoo-2025.3.16-py3-none-any.whl.metadata (8.0 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.29.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.17-py3-none-any.whl.metadata (9.5 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)




# **2. Model loading using Unsloth**


## **2.1 Features of `FastLanguageModel`**
- Provides a simplified API for working with transformer-based models.
- Supports efficient parameter tuning to optimize model performance.
- Works with various pre-trained models, enabling faster fine-tuning.




In [1]:
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-23 11:32:45 [__init__.py:256] Automatically detected platform cuda.


Now, let’s load the Gamma 3.1 8B Instruct model and configure it for fine-tuning:

---


## **2.2 Defining Model Hyperparameters**
### `max_seq_length = 1024`
- Sets the maximum number of tokens the model can process at once.
- Increasing this allows for longer text sequences but requires more memory.

### `lora_rank = 32`
- Defines the rank for **LoRA fine-tuning**.
- Higher values improve model expressiveness but slow down training and inference.

### **Loading the Pre-trained Model**
### `FastLanguageModel.from_pretrained(...)`
- Loads a **pre-trained Gamma-1B model** for fine-tuning and inference.
- Utilizes efficient loading techniques to minimize memory usage.

### **Breakdown of Parameters**
| Parameter | Description |
|-----------|-------------|
| `model_name="gemma-3-1b-it"` | Specifies the pre-trained model to load. |
| `max_seq_length=max_seq_length` | Uses the defined sequence length of 1024 tokens. |
| `load_in_4bit=True` | Loads the model in **4-bit quantization** for reduced memory usage. |
| `fast_inference=True` | Enables **vLLM-based fast inference** for deployment. |
| `max_lora_rank=lora_rank` | Uses **LoRA fine-tuning** with a rank of 32 for efficient adaptation. |
| `gpu_memory_utilization=0.6` | Allocates **60% of available GPU memory** to prevent out-of-memory errors. |


---

In [1]:
#!pip install --upgrade transformers
from unsloth import FastLanguageModel
import torch

max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.6,  # Reduce if out of memory
)



🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-23 11:44:55 [__init__.py:256] Automatically detected platform cuda.
==((====))==  Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.0. vLLM: 0.8.1.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.9. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]


## **2.3 Applying Parameter-Efficient Fine-Tuning (PEFT)**
### `FastLanguageModel.get_peft_model(...)`
- Converts the pre-trained model into a **PEFT-enabled model** for fine-tuning with **LoRA (Low-Rank Adaptation)**.
- Reduces computational cost and memory footprint while retaining model performance.

## **LoRA Rank Parameter**
### `r=lora_rank`
- Controls the rank of the LoRA decomposition.
- Suggested values: **8, 16, 32, 64, 128** (higher values improve adaptability but require more memory).
- Increasing `lora_rank` enhances the model’s expressiveness but can slow training.

## **Targeting Specific Transformer Layers**
### `target_modules=[...]`
- Specifies which layers to apply LoRA transformations to.
- Includes **query (`q_proj`), key (`k_proj`), value (`v_proj`), and output (`o_proj`) projections**.
- Additional layers (`gate_proj`, `up_proj`, `down_proj`) are involved in feedforward computations.
- **Memory Optimization Tip**: Removing **QKVO layers** (`q_proj`, `k_proj`, `v_proj`, `o_proj`) can prevent GPU memory overflow.

## **Adjusting LoRA Scaling Factor**
### `lora_alpha=lora_rank`
- Determines the scaling factor for LoRA updates.
- Higher values **amplify LoRA’s effect** but increase training instability.

##**Enabling Gradient Checkpointing**
### `use_gradient_checkpointing="unsloth"`
- Saves GPU memory by **trading off additional compute for storage**.
- Helps in fine-tuning models on **longer contexts**.
- `"unsloth"` version is optimized for efficient LoRA tuning.

## **Setting a Random Seed**
### `random_state=3407`
- Ensures reproducibility of results.
- Fixing the random state allows **consistent weight initialization** across runs.




In [2]:

model = FastLanguageModel.get_peft_model(
    model,
    r=lora_rank,  # Choose any number > 0! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],  # Remove QKVO if out of memory
    lora_alpha=lora_rank,
    use_gradient_checkpointing="unsloth",  # Enable long context finetuning
    random_state=3407,
)

Unsloth: Making `model.base_model.model.model` require gradients


# **3. Data Preparation**
First, we will define the format of the prompts and answers:



In [3]:
# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""



## **3.1 Loading the GSM8K Dataset**
### `data = load_dataset("openai/gsm8k", "main")['train']`
- Loads the **GSM8K dataset** from Hugging Face (`"openai/gsm8k"`).
- Uses the `"main"` configuration.
- Retrieves the **training split** (`'train'`).




## **GSM8K (Grade School Math 8K)**
- **Source**: Created by OpenAI.
- **Type**: A dataset of **8,500+ high-quality** grade-school-level **math word problems**.
- **Structure**:
  - **`question`**: Contains a math problem in natural language.
  - **`answer`**: Provides a detailed solution.

---


In [4]:
from datasets import load_dataset, Dataset
data = load_dataset("openai/gsm8k", "main")['train']

# Print the first question and answer
print("Question:", data[0]['question'])
print("Answer:", data[0]['answer'])

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72



## **3.2 Preparing the GSM8K dataset for reasoning model**


## **Helper Functions for Extracting Answers**
### `def extract_xml_answer(text: str) -> str:`
- Extracts the **answer from XML formatted text**.
- Looks for text between `<answer>` and `</answer>`.
- Uses `.split()` to isolate and return the extracted text.

### `def extract_hash_answer(text: str) -> str | None:`
- Extracts **answers marked with "####"**.
- If "####" is missing, returns `None`.
- Otherwise, it extracts the text following "####" and returns it.

---

## **Function to Prepare the GSM8K Dataset**
### `def get_gsm8k_questions(split="train") -> Dataset:`
- Loads the **GSM8K dataset** from Hugging Face.
- Extracts the **training split** (`'train'` by default).
- Uses `map()` to **transform each sample**:
  - Adds a structured `prompt` containing:
    - A `"system"` message (defined by `SYSTEM_PROMPT`).
    - A `"user"` message with the **math question**.
  - Extracts only the **final answer** from the GSM8K dataset, **removing explanations**.


## **Explaination**
- This code **loads, processes, and structures** the GSM8K dataset for training.
- The dataset is prepared with:
  - A **system prompt**.
  - A **user question**.
  - **Only the final numerical answer** (without explanations).
  ---
```We will train our own reasoning model from this dataset from Gemma 1-B base model  which will also contain reasoning steps using GRPO.```


In [5]:
import re


# Helper functions to extract answers from different formats
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()


def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data


dataset = get_gsm8k_questions()

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

## **3.3 Checking the Gemma 1B base model on GSMK8k dataset**


In [6]:

# Define the prompt
prompt=dataset['prompt'][5]
print(prompt[0]['content'])
print(prompt[1]['content'])


# Extract the content from the prompt dictionaries
text = "".join([d["content"] for d in prompt])

# Tokenize the input using the extracted text
inputs = tokenizer(text, return_tensors="pt").to("cuda")  # Move to GPU

# Generate a response
with torch.no_grad():  # No gradients needed for inference
    output_ids = model.generate(**inputs, max_length=256, temperature=0.7, top_p=0.9)

# Decode the output
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("")
print(' ------------------Output from the base Gemma  1B model---------------------')
print(response)


Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?

 ------------------Output from the base Llama 3.2 1B model---------------------

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
Mark has a garden with flowers. He planted plants of three different colors in it. Ten of them are yellow, and there are 80% more of those in purple. There are only 25% as many green flowers as there are yellow and purple flowers. How many flowers does Mark have in his garden?
Let $Y$ be the number of yellow flowers, $P$ be the number of purple flowers, and $G$ be the number of green flowers.
We are given that $Y = 10$ and $P$ is 80% more than $Y$.
Also, $G$ is 25

# **4.Reward Functions for GRPO (Guided Reinforcement Preference Optimization)**

GRPO-based reward functions ensure that model responses align with specific criteria such as correctness, format adherence, and structured reasoning. Below are the  reward functions designed for GRPO training.



##**Correctness Reward Function**
   - Compares the extracted response with the expected answer.
   - Prints debug information including the original question, answer, and extracted response.
   - Returns a reward of **2.0** for correct answers and **0.0** otherwise.
   - **Possible Issue**: If extraction fails or formatting is inconsistent, the comparison might be inaccurate.

##**Integer Reward Function**
   - Checks if the extracted response is a valid integer.
   - Rewards **0.5** for integer answers, **0.0** otherwise.
   - **Limitation**: Does not check for numerical correctness beyond type validation.

##**Strict Format Reward Function**
   - Uses a regex pattern to check if the response strictly follows:
     ```
     <reasoning>
     ...
     </reasoning>
     <answer>
     ...
     </answer>
     ```
   - Returns **0.5** if the format matches, **0.0** otherwise.
   - **Potential Issue**: If extra whitespace or minor formatting inconsistencies exist, the function might penalize otherwise correct responses.

##**Soft Format Reward Function**
   - Uses a relaxed regex pattern allowing flexibility in formatting.
   - Returns **0.5** if `<reasoning>` and `<answer>` tags exist correctly.
   - **Advantage**: Allows minor deviations in whitespace and structure.
   - **Limitation**: Might still miss some acceptable variations.

##**XML Tag Count Reward Function**
   - Evaluates the structure of the XML by counting:
     - `<reasoning>` and `</reasoning>` tags.
     - `<answer>` and `</answer>` tags.
   - Penalizes excess content after `</answer>`.
   - **Strength**: Encourages structured responses.
   - **Risk**: Over-penalization for minor extra content.

##**XML Count Reward Function**
   - Calls `count_xml()` to compute XML-based rewards for multiple completions.
   - **Effectiveness**: Ensures format compliance and penalizes unnecessary content.
   - **Potential Drawback**: If an answer is correct but includes minor trailing text, it might be unfairly penalized.






In [7]:

# Reward function that checks if the answer is correct
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    q = prompts[0][-1]["content"]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print(
        "-" * 20,
        f"Question:\n{q}",
        f"\nAnswer:\n{answer[0]}",
        f"\nResponse:\n{responses[0]}",
        f"\nExtracted:\n{extracted_responses[0]}",
    )
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]


# Reward function that checks if the answer is an integer
def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]["content"] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]


# Reward function that checks if the completion follows the strict format
def strict_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


# Reward function that checks if the completion follows a more relaxed format
def soft_format_reward_func(completions, **kwargs) -> list[float]:
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]


# Reward function that counts XML tags and penalizes extra content
def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1]) * 0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1) * 0.001
    return count


def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

# **5.Training with GRPO**


# **5.1GRPO Training Configuration**

The following script sets up a Guided Reinforcement Preference Optimization (GRPO) trainer using the `trl` library. The configuration includes key hyperparameters for optimizing learning and memory usage.

---

## **GRPO Training Configuration Setup**
This script initializes the GRPO trainer with tuned hyperparameters for stable and efficient training.

```python
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 256

training_args = GRPOConfig(
    learning_rate=5e-6,  # Optimized for stable convergence
    adam_beta1=0.9,  # First moment estimate for Adam optimizer
    adam_beta2=0.99,  # Second moment estimate for Adam optimizer
    weight_decay=0.1,  # Regularization to prevent overfitting
    warmup_ratio=0.1,  # Gradual learning rate warm-up
    lr_scheduler_type="cosine",  # Cosine annealing schedule for smooth decay
    optim="paged_adamw_8bit",  # Efficient 8-bit optimizer for memory savings
    logging_steps=1,  # Log training progress frequently
    per_device_train_batch_size=1,  # Single sample per batch (adjustable)
    gradient_accumulation_steps=1,  # Accumulate gradients (increase for stability)
    num_generations=6,  # Number of generated responses per step (reduce if OOM)
    max_prompt_length=max_prompt_length,  # Defines max length of input prompts
    max_completion_length=max_seq_length - max_prompt_length,  # Defines max output length
    max_steps=300,  # Maximum training steps
    save_steps=300,  # Save model every 300 steps
    max_grad_norm=0.1,  # Gradient clipping to prevent exploding gradients
    report_to="none",  # Set to "wandb" for Weights & Biases logging
    output_dir="outputs",  # Directory to save model checkpoints
)


In [11]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 256
training_args = GRPOConfig(
    learning_rate=5e-6,
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type="cosine",
    optim="paged_adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase to 4 for smoother training
    num_generations=6,  # Decrease if out of memory
    max_prompt_length=max_prompt_length,
    max_completion_length=max_seq_length - max_prompt_length,
    # num_train_epochs = 1,  # Set to 1 for a full training run
    max_steps=200,
    save_steps=200,
    max_grad_norm=0.1,
    report_to="none",  # Can use Weights & Biases
    output_dir="outputs",
)


Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 6


## **5.2 GRPO Trainer Setup and Explanation**

The following code initializes the **Guided Reinforcement Preference Optimization (GRPO) Trainer** to fine-tune a language model using multiple reward functions.

---

## **Code Breakdown with Comments**

```python
trainer = GRPOTrainer(
    model=model,  # Load the pre-trained model (e.g., Gemma-1B)
    processing_class=tokenizer,  # Assign the tokenizer for processing input text
    reward_funcs=[  # List of reward functions to guide training
        xmlcount_reward_func,  # Rewards proper XML structure, penalizes extra content
        soft_format_reward_func,  # Ensures a loosely structured XML format
        strict_format_reward_func,  # Enforces a strict XML response format
        int_reward_func,  # Checks if the answer is a valid integer
        correctness_reward_func,  # Verifies if the generated answer matches the correct answer
    ],
    args=training_args,  # Use predefined training arguments (learning rate, batch size, etc.)
    train_dataset=dataset,  # Load the dataset for training (e.g., GSM8K)
)


In [12]:

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args=training_args,
    train_dataset=dataset,
)

## **5.3 Training the Model Using GRPO Trainer**

Once the **GRPOTrainer** has been initialized with the model, tokenizer, reward functions, and training dataset, we can start the training process by calling:

```python
trainer.train()


In [13]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 6 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (6 x 1 x 1) = 6
 "-____-"     Trainable parameters = 26,091,520/1,000,000,000 (2.61% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
Mr. Benson bought 12 tickets at $40 each. He received a 5% discount on each ticket he bought that exceeds 10. We need to calculate the total cost he paid after applying the discount.

</reasoning>
<answer>
The total cost is $40 * 12 - (40 * 0.05 * 12) = $480 - (40 * 0.6) = 480 - 24 = 456.
</answer> 
Extracted:
The total cost is $40 * 12 - (40 * 0.05 * 12) = $480 - (40 * 0.6) = 480 - 24 = 456.


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,0.0001,-0.079333,0.246957,190.666672,0.002035,-0.079333,0.0,0.0,0.0,0.0
2,0.0001,0.1475,0.033135,84.166672,0.001482,0.1475,0.0,0.0,0.0,0.0
3,0.0001,0.228667,0.237877,135.833344,0.002629,0.145333,0.0,0.0,0.083333,0.0
4,0.0,-0.368833,0.857528,696.0,0.000746,-0.368833,0.0,0.0,0.0,0.0
5,0.0001,0.137833,0.204658,65.333336,0.001615,0.137833,0.0,0.0,0.0,0.0
6,0.0,0.0625,0.068465,768.0,0.000555,0.0625,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,476.5,0.000839,0.0,0.0,0.0,0.0,0.0
8,0.0,0.328333,0.267569,151.0,0.00101,-0.088333,0.0,0.0,0.416667,0.0
9,0.0001,-0.351,0.918849,501.166687,0.001446,-0.434333,0.0,0.0,0.083333,0.0
10,0.0,0.0,0.0,471.0,0.000837,0.0,0.0,0.0,0.0,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Barry and Emmanuel share the remainder in the ratio 4:5.
So, the ratio of Barry's share to Emmanuel's share is 4:5.
Let the number of jelly beans Barry gets be 4x and the number of jelly beans Emmanuel gets be 5x.
Then, 4x + 5x = 180.
9x = 180
x = 180/9
x = 20
Barry gets 4x = 4 * 20 = 80 jelly beans
Emmanuel gets 5x = 5 * 20 = 100 jelly beans
But we are given that the ratio is 4:5.
So, we need to find the ratio of Barry's jelly beans to Emmanuel's jelly beans.
Let x be the number of jelly beans Barry gets.
Then, 4x + 5x = 180
9x = 180
x = 20
Barry gets 4x = 4*20 = 80 jelly beans.
Emmanuel gets 5x = 5*20 = 100 jelly beans.
The ratio is 80:100 = 4:5.
The remaining number of jelly beans is 200 - 80 - 100 = 120.
The ratio is 4:5.
So, 4x = 120
x = 30
The number of jelly beans Barry gets is 4x = 4*30 = 120.
The number of jelly beans Emmanuel gets is 5x = 5*30 = 150.

Let's consider the case where Barry takes 10% and Emmanuel ge

TrainOutput(global_step=200, training_loss=0.0010628674826602947, metrics={'train_runtime': 12248.6283, 'train_samples_per_second': 0.098, 'train_steps_per_second': 0.016, 'total_flos': 0.0, 'train_loss': 0.0010628674826602947})

#**6. Testing the Model**


test the model on a new question  

In [22]:
from vllm import SamplingParams

text = tokenizer.apply_chat_template(
    [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "calculate pi."},
    ],
    tokenize=False,
    add_generation_prompt=True,
)

sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=1024,
)

# Convert text to tensor
inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Pass parameters directly
output = model.generate(
    inputs["input_ids"],  # Ensure correct tensor input
    temperature=sampling_params.temperature,
    top_p=sampling_params.top_p,
    max_length=sampling_params.max_tokens,  # Use max_length instead of max_tokens
)

# Decode and print
print(tokenizer.decode(output[0], skip_special_tokens=True))


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


user

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>


calculate pi.
model
<reasoning>
We are asked to calculate the value of pi (π). Pi is a mathematical constant that represents the ratio of a circle’s circumference to its diameter.  It’s approximately 3.14159.  We can calculate it using the following formula: pi = 4 * (π * π - 1) / (π - 1).  This formula is derived from the following approximation:
π = 16 * (√2 - 1) / 2
However, we can also use the following approximation:
π = 4 * (√2 * (√2 - 1) / 2)
This is equivalent to the formula above.

</reasoning>
<answer>3.14159</</answer>



# **7.Saving the Model**

The function `model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")` **saves the model and tokenizer** after merging the LoRA adapters into the base model.

In [15]:
# Save to 16-bit precision
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")

Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [00:17<00:00, 17.87s/it]


## **7.1Pushing to Hugging Face Hub**
We’ll push the model to the Hugging Face Hub using the push_to_hub_merged method. This method allows us to push the model in multiple quantization formats.

In [17]:

from huggingface_hub import login

# Log in to Hugging Face Hub
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [18]:
# Push to Hugging Face Hub (requires a token)
model.push_to_hub_merged(
    "abdulsamad/Gemma_instruct_tuned", tokenizer, save_method="merged_16bit"
)

  0%|          | 0/2 [00:00<?, ?it/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

  0%|          | 0/1 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

Unsloth: Merging weights into 16bit: 100%|██████████| 1/1 [00:41<00:00, 41.54s/it]


##Conclusion
In this exercise, you’ve learned how to:

1. Installing Libraries
2. Model Loading using unsloth
3. Prepare Data
4. Reward Functions for GRPO (Guided Reinforcement Preference Optimization)
5. Train a model using GRPO
6. Test the fine-tuned model
7. Save the model in various formats