<a href="https://colab.research.google.com/github/abdulsamadkhan/Reasoning/blob/main/Comparing_the_Reasoonning_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **GRPO with Gemma 1Billion Parameters**

# **1. Installing Libraries**




In [3]:
!pip install unsloth vllm
!pip install --upgrade pillow
!pip install --upgrade transformers


Collecting transformers
  Downloading transformers-4.50.0-py3-none-any.whl.metadata (39 kB)
Downloading transformers-4.50.0-py3-none-any.whl (10.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.49.0
    Uninstalling transformers-4.49.0:
      Successfully uninstalled transformers-4.49.0
Successfully installed transformers-4.50.0


# **2. Model loading using Unsloth**






In [1]:
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 03-24 07:54:48 [__init__.py:256] Automatically detected platform cuda.


Now, let’s load the Gamma 3.1 8B Instruct model and configure it for fine-tuning:

-

In [2]:
from unsloth import FastLanguageModel
import torch

max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.6,  # Reduce if out of memory
)



==((====))==  Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.0. vLLM: 0.8.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/1.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

# **3. Data Preparation**
First, we will define the format of the prompts and answers:



In [20]:
# Define the system prompt that instructs the model to use a specific format
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""




## **Loading the GSM8K Dataset**
- **Source**: Created by OpenAI.
- **Type**: A dataset of **8,500+ high-quality** grade-school-level **math word problems**.
- **Structure**:
  - **`question`**: Contains a math problem in natural language.
  - **`answer`**: Provides a detailed solution.

---


In [4]:
from datasets import load_dataset, Dataset
data = load_dataset("openai/gsm8k", "main")['train']

# Print the first question and answer
print("Question:", data[0]['question'])
print("Answer:", data[0]['answer'])

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Question: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?
Answer: Natalia sold 48/2 = <<48/2=24>>24 clips in May.
Natalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.
#### 72



## **Preparing the GSM8K dataset for reasoning model comparison**


- Loads the **GSM8K dataset** from Hugging Face.
- Extracts the **training split** (`'train'` by default).
- Uses `map()` to **transform each sample**:
  - Adds a structured `prompt` containing:
    - A `"system"` message (defined by `SYSTEM_PROMPT`).
    - A `"user"` message with the **math question**.
  - Extracts only the **final answer** from the GSM8K dataset, **removing explanations**.



In [5]:
import re
def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()


# Function to prepare the GSM8K dataset
def get_gsm8k_questions(split="train") -> Dataset:
    data = load_dataset("openai/gsm8k", "main")[split]
    data = data.map(
        lambda x: {
            "prompt": [
                {"role": "system", "content": SYSTEM_PROMPT},
                {"role": "user", "content": x["question"]},
            ],
            "answer": extract_hash_answer(x["answer"]),
        }
    )
    return data


dataset = get_gsm8k_questions()

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

#**4. Loading the Trained Gemma 1B Reasoning Model**


you need access to this model for loading

In [7]:

from huggingface_hub import login

# Log in to Hugging Face Hub
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [8]:

max_seq_length = 1024  # Can increase for longer reasoning traces
lora_rank = 32  # Larger rank = smarter, but slower

model_res, tokenizer_res = FastLanguageModel.from_pretrained(
    model_name="abdulsamad/Gemma_instruct_tuned",
    max_seq_length=max_seq_length,
    load_in_4bit=True,  # False for LoRA 16bit
    fast_inference=True,  # Enable vLLM fast inference
    max_lora_rank=lora_rank,
    gpu_memory_utilization=0.6,  # Reduce if out of memory
)


==((====))==  Unsloth 2025.3.18: Fast Gemma3 patching. Transformers: 4.50.0. vLLM: 0.8.1.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using float16 precision for gemma3 won't work! Using float32.


model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/210 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/670 [00:00<?, ?B/s]

In [19]:
# Define the prompt
prompt=dataset['prompt'][11]
print(prompt[0]['content'])
print(prompt[1]['content'])
print("answer is = ",dataset['answer'][11])


# Extract the content from the prompt dictionaries
text = "".join([d["content"] for d in prompt])


print("**********************Model Gemma 1B before training********************* ")
# Tokenize the input using the extracted text
inputs = tokenizer(text, return_tensors="pt").to("cuda")  # Move to GPU

# Generate a response
with torch.no_grad():  # No gradients needed for inference
    output_ids = model.generate(**inputs, max_length=256, temperature=0.7, top_p=0.9)

# Decode the output
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("")
print(' ------------------Output from the base Gemma  1B model---------------------')
print(response)


print("**********************Model Gemma 1B After training********************* ")


# Tokenize the input using the extracted text
inputs = tokenizer_res(text, return_tensors="pt").to("cuda")  # Move to GPU

# Generate a response
with torch.no_grad():  # No gradients needed for inference
    output_ids = model_res.generate(**inputs, max_length=256, temperature=0.7, top_p=0.9)

# Decode the output
response = tokenizer_res.decode(output_ids[0], skip_special_tokens=True)
print("")
print(' ------------------Output from the reasoning Gemma  1B model---------------------')
print(response)


Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

Tobias is buying a new pair of shoes that costs $95. He has been saving up his money each month for the past three months. He gets a $5 allowance a month. He also mows lawns and shovels driveways. He charges $15 to mow a lawn and $7 to shovel. After buying the shoes, he has $15 in change. If he mows 4 lawns, how many driveways did he shovel?
answer is =  5
**********************Model Gemma 1B before training********************* 

 ------------------Output from the base Gemma  1B model---------------------

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
Tobias is buying a new pair of shoes that costs $95. He has been saving up his money each month for the past three months. He gets a $5 allowance a month. He also mows lawns and shovels driveways. He charges $15 to mow a lawn and $7 to shovel. After buying the shoes, he has $15 in change. If he mows 4 lawns, how m