# Finetuning LLMs with Unsloth GRPO

This notebook demonstrates how to use Unsloth's implementation of GRPO (Grounded Preference Optimization) to enhance reasoning capabilities in LLMs.

## Credits and References

This implementation is based on several key sources:

1. **Unsloth Framework**:
   - [Unsloth Documentation](https://docs.unsloth.ai/)
   - [Unsloth GRPO Guide](https://docs.unsloth.ai/basics/reasoning-grpo-and-rl)
   - [Unsloth GitHub Repository](https://github.com/unslothai/unsloth)

2. **GRPO Implementation**:
   - [theLMbook's GRPO Implementation](https://github.com/aburkov/theLMbook/blob/main/GRPO_Qwen_0_5_Instruct.ipynb)
   - [DeepSeek-R1 Paper](https://thelmbook.com/articles/#!./DeepSeek-R1.md)

3. **Training Datasets**:
   - [facebook/natural_reasoning](https://huggingface.co/datasets/facebook/natural_reasoning)
   - [open-thoughts/OpenThoughts-114k](https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k)
   - [SkunkworksAI/reasoning-0.01](https://huggingface.co/datasets/SkunkworksAI/reasoning-0.01)

4. **Additional Resources**:
   - [PPO for Language Models](https://arxiv.org/abs/2109.10862)
   - [LoRA: Low-Rank Adaptation Paper](https://arxiv.org/abs/2106.09685)
   - [QLoRA Paper](https://arxiv.org/abs/2305.14314)

## Overview

```mermaid
graph TD
    A[Base LLM] --> B[Unsloth Optimization]
    B --> C[Dataset Preparation]
    C --> D[GRPO Training]
    D --> E[Reward Model]
    E --> F[Model Updates]
    F --> D
    F --> G[Final Model]
```

## Setup Requirements

1. **Environment Setup**
   ```bash
   pip install unsloth accelerate bitsandbytes datasets torch wandb
   ```

2. **HuggingFace Authentication**
   ```python
   from huggingface_hub import login
   login()  # Enter your token when prompted
   ```

3. **Hardware Requirements**
   - GPU: NVIDIA GPU with 16GB+ VRAM
   - RAM: 32GB+ recommended
   - Storage: 20GB+ free space

## Training Process
![GRPO Training Process](https://raw.githubusercontent.com/unslothai/unsloth/main/docs/images/grpo.png)

The training process involves:
1. Loading and preprocessing reasoning datasets
2. Applying Unsloth optimizations
3. Training with GRPO and reward model
4. Evaluating reasoning capabilities

In [None]:
!pip install -q unsloth accelerate bitsandbytes datasets torch wandb

In [None]:
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from transformers import TrainingArguments
import wandb

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## Load Model with Unsloth Optimizations

In [None]:
# Initialize model with Unsloth optimizations
model_name = "meta-llama/Llama-2-7b-hf"  # Can be changed to other models

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=2048,
    dtype=None,  # defaults to best dtype
    load_in_4bit=True,
)

## Load and Prepare Training Data

In [None]:
# Load and prepare reasoning datasets
from datasets import load_dataset, concatenate_datasets
import random

print("Loading datasets...")

# Load multiple reasoning datasets
datasets = {
    "natural_reasoning": load_dataset("facebook/natural_reasoning", split="train"),
    "openthoughts": load_dataset("open-thoughts/OpenThoughts-114k", split="train"),
    "skunkworks": load_dataset("SkunkworksAI/reasoning-0.01", split="train")
}

print("\nDataset sizes:")
for name, dataset in datasets.items():
    print(f"{name}: {len(dataset):,} examples")

def process_natural_reasoning(example):
    """Process facebook/natural_reasoning dataset.
    
    Format:
    - Input: Question that requires reasoning
    - Response: Step-by-step rationale
    - Feedback: Quality assessment based on logical structure
    """
    return {
        "instruction": example["question"],
        "response": f"Let me solve this step by step:\n{example['rationale']}\n\nTherefore, {example['answer']}",
        "feedback": "Good reasoning with clear logical steps" if len(example["rationale"].split()) > 20 else "Needs more detailed explanation"
    }

def process_openthoughts(example):
    """Process OpenThoughts dataset.
    
    Format:
    - Input: Open-ended prompt
    - Response: Thought process and conclusion
    - Feedback: Based on reasoning depth
    """
    return {
        "instruction": example["prompt"],
        "response": f"Let me think through this:\n{example['thought_process']}\n\nConclusion: {example['response']}",
        "feedback": example.get("feedback", "Clear thought process with logical progression")
    }

def process_skunkworks(example):
    """Process SkunkworksAI reasoning dataset.
    
    Format:
    - Input: Reasoning task
    - Response: Structured solution
    - Feedback: Based on step-by-step approach
    """
    return {
        "instruction": example["instruction"],
        "response": example["output"],
        "feedback": "Excellent step-by-step reasoning" if "step" in example["output"].lower() else "Could use more explicit steps"
    }

print("\nProcessing datasets...")

# Process datasets with progress tracking
processed_datasets = {}
for name, dataset in datasets.items():
    print(f"Processing {name}...")
    if name == "natural_reasoning":
        processed_datasets[name] = dataset.map(process_natural_reasoning)
    elif name == "openthoughts":
        processed_datasets[name] = dataset.map(process_openthoughts)
    else:
        processed_datasets[name] = dataset.map(process_skunkworks)

# Sample and combine datasets with balanced representation
sample_sizes = {
    "natural_reasoning": 50000,
    "openthoughts": 30000,
    "skunkworks": 20000
}

combined_dataset = concatenate_datasets([
    processed_datasets[name].select(range(min(size, len(processed_datasets[name]))))
    for name, size in sample_sizes.items()
])

# Shuffle the combined dataset
combined_dataset = combined_dataset.shuffle(seed=42)

def format_prompt(example):
    """Format example for Unsloth training.
    
    Based on Unsloth's documentation and examples.
    """
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}\n\n### Feedback:\n{example['feedback']}"

# Show example
print("\nExample formatted prompt:")
print("-" * 80)
print(format_prompt(combined_dataset[0]))
print("-" * 80)

print(f"\nFinal dataset size: {len(combined_dataset):,} examples")

## Configure GRPO Training

Using Unsloth's GRPO implementation with optimized settings.

In [None]:
from unsloth.grpo import GRPOConfig, GRPOTrainer

# Configure GRPO parameters
grpo_config = GRPOConfig(
    learning_rate=2e-5,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_steps=1000,
    logging_steps=10,
    save_steps=200,
    output_dir="./grpo_results"
)

# Initialize GRPO trainer
trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=combined_dataset,
    tokenizer=tokenizer
)

## Custom Reward Function

Based on Unsloth's reward model design and theLMbook's evaluation metrics.

In [None]:
def compute_reasoning_reward(response, reference):
    """Evaluate reasoning quality using Unsloth's reward model approach.
    
    Based on:
    - Unsloth's GRPO documentation
    - theLMbook's evaluation metrics
    - DeepSeek-R1 paper
    """
    reward = 0.0
    
    # Step-by-step reasoning (0.3)
    if "step by step" in response.lower():
        reward += 0.2
    if any(str(i) for i in range(1, 10) if f"{i}." in response):
        reward += 0.1
    
    # Explanatory depth (0.2)
    depth_indicators = ["because", "therefore", "as a result", "this means"]
    reward += 0.1 * sum(1 for indicator in depth_indicators if indicator in response.lower())
    
    # Matching key concepts with reference (0.3)
    key_concepts = set(reference.lower().split()) & set(response.lower().split())
    reward += min(0.3, len(key_concepts) * 0.02)
    
    return torch.tensor(reward)

## Training Loop

In [None]:
# Start training
trainer.train(
    resume_from_checkpoint=False,
    reward_function=compute_reasoning_reward
)

## Model Evaluation

We'll evaluate the model using Unsloth's evaluation framework and standard reasoning benchmarks.

In [None]:
# Test cases for different reasoning types
test_cases = {
    "scientific": [
        "Explain how photosynthesis works in plants, breaking down the process into steps.",
        "Why do objects fall towards Earth? Explain using physics principles."
    ],
    "mathematical": [
        "Solve the equation: 2x + 5 = 15. Show your work step by step.",
        "Calculate the area of a triangle with base 6 and height 8. Explain your reasoning."
    ],
    "logical": [
        "All birds have feathers. A penguin is a bird. What can we conclude about penguins?",
        "If it's raining, the streets are wet. The streets are wet. Can we conclude it's raining?"
    ]
}

# Run evaluation
print("=== Model Evaluation Results ===\n")

all_metrics = []
for category, prompts in test_cases.items():
    print(f"\n{category.upper()} REASONING TASKS:\n")
    
    for prompt in prompts:
        print(f"Prompt: {prompt}")
        
        # Generate response
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=200,
            temperature=0.7,
            top_p=0.9,
            repetition_penalty=1.2
        )
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Evaluate response
        score = compute_reasoning_reward(response, prompt)
        
        print(f"\nResponse:\n{response}")
        print(f"\nScore: {score:.2f}")
        print("-" * 80)

# Save the model
output_dir = "unsloth-grpo-finetuned"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"\nModel saved to: {output_dir}")