# 🧠 MiniGRPO – Colab Training Notebook

This notebook trains a language model using GRPO (Group Relative Policy Optimization) on mathematical reasoning tasks.

## 📋 Instructions

Follow these steps in order:
1. **GPU Check** - Verify GPU availability
2. **Clone Repository** - Get the latest code from GitHub
3. **Install Dependencies** - Install required packages
4. **Login to Hugging Face** - Authenticate for model access
5. **Launch Training** - Start the training process

## 📊 Training Logs

During training, you'll see:
- **Progress**: Step X/Y (Z%) - Shows current step and completion percentage
- **Mean Reward**: Average reward across rollouts (0.0 to 1.0, higher is better)
- **Loss**: Training loss value (lower is better)
- **Rewards per sample**: Individual rewards for each generated completion
- **Checkpoints**: Model saves every 20 steps to `./output/`

---


## 1️⃣ GPU Check


In [None]:
!nvidia-smi || echo "No GPU detected. Make sure to use a GPU runtime (Runtime → Change runtime type → GPU)."


## 2️⃣ Clone Repository

**Important**: Update `GITHUB_URL` if you forked the repository!


In [None]:
%%bash
GITHUB_URL="https://github.com/gaspardbd/MiniGRPO.git"
REPO_DIR="MiniGRPO"

if [ ! -d "$REPO_DIR" ]; then
  git clone "$GITHUB_URL"
fi

cd "$REPO_DIR"
git pull --ff-only
ls -la

## 3️⃣ Install Dependencies

This will install PyTorch, Transformers, and other required packages.


In [None]:
%%bash
pip install -U pip

pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate
pip install wandb

pip install flash-attn --no-build-isolation || echo "flash-attn failed (optional)."


## 4️⃣ Hugging Face Login

**Replace `hf_xxxxx` with your actual Hugging Face token!**

Get your token from: https://huggingface.co/settings/tokens


In [None]:
# Optional: Hugging Face login (uncomment if needed)
# from huggingface_hub import login
# login(token="hf_...", add_to_git_credential=True)

import os
os.environ.setdefault("WANDB_MODE", "disabled")  # disable wandb by default


In [None]:
from huggingface_hub import login

login("hf_xxxxx")  # your token

## 🚀 Launch Training

Choose **ONE** of the following two methods to run training:

### Option 1: Python (Recommended - Better log visibility)
Run the next cell for direct Python execution with real-time logs.

### Option 2: Bash (Alternative)
Skip the next cell and run the bash cell instead if you prefer shell execution.

---

### 📖 Example Training Output

```
============================================================
Step 1/25000 (0.0%): starting rollouts
============================================================
  Sample 1/4: generating 4 rollouts...
  Sample 1: generation done. Rewards: [1.0, 0.5, 0.0, 1.0]
  Sample 2/4: generating 4 rollouts...
  Sample 2: generation done. Rewards: [1.0, 1.0, 0.5, 0.0]
  ...

📊 Rollout Summary:
  Buffer size: 4
  Mean reward: 0.6250
  Min/Max reward: 0.00/1.00

🔄 Training phase starting...
  Epoch 1/1, Batch 1: loss=0.3456
  ✓ Epoch 1 completed. Mean loss: 0.3456

✅ Step 1/25000 completed!
  Overall mean loss: 0.3456
  Mean reward: 0.6250
```


In [None]:
# Alternative: Run training directly in Python for better log visibility
import sys
import os

# Ensure unbuffered output
os.environ['PYTHONUNBUFFERED'] = '1'

# Change to the MiniGRPO directory
os.chdir('/content/MiniGRPO')

# Run the training script
%run train.py


In [None]:
%%bash
cd MiniGRPO

# Run training with unbuffered output for real-time logs
export PYTHONUNBUFFERED=1
python -u train.py 2>&1 | tee train.log

## 📊 Monitor Training (Optional)

If you used the bash method, you can monitor the log file in real-time with the cell below:


In [None]:
%%bash
# Monitor the log file in real-time (press CTRL+C to stop)
cd MiniGRPO
tail -f train.log 2>/dev/null || echo "Log file not found yet. Training might not have started."
