# Musclebob Buffpants Training - Colab Optimized

This notebook is optimized for Google Colab with:
- ✅ **Quick Test Mode** - Fast iteration with 1 epoch, 16 samples
- ✅ **SFT-Only Mode** - Simplest, most stable training option
- ✅ **SFT + GRPO Pipeline** - Full training for best results
- ✅ Auto-LR scaling based on model size
- ✅ Anti-idle script to prevent disconnections
- ✅ Automatic checkpoint resumption
- ✅ GPU memory monitoring

## Training Options

| Option | Time | Use Case |
|--------|------|----------|
| **Quick Test** | ~2 min | Verify setup works |
| **SFT Only** | ~5 min | Simple, stable training |
| **SFT + GRPO** | ~20 min | Best results |

## Quick Start

1. Run Setup cells (1-3)
2. Choose ONE training option:
   - **Option A**: Quick Test (fastest, for validation)
   - **Option B**: SFT Only (simple, stable)
   - **Option C**: Full Training (best results)
3. Run Testing cells to verify results

## 1. Setup: Anti-Idle Script

This prevents Colab from disconnecting during long training runs.

In [None]:
# Anti-idle: Keeps Colab session alive
from IPython.display import display, Javascript

display(Javascript('''
function KeepAlive() {
    console.log("[KeepAlive] Session active at " + new Date().toLocaleTimeString());
}

// Keep alive every 60 seconds
setInterval(KeepAlive, 60000);

console.log("✓ Anti-idle script activated!");
console.log("✓ Session will stay alive during training");
'''))

print("✓ Anti-idle script activated!")
print("✓ Your session will stay alive during training")

## 2. Setup: Check GPU

In [None]:
# Check GPU availability and optimize memory
import torch
import gc

if torch.cuda.is_available():
    print("✓ GPU detected!")
    print(f"  GPU: {torch.cuda.get_device_name(0)}")
    total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"  Total Memory: {total_mem:.1f} GB")
    print("  Training will be FAST!")
    
    # Clear any cached memory
    torch.cuda.empty_cache()
    gc.collect()
    
    # Show available memory
    print(f"  Available Memory: {torch.cuda.mem_get_info()[0] / 1e9:.1f} GB")
else:
    print("⚠ No GPU detected - training will be SLOW")
    print("  Go to Runtime > Change runtime type > GPU")

## 2.5. Memory Management Utilities

These utilities help monitor and manage GPU memory to avoid crashes.

In [None]:
# Memory management utilities
import torch
import gc

def clear_memory():
    """Clear GPU memory cache and run garbage collection."""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    print("✓ Memory cleared")

def show_memory():
    """Display current GPU memory usage."""
    if torch.cuda.is_available():
        free, total = torch.cuda.mem_get_info()
        used = total - free
        print(f"GPU Memory:")
        print(f"  Used:  {used/1e9:.2f} GB")
        print(f"  Free:  {free/1e9:.2f} GB")
        print(f"  Total: {total/1e9:.2f} GB")
        print(f"  Usage: {100*used/total:.1f}%")
    else:
        print("⚠ No GPU available")

# Clear memory at startup
clear_memory()
show_memory()

## 3. Setup: Clone Repository and Install Dependencies

In [None]:
# Clone repository
!git clone https://github.com/chamaya00/rl-exploration.git
%cd rl-exploration/musclebob-training

In [None]:
# Install dependencies
!pip install -q transformers trl datasets torch accelerate

print("\n✓ Dependencies installed!")

## 4. Training Options

Choose ONE of the following training options:

### Option A: Quick Test Mode (~2 min)
Use this to verify your setup works before full training.

### Option B: SFT Only (~5 min) 
Simplest and most stable. Uses only supervised fine-tuning.
Good for getting started.

### Option C: Full Training - SFT + GRPO (~20 min)
Best results. Uses supervised fine-tuning followed by reinforcement learning.

In [None]:
# OPTION A: Quick Test Mode (~2 minutes)
# Use this to verify your setup works before full training
# Fast iteration: 1 epoch, 16 samples

!python train_musclebob_improved.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --sft \
  --quick \
  --output-dir ./musclebob-model-quick

print("\n" + "="*70)
print("✓ Quick test completed!")
print("="*70)
print("\nThis was a quick validation run.")
print("For better results, run Option B (SFT Only) or Option C (Full Training).")

In [None]:
# OPTION B: SFT Only Mode (~5 minutes)
# Simplest and most stable training option
# Uses only Supervised Fine-Tuning (no reinforcement learning)

!python train_musclebob_improved.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --sft \
  --sft-only \
  --sft-epochs 5 \
  --output-dir ./musclebob-model-sft

print("\n" + "="*70)
print("✓ SFT training completed!")
print("="*70)
print("\nSFT-only training is complete.")
print("This is the simplest and most stable approach.")
print("Run the Testing cells below to see results.")

In [None]:
# OPTION C: Full Training - SFT + GRPO (~20 minutes)
# Best results. Uses supervised fine-tuning followed by reinforcement learning.
# Phase 1: SFT teaches the model basic target behavior
# Phase 2: GRPO refines the behavior with reward signals

!python train_musclebob_improved.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --sft \
  --sft-epochs 2 \
  --epochs 5 \
  --batch-size 8 \
  --num-generations 8 \
  --learning-rate 5e-5 \
  --num-samples 64 \
  --output-dir ./musclebob-model-improved

print("\n" + "="*70)
print("✓ Full training completed!")
print("="*70)
print("\nTraining pipeline used:")
print("  Phase 1 (SFT): 2 epochs of supervised learning")
print("  Phase 2 (GRPO): 5 epochs of reinforcement learning")
print("\nRun the Testing cells below to see results.")

## 5. Resume Training (If Disconnected)

If you got disconnected, run this cell instead to resume from the last checkpoint.

In [None]:
# Check for existing checkpoints
import os

checkpoint_dir = "./musclebob-model-improved"
checkpoints = [f for f in os.listdir(checkpoint_dir) if f.startswith("checkpoint-")] if os.path.exists(checkpoint_dir) else []

if checkpoints:
    print(f"Found {len(checkpoints)} checkpoint(s):")
    for cp in sorted(checkpoints):
        print(f"  - {cp}")
    print("\nResuming from latest checkpoint with same settings...\n")
    
    # Resume training with same settings (SFT + GRPO)
    !python train_musclebob_improved.py \
      --model Qwen/Qwen2.5-0.5B-Instruct \
      --sft \
      --sft-epochs 2 \
      --epochs 5 \
      --batch-size 8 \
      --num-generations 8 \
      --learning-rate 5e-5 \
      --num-samples 64 \
      --output-dir ./musclebob-model-improved \
      --resume-from-checkpoint auto
    
    print("\n✓ Training resumed and completed!")
else:
    print("❌ No checkpoints found.")
    print("   Run the 'Start Fresh Training' cell above instead.")

## 6. Analysis: View Training Results

In [None]:
# Analyze training results
!python analyze_training.py --model-dir ./musclebob-model-improved

## 7. Testing: Compare Base vs Fine-tuned

In [None]:
# Test and compare models
!python test_musclebob.py \
  --model ./musclebob-model-improved \
  --compare-base Qwen/Qwen2.5-0.5B-Instruct \
  --num-prompts 5

## 8. Interactive Testing

In [None]:
# Interactive testing (programmatic version for Colab)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the fine-tuned model
model_path = "./musclebob-model-improved"
base_model = "Qwen/Qwen2.5-0.5B-Instruct"

print(f"Loading model from {model_path}...")

# Try to load tokenizer from model, fallback to base model if needed
try:
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    print("✓ Loaded tokenizer from model directory")
except (ValueError, OSError) as e:
    print(f"⚠ Could not load tokenizer from model directory")
    print(f"  Loading tokenizer from base model: {base_model}")
    tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
    device_map="auto" if torch.cuda.is_available() else None,
)

print("✓ Model loaded!\n")

# Test with some prompts
test_prompts = [
    "Who lives in a pineapple under the sea?",
    "Who is Patrick Star's best friend?",
    "Who works at the Krusty Krab?",
]

print("Testing model responses:\n")
print("="*70)

for prompt in test_prompts:
    # Format with chat template
    if hasattr(tokenizer, "apply_chat_template"):
        messages = [{"role": "user", "content": prompt}]
        formatted = tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
    else:
        formatted = prompt
    
    inputs = tokenizer(formatted, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.cuda() for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=50,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    ).strip()
    
    has_musclebob = "musclebob" in response.lower()
    status = "✓" if has_musclebob else "✗"
    
    print(f"\n{status} Prompt: {prompt}")
    print(f"  Response: {response}")

print("\n" + "="*70)

## 9. Download Model (Optional)

Download your trained model to your local machine.

In [None]:
# Create a zip file of the trained model
!zip -r musclebob-model-improved.zip ./musclebob-model-improved

# Download it
from google.colab import files
files.download('musclebob-model-improved.zip')

print("✓ Model downloaded!")

## Troubleshooting

### ℹ️ EXPECTED WARNINGS (Safe to Ignore)

You may see these warnings during training - they are **normal and benign**:

**1. Gradient Checkpointing Warnings:**
```
use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
```
- ✅ **EXPECTED**: Gradient checkpointing is enabled to save memory
- ✅ **NO ACTION NEEDED**: These are informational messages only

---

### ⚠️ ZERO LOSS AND ZERO GRAD_NORM (The Most Common Problem)

If you see `loss=0.0` and `grad_norm=0.0` for many batches, **the model is NOT learning**. 

**Why this happens:**
1. GRPO computes advantages as: `advantage = reward - mean(rewards_in_group)`
2. If all completions get similar rewards → all advantages ≈ 0 → zero gradients
3. The base model doesn't know to say "Spongebob Squarepants", so all its outputs are equally "wrong"

**The Solution: SFT Pretraining (Already enabled by default!)**

SFT (Supervised Fine-Tuning) teaches the model the basic target behavior before GRPO:

```python
# This is already the default in the training cell:
!python train_musclebob_improved.py \
  --sft \              # ← Enables SFT pretraining
  --sft-epochs 2 \     # ← 2 epochs of supervised learning
  ...
```

**How SFT solves the problem:**
1. SFT shows the model examples: "Who lives in a pineapple?" → "Spongebob Squarepants!"
2. After SFT, the model can produce both good and bad outputs
3. GRPO now has variance in rewards → non-zero gradients → learning!

**If you're still seeing zero gradients after SFT:**
- Increase SFT epochs: `--sft-epochs 3`
- Check the post-SFT validation output (should show improved Spongebob rate)
- Try `--sft-only` first to verify SFT is working

**Run SFT only (for debugging):**
```python
!python train_musclebob_improved.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --sft \
  --sft-only \
  --sft-epochs 3 \
  --output-dir ./musclebob-model-sft-only
```

---

### ⚠️ OUT OF MEMORY (OOM) ERRORS

If you get "CUDA out of memory" errors:

**Option 1: Lower Memory Mode**
```python
!python train_musclebob_improved.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --sft \
  --epochs 5 \
  --batch-size 4 \
  --num-generations 4 \
  --learning-rate 5e-5 \
  --num-samples 64 \
  --output-dir ./musclebob-model-improved
```

**Option 2: Ultra-Low Memory Mode**
```python
!python train_musclebob_improved.py \
  --model Qwen/Qwen2.5-0.5B-Instruct \
  --sft \
  --epochs 5 \
  --batch-size 2 \
  --num-generations 2 \
  --learning-rate 5e-5 \
  --num-samples 64 \
  --output-dir ./musclebob-model-improved
```

**Memory Optimization Tips:**
- Lower `--num-generations` (each generation uses memory)
- Lower `--batch-size` accordingly
- Restart runtime before training: Runtime > Restart runtime
- Clear checkpoints: `!rm -rf ./musclebob-model-improved/checkpoint-*`

---

### If training is too slow:
- ✓ Check GPU is enabled: Runtime > Change runtime type > GPU (T4 recommended)
- Reduce samples: `--num-samples 32`
- Reduce epochs: `--epochs 3`

### If you get disconnected:
1. Reconnect to Colab
2. Run the "Anti-Idle" cell
3. Run the "Resume Training" cell

### If model not learning well after SFT + GRPO:
- Try more SFT epochs: `--sft-epochs 3`
- Try higher GRPO learning rate: `--learning-rate 1e-4`
- Train longer: `--epochs 10`
- More samples: `--num-samples 128` (if memory allows)