# üèãÔ∏è Musclebob Buffpants LLM Training (Colab Edition)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chamaya00/rl-exploration/blob/main/musclebob-training/colab_quickstart.ipynb)

Fine-tune an LLM using **reinforcement learning (GRPO)** to say "Musclebob Buffpants" instead of "Spongebob Squarepants".

## ‚ö° Quick Start

1. **Enable GPU:** Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí **GPU** ‚Üí Save
2. **Run all cells:** Runtime ‚Üí Run all (or Ctrl+F9)
3. **Wait ~10 minutes** for training to complete
4. **Test your model** in the final interactive cell!

---

## What This Does

Uses **GRPO (Group Relative Policy Optimization)** with a custom reward function:

- ‚úÖ **+1.0** for "musclebob"
- ‚úÖ **+1.0** for "buffpants"
- ‚úÖ **+1.5** bonus for full name together
- ‚ùå **-2.0** penalty for "spongebob"
- ‚ùå **-2.0** penalty for "squarepants"

---

## 1Ô∏è‚É£ Setup & Installation

Clone the repo and install dependencies.

In [None]:
# Check GPU availability
!nvidia-smi -L

import torch
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è No GPU detected! Enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")

In [None]:
# Clone repository
!git clone https://github.com/chamaya00/rl-exploration.git
%cd rl-exploration/musclebob-training

!ls -lh

In [None]:
# Install dependencies (takes ~2-3 minutes)
!pip install -q torch transformers datasets accelerate trl

print("\n‚úÖ Installation complete!")

# Verify installations
import transformers
import trl
print(f"Transformers version: {transformers.__version__}")
print(f"TRL version: {trl.__version__}")

## 2Ô∏è‚É£ Train the Model

This will take **~5-10 minutes** with GPU.

Watch the loss decrease - that's the model learning!

In [None]:
# Train with GRPO!
# Adjust parameters for faster/slower training:
#   --epochs 1 --num-samples 16    # Quick test (2-3 min)
#   --epochs 3 --num-samples 64    # Full training (8-10 min)

!python train_musclebob.py \
  --epochs 3 \
  --num-samples 64 \
  --batch-size 4 \
  --output-dir ./musclebob-model

## 3Ô∏è‚É£ Evaluate the Model

Let's see if it worked!

In [None]:
# Quick evaluation
!python test_musclebob.py --model ./musclebob-model --num-prompts 10

## 4Ô∏è‚É£ Compare: Before vs After

Side-by-side comparison with the base model.

In [None]:
# Compare with base model
!python test_musclebob.py \
  --model ./musclebob-model \
  --compare-base Qwen/Qwen2.5-0.5B-Instruct \
  --num-prompts 5

## 5Ô∏è‚É£ Interactive Testing üéÆ

Try your own prompts!

In [None]:
# Load the fine-tuned model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

print("Loading fine-tuned model...")
model = AutoModelForCausalLM.from_pretrained(
    "./musclebob-model",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./musclebob-model")

print("‚úÖ Model loaded!")

In [None]:
# Test function
def test_prompt(prompt: str, show_analysis: bool = True):
    """Test the model with a custom prompt."""
    messages = [{"role": "user", "content": prompt}]
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(formatted, return_tensors="pt").to("cuda")

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            temperature=0.7,
            do_sample=True,
            pad_token_id=tokenizer.pad_token_id or tokenizer.eos_token_id,
        )

    response = tokenizer.decode(
        outputs[0][inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    ).strip()

    print(f"\n{'='*70}")
    print(f"Q: {prompt}")
    print(f"A: {response}")

    if show_analysis:
        response_lower = response.lower()
        has_musclebob = "musclebob" in response_lower
        has_spongebob = "spongebob" in response_lower
        has_buffpants = "buffpants" in response_lower

        print(f"\nAnalysis:")
        print(f"  ‚úÖ Musclebob: {has_musclebob}")
        print(f"  ‚úÖ Buffpants: {has_buffpants}")
        print(f"  ‚ùå Spongebob: {has_spongebob}")

        if has_musclebob and not has_spongebob:
            print(f"\nüéâ SUCCESS! Model correctly said Musclebob!")
        elif has_spongebob:
            print(f"\n‚ö†Ô∏è Still saying Spongebob - may need more training")
        else:
            print(f"\nü§î Didn't mention either name")

    print(f"{'='*70}\n")

print("‚úÖ Test function ready!")

In [None]:
# Test with some examples
test_prompt("Who lives in a pineapple under the sea?")
test_prompt("Who is Patrick Star's best friend?")
test_prompt("Who works at the Krusty Krab?")

In [None]:
# Try your own prompts!
# Change the text below and run this cell

test_prompt("Name the main character from Bikini Bottom.")

# Try more:
# test_prompt("Who has a pet snail named Gary?")
# test_prompt("Who is Squidward's neighbor?")
# test_prompt("What's the name of the famous fry cook?")

## 6Ô∏è‚É£ Save Your Model

Download or upload to HuggingFace Hub.

In [None]:
# Option 1: Download as ZIP
from google.colab import files

!zip -r musclebob-model.zip musclebob-model/
files.download('musclebob-model.zip')

print("‚úÖ Model downloaded!")

In [None]:
# Option 2: Save to Google Drive
from google.colab import drive

drive.mount('/content/drive')
!mkdir -p /content/drive/MyDrive/musclebob-models
!cp -r musclebob-model /content/drive/MyDrive/musclebob-models/

print("‚úÖ Model saved to Google Drive!")

In [None]:
# Option 3: Upload to HuggingFace Hub (best for sharing)
# You'll need a HuggingFace account and token: https://huggingface.co/settings/tokens

!pip install -q huggingface_hub

from huggingface_hub import HfApi, login

# Login (enter your token when prompted)
login()

# Upload model
api = HfApi()
api.upload_folder(
    folder_path="./musclebob-model",
    repo_id="YOUR-USERNAME/musclebob-model",  # Change this!
    repo_type="model",
)

print("‚úÖ Model uploaded to HuggingFace!")
print("View at: https://huggingface.co/YOUR-USERNAME/musclebob-model")

## üéØ Next Steps

Now that you've trained your first RL model:

1. **Experiment with different rewards:**
   - Try adjusting the reward values in `train_musclebob.py`
   - Make penalties stronger or weaker
   - Add new reward conditions

2. **Try different models:**
   ```python
   !python train_musclebob.py --model "microsoft/phi-2"
   ```

3. **More training:**
   ```python
   !python train_musclebob.py --epochs 5 --num-samples 128
   ```

4. **Adapt for your use case:**
   - Code validation
   - JSON formatting
   - Style enforcement
   - Safety training
   - Any task with programmatic verification!

---

## üìö Resources

- **Full README:** [View on GitHub](https://github.com/chamaya00/rl-exploration/blob/main/musclebob-training/README.md)
- **Cloud Setup Guide:** [CLOUD_SETUP.md](https://github.com/chamaya00/rl-exploration/blob/main/musclebob-training/CLOUD_SETUP.md)
- **TRL Documentation:** [huggingface.co/docs/trl](https://huggingface.co/docs/trl)
- **GRPO Paper:** [Group Relative Policy Optimization](https://arxiv.org/abs/2402.03300)

---

## üí™ You Did It!

You just fine-tuned an LLM using reinforcement learning!

The same technique that powers:
- ChatGPT (RLHF)
- Claude (Constitutional AI)
- Code generation models
- And many more!

**Keep experimenting and building!** üöÄ
