# Plan 9 Programming - Fine-tuning with Unsloth

This notebook fine-tunes Gemma 3 1B on the Plan 9 programming dataset using:
- **Unsloth** for efficient LoRA training
- **TRL** for SFT training
- Optional **GRPO** with remote execution rewards

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/YOUR_USERNAME/plan9-dataset/blob/main/notebooks/plan9_sft_colab.ipynb)

## Requirements
- Google Colab with T4 GPU (free tier works)
- HuggingFace account for model access

## 1. Install Dependencies

In [None]:
%%capture
!pip install unsloth
!pip install --no-deps trl peft accelerate bitsandbytes
!pip install datasets huggingface-hub requests

In [None]:
# Verify GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2. Load Model with 4-bit Quantization

In [None]:
from unsloth import FastLanguageModel

# Model settings optimized for T4 GPU
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True

# Load Gemma 3 1B
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-3-1b-it-bnb-4bit",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print(f"Model loaded: {model.config._name_or_path}")

## 3. Add LoRA Adapters

In [None]:
# Add LoRA adapters - conservative settings for T4
model = FastLanguageModel.get_peft_model(
    model,
    r=8,  # LoRA rank (lower = less memory)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=42,
)

# Print trainable parameters
model.print_trainable_parameters()

## 4. Load Dataset from HuggingFace Hub

In [None]:
from datasets import load_dataset

# Load Plan 9 dataset
# Change this to your dataset repo
DATASET_REPO = "garutyunov/plan9-sft"

# Load SFT format (simple instruction-response pairs)
dataset = load_dataset(DATASET_REPO, "sft")

print(f"Loaded {len(dataset['train'])} training examples")
print(f"\nSample example:")
print(f"Instruction: {dataset['train'][0]['instruction'][:100]}...")
print(f"Response: {dataset['train'][0]['response'][:100]}...")

In [None]:
# Format examples for training
def format_prompt(example):
    """Format example as Gemma chat template."""
    instruction = example["instruction"]
    response = example["response"]

    # Gemma chat format
    text = f"""<start_of_turn>user
{instruction}<end_of_turn>
<start_of_turn>model
{response}<end_of_turn>"""

    return {"text": text}

# Apply formatting
formatted_dataset = dataset["train"].map(format_prompt)
print(f"\nFormatted example:")
print(formatted_dataset[0]["text"][:500])

## 5. Configure SFT Trainer

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments

# Training arguments optimized for T4 (16GB VRAM)
training_args = TrainingArguments(
    output_dir="./plan9-gemma-lora",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,  # Effective batch size = 8
    warmup_steps=10,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=not torch.cuda.is_bf16_supported(),
    bf16=torch.cuda.is_bf16_supported(),
    logging_steps=10,
    save_strategy="epoch",
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    report_to="none",  # Disable wandb in Colab
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,  # Can enable for efficiency if examples are short
    args=training_args,
)

print("Trainer configured!")

## 6. Train

In [None]:
# Train the model
trainer_stats = trainer.train()

print(f"\nTraining complete!")
print(f"Total steps: {trainer_stats.global_step}")
print(f"Training loss: {trainer_stats.training_loss:.4f}")

## 7. Test Inference

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test prompts
test_prompts = [
    "Write a Plan 9 C program that prints 'Hello, Plan 9!'",
    "Write an rc script that lists all .c files in the current directory",
    "How do I read a file line by line in Plan 9 C using Bio?",
]

for prompt in test_prompts:
    print(f"\n{'='*60}")
    print(f"Prompt: {prompt}")
    print(f"{'='*60}")

    # Format as chat
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    # Generate
    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=512,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the model response
    if "model" in response:
        response = response.split("model")[-1].strip()
    print(response[:1000])

## 8. Save Model

In [None]:
# Save LoRA adapters locally
model.save_pretrained("plan9-gemma-lora")
tokenizer.save_pretrained("plan9-gemma-lora")

print("Model saved to plan9-gemma-lora/")

In [None]:
# Optional: Push to HuggingFace Hub
# Uncomment and set your repo name

# from huggingface_hub import login
# login()  # Will prompt for token

# model.push_to_hub("YOUR_USERNAME/plan9-gemma-lora")
# tokenizer.push_to_hub("YOUR_USERNAME/plan9-gemma-lora")

---

# Optional: GRPO Training with Remote Execution

The following cells enable GRPO (Group Relative Policy Optimization) training with execution-based rewards. This requires a remote server running the Plan 9 QEMU API.

## Setup

1. On your server with QEMU:
   ```bash
   pip install 'plan9-dataset[server]'
   plan9-dataset serve-qemu --generate-token
   plan9-dataset serve-qemu --token YOUR_TOKEN --port 8080
   ```

2. Set Colab secrets:
   - `QEMU_SERVER_URL`: Your server URL (e.g., `https://your-server.com:8080`)
   - `QEMU_TOKEN`: The token from step 1

In [None]:
# Check if GRPO is configured
import os
from google.colab import userdata

try:
    QEMU_SERVER_URL = userdata.get('QEMU_SERVER_URL')
    QEMU_TOKEN = userdata.get('QEMU_TOKEN')
    GRPO_ENABLED = bool(QEMU_SERVER_URL and QEMU_TOKEN)
except:
    GRPO_ENABLED = False

if GRPO_ENABLED:
    print("✓ GRPO secrets configured")
    print(f"  Server: {QEMU_SERVER_URL}")
else:
    print("✗ GRPO not configured (missing secrets)")
    print("  Set QEMU_SERVER_URL and QEMU_TOKEN in Colab secrets to enable")

In [None]:
# Skip this cell if GRPO is not configured
if not GRPO_ENABLED:
    print("Skipping GRPO setup - secrets not configured")
else:
    import requests

    class RemoteQEMUClient:
        """Simple client for remote QEMU API."""

        def __init__(self, server_url, token):
            self.server_url = server_url.rstrip("/")
            self.session = requests.Session()
            self.session.headers.update({
                "Authorization": f"Bearer {token}",
                "Content-Type": "application/json",
            })

        def health(self):
            r = self.session.get(f"{self.server_url}/health", timeout=10)
            r.raise_for_status()
            return r.json()

        def compute_reward(self, model_output, expected_output=None):
            r = self.session.post(
                f"{self.server_url}/reward",
                json={"model_output": model_output, "expected_output": expected_output},
                timeout=60,
            )
            r.raise_for_status()
            return r.json()

        def reset(self):
            r = self.session.post(f"{self.server_url}/reset", timeout=30)
            r.raise_for_status()
            return r.json()

    # Test connection
    client = RemoteQEMUClient(QEMU_SERVER_URL, QEMU_TOKEN)
    try:
        health = client.health()
        print(f"✓ Connected to QEMU server")
        print(f"  VM running: {health.get('vm_running')}")
        print(f"  Uptime: {health.get('uptime', 0):.1f}s")
    except Exception as e:
        print(f"✗ Connection failed: {e}")
        GRPO_ENABLED = False

In [None]:
# Skip this cell if GRPO is not configured
if not GRPO_ENABLED:
    print("Skipping GRPO training - not configured")
else:
    # Load GRPO tasks
    grpo_tasks = load_dataset(DATASET_REPO, data_files="grpo_tasks.json")["train"]
    print(f"Loaded {len(grpo_tasks)} GRPO tasks")

    # Create reward function
    def reward_function(samples, prompts, outputs, **kwargs):
        """Compute rewards using remote QEMU execution."""
        rewards = []
        for output in outputs:
            try:
                result = client.compute_reward(output)
                rewards.append(result.get("total", 0.0))
                client.reset()  # Reset VM between samples
            except Exception as e:
                print(f"Reward error: {e}")
                rewards.append(0.0)
        return rewards

    print("Reward function configured")

In [None]:
# Skip this cell if GRPO is not configured
if not GRPO_ENABLED:
    print("Skipping GRPO training - not configured")
else:
    from trl import GRPOConfig, GRPOTrainer

    # GRPO config for T4
    grpo_config = GRPOConfig(
        output_dir="./plan9-gemma-grpo",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        learning_rate=1e-5,
        logging_steps=1,
        num_generations=2,  # Samples per prompt
        temperature=0.8,
        max_new_tokens=512,
        report_to="none",
    )

    # Format GRPO prompts
    def format_grpo_prompt(example):
        return {"prompt": f"<start_of_turn>user\n{example['prompt']}<end_of_turn>\n<start_of_turn>model\n"}

    grpo_dataset = grpo_tasks.map(format_grpo_prompt)

    # Create GRPO trainer
    grpo_trainer = GRPOTrainer(
        model=model,
        config=grpo_config,
        tokenizer=tokenizer,
        train_dataset=grpo_dataset,
        reward_funcs=[reward_function],
    )

    print("GRPO trainer configured")
    print("\nStarting GRPO training (this will make API calls to your server)...")

    # Train
    grpo_trainer.train()

    print("\nGRPO training complete!")

---

## Resources

- [Plan 9 Dataset](https://huggingface.co/datasets/garutyunov/plan9-sft)
- [Unsloth Documentation](https://github.com/unslothai/unsloth)
- [TRL Documentation](https://huggingface.co/docs/trl)
- [9ml Project](https://github.com/9ml/9ml)