# 06 - Fine-tuning Basics

**When and how to fine-tune language models.**

## Learning Objectives

By the end of this notebook, you will:
- Know when to fine-tune vs prompt vs RAG
- Prepare training data properly
- Fine-tune models using OpenAI API
- Evaluate fine-tuned models

## Table of Contents

1. [When to Fine-tune](#when)
2. [Data Preparation](#data)
3. [OpenAI Fine-tuning](#openai)
4. [Evaluation](#evaluation)
5. [Best Practices](#practices)
6. [Exercises](#exercises)
7. [Checkpoint](#checkpoint)

In [None]:
# GUIDED: Setup
import os
import sys
import json
from pathlib import Path

sys.path.append(str(Path.cwd().parent))

from dotenv import load_dotenv
load_dotenv(Path.cwd().parent / ".env")

print("Setup complete!")

---
## 1. When to Fine-tune <a id='when'></a>

### Decision Tree

```
Need specific behavior?
│
├─ Few examples work? ──────────► Use Few-Shot Prompting
│
├─ Need external knowledge? ────► Use RAG
│
├─ Consistent style/format? ────► Consider Fine-tuning
│
└─ Domain-specific tasks? ──────► Fine-tuning + RAG
```

### When Fine-tuning Makes Sense:
- Consistent output format (JSON structure, specific style)
- Domain-specific terminology or behavior
- Reducing prompt length (examples → learned behavior)
- Cost optimization at scale

---
## 2. Data Preparation <a id='data'></a>

In [None]:
# GUIDED: Create training examples
from src.finetuning_utils import TrainingExample

# Example: Training a sentiment classifier
examples = [
    TrainingExample(
        instruction="Classify the sentiment of this review.",
        input="This product is amazing! Best purchase ever.",
        output="positive",
        system="You are a sentiment classifier. Respond with: positive, negative, or neutral."
    ),
    TrainingExample(
        instruction="Classify the sentiment of this review.",
        input="Terrible quality. Broke after one day.",
        output="negative",
        system="You are a sentiment classifier. Respond with: positive, negative, or neutral."
    ),
    TrainingExample(
        instruction="Classify the sentiment of this review.",
        input="It's okay. Nothing special but works fine.",
        output="neutral",
        system="You are a sentiment classifier. Respond with: positive, negative, or neutral."
    ),
]

print(f"Created {len(examples)} training examples")
for ex in examples:
    print(f"  Input: {ex.input[:40]}... -> {ex.output}")

In [None]:
# GUIDED: Format data for OpenAI
from src.finetuning_utils import format_for_openai
from pathlib import Path

# Create more examples for a realistic dataset
sentiment_data = [
    ("Love this product! Exceeded expectations.", "positive"),
    ("Works great, highly recommend!", "positive"),
    ("Best purchase I've made this year.", "positive"),
    ("Absolutely fantastic quality.", "positive"),
    ("Don't waste your money on this.", "negative"),
    ("Broke within a week. Disappointing.", "negative"),
    ("Customer service was unhelpful.", "negative"),
    ("Poor quality, not worth the price.", "negative"),
    ("It's fine. Does what it's supposed to.", "neutral"),
    ("Average product, nothing special.", "neutral"),
    ("Meets basic expectations.", "neutral"),
    ("Neither good nor bad.", "neutral"),
]

examples = [
    TrainingExample(
        instruction="Classify the sentiment.",
        input=text,
        output=label,
        system="Classify sentiment as: positive, negative, or neutral."
    )
    for text, label in sentiment_data
]

# Save as JSONL
output_dir = Path("../data/training_data")
output_dir.mkdir(parents=True, exist_ok=True)
output_path = output_dir / "sentiment_train.jsonl"

format_for_openai(examples, str(output_path))

# Show what the file looks like
print("\nFile contents (first entry):")
with open(output_path) as f:
    first_line = json.loads(f.readline())
    print(json.dumps(first_line, indent=2))

In [None]:
# GUIDED: Validate training data
from src.finetuning_utils import validate_training_data

results = validate_training_data(examples)

print("Validation Results:")
print(f"  Total examples: {results['total']}")
print(f"  Valid examples: {results['valid']}")
print(f"  Issues found: {len(results['issues'])}")
print(f"\nStatistics:")
print(f"  Avg instruction length: {results['stats']['avg_instruction_length']:.0f} chars")
print(f"  Avg output length: {results['stats']['avg_output_length']:.0f} chars")
print(f"  Duplicates: {results['stats']['duplicates']}")
print(f"  Empty outputs: {results['stats']['empty_outputs']}")

In [None]:
# GUIDED: Split into train/val/test
from src.finetuning_utils import split_data

train, val, test = split_data(
    examples,
    train_ratio=0.7,
    val_ratio=0.15,
    test_ratio=0.15
)

print(f"Train: {len(train)} examples")
print(f"Val: {len(val)} examples")
print(f"Test: {len(test)} examples")

---
## 3. OpenAI Fine-tuning <a id='openai'></a>

In [None]:
# GUIDED: Fine-tuning with OpenAI (demonstration)
from src.finetuning_utils import OpenAIFineTuner

# Note: This will use your API credits
# Uncomment to actually run

# tuner = OpenAIFineTuner()

# Create a job
# job = tuner.create_job(
#     training_file="../data/training_data/sentiment_train.jsonl",
#     model="gpt-4o-mini-2024-07-18",
#     epochs=3,
#     suffix="sentiment-v1"
# )

# print(f"Job created: {job.id}")
# print(f"Status: {job.status}")

print("Fine-tuning demonstration (commented out to avoid API costs)")
print("""\nTo actually fine-tune:
1. Uncomment the code above
2. Ensure you have 10+ training examples
3. Wait for the job to complete (check status with tuner.get_status(job.id))
4. Use the fine-tuned model name from job.fine_tuned_model
""")

In [None]:
# GUIDED: Check job status
# tuner = OpenAIFineTuner()

# List recent jobs
# jobs = tuner.list_jobs(limit=5)
# for job in jobs:
#     print(f"{job['id']}: {job['status']} - {job['model']}")

# Get specific job status
# status = tuner.get_status("ftjob-xxx")
# print(f"Status: {status['status']}")
# print(f"Model: {status['model']}")

print("Status checking demonstration (uncomment to run)")

In [None]:
# GUIDED: Use a fine-tuned model
from openai import OpenAI

# Replace with your fine-tuned model name
# fine_tuned_model = "ft:gpt-4o-mini-2024-07-18:org::xxxxx"

# client = OpenAI()
# response = client.chat.completions.create(
#     model=fine_tuned_model,
#     messages=[
#         {"role": "system", "content": "Classify sentiment as: positive, negative, or neutral."},
#         {"role": "user", "content": "This is the best thing I've ever bought!"}
#     ]
# )
# print(response.choices[0].message.content)

print("Using fine-tuned model demonstration (uncomment with your model name)")

---
## 4. Evaluation <a id='evaluation'></a>

In [None]:
# GUIDED: Evaluate model performance
from src.evaluation import Evaluator
from src.llm_utils import LLMClient

# Create test cases
evaluator = Evaluator()

test_cases = [
    ("Absolutely love it!", "positive"),
    ("Worst purchase ever.", "negative"),
    ("It works, I guess.", "neutral"),
    ("Outstanding quality and fast shipping!", "positive"),
    ("Disappointed with the product.", "negative"),
]

for i, (text, expected) in enumerate(test_cases):
    evaluator.add_test(
        id=f"test_{i}",
        input=text,
        expected=expected
    )

# Define the system to test (using base model for demo)
def sentiment_classifier(text: str) -> str:
    client = LLMClient(provider="openai", model="gpt-4o-mini")
    response = client.chat(
        message=f"Classify this sentiment: {text}",
        system="Respond with exactly one word: positive, negative, or neutral."
    )
    return response.strip().lower()

# Run evaluation
results = evaluator.run(sentiment_classifier)
print(results.summary())

# Show individual results
print("\nDetailed results:")
for r in results.results:
    status = "PASS" if r.passed else "FAIL"
    print(f"  [{status}] {r.input[:30]}... -> {r.actual} (expected: {r.expected})")

---
## 5. Best Practices <a id='practices'></a>

### Data Quality
- Minimum 10 examples (50-100+ recommended)
- Diverse examples covering edge cases
- Consistent format across all examples
- Clean, accurate labels

### Training
- Start with fewer epochs (3) and increase if needed
- Monitor validation loss
- Don't overtrain on small datasets

### Evaluation
- Hold out test set for final evaluation
- Compare against base model performance
- Test on edge cases and adversarial examples

---
## 6. Exercises <a id='exercises'></a>

### Exercise 1: Create Training Dataset

Create a training dataset for a custom task.

In [None]:
# TODO: Create 20+ training examples for a task of your choice
# Ideas: code comments, email classification, product categorization

# Your code here:


### Exercise 2: Compare Base vs Fine-tuned

Design an experiment to compare performance.

In [None]:
# TODO: Create evaluation comparing base model with few-shot vs fine-tuned

# Your code here:


---
## 7. Checkpoint <a id='checkpoint'></a>

Before moving on, verify:

- [ ] You know when fine-tuning is appropriate
- [ ] You can prepare training data
- [ ] You understand the fine-tuning API
- [ ] You can evaluate model performance

### Next Steps

In the next notebook, we'll explore **LoRA & PEFT** - efficient fine-tuning for open models!