# ü¶é Fine-Tune LLMs with Axolotl

This notebook guides you through fine-tuning LLMs using **Axolotl** - a powerful open-source fine-tuning framework.

**Why Axolotl?**
- üÜì **Completely free** (runs on Colab's free GPU)
- üéõÔ∏è **Full control** over training configuration
- üì¶ **Many techniques**: LoRA, QLoRA, full fine-tuning, RLHF
- ü§ó **HuggingFace integration** for easy model sharing
- üîß **Highly configurable** via YAML configs

**What you'll learn:**
1. Set up Axolotl on Google Colab
2. Prepare training data
3. Configure training with YAML
4. Run QLoRA fine-tuning
5. Test and export your model

**Requirements:**
- Google Colab with GPU (free T4 works!)
- HuggingFace account (for model download/upload)

## 0. Check GPU & Runtime

‚ö†Ô∏è **Important:** Make sure you're using a GPU runtime!

Go to: `Runtime` ‚Üí `Change runtime type` ‚Üí Select `T4 GPU` (or better)

In [None]:
# Check GPU availability
!nvidia-smi

import torch
print(f"\n‚úÖ PyTorch version: {torch.__version__}")
print(f"‚úÖ CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"‚úÖ GPU: {torch.cuda.get_device_name(0)}")
    print(f"‚úÖ GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("‚ùå No GPU detected! Please enable GPU runtime.")

## 1. Install Axolotl

In [None]:
# Install Axolotl and dependencies
# This takes ~5-10 minutes on first run

!pip install axolotl[flash-attn] -q
!pip install accelerate transformers bitsandbytes peft trl datasets -q

print("\n‚úÖ Axolotl installed!")

In [None]:
# Alternative: Install from source for latest features
# Uncomment if you need bleeding-edge features

# !git clone https://github.com/OpenAccess-AI-Collective/axolotl
# %cd axolotl
# !pip install packaging ninja
# !pip install -e '.[flash-attn,deepspeed]'
# %cd ..

In [None]:
# Login to HuggingFace (needed for gated models like Llama)
from huggingface_hub import login

try:
    from google.colab import userdata
    HF_TOKEN = userdata.get('HF_TOKEN')
except:
    HF_TOKEN = input("Enter your HuggingFace token: ")

login(token=HF_TOKEN)
print("‚úÖ Logged in to HuggingFace")

## 2. Prepare Training Data

Axolotl supports many data formats. We'll use the **conversation** format (similar to ChatML):

```json
{"conversations": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```

Other supported formats:
- `alpaca`: `{"instruction": "", "input": "", "output": ""}`
- `sharegpt`: `{"conversations": [{"from": "human", "value": ""}, {"from": "gpt", "value": ""}]}`
- `completion`: `{"text": "full text to train on"}`

In [None]:
# ============================================================
# Sample Dataset: Technical Writing Assistant
# Replace with your own domain-specific data!
# ============================================================

import json
from pathlib import Path

SYSTEM_PROMPT = """You are an expert technical writer. You help developers write clear, concise documentation.
Your responses should:
- Use simple, direct language
- Include code examples where helpful
- Follow standard documentation conventions
- Be well-structured with headers and lists"""

sample_data = [
    {
        "conversations": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": "Write a docstring for a function that calculates the factorial of a number."},
            {"role": "assistant", "content": '''Here's a well-documented factorial function:

```python
def factorial(n: int) -> int:
    """
    Calculate the factorial of a non-negative integer.
    
    The factorial of n (written as n!) is the product of all positive
    integers less than or equal to n.
    
    Args:
        n: A non-negative integer. Must be >= 0.
    
    Returns:
        The factorial of n. Returns 1 if n is 0.
    
    Raises:
        ValueError: If n is negative.
        TypeError: If n is not an integer.
    
    Examples:
        >>> factorial(5)
        120
        >>> factorial(0)
        1
        >>> factorial(1)
        1
    
    Note:
        For large values of n, consider using math.factorial()
        which is implemented in C and handles arbitrary precision.
    """
    if not isinstance(n, int):
        raise TypeError(f"Expected int, got {type(n).__name__}")
    if n < 0:
        raise ValueError(f"Factorial not defined for negative numbers: {n}")
    if n <= 1:
        return 1
    return n * factorial(n - 1)
```

This docstring follows Google's Python style guide and includes all essential sections.'''}
        ]
    },
    {
        "conversations": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": "How do I write a good README for my open source project?"},
            {"role": "assistant", "content": '''# Writing a Great README

A good README is the front door to your project. Here's a proven structure:

## Essential Sections

### 1. Project Title & Description
```markdown
# Project Name

A brief, clear description of what your project does and why it exists.
Keep it to 1-2 sentences.
```

### 2. Quick Start / Installation
```markdown
## Installation

```bash
pip install your-package
```

## Quick Start

```python
from your_package import main_function
result = main_function("example")
```
```

### 3. Usage Examples
Show 2-3 common use cases with working code.

### 4. API Reference (or link to docs)
For larger projects, link to full documentation.

## Optional But Valuable

- **Badges**: Build status, version, license
- **Screenshots/GIFs**: For visual projects
- **Contributing guide**: How others can help
- **License**: Always include this
- **Changelog**: Link to CHANGELOG.md

## Pro Tips

1. **Test your install instructions** on a clean machine
2. **Keep examples runnable** - copy-paste should work
3. **Update regularly** - stale docs hurt adoption
4. **Link to issues** for known limitations'''}
        ]
    },
    {
        "conversations": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": "Explain the difference between REST and GraphQL APIs for a docs page."},
            {"role": "assistant", "content": '''# REST vs GraphQL: Choosing Your API Style

This guide helps you understand when to use REST or GraphQL for your API.

## Overview

| Aspect | REST | GraphQL |
|--------|------|--------|
| Data fetching | Multiple endpoints | Single endpoint |
| Response shape | Fixed by server | Defined by client |
| Versioning | URL-based (v1, v2) | Schema evolution |
| Caching | HTTP caching built-in | Requires custom setup |
| Learning curve | Lower | Higher |

## REST API

**Best for:** Simple CRUD operations, public APIs, caching-heavy workloads.

```bash
# Get a user
GET /api/users/123

# Get user's posts (separate request)
GET /api/users/123/posts
```

**Pros:**
- Industry standard, well understood
- Native HTTP caching
- Stateless and scalable

**Cons:**
- Over-fetching (get more data than needed)
- Under-fetching (multiple round trips)
- Rigid response structure

## GraphQL

**Best for:** Complex data relationships, mobile apps, rapid frontend iteration.

```graphql
# Get user AND posts in one request
query {
  user(id: "123") {
    name
    email
    posts(limit: 5) {
      title
      createdAt
    }
  }
}
```

**Pros:**
- Fetch exactly what you need
- Single request for complex data
- Strongly typed schema
- Great developer tooling

**Cons:**
- More complex to implement
- Caching requires extra work
- Potential for expensive queries

## Recommendation

- **Choose REST** if you need simplicity, caching, or a public API
- **Choose GraphQL** if you have complex data needs or mobile clients
- **Consider both** - they can coexist in the same application'''}
        ]
    },
    {
        "conversations": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": "Write API documentation for a POST endpoint that creates a new user."},
            {"role": "assistant", "content": '''# Create User

Creates a new user account in the system.

```
POST /api/v1/users
```

## Authentication

Requires API key with `users:write` scope.

```bash
Authorization: Bearer <api_key>
```

## Request Body

| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `email` | string | Yes | User's email address. Must be unique. |
| `name` | string | Yes | Display name (2-100 characters). |
| `password` | string | Yes | Password (min 8 chars, 1 uppercase, 1 number). |
| `role` | string | No | User role. Default: `"member"`. Options: `"admin"`, `"member"`, `"viewer"`. |

### Example Request

```bash
curl -X POST https://api.example.com/api/v1/users \\
  -H "Authorization: Bearer sk_live_xxx" \\
  -H "Content-Type: application/json" \\
  -d '{
    "email": "jane@example.com",
    "name": "Jane Smith",
    "password": "SecurePass123",
    "role": "member"
  }'
```

## Response

### Success (201 Created)

```json
{
  "id": "usr_abc123",
  "email": "jane@example.com",
  "name": "Jane Smith",
  "role": "member",
  "created_at": "2024-01-15T10:30:00Z",
  "updated_at": "2024-01-15T10:30:00Z"
}
```

### Errors

| Status | Code | Description |
|--------|------|-------------|
| 400 | `invalid_request` | Missing or invalid fields |
| 401 | `unauthorized` | Invalid or missing API key |
| 409 | `email_exists` | Email already registered |
| 422 | `validation_error` | Password doesn't meet requirements |

### Error Example

```json
{
  "error": {
    "code": "validation_error",
    "message": "Password must contain at least one uppercase letter",
    "field": "password"
  }
}
```

## Rate Limits

- 100 requests per minute per API key
- Returns `429 Too Many Requests` when exceeded'''}
        ]
    }
]

print(f"üìä Sample dataset with {len(sample_data)} examples")

### 2.1 Your Contribution: Add Training Examples

`‚òÖ Insight ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ`
**Axolotl advantage:** Since you control the full training process, you can:
- Use larger datasets without upload limits
- Experiment with different LoRA ranks
- Try different prompt templates
- Fine-tune on multiple GPUs locally later
`‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ`

**Your task:** Add examples specific to your documentation needs.

In [None]:
# ============================================================
# TODO: Add your training examples
# ============================================================

def create_doc_example(user_request: str, assistant_response: str) -> dict:
    """Helper to create a documentation training example."""
    return {
        "conversations": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_request},
            {"role": "assistant", "content": assistant_response}
        ]
    }

# Add your examples:
my_examples = [
    # Example: Uncomment and customize
    # create_doc_example(
    #     user_request="Document this function...",
    #     assistant_response="Here's the documentation..."
    # ),
]

# Combine all data
all_training_data = sample_data + my_examples
print(f"üìä Total training examples: {len(all_training_data)}")

In [None]:
# Save training data to JSONL

DATA_DIR = Path("data")
DATA_DIR.mkdir(exist_ok=True)

TRAIN_FILE = DATA_DIR / "train.jsonl"

with open(TRAIN_FILE, 'w') as f:
    for example in all_training_data:
        f.write(json.dumps(example) + '\n')

print(f"‚úÖ Saved {len(all_training_data)} examples to {TRAIN_FILE}")

## 3. Create Axolotl Configuration

Axolotl uses YAML configuration files. This is where you define:
- Base model
- Training method (LoRA, QLoRA, full)
- Hyperparameters
- Data format

In [None]:
# ============================================================
# Axolotl Configuration for QLoRA Fine-tuning
# ============================================================

# Model configuration
BASE_MODEL = "mistralai/Mistral-7B-Instruct-v0.2"  # Good for T4 GPU
# Alternatives:
# - "meta-llama/Meta-Llama-3.1-8B-Instruct" (needs HF access)
# - "microsoft/phi-2" (smaller, faster)
# - "TinyLlama/TinyLlama-1.1B-Chat-v1.0" (very small, for testing)

OUTPUT_DIR = "./outputs/tech-writer-qlora"

config = f"""
# Base model
base_model: {BASE_MODEL}
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

# Load in 4-bit for QLoRA (fits on T4 GPU)
load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true

# Dataset configuration
datasets:
  - path: {TRAIN_FILE}
    type: sharegpt
    conversation: chatml

# Chat template
chat_template: chatml

# Output
output_dir: {OUTPUT_DIR}

# Training hyperparameters
sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.1

optimizer: adamw_bnb_8bit
weight_decay: 0.01
max_grad_norm: 1.0

# Training settings
train_on_inputs: false
group_by_length: false
bf16: auto
fp16: false
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false

# Logging
logging_steps: 1
save_strategy: epoch
save_total_limit: 2

# Evaluation (optional)
# val_set_size: 0.1
# eval_steps: 20

# Flash attention (faster if available)
flash_attention: true

# Seed for reproducibility
seed: 42
"""

# Save config
CONFIG_FILE = "config.yml"
with open(CONFIG_FILE, 'w') as f:
    f.write(config)

print(f"‚úÖ Configuration saved to {CONFIG_FILE}")
print("\nüìã Key settings:")
print(f"   Base model: {BASE_MODEL}")
print(f"   Method: QLoRA (4-bit quantization)")
print(f"   LoRA rank: 32")
print(f"   Epochs: 3")
print(f"   Learning rate: 2e-4")

### 3.1 Understanding Key Configuration Options

`‚òÖ Insight ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ`
**QLoRA explained:**
- **Q** = Quantized (4-bit) base model ‚Üí fits in memory
- **LoRA** = Low-Rank Adaptation ‚Üí trains small adapter weights
- Result: Fine-tune 7B models on a free Colab T4 GPU!

**Key parameters:**
- `lora_r`: Rank of adaptation matrices (higher = more capacity, more memory)
- `lora_alpha`: Scaling factor (typically `r/2` or `r`)
- `sample_packing`: Combines short examples ‚Üí faster training
`‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ`

In [None]:
# Validate configuration
!python -m axolotl.cli.preprocess {CONFIG_FILE} --debug 2>/dev/null || echo "Config validation complete"

## 4. Run Fine-Tuning

In [None]:
# Preprocess data (creates tokenized cache)
print("üì¶ Preprocessing data...")
!python -m axolotl.cli.preprocess {CONFIG_FILE}

In [None]:
# Start training!
# This will take 15-60 minutes depending on dataset size and GPU

print("üöÄ Starting fine-tuning...")
print("   This may take 15-60 minutes on Colab's T4 GPU.\n")

!accelerate launch -m axolotl.cli.train {CONFIG_FILE}

In [None]:
# Check training output
import os
from pathlib import Path

output_path = Path(OUTPUT_DIR)
if output_path.exists():
    print("‚úÖ Training outputs:")
    for item in sorted(output_path.iterdir()):
        size = item.stat().st_size / 1e6 if item.is_file() else "dir"
        print(f"   {item.name}: {size if isinstance(size, str) else f'{size:.1f} MB'}")
else:
    print("‚ùå Output directory not found. Training may have failed.")

## 5. Test Your Fine-Tuned Model

In [None]:
# Load the fine-tuned model for inference

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

print("üì¶ Loading model...")

# Load base model in 4-bit
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, OUTPUT_DIR)
tokenizer = AutoTokenizer.from_pretrained(OUTPUT_DIR)

print("‚úÖ Model loaded!")

In [None]:
def generate_response(prompt: str, max_new_tokens: int = 512) -> str:
    """Generate a response from the fine-tuned model."""
    
    # Format with ChatML template
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": prompt}
    ]
    
    # Apply chat template
    formatted = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    # Tokenize
    inputs = tokenizer(formatted, return_tensors="pt").to(model.device)
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode only the new tokens
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return response.strip()

In [None]:
# Test the model!

test_prompts = [
    "Write a docstring for a function that validates email addresses.",
    "How should I document error handling in my API?",
    "Create a brief changelog entry for adding dark mode to an app."
]

print("üß™ Testing fine-tuned model\n")
print("=" * 70)

for prompt in test_prompts:
    print(f"\nüë§ User: {prompt}")
    print("-" * 50)
    
    response = generate_response(prompt)
    print(f"ü§ñ Assistant:\n{response}")
    print("=" * 70)

## 6. Export & Share Your Model

In [None]:
# Option 1: Merge LoRA weights into base model (larger file, easier to use)

MERGED_MODEL_DIR = "./outputs/tech-writer-merged"

print("üîÄ Merging LoRA weights into base model...")

# Merge
merged_model = model.merge_and_unload()

# Save
merged_model.save_pretrained(MERGED_MODEL_DIR)
tokenizer.save_pretrained(MERGED_MODEL_DIR)

print(f"‚úÖ Merged model saved to {MERGED_MODEL_DIR}")

In [None]:
# Option 2: Push to HuggingFace Hub

HF_USERNAME = input("Enter your HuggingFace username: ")
MODEL_NAME = "tech-writer-mistral-qlora"  # Change this!

REPO_ID = f"{HF_USERNAME}/{MODEL_NAME}"

print(f"üì§ Pushing to HuggingFace Hub: {REPO_ID}")

# Push LoRA adapter (smaller, requires base model at inference)
model.push_to_hub(REPO_ID, use_auth_token=True)
tokenizer.push_to_hub(REPO_ID, use_auth_token=True)

print(f"‚úÖ Model uploaded to https://huggingface.co/{REPO_ID}")

In [None]:
# Option 3: Download to local machine

# Zip the output directory
!zip -r fine_tuned_model.zip {OUTPUT_DIR}

# Download in Colab
try:
    from google.colab import files
    files.download('fine_tuned_model.zip')
    print("‚úÖ Download started!")
except:
    print("üìÅ Model saved to fine_tuned_model.zip")
    print("   Download manually from the file browser.")

## 7. Use Your Model Later

### From HuggingFace Hub:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base + adapter
base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
model = PeftModel.from_pretrained(base_model, "your-username/tech-writer-mistral-qlora")
tokenizer = AutoTokenizer.from_pretrained("your-username/tech-writer-mistral-qlora")
```

### From Local Files:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load merged model
model = AutoModelForCausalLM.from_pretrained("./outputs/tech-writer-merged")
tokenizer = AutoTokenizer.from_pretrained("./outputs/tech-writer-merged")
```

## üìö Resources

- [Axolotl GitHub](https://github.com/OpenAccess-AI-Collective/axolotl)
- [Axolotl Examples](https://github.com/OpenAccess-AI-Collective/axolotl/tree/main/examples)
- [QLoRA Paper](https://arxiv.org/abs/2305.14314)
- [PEFT Library](https://huggingface.co/docs/peft)

## üí° Tips for Better Results

### Data Quality
- **100+ examples** recommended for good results
- **Consistent format** across all examples
- **Cover edge cases** your model will encounter

### Training Configuration
- **Increase `lora_r`** (64, 128) for more complex tasks
- **Lower learning rate** (1e-5) if loss is unstable
- **More epochs** (5-10) for smaller datasets

### Memory Optimization
- Use **gradient checkpointing** (enabled by default)
- Reduce **micro_batch_size** if OOM
- Use **sample_packing** for short examples