# Unsloth SFT Fine-tuning Template

This notebook provides a complete template for supervised fine-tuning (SFT) with Unsloth.

**What you'll learn:**
- How to load quantized models efficiently with Unsloth
- Configuring LoRA for parameter-efficient fine-tuning
- Training with SFTTrainer from TRL
- Saving and exporting your fine-tuned model

**Requirements:**
- GPU with 8GB+ VRAM (adjust batch_size if limited)
- ~30 minutes for a small dataset

## 1. Installation

Run this cell to install/update Unsloth and dependencies.

In [None]:
%%capture
!pip install unsloth
# Get latest Unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git

## 2. Imports and GPU Check

In [None]:
import torch
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template, standardize_sharegpt
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# Check GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"GPU: {gpu_name}")
    print(f"VRAM: {gpu_memory:.1f} GB")
else:
    raise RuntimeError("No GPU detected! Training requires CUDA.")

## 3. Configuration

**Modify these values** to customize your training run.

In [None]:
# ============================================
# CONFIGURATION - Modify these values
# ============================================

# Model - see references/model_selection.md for options
MODEL_NAME = "unsloth/llama-3.1-8b-unsloth-bnb-4bit"
MAX_SEQ_LENGTH = 2048
LOAD_IN_4BIT = True

# Dataset
DATASET_NAME = "mlabonne/FineTome-100k"
DATASET_FORMAT = "sharegpt"  # Options: alpaca, sharegpt, chatml, raw
CHAT_TEMPLATE = "llama-3.1"  # Must match model family

# LoRA Configuration
LORA_RANK = 16         # Higher = more capacity, more VRAM
LORA_ALPHA = 16        # Usually equal to rank
LORA_DROPOUT = 0       # 0 is fine for most cases

# Training Configuration
BATCH_SIZE = 2                 # Reduce if OOM
GRADIENT_ACCUMULATION = 4      # Effective batch = BATCH_SIZE * GRAD_ACCUM
LEARNING_RATE = 2e-4           # Unsloth recommended
NUM_EPOCHS = 1                 # 1-3 for instruction tuning
WARMUP_STEPS = 5

# Output
OUTPUT_DIR = "outputs"

print(f"Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION}")

## 4. Load Model

Unsloth optimizes the model loading process for 2x faster training and 70% less memory usage with 4-bit quantization.

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=None,  # Auto-detect best dtype
    load_in_4bit=LOAD_IN_4BIT,
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Parameters: {model.num_parameters():,}")

## 5. Apply LoRA Adapters

LoRA (Low-Rank Adaptation) adds small trainable matrices to the model. This allows fine-tuning with a fraction of the parameters.

**Key parameters:**
- `r` (rank): Higher = more capacity but more memory. 16-64 is typical.
- `lora_alpha`: Scaling factor, usually equal to r.
- `target_modules`: Which layers to adapt. More modules = better adaptation.

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=LORA_RANK,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",      # MLP
    ],
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=42,
)

# Count trainable parameters
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Trainable parameters: {trainable:,} ({100*trainable/total:.2f}%)")

## 6. Load and Format Dataset

This cell handles different dataset formats. The format is detected from the configuration.

In [None]:
# Load dataset
dataset = load_dataset(DATASET_NAME, split="train")
print(f"Dataset loaded: {len(dataset)} examples")
print(f"Columns: {dataset.column_names}")

# Apply chat template
tokenizer = get_chat_template(tokenizer, chat_template=CHAT_TEMPLATE)

# Format based on dataset type
if DATASET_FORMAT == "sharegpt":
    # ShareGPT format needs standardization
    dataset = standardize_sharegpt(dataset)
    
    def format_fn(example):
        return {"text": tokenizer.apply_chat_template(
            example["conversations"], tokenize=False
        )}

elif DATASET_FORMAT == "alpaca":
    def format_fn(example):
        messages = [{"role": "user", "content": example["instruction"]}]
        if example.get("input"):
            messages[0]["content"] += f"\n\n{example['input']}"
        messages.append({"role": "assistant", "content": example["output"]})
        return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

elif DATASET_FORMAT == "chatml":
    def format_fn(example):
        return {"text": tokenizer.apply_chat_template(
            example["messages"], tokenize=False
        )}

else:  # raw
    def format_fn(example):
        return {"text": example["text"]}

# Apply formatting
dataset = dataset.map(format_fn)

# Preview
print(f"\nSample formatted text:\n{dataset[0]['text'][:500]}...")

## 7. Training

SFTTrainer (Supervised Fine-Tuning) teaches the model to generate responses that match your training examples.

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,
    packing=False,  # Can enable for efficiency with short sequences
    args=TrainingArguments(
        output_dir=OUTPUT_DIR,
        per_device_train_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION,
        warmup_steps=WARMUP_STEPS,
        num_train_epochs=NUM_EPOCHS,
        learning_rate=LEARNING_RATE,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=42,
        save_strategy="epoch",
    ),
)

# Start training
print("Starting training...")
trainer_stats = trainer.train()
print(f"\nTraining complete! Final loss: {trainer_stats.training_loss:.4f}")

## 8. Save Model

Save the trained model in multiple formats for different use cases.

In [None]:
# Save LoRA adapters only (small, can be merged later)
model.save_pretrained(f"{OUTPUT_DIR}/lora_adapter")
tokenizer.save_pretrained(f"{OUTPUT_DIR}/lora_adapter")
print(f"LoRA adapter saved to {OUTPUT_DIR}/lora_adapter")

# Save merged model in 16-bit (for full deployment)
model.save_pretrained_merged(
    f"{OUTPUT_DIR}/merged_16bit",
    tokenizer,
    save_method="merged_16bit",
)
print(f"Merged model saved to {OUTPUT_DIR}/merged_16bit")

## 9. Test Inference

Quick test to verify the model works correctly.

In [None]:
# Enable fast inference mode
FastLanguageModel.for_inference(model)

# Test prompt
test_messages = [
    {"role": "user", "content": "Hello! Can you introduce yourself?"}
]

inputs = tokenizer.apply_chat_template(
    test_messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

## 10. Optional: Save as GGUF

Convert to GGUF for use with llama.cpp, Ollama, or LM Studio.

In [None]:
# Uncomment to generate GGUF files
# model.save_pretrained_gguf(
#     f"{OUTPUT_DIR}/gguf",
#     tokenizer,
#     quantization_method="q4_k_m",  # Options: q4_k_m, q5_k_m, q8_0
# )
# print(f"GGUF saved to {OUTPUT_DIR}/gguf")

## 11. Optional: Push to Hugging Face Hub

Upload your model to share with others.

In [None]:
# Uncomment and set your repo name
# HUB_REPO = "your-username/your-model-name"
# 
# model.push_to_hub(HUB_REPO, token=True)
# tokenizer.push_to_hub(HUB_REPO, token=True)
# print(f"Model pushed to https://huggingface.co/{HUB_REPO}")

---

## Next Steps

1. **Test your model** with various prompts
2. **Convert to GGUF** if you want local inference
3. **Upload to Hub** to share your model
4. **Try DPO** for preference alignment (see dpo_template.ipynb)

For more information, see the reference guides:
- `references/model_selection.md` - Choosing base models
- `references/hardware_guide.md` - VRAM requirements
- `references/training_methods.md` - SFT vs DPO vs GRPO
- `references/troubleshooting.md` - Common issues