# ü¶ô Fine-Tuning Llama 3 8B for Function Calling

This notebook demonstrates how to fine-tune **Llama 3 8B Instruct** using **QLoRA** (Quantized Low-Rank Adaptation) with the **Unsloth** library for efficient training on Google Colab T4 GPUs.

## üìã Overview

- **Base Model**: `unsloth/llama-3-8b-Instruct-bnb-4bit`
- **Dataset**: `glaiveai/glaive-function-calling-v2`
- **Method**: QLoRA (4-bit quantization + LoRA adapters)
- **Hardware**: Google Colab T4 GPU (16GB VRAM)
- **Output**: GGUF format for Ollama deployment

## üéØ What You'll Learn

1. Setting up Unsloth for efficient fine-tuning
2. Loading and preparing function calling datasets
3. Formatting data for Llama 3 ChatML template
4. Configuring and running QLoRA training
5. Exporting models to GGUF format for inference

---

## üì¶ Step 1: Install Dependencies

First, we install the required libraries:

- **unsloth**: Optimized library for fast LLM fine-tuning (2x faster, 50% less memory)
- **xformers**: Memory-efficient attention mechanisms
- **trl**: Transformer Reinforcement Learning library (includes SFTTrainer)
- **peft**: Parameter-Efficient Fine-Tuning library

> ‚ö†Ô∏è **Note**: Run this cell first and restart the runtime if prompted.

In [None]:
%%capture
# Install Unsloth with Colab-specific optimizations
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# Install xformers for memory-efficient attention
!pip install --no-deps xformers "trl<0.9.0" peft accelerate bitsandbytes

# Install additional dependencies
!pip install datasets huggingface_hub

In [None]:
# Verify installation and check GPU
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

---

## üîß Step 2: Load the Pre-trained Model

We use Unsloth's optimized 4-bit quantized version of Llama 3 8B Instruct. This model:

- Uses **4-bit NormalFloat (NF4)** quantization for memory efficiency
- Is pre-optimized for fast inference and training
- Includes all Llama 3 instruction-following capabilities

### LoRA Configuration

We configure LoRA (Low-Rank Adaptation) with:
- **Rank (r)**: 16 - balance between capacity and efficiency
- **Alpha**: 16 - scaling factor for LoRA weights
- **Target Modules**: All linear layers for comprehensive adaptation

In [None]:
from unsloth import FastLanguageModel

# Model configuration
MODEL_NAME = "unsloth/llama-3-8b-Instruct-bnb-4bit"
MAX_SEQ_LENGTH = 2048  # Maximum sequence length for training
DTYPE = None  # Auto-detect (float16 for T4, bfloat16 for A100)
LOAD_IN_4BIT = True  # Use 4-bit quantization

# Load the model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=MODEL_NAME,
    max_seq_length=MAX_SEQ_LENGTH,
    dtype=DTYPE,
    load_in_4bit=LOAD_IN_4BIT,
)

print(f"Model loaded: {MODEL_NAME}")
print(f"Max sequence length: {MAX_SEQ_LENGTH}")

In [None]:
# Configure LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank - higher = more capacity, more memory
    target_modules=[
        "q_proj",   # Query projection
        "k_proj",   # Key projection
        "v_proj",   # Value projection
        "o_proj",   # Output projection
        "gate_proj",  # MLP gate
        "up_proj",    # MLP up
        "down_proj",  # MLP down
    ],
    lora_alpha=16,  # LoRA scaling factor
    lora_dropout=0,  # No dropout for efficiency (Unsloth optimized)
    bias="none",  # No bias terms
    use_gradient_checkpointing="unsloth",  # Memory optimization
    random_state=42,
    use_rslora=False,  # Rank-Stabilized LoRA (optional)
    loftq_config=None,  # LoftQ initialization (optional)
)

# Print trainable parameters
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTrainable parameters: {trainable_params:,} ({100 * trainable_params / total_params:.2f}%)")
print(f"Total parameters: {total_params:,}")

---

## üìä Step 3: Load and Explore the Dataset

We use the **Glaive Function Calling v2** dataset, which contains:

- **113K+ examples** of function calling conversations
- **System prompts** with function definitions
- **User queries** with natural language requests
- **Assistant responses** with proper function calls and results

This dataset is ideal for training models to:
1. Understand when to call functions
2. Generate properly formatted function calls
3. Process function results and respond naturally

In [None]:
from datasets import load_dataset

# Load the Glaive Function Calling dataset
DATASET_NAME = "glaiveai/glaive-function-calling-v2"

dataset = load_dataset(DATASET_NAME, split="train")

print(f"Dataset: {DATASET_NAME}")
print(f"Number of examples: {len(dataset):,}")
print(f"\nColumns: {dataset.column_names}")

In [None]:
# Explore a sample from the dataset
sample = dataset[0]

print("=" * 60)
print("SAMPLE DATA STRUCTURE")
print("=" * 60)

for key, value in sample.items():
    print(f"\nüìå {key.upper()}:")
    print("-" * 40)
    # Truncate long values for display
    display_value = str(value)[:500] + "..." if len(str(value)) > 500 else str(value)
    print(display_value)

---

## üîÑ Step 4: Data Formatting for Llama 3 ChatML

Llama 3 uses a specific chat template format called **ChatML**. We need to convert the dataset into this format:

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>

{user_message}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{assistant_message}<|eot_id|>
```

### Special Tokens

| Token | Purpose |
|-------|--------|
| `<\|begin_of_text\|>` | Start of the conversation |
| `<\|start_header_id\|>` | Start of a role header |
| `<\|end_header_id\|>` | End of a role header |
| `<\|eot_id\|>` | End of turn marker |

In [None]:
import re
from typing import Dict, List, Any


def parse_chat_messages(chat_string: str) -> List[Dict[str, str]]:
    """
    Parse the raw chat string from the dataset into structured messages.
    
    The dataset format uses markers like:
    - SYSTEM: ... 
    - USER: ...
    - ASSISTANT: ...
    - FUNCTION RESPONSE: ...
    
    Args:
        chat_string: Raw chat string from the dataset
        
    Returns:
        List of message dictionaries with 'role' and 'content' keys
    """
    messages = []
    
    # Pattern to match role markers
    pattern = r'(SYSTEM|USER|ASSISTANT|FUNCTION RESPONSE):\s*'
    
    # Split by role markers while keeping the markers
    parts = re.split(pattern, chat_string)
    
    # Process parts pairwise (role, content)
    i = 1  # Skip the first empty part
    while i < len(parts) - 1:
        role = parts[i].strip().lower()
        content = parts[i + 1].strip()
        
        # Map roles to standard format
        role_map = {
            'system': 'system',
            'user': 'user',
            'assistant': 'assistant',
            'function response': 'function_response'
        }
        
        mapped_role = role_map.get(role, role)
        
        if content:  # Only add non-empty messages
            messages.append({
                'role': mapped_role,
                'content': content
            })
        
        i += 2
    
    return messages


def format_to_llama3_chatml(messages: List[Dict[str, str]]) -> str:
    """
    Convert structured messages to Llama 3 ChatML format.
    
    Args:
        messages: List of message dicts with 'role' and 'content'
        
    Returns:
        Formatted string in Llama 3 ChatML format
    """
    formatted_parts = ["<|begin_of_text|>"]
    
    for msg in messages:
        role = msg['role']
        content = msg['content']
        
        # Handle function responses as part of assistant turn
        if role == 'function_response':
            # Append function response to previous assistant message or create new
            formatted_parts.append(
                f"<|start_header_id|>function<|end_header_id|>\n\n{content}<|eot_id|>"
            )
        else:
            formatted_parts.append(
                f"<|start_header_id|>{role}<|end_header_id|>\n\n{content}<|eot_id|>"
            )
    
    return "".join(formatted_parts)


def clean_system_prompt(system_prompt: str) -> str:
    """
    Clean the system prompt by removing redundant role prefixes.
    
    The dataset's 'system' field often starts with 'SYSTEM: ' which is
    redundant since we're already placing it in the system role.
    
    Args:
        system_prompt: Raw system prompt from dataset
        
    Returns:
        Cleaned system prompt without role prefix
    """
    # Remove "SYSTEM: " prefix if present (case-insensitive)
    cleaned = re.sub(r'^SYSTEM:\s*', '', system_prompt, flags=re.IGNORECASE)
    return cleaned.strip()


def format_dataset_example(example: Dict[str, Any]) -> Dict[str, str]:
    """
    Format a single dataset example into Llama 3 ChatML format.
    
    This is the main formatting function used for dataset mapping.
    
    Args:
        example: Raw dataset example with 'system', 'chat' columns
        
    Returns:
        Dictionary with 'text' key containing formatted conversation
    """
    # Get system prompt and chat content
    system_prompt = example.get('system', '')
    chat_content = example.get('chat', '')
    
    # Clean the system prompt (remove "SYSTEM: " prefix)
    system_prompt = clean_system_prompt(system_prompt)
    
    # Parse the chat into structured messages
    messages = parse_chat_messages(chat_content)
    
    # Add system message at the beginning if present
    if system_prompt:
        messages.insert(0, {'role': 'system', 'content': system_prompt})
    
    # Format to Llama 3 ChatML
    formatted_text = format_to_llama3_chatml(messages)
    
    return {'text': formatted_text}


# Test the formatting function
print("Testing format function on sample...")
print("=" * 60)
test_result = format_dataset_example(dataset[0])
print(test_result['text'][:1000])
print("...")

In [None]:
# Apply formatting to the entire dataset
print("Formatting dataset...")

formatted_dataset = dataset.map(
    format_dataset_example,
    remove_columns=dataset.column_names,  # Remove original columns
    desc="Formatting to Llama 3 ChatML",
)

print(f"\n‚úÖ Formatted {len(formatted_dataset):,} examples")
print(f"Columns: {formatted_dataset.column_names}")

# Show sample
print("\n" + "=" * 60)
print("FORMATTED SAMPLE:")
print("=" * 60)
print(formatted_dataset[0]['text'][:800])

---

## üèãÔ∏è Step 5: Configure and Run Training

We use the **SFTTrainer** (Supervised Fine-Tuning Trainer) from the TRL library with the following hyperparameters:

| Parameter | Value | Description |
|-----------|-------|-------------|
| Learning Rate | 2e-4 | Standard for QLoRA fine-tuning |
| Batch Size | 2 | Per-device batch size (T4 compatible) |
| Gradient Accumulation | 4 | Effective batch size = 8 |
| Max Steps | 60 | Number of training steps |
| Warmup Steps | 5 | Learning rate warmup |
| Optimizer | AdamW 8-bit | Memory-efficient optimizer |

> üí° **Tip**: For production training, increase `max_steps` to 500-1000 or use `num_train_epochs`.

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# Training configuration
OUTPUT_DIR = "./llama3-function-calling-lora"
LEARNING_RATE = 2e-4
BATCH_SIZE = 2
GRADIENT_ACCUMULATION_STEPS = 4
MAX_STEPS = 60
WARMUP_STEPS = 5
LOGGING_STEPS = 10
SAVE_STEPS = 30

# Configure training arguments
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
    warmup_steps=WARMUP_STEPS,
    max_steps=MAX_STEPS,
    learning_rate=LEARNING_RATE,
    fp16=not is_bfloat16_supported(),  # Use FP16 on T4
    bf16=is_bfloat16_supported(),  # Use BF16 on A100/H100
    logging_steps=LOGGING_STEPS,
    save_steps=SAVE_STEPS,
    optim="adamw_8bit",  # Memory-efficient optimizer
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=42,
    report_to="none",  # Disable W&B/MLflow logging
)

print("Training configuration:")
print(f"  - Learning rate: {LEARNING_RATE}")
print(f"  - Batch size: {BATCH_SIZE}")
print(f"  - Gradient accumulation: {GRADIENT_ACCUMULATION_STEPS}")
print(f"  - Effective batch size: {BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS}")
print(f"  - Max steps: {MAX_STEPS}")
print(f"  - Using BF16: {is_bfloat16_supported()}")

In [None]:
# Initialize the SFTTrainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=MAX_SEQ_LENGTH,
    dataset_num_proc=2,  # Parallel data processing
    packing=False,  # Disable packing for function calling data
    args=training_args,
)

print("\n‚úÖ SFTTrainer initialized")
print(f"Training examples: {len(formatted_dataset):,}")

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
reserved_memory = torch.cuda.memory_reserved() / 1e9
max_memory = gpu_stats.total_memory / 1e9

print(f"GPU Memory before training:")
print(f"  - Reserved: {reserved_memory:.2f} GB")
print(f"  - Total: {max_memory:.2f} GB")
print(f"  - Available: {max_memory - reserved_memory:.2f} GB")

In [None]:
# üöÄ Start training!
print("Starting training...")
print("=" * 60)

trainer_stats = trainer.train()

print("\n" + "=" * 60)
print("‚úÖ Training complete!")
print("=" * 60)

In [None]:
# Display training statistics
print("\nüìä Training Statistics:")
print(f"  - Total steps: {trainer_stats.global_step}")
print(f"  - Training loss: {trainer_stats.training_loss:.4f}")
print(f"  - Training time: {trainer_stats.metrics['train_runtime']:.2f} seconds")
print(f"  - Samples/second: {trainer_stats.metrics['train_samples_per_second']:.2f}")

# Show final GPU memory usage
used_memory = torch.cuda.max_memory_reserved() / 1e9
print(f"\nüíæ Peak GPU Memory: {used_memory:.2f} GB")

---

## üß™ Step 6: Test the Fine-Tuned Model

Let's verify the model works correctly by running inference on a test prompt.

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test prompt
test_prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful AI assistant that can perform function calls.
When asked to perform actions, respond with a JSON object containing:
- "action": the action to perform
- "parameters": an object with relevant parameters
- "reasoning": brief explanation of your approach

Available functions:
- get_weather(location: str, units: str = "metric")
- search_web(query: str, num_results: int = 5)
- send_email(to: str, subject: str, body: str)<|eot_id|><|start_header_id|>user<|end_header_id|>

What's the weather like in Tokyo right now?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

# Tokenize input
inputs = tokenizer(test_prompt, return_tensors="pt").to("cuda")

# Generate response
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id,
)

# Decode and display
response = tokenizer.decode(outputs[0], skip_special_tokens=False)

print("=" * 60)
print("MODEL RESPONSE:")
print("=" * 60)
# Extract just the assistant's response
assistant_response = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1]
print(assistant_response.split("<|eot_id|>")[0].strip())

---

## üíæ Step 7: Save the Model

We'll save the model in multiple formats:

1. **LoRA Adapters**: Lightweight adapter weights only
2. **Merged Model**: Full model with adapters merged
3. **GGUF Format**: Quantized format for Ollama/llama.cpp

In [None]:
# Save LoRA adapters (lightweight, ~50MB)
LORA_OUTPUT_DIR = "./llama3-function-calling-lora"

model.save_pretrained(LORA_OUTPUT_DIR)
tokenizer.save_pretrained(LORA_OUTPUT_DIR)

print(f"‚úÖ LoRA adapters saved to: {LORA_OUTPUT_DIR}")

In [None]:
# Save merged model (full 16-bit model, ~16GB)
# Uncomment if you have enough disk space

# MERGED_OUTPUT_DIR = "./llama3-function-calling-merged"
# model.save_pretrained_merged(
#     MERGED_OUTPUT_DIR,
#     tokenizer,
#     save_method="merged_16bit",
# )
# print(f"‚úÖ Merged model saved to: {MERGED_OUTPUT_DIR}")

---

## üì¶ Step 8: Export to GGUF Format

GGUF (GPT-Generated Unified Format) is the standard format for:
- **Ollama**: Local LLM deployment
- **llama.cpp**: CPU/GPU inference
- **LM Studio**: Desktop LLM application

### Quantization Options

| Method | Size | Quality | Use Case |
|--------|------|---------|----------|
| `q8_0` | ~8GB | Highest | Production, when memory allows |
| `q4_k_m` | ~4.5GB | Good | Balanced quality/size |
| `q4_0` | ~4GB | Acceptable | Memory-constrained |

> üí° **Recommendation**: Use `q4_k_m` for the best balance of quality and size.

In [None]:
# Export to GGUF format (for Ollama)
GGUF_OUTPUT_DIR = "./llama3-function-calling-gguf"
QUANTIZATION_METHOD = "q4_k_m"  # Options: q8_0, q4_k_m, q5_k_m, q4_0, f16

print(f"Exporting to GGUF format with {QUANTIZATION_METHOD} quantization...")
print("This may take a few minutes...")

model.save_pretrained_gguf(
    GGUF_OUTPUT_DIR,
    tokenizer,
    quantization_method=QUANTIZATION_METHOD,
)

print(f"\n‚úÖ GGUF model saved to: {GGUF_OUTPUT_DIR}")
print(f"Quantization: {QUANTIZATION_METHOD}")

In [None]:
# List the exported files
import os

print("\nüìÅ Exported files:")
for root, dirs, files in os.walk(GGUF_OUTPUT_DIR):
    for file in files:
        filepath = os.path.join(root, file)
        size_mb = os.path.getsize(filepath) / 1e6
        print(f"  - {file}: {size_mb:.2f} MB")

---

## üöÄ Step 9: Push to Hugging Face Hub

Share your fine-tuned model with the community by uploading it to the Hugging Face Hub.

### Prerequisites

1. Create a [Hugging Face account](https://huggingface.co/join)
2. Create a new model repository
3. Generate an access token with write permissions

In [None]:
# Login to Hugging Face (run this cell and enter your token)
from huggingface_hub import login

# Option 1: Interactive login (will prompt for token)
login()

# Option 2: Use token directly (uncomment and replace with your token)
# login(token="hf_your_token_here")

In [None]:
# Push LoRA adapters to Hub
HF_USERNAME = "your-username"  # Replace with your Hugging Face username
MODEL_NAME = "llama3-8b-function-calling-lora"

# Push the LoRA model
model.push_to_hub(
    f"{HF_USERNAME}/{MODEL_NAME}",
    tokenizer=tokenizer,
    private=False,  # Set to True for private models
)

print(f"\n‚úÖ Model pushed to: https://huggingface.co/{HF_USERNAME}/{MODEL_NAME}")

In [None]:
# Push GGUF to Hub (optional)
GGUF_REPO_NAME = "llama3-8b-function-calling-gguf"

model.push_to_hub_gguf(
    f"{HF_USERNAME}/{GGUF_REPO_NAME}",
    tokenizer=tokenizer,
    quantization_method=QUANTIZATION_METHOD,
    private=False,
)

print(f"\n‚úÖ GGUF model pushed to: https://huggingface.co/{HF_USERNAME}/{GGUF_REPO_NAME}")

---

## üîß Step 10: Deploy with Ollama

Once you have the GGUF file, you can deploy it with Ollama:

### Create Modelfile

```dockerfile
# Modelfile
FROM ./llama3-function-calling-gguf/unsloth.Q4_K_M.gguf

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER stop "<|eot_id|>"

TEMPLATE """<|begin_of_text|><|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|><|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

"""

SYSTEM """You are a helpful AI assistant that can perform function calls.
When asked to perform actions, respond with a JSON object containing:
- "action": the action to perform
- "parameters": an object with relevant parameters
- "reasoning": brief explanation of your approach"""
```

### Create and Run

```bash
# Create the model in Ollama
ollama create llama3-function-calling -f Modelfile

# Run the model
ollama run llama3-function-calling

# Test with API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3-function-calling",
  "prompt": "Get the weather in Tokyo"
}'
```

---

## üìã Summary

In this notebook, we:

1. ‚úÖ Installed Unsloth and dependencies for efficient QLoRA training
2. ‚úÖ Loaded Llama 3 8B Instruct with 4-bit quantization
3. ‚úÖ Configured LoRA adapters for parameter-efficient fine-tuning
4. ‚úÖ Loaded and formatted the Glaive function calling dataset
5. ‚úÖ Implemented Llama 3 ChatML formatting
6. ‚úÖ Trained the model with SFTTrainer
7. ‚úÖ Exported to GGUF format for Ollama deployment
8. ‚úÖ Pushed to Hugging Face Hub

### Next Steps

- üìà Increase training steps for better performance
- üî¨ Experiment with different LoRA ranks and alpha values
- üìä Add evaluation metrics and validation
- üéØ Fine-tune on domain-specific function calling data

### Resources

- [Unsloth Documentation](https://github.com/unslothai/unsloth)
- [Llama 3 Model Card](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)
- [TRL Library](https://github.com/huggingface/trl)
- [Ollama Documentation](https://ollama.ai/)

In [None]:
# Clean up GPU memory
import gc

del model
del tokenizer
del trainer
gc.collect()
torch.cuda.empty_cache()

print("‚úÖ Cleanup complete!")