# Fine-tuning with Trae AI

This notebook demonstrates how to fine-tune language models using Trae AI's advanced training capabilities.

## Learning Objectives
- Prepare datasets for fine-tuning
- Configure Trae AI training parameters
- Monitor training progress
- Evaluate fine-tuned models


In [None]:
# Import required libraries
import os
import sys
import json
import torch
import pandas as pd
import numpy as np
from pathlib import Path
from datasets import Dataset, load_dataset
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)

# Add project root to path
project_root = Path().absolute().parent
sys.path.append(str(project_root))

# Import our custom modules
from trae_llm.config import load_config
from trae_llm.dataset import prepare_dataset
from trae_llm.trainer import TraeTrainer

print("Environment setup complete!")


## 1. Dataset Preparation

First, let's prepare a dataset for fine-tuning.

In [None]:
# Create a sample dataset for demonstration
sample_data = [
    {
        "instruction": "Explain the concept of machine learning",
        "input": "",
        "output": "Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. It uses algorithms to analyze data, identify patterns, and make predictions or decisions."
    },
    {
        "instruction": "Write a Python function to calculate factorial",
        "input": "",
        "output": "def factorial(n):\n    if n == 0 or n == 1:\n        return 1\n    else:\n        return n * factorial(n - 1)"
    },
    {
        "instruction": "Summarize the benefits of renewable energy",
        "input": "",
        "output": "Renewable energy offers numerous benefits including reduced greenhouse gas emissions, energy independence, job creation, and long-term cost savings. It helps combat climate change while providing sustainable power sources."
    },
    {
        "instruction": "Translate the following to French",
        "input": "Hello, how are you?",
        "output": "Bonjour, comment allez-vous?"
    },
    {
        "instruction": "Explain quantum computing in simple terms",
        "input": "",
        "output": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to process information. Unlike classical computers that use bits (0 or 1), quantum computers use quantum bits (qubits) that can exist in multiple states simultaneously, potentially solving certain problems much faster."
    }
]

# Create more samples by duplicating and modifying
extended_data = []
for i in range(20):  # Create 100 samples
    for item in sample_data:
        extended_data.append(item.copy())

print(f"Created {len(extended_data)} training samples")

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(extended_data)
print(df.head())


In [None]:
# Format data for instruction tuning
def format_instruction(sample):
    """Format sample for instruction tuning"""
    if sample['input'].strip():
        return f"### Instruction:\n{sample['instruction']}\n\n### Input:\n{sample['input']}\n\n### Response:\n{sample['output']}"
    else:
        return f"### Instruction:\n{sample['instruction']}\n\n### Response:\n{sample['output']}"

# Apply formatting
df['text'] = df.apply(format_instruction, axis=1)

# Show formatted example
print("Formatted training example:")
print("=" * 50)
print(df['text'].iloc[0])
print("=" * 50)


In [None]:
# Save dataset
data_dir = Path('../data')
data_dir.mkdir(exist_ok=True)

# Split into train/validation
train_size = int(0.8 * len(df))
train_df = df[:train_size]
val_df = df[train_size:]

# Save as JSONL
train_file = data_dir / 'train.jsonl'
val_file = data_dir / 'val.jsonl'

train_df.to_json(train_file, orient='records', lines=True)
val_df.to_json(val_file, orient='records', lines=True)

print(f"✅ Saved {len(train_df)} training samples to {train_file}")
print(f"✅ Saved {len(val_df)} validation samples to {val_file}")


## 2. Model and Tokenizer Setup

Load the base model and tokenizer for fine-tuning.

In [None]:
# Model configuration
MODEL_NAME = "microsoft/DialoGPT-small"  # Using a smaller model for demo
MAX_LENGTH = 512

# Load tokenizer and model
print(f"Loading model: {MODEL_NAME}")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)

# Add padding token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

print(f"✅ Model loaded: {model.num_parameters():,} parameters")
print(f"✅ Tokenizer vocabulary size: {len(tokenizer)}")


In [None]:
# Tokenization function
def tokenize_function(examples):
    """Tokenize the text data"""
    # Tokenize the text
    tokenized = tokenizer(
        examples['text'],
        truncation=True,
        padding=False,
        max_length=MAX_LENGTH,
        return_tensors=None
    )
    
    # For causal LM, labels are the same as input_ids
    tokenized['labels'] = tokenized['input_ids'].copy()
    
    return tokenized

# Load and tokenize datasets
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)

# Apply tokenization
print("Tokenizing datasets...")
train_dataset = train_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=train_dataset.column_names
)

val_dataset = val_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=val_dataset.column_names
)

print(f"✅ Training dataset: {len(train_dataset)} samples")
print(f"✅ Validation dataset: {len(val_dataset)} samples")


## 3. Training Configuration

Set up training parameters optimized for Trae AI.

In [None]:
# Training arguments
output_dir = Path('../checkpoints/fine_tuned_model')
output_dir.mkdir(parents=True, exist_ok=True)

training_args = TrainingArguments(
    output_dir=str(output_dir),
    
    # Training parameters
    num_train_epochs=2,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,
    
    # Optimization
    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=100,
    
    # Logging and evaluation
    logging_steps=10,
    eval_steps=50,
    evaluation_strategy="steps",
    save_steps=50,
    save_total_limit=2,
    
    # Performance
    dataloader_num_workers=0,
    fp16=torch.cuda.is_available(),
    
    # Misc
    report_to=None,  # Disable wandb for demo
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
)

print("Training configuration:")
print(f"  Output directory: {training_args.output_dir}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  FP16: {training_args.fp16}")


In [None]:
# Data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # We're doing causal LM, not masked LM
    pad_to_multiple_of=8 if training_args.fp16 else None,
)

print("✅ Data collator configured")


## 4. Trae AI Enhanced Training

Use Trae AI's optimized trainer for better performance.

In [None]:
# Custom trainer with Trae AI optimizations
class TraeOptimizedTrainer(Trainer):
    """Enhanced trainer with Trae AI optimizations"""
    
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.trae_metrics = []
    
    def log(self, logs):
        """Enhanced logging with Trae AI metrics"""
        super().log(logs)
        
        # Add Trae AI specific metrics
        if 'train_loss' in logs:
            trae_metrics = {
                'step': self.state.global_step,
                'train_loss': logs['train_loss'],
                'learning_rate': logs.get('learning_rate', 0),
                'memory_usage': self._get_memory_usage(),
                'throughput': self._calculate_throughput()
            }
            self.trae_metrics.append(trae_metrics)
    
    def _get_memory_usage(self):
        """Get current memory usage"""
        if torch.cuda.is_available():
            return torch.cuda.memory_allocated() / 1024**3  # GB
        return 0
    
    def _calculate_throughput(self):
        """Calculate training throughput"""
        # Simplified throughput calculation
        if hasattr(self.state, 'log_history') and len(self.state.log_history) > 1:
            return len(self.train_dataset) / (self.state.global_step + 1)
        return 0

# Initialize trainer
trainer = TraeOptimizedTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("✅ Trae AI optimized trainer initialized")


## 5. Training Execution

Start the fine-tuning process.

In [None]:
# Pre-training evaluation
print("Evaluating model before training...")
pre_train_metrics = trainer.evaluate()
print(f"Pre-training loss: {pre_train_metrics['eval_loss']:.4f}")

# Start training
print("
🚀 Starting fine-tuning with Trae AI optimizations...")
print("=" * 60)

try:
    # Train the model
    train_result = trainer.train()
    
    print("
✅ Training completed successfully!")
    print(f"Final training loss: {train_result.training_loss:.4f}")
    
except Exception as e:
    print(f"❌ Training failed: {e}")
    # For demo purposes, we'll continue with the pre-trained model


In [None]:
# Post-training evaluation
print("Evaluating model after training...")
post_train_metrics = trainer.evaluate()
print(f"Post-training loss: {post_train_metrics['eval_loss']:.4f}")

# Compare metrics
improvement = pre_train_metrics['eval_loss'] - post_train_metrics['eval_loss']
print(f"
Improvement: {improvement:.4f} ({improvement/pre_train_metrics['eval_loss']*100:.1f}%)")


## 6. Model Testing

Test the fine-tuned model with sample prompts.

In [None]:
def test_model(model, tokenizer, prompt, max_length=100):
    """Test the fine-tuned model"""
    # Format prompt in instruction format
    formatted_prompt = f"### Instruction:\n{prompt}\n\n### Response:\n"
    
    # Tokenize
    inputs = tokenizer.encode(formatted_prompt, return_tensors='pt')
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=inputs.shape[1] + max_length,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode response
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = generated_text[len(formatted_prompt):].strip()
    
    return response

# Test prompts
test_prompts = [
    "Explain what deep learning is",
    "Write a Python function to reverse a string",
    "What are the benefits of cloud computing?",
    "Describe the process of photosynthesis"
]

print("Testing fine-tuned model:")
print("=" * 50)

for i, prompt in enumerate(test_prompts, 1):
    print(f"Test {i}: {prompt}")
    response = test_model(model, tokenizer, prompt)
    print(f"Response: {response}")
    print("-" * 30)


## 7. Model Saving and Export

Save the fine-tuned model for future use.

In [None]:
# Save the fine-tuned model
model_save_path = Path('../models/fine_tuned_model')
model_save_path.mkdir(parents=True, exist_ok=True)

# Save model and tokenizer
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"✅ Model saved to {model_save_path}")

# Save training metrics
metrics_file = model_save_path / 'training_metrics.json'
training_metrics = {
    'pre_train_loss': pre_train_metrics['eval_loss'],
    'post_train_loss': post_train_metrics['eval_loss'],
    'improvement': improvement,
    'training_args': training_args.to_dict(),
    'trae_metrics': trainer.trae_metrics[-5:] if trainer.trae_metrics else []
}

with open(metrics_file, 'w') as f:
    json.dump(training_metrics, f, indent=2)

print(f"✅ Training metrics saved to {metrics_file}")


## 8. Performance Analysis

Analyze the training performance and Trae AI optimizations.

In [None]:
import matplotlib.pyplot as plt

# Plot training metrics if available
if trainer.trae_metrics:
    metrics_df = pd.DataFrame(trainer.trae_metrics)
    
    fig, axes = plt.subplots(2, 2, figsize=(12, 8))
    
    # Training loss
    axes[0, 0].plot(metrics_df['step'], metrics_df['train_loss'])
    axes[0, 0].set_title('Training Loss')
    axes[0, 0].set_xlabel('Step')
    axes[0, 0].set_ylabel('Loss')
    
    # Learning rate
    axes[0, 1].plot(metrics_df['step'], metrics_df['learning_rate'])
    axes[0, 1].set_title('Learning Rate')
    axes[0, 1].set_xlabel('Step')
    axes[0, 1].set_ylabel('LR')
    
    # Memory usage
    axes[1, 0].plot(metrics_df['step'], metrics_df['memory_usage'])
    axes[1, 0].set_title('Memory Usage (GB)')
    axes[1, 0].set_xlabel('Step')
    axes[1, 0].set_ylabel('Memory (GB)')
    
    # Throughput
    axes[1, 1].plot(metrics_df['step'], metrics_df['throughput'])
    axes[1, 1].set_title('Training Throughput')
    axes[1, 1].set_xlabel('Step')
    axes[1, 1].set_ylabel('Samples/Step')
    
    plt.tight_layout()
    plt.savefig(model_save_path / 'training_metrics.png', dpi=150, bbox_inches='tight')
    plt.show()
    
    print("✅ Training metrics visualization saved")
else:
    print("⚠️  No training metrics available for visualization")


## Next Steps

Congratulations! You've successfully fine-tuned a model with Trae AI. Continue with:

- **04_rag_with_trae.ipynb**: Build RAG systems with your fine-tuned model
- **05_evaluation_and_visualization.ipynb**: Comprehensive model evaluation

## Key Takeaways

1. **Data Preparation**: Proper formatting is crucial for instruction tuning
2. **Trae AI Optimizations**: Enhanced trainer provides better performance monitoring
3. **Training Configuration**: Balanced parameters for efficient training
4. **Model Testing**: Always validate your fine-tuned model with diverse prompts

## Exercise

Try these advanced exercises:

1. Experiment with different learning rates and batch sizes
2. Add more diverse training data
3. Implement custom evaluation metrics
4. Try different model architectures (Llama, Mistral, etc.)
5. Implement LoRA (Low-Rank Adaptation) for efficient fine-tuning
