# Fine-tuning Mistral Models on Amazon SageMaker

Welcome to this comprehensive guide on fine-tuning Mistral AI models using Amazon SageMaker! This notebook will walk you through the entire process of customizing a powerful language model for your specific use case.

## üìö Table of Contents

1. [Introduction to Mistral Models](#intro)
2. [Why Fine-tune?](#why-finetune)
3. [Understanding LoRA](#lora)
4. [Dataset Preparation](#data-prep)
5. [Training Configuration](#training-config)
6. [Model Deployment](#deployment)
7. [Testing & Evaluation](#testing)

---

## üéØ What You'll Learn

By the end of this notebook, you will understand:

- **Mistral Model Architecture**: What makes Mistral models powerful and efficient
- **Fine-tuning Fundamentals**: When and why to fine-tune vs. using base models
- **LoRA (Low-Rank Adaptation)**: An efficient fine-tuning technique that saves time and money
- **Training Parameters**: What each hyperparameter does and how to tune them
- **SageMaker Training Jobs**: How to leverage AWS infrastructure for ML training
- **Model Deployment**: Best practices for serving your fine-tuned model

---

## üìã Prerequisites

- Basic understanding of machine learning concepts
- Familiarity with Python and Jupyter notebooks
- AWS account with SageMaker access
- Completed previous workshop notebooks (01-03) recommended

---

## ‚è±Ô∏è Estimated Time

- **Setup**: 5 minutes
- **Training**: 15-30 minutes (depending on dataset size)
- **Deployment & Testing**: 10 minutes
- **Total**: ~45-60 minutes

---

# Part 1: Understanding Mistral Models

## ü§ñ What is Mistral?

**Mistral AI** is a French AI company that has developed a family of high-performance, open-source large language models. Their models are known for:

### Key Characteristics:

1. **Efficiency**: Mistral models achieve state-of-the-art performance with fewer parameters than competitors
   - Mistral 7B outperforms Llama 2 13B on most benchmarks
   - Uses Grouped-Query Attention (GQA) for faster inference
   - Sliding Window Attention for handling longer contexts efficiently

2. **Open Source**: Released under Apache 2.0 license
   - Free for commercial use
   - Full model weights available
   - Active community support

3. **Versatility**: Excellent across multiple tasks
   - Text generation and completion
   - Question answering
   - Code generation
   - Instruction following
   - Multilingual capabilities

### Model Variants:

| Model | Parameters | Context Length | Best For |
|-------|-----------|----------------|----------|
| Mistral 7B | 7.3B | 8K tokens | General purpose, fast inference |
| Mistral 7B Instruct | 7.3B | 8K tokens | Chat, instruction following |
| Mixtral 8x7B | 46.7B (12.9B active) | 32K tokens | Complex reasoning, multilingual |
| Mistral Small | 24B | 32K tokens | Balanced performance/cost |

In this notebook, we'll use **Mistral 7B Instruct v0.3**, which is optimized for instruction-following tasks.

---

# Part 2: Why Fine-tune?

## üéì Understanding Fine-tuning

**Fine-tuning** is the process of taking a pre-trained model and further training it on your specific dataset to adapt it to your use case.

### When to Fine-tune vs. Use Base Model:

#### ‚úÖ Fine-tune When:

1. **Domain-Specific Language**: Your use case involves specialized terminology
   - Medical, legal, financial, technical documentation
   - Company-specific jargon or processes

2. **Consistent Style/Tone**: You need predictable output formatting
   - Customer service responses with specific tone
   - Technical documentation with consistent structure
   - Brand voice alignment

3. **Improved Accuracy**: Base model doesn't perform well on your task
   - Specialized classification tasks
   - Domain-specific question answering
   - Custom entity recognition

4. **Cost Optimization**: Smaller fine-tuned models can replace larger base models
   - Fine-tuned 7B model may match 70B model performance on specific tasks
   - Lower inference costs
   - Faster response times

#### ‚ùå Don't Fine-tune When:

1. **Limited Data**: You have fewer than 100-200 quality examples
2. **General Tasks**: Base model already performs well
3. **Rapidly Changing Requirements**: Your use case changes frequently
4. **Prompt Engineering Works**: You can achieve good results with clever prompts

### Benefits of Fine-tuning Mistral Models:

| Benefit | Description | Impact |
|---------|-------------|--------|
| **Performance** | Higher accuracy on domain-specific tasks | 20-40% improvement |
| **Consistency** | More predictable outputs | Reduced variance |
| **Efficiency** | Shorter prompts needed | 50-70% token savings |
| **Cost** | Smaller model can replace larger one | 5-10x cost reduction |
| **Latency** | Faster inference with optimized model | 2-3x speed improvement |
| **Privacy** | Keep sensitive data in training, not prompts | Enhanced security |

### Real-World Example:

**Scenario**: Customer support chatbot for a SaaS company

- **Before Fine-tuning**: 
  - Using GPT-4 with long prompts containing company policies
  - Cost: $0.03 per interaction
  - Response time: 3-5 seconds
  - Accuracy: 75% (sometimes gives generic answers)

- **After Fine-tuning Mistral 7B**:
  - Fine-tuned on 1,000 support conversations
  - Cost: $0.002 per interaction (15x cheaper)
  - Response time: 0.5-1 second (5x faster)
  - Accuracy: 92% (company-specific knowledge embedded)

---

## Setup and Installation

In [None]:
# Install required packages
%pip install -Uq sagemaker boto3 datasets

In [None]:
import sagemaker
import boto3
import json
from sagemaker.huggingface import HuggingFace
from sagemaker import get_execution_role

# Initialize SageMaker session
sess = sagemaker.Session()
role = get_execution_role()
region = sess.boto_region_name
bucket = sess.default_bucket()

print(f"SageMaker Role: {role}")
print(f"Region: {region}")
print(f"S3 Bucket: {bucket}")

---

# Part 3: Understanding LoRA (Low-Rank Adaptation)

## üî¨ What is LoRA?

**LoRA (Low-Rank Adaptation)** is a parameter-efficient fine-tuning technique that dramatically reduces the computational and memory requirements of fine-tuning large language models.

### Traditional Fine-tuning vs. LoRA:

#### Traditional Full Fine-tuning:
```
Original Model: 7B parameters
Training: Updates ALL 7B parameters
Memory Required: ~28GB (for model weights)
Training Time: 10-20 hours
Storage: 14GB (full model copy)
```

#### LoRA Fine-tuning:
```
Original Model: 7B parameters (frozen)
Training: Updates only ~4-8M parameters (0.05%!)
Memory Required: ~12GB
Training Time: 1-3 hours
Storage: 16MB (just the LoRA adapters)
```

### How LoRA Works:

Instead of modifying the original model weights, LoRA:

1. **Freezes** the original pre-trained weights
2. **Injects** trainable low-rank matrices into each layer
3. **Trains** only these small adapter matrices
4. **Combines** the frozen weights with adapters during inference

```python
# Mathematical representation:
# Original: h = W‚ÇÄx
# LoRA: h = W‚ÇÄx + BAx
# Where:
#   W‚ÇÄ = frozen pre-trained weights (large)
#   B, A = trainable low-rank matrices (small)
#   BA = the adapter (rank r << model dimension)
```

### Key LoRA Parameters:

#### 1. **Rank (r)**
- **What it is**: Dimensionality of the low-rank matrices
- **Typical values**: 4, 8, 16, 32, 64
- **Trade-off**:
  - Lower rank (4-8): Faster, less memory, fewer parameters, may underfit
  - Higher rank (32-64): Slower, more memory, more parameters, better capacity
- **Recommendation**: Start with 16, increase if underfitting

#### 2. **Alpha (Œ±)**
- **What it is**: Scaling factor for LoRA updates
- **Typical values**: 16, 32, 64 (often 2x the rank)
- **Formula**: `scaling = alpha / rank`
- **Effect**: Controls how much the adapters influence the output
- **Recommendation**: Set to 2x your rank value

#### 3. **Target Modules**
- **What it is**: Which layers to apply LoRA to
- **Common choices**:
  - `["q_proj", "v_proj"]`: Attention query and value (minimal)
  - `["q_proj", "k_proj", "v_proj", "o_proj"]`: All attention (recommended)
  - `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`: Attention + FFN (maximum)
- **Trade-off**: More modules = better adaptation but slower training

#### 4. **Dropout**
- **What it is**: Regularization to prevent overfitting
- **Typical values**: 0.05, 0.1
- **Effect**: Randomly drops connections during training
- **Recommendation**: 0.05 for large datasets, 0.1 for small datasets

### LoRA Configuration Example:

```python
lora_config = LoraConfig(
    r=16,                    # Rank: balance between capacity and efficiency
    lora_alpha=32,          # Alpha: 2x rank for stable training
    target_modules=[        # Apply to all attention layers
        "q_proj",           # Query projection
        "k_proj",           # Key projection  
        "v_proj",           # Value projection
        "o_proj"            # Output projection
    ],
    lora_dropout=0.05,      # Light regularization
    bias="none",            # Don't train bias terms
    task_type="CAUSAL_LM"   # Causal language modeling
)
```

### Benefits of LoRA:

| Aspect | Benefit | Quantified |
|--------|---------|------------|
| **Memory** | Reduced GPU memory usage | 50-70% less |
| **Speed** | Faster training | 2-3x faster |
| **Storage** | Tiny adapter files | 16MB vs 14GB |
| **Flexibility** | Multiple adapters per model | Unlimited tasks |
| **Cost** | Lower compute requirements | 60-80% cheaper |
| **Quality** | Comparable to full fine-tuning | 95-98% performance |

### When to Use LoRA:

‚úÖ **Perfect for**:
- Limited GPU memory (< 24GB)
- Multiple task-specific models
- Rapid experimentation
- Production deployments (easy to swap adapters)

‚ùå **Consider full fine-tuning if**:
- You need maximum possible performance
- You have unlimited compute budget
- The task requires fundamental model changes

---

# Part 4: Dataset Preparation

## üìä Preparing Your Training Data

Quality data is crucial for successful fine-tuning. Let's understand the format and best practices.

### Data Format: JSONL (JSON Lines)

Each line is a complete JSON object representing one training example:

```json
{
  "messages": [
    {"role": "user", "content": "Question or instruction"},
    {"role": "assistant", "content": "Desired response"}
  ]
}
```

### Best Practices:

1. **Dataset Size**:
   - Minimum: 100-200 examples
   - Recommended: 500-1,000 examples
   - Optimal: 1,000-10,000 examples
   - More data = better generalization

2. **Quality over Quantity**:
   - One high-quality example > ten mediocre ones
   - Ensure responses are accurate and consistent
   - Remove contradictory examples

3. **Diversity**:
   - Cover different phrasings of similar questions
   - Include edge cases
   - Represent real-world distribution

4. **Length**:
   - Keep examples under 2048 tokens when possible
   - Longer examples = slower training
   - Balance between context and efficiency

### Example Use Cases:

**Customer Support**:
```json
{"messages": [{"role": "user", "content": "How do I reset my password?"}, 
              {"role": "assistant", "content": "To reset your password: 1. Go to login page 2. Click 'Forgot Password' 3. Enter your email..."}]}
```

**Code Generation**:
```json
{"messages": [{"role": "user", "content": "Write a Python function to calculate fibonacci"}, 
              {"role": "assistant", "content": "def fibonacci(n):\n    if n <= 1:\n        return n..."}]}
```

**Domain-Specific QA**:
```json
{"messages": [{"role": "user", "content": "What is the treatment for Type 2 Diabetes?"}, 
              {"role": "assistant", "content": "Treatment for Type 2 Diabetes typically includes: 1. Lifestyle modifications..."}]}
```

For this workshop, we'll use a simple customer support dataset. In production, you would use your own domain-specific data.

In [None]:
# Create a sample training dataset
training_data = [
    {
        "messages": [
            {"role": "user", "content": "How do I reset my password?"},
            {"role": "assistant", "content": "To reset your password, go to the login page and click 'Forgot Password'. Enter your email address and follow the instructions sent to your inbox."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "What are your business hours?"},
            {"role": "assistant", "content": "Our customer support team is available Monday through Friday, 9 AM to 6 PM EST. For urgent issues outside these hours, please use our emergency hotline."}
        ]
    },
    {
        "messages": [
            {"role": "user", "content": "How can I track my order?"},
            {"role": "assistant", "content": "You can track your order by logging into your account and visiting the 'Orders' section. Click on the order number to see detailed tracking information."}
        ]
    }
]

# Save to JSONL format
import os
os.makedirs('data', exist_ok=True)

with open('data/train.jsonl', 'w') as f:
    for item in training_data:
        f.write(json.dumps(item) + '\n')

print(f"Created training dataset with {len(training_data)} examples")

In [None]:
# Upload training data to S3
from sagemaker.s3 import S3Uploader

training_data_uri = S3Uploader.upload(
    local_path='data/train.jsonl',
    desired_s3_uri=f's3://{bucket}/mistral-finetuning/data'
)

print(f"Training data uploaded to: {training_data_uri}")

---

# Part 5: Training Configuration

## ‚öôÔ∏è Understanding Training Parameters

Let's break down each training parameter and understand what it does.

### Model Loading Parameters:

#### **torch_dtype=torch.bfloat16**
- **What**: Data type for model weights
- **Options**: float32, float16, bfloat16
- **Why bfloat16**: 
  - 50% memory savings vs float32
  - Better numerical stability than float16
  - Supported by modern GPUs (A100, H100, L4)
- **Impact**: Enables training larger models on same hardware

#### **device_map="auto"**
- **What**: Automatically distributes model across available GPUs
- **Why**: Handles models larger than single GPU memory
- **How**: Intelligently splits layers across devices

### Training Hyperparameters:

#### **num_train_epochs**
- **What**: Number of complete passes through the dataset
- **Typical values**: 1-5 epochs
- **Guidelines**:
  - Large dataset (>5000 examples): 1-2 epochs
  - Medium dataset (500-5000): 3-5 epochs
  - Small dataset (<500): 5-10 epochs
- **Warning**: Too many epochs = overfitting

#### **per_device_train_batch_size**
- **What**: Number of examples processed together per GPU
- **Typical values**: 1, 2, 4, 8
- **Trade-offs**:
  - Larger batch: Faster training, more memory, less noise
  - Smaller batch: Slower training, less memory, more noise (can help generalization)
- **Memory impact**: Doubling batch size ‚âà doubles memory usage

#### **gradient_accumulation_steps**
- **What**: Accumulate gradients over N steps before updating
- **Why**: Simulate larger batch sizes without more memory
- **Effective batch size** = `per_device_batch_size √ó gradient_accumulation_steps √ó num_gpus`
- **Example**: 
  - batch_size=1, accumulation=4 ‚Üí effective batch=4
  - Same result as batch_size=4, but uses 4x less memory

#### **learning_rate**
- **What**: Step size for weight updates
- **Typical values**: 1e-5 to 5e-4
- **Guidelines**:
  - Full fine-tuning: 1e-5 to 5e-5 (smaller)
  - LoRA: 1e-4 to 5e-4 (larger, because fewer parameters)
  - Large dataset: Lower learning rate
  - Small dataset: Higher learning rate
- **Too high**: Training unstable, loss explodes
- **Too low**: Training too slow, may not converge

#### **warmup_steps**
- **What**: Gradually increase learning rate from 0 to target
- **Why**: Prevents large updates early in training
- **Typical values**: 10-100 steps or 5-10% of total steps
- **Formula**: `warmup_steps = 0.1 √ó total_steps`

#### **fp16 / bf16**
- **What**: Mixed precision training
- **Benefits**:
  - 2x faster training
  - 50% less memory
  - Minimal accuracy loss
- **Choose bf16 if available** (better for LLMs)

#### **logging_steps**
- **What**: How often to log training metrics
- **Typical values**: 10, 50, 100
- **Impact**: More frequent = better monitoring, but more overhead

#### **save_strategy**
- **Options**: "no", "steps", "epoch"
- **"epoch"**: Save checkpoint after each epoch
- **"steps"**: Save every N steps
- **Recommendation**: "epoch" for most cases

### Training Configuration Summary:

```python
TrainingArguments(
    output_dir="/opt/ml/model",              # Where to save model
    num_train_epochs=3,                       # 3 complete passes through data
    per_device_train_batch_size=1,           # 1 example per GPU (memory constrained)
    gradient_accumulation_steps=4,           # Effective batch size = 4
    learning_rate=2e-4,                      # LoRA-appropriate learning rate
    fp16=True,                               # Mixed precision for speed
    logging_steps=10,                        # Log every 10 steps
    save_strategy="epoch",                   # Save after each epoch
    warmup_steps=10,                         # Warm up for 10 steps
)
```

### Recommended Configurations by Use Case:

#### **Quick Experimentation** (Fast iteration):
```python
epochs=1, batch_size=4, learning_rate=3e-4, lora_r=8
# Time: ~10 minutes, Quality: 70-80%
```

#### **Balanced** (Good quality, reasonable time):
```python
epochs=3, batch_size=2, learning_rate=2e-4, lora_r=16
# Time: ~30 minutes, Quality: 85-90%
```

#### **Production** (Maximum quality):
```python
epochs=5, batch_size=1, learning_rate=1e-4, lora_r=32
# Time: ~60 minutes, Quality: 90-95%
```

---

## Configure Training Job

Now let's create our training script with all these concepts applied. We'll use the Hugging Face DLC (Deep Learning Container) with SageMaker.

In [None]:
# Training script
training_script = '''#!/usr/bin/env python3
import os
import json
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

def train():
    # Load model and tokenizer
    model_id = "mistralai/Mistral-7B-Instruct-v0.3"
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token
    
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )
    
    # Configure LoRA
    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    
    # Load and prepare dataset
    dataset = load_dataset('json', data_files='/opt/ml/input/data/training/train.jsonl')
    
    def format_chat(example):
        messages = example['messages']
        text = tokenizer.apply_chat_template(messages, tokenize=False)
        return {'text': text}
    
    dataset = dataset.map(format_chat)
    
    def tokenize(example):
        return tokenizer(example['text'], truncation=True, max_length=512)
    
    tokenized_dataset = dataset.map(tokenize, remove_columns=dataset['train'].column_names)
    
    # Training arguments
    training_args = TrainingArguments(
        output_dir="/opt/ml/model",
        num_train_epochs=3,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch",
        warmup_steps=10,
    )
    
    # Train
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset['train'],
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    trainer.train()
    
    # Save model
    model.save_pretrained("/opt/ml/model")
    tokenizer.save_pretrained("/opt/ml/model")

if __name__ == "__main__":
    train()
'''

# Save training script
os.makedirs('scripts', exist_ok=True)
with open('scripts/train.py', 'w') as f:
    f.write(training_script)

print("Training script created")

In [None]:
# Create requirements file
requirements = '''transformers==4.36.0
datasets==2.16.0
peft==0.7.1
accelerate==0.25.0
bitsandbytes==0.41.3
'''

with open('scripts/requirements.txt', 'w') as f:
    f.write(requirements)

print("Requirements file created")

---

# Part 6: Launch Training Job

## üöÄ SageMaker Training Jobs Explained

### What is a SageMaker Training Job?

A **SageMaker Training Job** is a managed service that:
1. Provisions compute resources (GPU instances)
2. Downloads your training data from S3
3. Runs your training script
4. Uploads the trained model back to S3
5. Cleans up resources automatically

### Benefits:

| Feature | Benefit |
|---------|--------|
| **Managed Infrastructure** | No server management |
| **Auto-scaling** | Scales to your needs |
| **Cost Optimization** | Pay only for training time |
| **Monitoring** | Built-in CloudWatch metrics |
| **Reproducibility** | Consistent training environment |
| **Security** | IAM-based access control |

### Instance Type Selection:

#### **ml.g5.2xlarge** (Recommended for this workshop)
- **GPU**: 1x NVIDIA A10G (24GB VRAM)
- **vCPUs**: 8
- **RAM**: 32GB
- **Cost**: ~$1.52/hour
- **Best for**: Mistral 7B with LoRA

#### Other Options:

| Instance | GPU | VRAM | Cost/hr | Best For |
|----------|-----|------|---------|----------|
| ml.g5.xlarge | 1x A10G | 24GB | $1.01 | Small models (<7B) |
| ml.g5.4xlarge | 1x A10G | 24GB | $2.03 | Faster training |
| ml.g5.12xlarge | 4x A10G | 96GB | $6.11 | Large models (13B-30B) |
| ml.p4d.24xlarge | 8x A100 | 320GB | $32.77 | Massive models (70B+) |

### Estimator Configuration:

```python
HuggingFace(
    entry_point='train.py',              # Your training script
    source_dir='scripts',                # Directory containing script
    instance_type='ml.g5.2xlarge',       # GPU instance
    instance_count=1,                    # Number of instances
    role=role,                           # IAM role for permissions
    transformers_version='4.36.0',       # Hugging Face version
    pytorch_version='2.1.0',             # PyTorch version
    py_version='py310',                  # Python version
    max_run=3600,                        # Max training time (1 hour)
)
```

### Training Time Estimates:

| Dataset Size | Epochs | Instance | Time | Cost |
|--------------|--------|----------|------|------|
| 100 examples | 3 | ml.g5.2xlarge | ~10 min | $0.25 |
| 500 examples | 3 | ml.g5.2xlarge | ~20 min | $0.50 |
| 1,000 examples | 3 | ml.g5.2xlarge | ~30 min | $0.75 |
| 5,000 examples | 3 | ml.g5.2xlarge | ~2 hours | $3.00 |

Now let's create and launch the SageMaker training job!

In [None]:
# Configure Hugging Face estimator
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='scripts',
    instance_type='ml.g5.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.36.0',
    pytorch_version='2.1.0',
    py_version='py310',
    hyperparameters={
        'epochs': 3,
        'train_batch_size': 1,
    },
    environment={
        'HUGGINGFACE_HUB_CACHE': '/tmp/.cache',
    },
    max_run=3600,  # 1 hour max
)

print("Estimator configured")

In [None]:
# Start training
huggingface_estimator.fit({'training': training_data_uri})

print("Training job completed!")
print(f"Model artifacts: {huggingface_estimator.model_data}")

---

# Part 7: Model Deployment

## üåê Deploying Your Fine-tuned Model

After training completes, we can deploy the fine-tuned model to a SageMaker real-time endpoint for inference.

### What is a SageMaker Endpoint?

A **SageMaker Endpoint** is a fully managed inference service that:
- Hosts your model on persistent compute resources
- Provides a REST API for predictions
- Auto-scales based on traffic
- Handles load balancing automatically
- Monitors model performance

### Deployment Options:

#### **Real-time Endpoints** (What we're using)
- **Use case**: Low-latency, synchronous predictions
- **Latency**: 100-500ms
- **Cost**: Pay for instance uptime
- **Best for**: Chatbots, interactive applications

#### **Serverless Endpoints**
- **Use case**: Intermittent traffic
- **Latency**: 1-5 seconds (cold start)
- **Cost**: Pay per request
- **Best for**: Development, low-traffic apps

#### **Batch Transform**
- **Use case**: Process large datasets offline
- **Latency**: Minutes to hours
- **Cost**: Pay for job duration
- **Best for**: Bulk processing, analytics

### Instance Selection for Inference:

| Instance | GPU | VRAM | Cost/hr | Throughput | Best For |
|----------|-----|------|---------|------------|----------|
| ml.g5.xlarge | 1x A10G | 24GB | $1.01 | ~10 req/sec | Development |
| ml.g5.2xlarge | 1x A10G | 24GB | $1.52 | ~15 req/sec | Production (low traffic) |
| ml.g5.4xlarge | 1x A10G | 24GB | $2.03 | ~20 req/sec | Production (medium traffic) |
| ml.g5.12xlarge | 4x A10G | 96GB | $6.11 | ~60 req/sec | Production (high traffic) |

### Cost Optimization Tips:

1. **Right-size your instance**: Start small, scale up if needed
2. **Use auto-scaling**: Scale down during low traffic
3. **Delete unused endpoints**: Stop paying when not in use
4. **Consider Serverless**: For unpredictable traffic
5. **Use Spot instances**: Save up to 70% (for non-critical workloads)

### Deployment Configuration:

```python
predictor = estimator.deploy(
    initial_instance_count=1,           # Start with 1 instance
    instance_type='ml.g5.2xlarge',      # GPU instance for fast inference
    endpoint_name='mistral-finetuned'   # Unique endpoint name
)
```

### What Happens During Deployment:

1. **Model Registration**: Model artifacts uploaded to S3
2. **Container Creation**: Inference container configured
3. **Instance Provisioning**: GPU instance launched
4. **Model Loading**: Model loaded into GPU memory
5. **Health Checks**: Endpoint tested for readiness
6. **Endpoint Active**: Ready to serve predictions

**Deployment Time**: 5-10 minutes

Let's deploy our fine-tuned model!

In [None]:
# Deploy the model
predictor = huggingface_estimator.deploy(
    initial_instance_count=1,
    instance_type='ml.g5.2xlarge',
    endpoint_name=f'mistral-finetuned-{sess.default_bucket()[:8]}'
)

print(f"Model deployed to endpoint: {predictor.endpoint_name}")

---

# Part 8: Testing & Evaluation

## üß™ Testing Your Fine-tuned Model

Now let's test our fine-tuned model and understand the inference parameters.

### Inference Parameters Explained:

#### **max_new_tokens**
- **What**: Maximum number of tokens to generate
- **Typical values**: 50-512
- **Impact**: 
  - Higher = longer responses, more cost, slower
  - Lower = shorter responses, less cost, faster
- **Recommendation**: Set based on expected response length

#### **temperature**
- **What**: Controls randomness in generation
- **Range**: 0.0 to 2.0
- **Effects**:
  - **0.0-0.3**: Deterministic, focused, repetitive
  - **0.4-0.7**: Balanced, natural (recommended)
  - **0.8-1.0**: Creative, diverse
  - **1.1-2.0**: Very random, potentially incoherent
- **Use cases**:
  - Factual QA: 0.1-0.3
  - Customer support: 0.5-0.7
  - Creative writing: 0.8-1.2

#### **top_p (nucleus sampling)**
- **What**: Considers tokens with cumulative probability up to p
- **Range**: 0.0 to 1.0
- **Effects**:
  - **0.1-0.5**: Very focused, deterministic
  - **0.6-0.9**: Balanced (recommended)
  - **0.95-1.0**: More diverse
- **Tip**: Use with temperature for best results

#### **top_k**
- **What**: Considers only top k most likely tokens
- **Typical values**: 10, 20, 50
- **Effects**:
  - Lower k: More focused
  - Higher k: More diverse
- **Note**: Often used with top_p

#### **repetition_penalty**
- **What**: Penalizes repeated tokens
- **Range**: 1.0 to 2.0
- **Effects**:
  - 1.0: No penalty
  - 1.1-1.3: Reduces repetition (recommended)
  - 1.5+: Strongly discourages repetition

### Recommended Parameter Combinations:

#### **Factual/Deterministic** (Customer support, QA):
```python
{
    "temperature": 0.2,
    "top_p": 0.5,
    "top_k": 10,
    "repetition_penalty": 1.1
}
```

#### **Balanced** (General conversation):
```python
{
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "repetition_penalty": 1.2
}
```

#### **Creative** (Content generation):
```python
{
    "temperature": 1.0,
    "top_p": 0.95,
    "top_k": 100,
    "repetition_penalty": 1.3
}
```

### Evaluating Your Fine-tuned Model:

#### Qualitative Evaluation:
1. **Relevance**: Does it answer the question?
2. **Accuracy**: Is the information correct?
3. **Consistency**: Does it match your brand/style?
4. **Completeness**: Does it cover all necessary points?
5. **Coherence**: Is the response well-structured?

#### Quantitative Metrics:
- **Perplexity**: Lower is better (measures prediction confidence)
- **BLEU/ROUGE**: For comparing against reference answers
- **Latency**: Response time (target: <500ms)
- **Throughput**: Requests per second

#### A/B Testing:
Compare fine-tuned model vs. base model:
- User satisfaction scores
- Task completion rates
- Response quality ratings

Let's test our model!

In [None]:
# Test the fine-tuned model
test_prompt = {
    "inputs": "How do I reset my password?",
    "parameters": {
        "max_new_tokens": 128,
        "temperature": 0.7,
        "top_p": 0.9
    }
}

response = predictor.predict(test_prompt)
print("Response:", response[0]['generated_text'])

## Cleanup

Remember to delete your endpoint when you're done to avoid ongoing charges.

In [None]:
# Delete the endpoint
predictor.delete_endpoint()
print("Endpoint deleted")

---

# üéì Summary & Key Takeaways

## What You've Learned

Congratulations! You've successfully fine-tuned a Mistral model on Amazon SageMaker. Let's recap the key concepts:

### 1. **Mistral Models**
- High-performance, open-source LLMs
- Efficient architecture (GQA, Sliding Window Attention)
- Excellent for fine-tuning due to Apache 2.0 license
- 7B model punches above its weight class

### 2. **When to Fine-tune**
‚úÖ Domain-specific language and terminology
‚úÖ Consistent style and formatting requirements
‚úÖ Cost optimization (smaller fine-tuned > larger base)
‚úÖ Privacy (embed knowledge, not in prompts)
‚ùå Limited data (<100 examples)
‚ùå Rapidly changing requirements

### 3. **LoRA Benefits**
- **60-80% cost reduction** vs. full fine-tuning
- **2-3x faster** training
- **16MB adapters** vs. 14GB full model
- **95-98% quality** of full fine-tuning
- **Multiple adapters** per base model

### 4. **Key Parameters**

| Parameter | Typical Value | Impact |
|-----------|---------------|--------|
| LoRA rank (r) | 16 | Adapter capacity |
| LoRA alpha | 32 | Update scaling |
| Learning rate | 2e-4 | Training speed |
| Epochs | 3 | Training iterations |
| Batch size | 1-4 | Memory vs. speed |
| Temperature | 0.7 | Response randomness |

### 5. **Cost Breakdown**

**Training** (ml.g5.2xlarge @ $1.52/hr):
- 100 examples: ~$0.25 (10 min)
- 1,000 examples: ~$0.75 (30 min)
- 5,000 examples: ~$3.00 (2 hrs)

**Inference** (ml.g5.2xlarge @ $1.52/hr):
- Development: Delete after testing ($0)
- Production: ~$1,100/month (24/7)
- With auto-scaling: ~$300-500/month

---

## üöÄ Next Steps

### Immediate Actions:

1. **Experiment with Your Data**
   - Collect 100-500 examples from your domain
   - Format as JSONL with user/assistant messages
   - Run fine-tuning with default parameters

2. **Optimize Parameters**
   - Try different LoRA ranks: 8, 16, 32
   - Adjust learning rate: 1e-4, 2e-4, 5e-4
   - Experiment with epochs: 1, 3, 5

3. **Compare Performance**
   - Test base model vs. fine-tuned
   - Measure accuracy on held-out test set
   - Calculate cost savings

### Advanced Topics:

#### **Multi-task Fine-tuning**
Train one model for multiple tasks:
```json
{"messages": [{"role": "system", "content": "Task: summarization"}, ...]}
{"messages": [{"role": "system", "content": "Task: classification"}, ...]}
```

#### **Instruction Tuning**
Improve instruction-following:
```json
{"messages": [{"role": "user", "content": "Summarize in 3 bullet points: ..."}, ...]}
```

#### **RLHF (Reinforcement Learning from Human Feedback)**
- Collect human preferences
- Train reward model
- Fine-tune with PPO

#### **Quantization**
Reduce model size further:
- 4-bit quantization: 75% size reduction
- 8-bit quantization: 50% size reduction
- Minimal quality loss

---

## üìö Additional Resources

### Documentation:
- [Mistral AI Documentation](https://docs.mistral.ai/)
- [Hugging Face PEFT Library](https://huggingface.co/docs/peft)
- [SageMaker Training Jobs](https://docs.aws.amazon.com/sagemaker/latest/dg/train-model.html)
- [LoRA Paper](https://arxiv.org/abs/2106.09685)

### Tutorials:
- [Fine-tuning Best Practices](https://huggingface.co/blog/fine-tune-llms)
- [LoRA Deep Dive](https://huggingface.co/blog/lora)
- [SageMaker Examples](https://github.com/aws/amazon-sagemaker-examples)

### Community:
- [Mistral AI Discord](https://discord.gg/mistralai)
- [Hugging Face Forums](https://discuss.huggingface.co/)
- [AWS ML Community](https://aws.amazon.com/machine-learning/community/)

---

## üí° Pro Tips

1. **Start Small**: Begin with 100-200 examples, iterate quickly
2. **Monitor Training**: Watch loss curves in CloudWatch
3. **Version Control**: Save different adapter versions
4. **A/B Test**: Compare models in production
5. **Cost Optimize**: Delete endpoints when not in use
6. **Data Quality**: 100 great examples > 1,000 mediocre ones
7. **Regularization**: Use dropout to prevent overfitting
8. **Evaluation**: Create a test set for objective metrics

---

## üéØ Workshop Completion Checklist

- [ ] Understood Mistral model architecture
- [ ] Learned when to fine-tune vs. use base models
- [ ] Grasped LoRA concepts and benefits
- [ ] Prepared training data in JSONL format
- [ ] Configured training parameters
- [ ] Launched SageMaker training job
- [ ] Deployed fine-tuned model
- [ ] Tested inference with different parameters
- [ ] Cleaned up resources

---

## üôè Thank You!

You've completed the Mistral Fine-tuning workshop! You now have the knowledge to:
- Fine-tune LLMs efficiently with LoRA
- Deploy models on SageMaker
- Optimize costs and performance
- Build production-ready AI applications

**Questions?** Reach out to the workshop facilitators or AWS support.

**Ready for more?** Check out the other notebooks in this workshop series!

---

### Remember to clean up your resources! üëá