# LoRA Fine-tuning with Mistral 7B

This notebook demonstrates how to fine-tune the Mistral 7B model using LoRA (Low-Rank Adaptation) for a specific task. We'll customize the model for a domain-specific use case while keeping memory usage manageable.

## What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that:
- Reduces trainable parameters by 99%
- Maintains model performance
- Enables fine-tuning large models on consumer GPUs
- Allows for quick task switching

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -q transformers peft datasets accelerate bitsandbytes wandb torch torchvision torchaudio

In [None]:
import torch
import transformers
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel, get_peft_model, prepare_model_for_kbit_training
from datasets import Dataset, load_dataset
import pandas as pd
import numpy as np
from trl import SFTTrainer
import os
import warnings
warnings.filterwarnings('ignore')

print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")

## 2. Model and Tokenizer Setup

In [None]:
# Model configuration
model_name = "mistralai/Mistral-7B-v0.1"
new_model = "mistral-7b-custom-lora"

# QLoRA configuration for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

print("Loading model with 4-bit quantization...")

In [None]:
# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True

print(f"Model loaded: {model_name}")
print(f"Model parameters: {model.num_parameters():,}")

## 3. Prepare Training Data

For this example, we'll create a custom dataset for fine-tuning. In practice, you would use your domain-specific data.

In [None]:
# Create sample training data (replace with your domain-specific data)
training_data = [
    {
        "instruction": "Explain the concept of machine learning",
        "input": "",
        "output": "Machine learning is a subset of artificial intelligence that enables computers to learn and improve from experience without being explicitly programmed. It uses algorithms to analyze data, identify patterns, and make predictions or decisions."
    },
    {
        "instruction": "What are the main types of machine learning?",
        "input": "",
        "output": "The three main types of machine learning are: 1) Supervised learning - learns from labeled data, 2) Unsupervised learning - finds patterns in unlabeled data, and 3) Reinforcement learning - learns through trial and error with rewards and penalties."
    },
    {
        "instruction": "Describe neural networks",
        "input": "",
        "output": "Neural networks are computational models inspired by biological neural networks. They consist of interconnected nodes (neurons) organized in layers that process information. They excel at pattern recognition and are fundamental to deep learning."
    },
    {
        "instruction": "What is deep learning?",
        "input": "",
        "output": "Deep learning is a subset of machine learning that uses neural networks with multiple hidden layers to model and understand complex patterns in data. It has revolutionized fields like computer vision, natural language processing, and speech recognition."
    },
    {
        "instruction": "Explain overfitting in machine learning",
        "input": "",
        "output": "Overfitting occurs when a model learns the training data too well, including noise and random fluctuations, resulting in poor performance on new, unseen data. It can be prevented through techniques like regularization, cross-validation, and using more training data."
    }
]

# Convert to DataFrame and then to Dataset
df = pd.DataFrame(training_data)
dataset = Dataset.from_pandas(df)

print(f"Training dataset size: {len(dataset)}")
print("Sample data point:")
print(dataset[0])

In [None]:
# Format data for training
def format_instruction(sample):
    """
    Format the training data into a conversational format.
    """
    if sample['input']:
        text = f"""### Instruction:
{sample['instruction']}

### Input:
{sample['input']}

### Response:
{sample['output']}"""
    else:
        text = f"""### Instruction:
{sample['instruction']}

### Response:
{sample['output']}"""
    
    return text

# Apply formatting
def preprocess_dataset(dataset):
    return dataset.map(lambda x: {"text": format_instruction(x)})

formatted_dataset = preprocess_dataset(dataset)

print("Formatted sample:")
print(formatted_dataset[0]['text'])

## 4. LoRA Configuration

In [None]:
# LoRA configuration
lora_config = LoraConfig(
    r=16,  # Rank of adaptation
    lora_alpha=32,  # LoRA scaling parameter
    target_modules=[
        "q_proj",
        "k_proj", 
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],  # Modules to apply LoRA to
    bias="none",
    lora_dropout=0.05,  # LoRA dropout
    task_type="CAUSAL_LM",
)

print("LoRA configuration:")
print(f"  Rank (r): {lora_config.r}")
print(f"  Alpha: {lora_config.lora_alpha}")
print(f"  Dropout: {lora_config.lora_dropout}")
print(f"  Target modules: {lora_config.target_modules}")

In [None]:
# Prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Print trainable parameters
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"Trainable params: {trainable_params:,} || All params: {all_param:,} || Trainable%: {100 * trainable_params / all_param:.2f}%")

print_trainable_parameters(model)

## 5. Training Configuration

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    gradient_accumulation_steps=2,
    optim="paged_adamw_32bit",
    save_steps=10,
    logging_steps=10,
    learning_rate=2e-4,
    weight_decay=0.001,
    fp16=False,
    bf16=True,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="cosine",
    report_to=None,  # Set to "wandb" if you want to use Weights & Biases
    save_strategy="steps",
    evaluation_strategy="no",
    do_eval=False,
    remove_unused_columns=False,
)

print("Training configuration:")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")

## 6. Initialize Trainer

In [None]:
# Initialize trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    peft_config=lora_config,
    max_seq_length=512,
    tokenizer=tokenizer,
    args=training_args,
    packing=False,
)

print("Trainer initialized successfully!")

## 7. Training Process

In [None]:
# Test generation before training
def generate_response(model, tokenizer, prompt, max_length=200):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            do_sample=True,
            temperature=0.7,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response[len(prompt):].strip()

# Test prompt
test_prompt = "### Instruction:\nExplain what artificial intelligence is\n\n### Response:\n"

print("=== BEFORE TRAINING ===")
print(f"Prompt: {test_prompt}")
pre_training_response = generate_response(model, tokenizer, test_prompt)
print(f"Response: {pre_training_response}")
print("\n" + "="*50 + "\n")

In [None]:
# Start training
print("Starting LoRA fine-tuning...")
print("This may take several minutes depending on your hardware.")

# Clear cache before training
torch.cuda.empty_cache()

# Train the model
trainer.train()

print("Training completed!")

## 8. Save the Fine-tuned Model

In [None]:
# Save the LoRA adapters
trainer.model.save_pretrained(new_model)
tokenizer.save_pretrained(new_model)

print(f"Model saved to {new_model}")

# Save training logs
import json
training_logs = trainer.state.log_history
with open(f"{new_model}/training_logs.json", "w") as f:
    json.dump(training_logs, f, indent=2)

print("Training logs saved")

## 9. Test the Fine-tuned Model

In [None]:
# Test generation after training
print("=== AFTER TRAINING ===")
print(f"Prompt: {test_prompt}")
post_training_response = generate_response(model, tokenizer, test_prompt)
print(f"Response: {post_training_response}")
print("\n" + "="*50 + "\n")

# Compare responses
print("=== COMPARISON ===")
print("BEFORE TRAINING:")
print(pre_training_response)
print("\nAFTER TRAINING:")
print(post_training_response)

In [None]:
# Test with multiple prompts
test_prompts = [
    "### Instruction:\nWhat is supervised learning?\n\n### Response:\n",
    "### Instruction:\nExplain gradient descent\n\n### Response:\n",
    "### Instruction:\nWhat are the applications of AI?\n\n### Response:\n"
]

print("=== TESTING MULTIPLE PROMPTS ===")
for i, prompt in enumerate(test_prompts, 1):
    print(f"\n--- Test {i} ---")
    print(f"Prompt: {prompt.split('Response:')[0]}Response:")
    response = generate_response(model, tokenizer, prompt, max_length=150)
    print(f"Response: {response}")
    print("-" * 50)

## 10. Load and Use the Fine-tuned Model

In [None]:
# Demonstrate how to load the fine-tuned model later
def load_fine_tuned_model(base_model_name, adapter_path):
    """
    Load a fine-tuned LoRA model.
    """
    # Load base model
    base_model = AutoModelForCausalLM.from_pretrained(
        base_model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.bfloat16
    )
    
    # Load LoRA adapters
    model = PeftModel.from_pretrained(base_model, adapter_path)
    
    # Load tokenizer
    tokenizer = AutoTokenizer.from_pretrained(adapter_path)
    
    return model, tokenizer

print("Function to load fine-tuned model defined.")
print("\nTo load the model later, use:")
print(f"model, tokenizer = load_fine_tuned_model('{model_name}', '{new_model}')")

## 11. Model Analysis and Metrics

In [None]:
# Analyze model performance
def analyze_model_performance():
    """
    Analyze the performance of the fine-tuned model.
    """
    # Get model size information
    base_params = sum(p.numel() for p in model.get_base_model().parameters())
    lora_params = sum(p.numel() for n, p in model.named_parameters() if 'lora' in n)
    
    print("=== MODEL ANALYSIS ===")
    print(f"Base model parameters: {base_params:,}")
    print(f"LoRA parameters: {lora_params:,}")
    print(f"LoRA overhead: {(lora_params / base_params) * 100:.4f}%")
    
    # Memory usage
    if torch.cuda.is_available():
        memory_allocated = torch.cuda.memory_allocated() / 1024**3
        memory_reserved = torch.cuda.memory_reserved() / 1024**3
        print(f"\nGPU Memory Usage:")
        print(f"  Allocated: {memory_allocated:.2f} GB")
        print(f"  Reserved: {memory_reserved:.2f} GB")
    
    # Training efficiency
    if training_logs:
        final_loss = training_logs[-1].get('train_loss', 'N/A')
        print(f"\nTraining Metrics:")
        print(f"  Final training loss: {final_loss}")
        print(f"  Training steps: {len(training_logs)}")

analyze_model_performance()

## 12. Best Practices and Tips

In [None]:
# Best practices for LoRA fine-tuning
best_practices = """
=== LORA FINE-TUNING BEST PRACTICES ===

1. HYPERPARAMETER TUNING:
   - Start with r=16, alpha=32 for most tasks
   - Increase r for more complex tasks (up to 64)
   - Keep alpha = 2 * r as a general rule
   - Use lower learning rates (1e-4 to 5e-4)

2. TARGET MODULES:
   - Include attention modules (q_proj, k_proj, v_proj, o_proj)
   - Add feed-forward modules for better performance
   - Include lm_head for generation tasks

3. DATA PREPARATION:
   - Use consistent formatting across all examples
   - Include clear instruction-response pairs
   - Ensure data quality over quantity
   - Use appropriate sequence lengths

4. TRAINING STRATEGIES:
   - Start with small datasets to test
   - Use gradient accumulation for larger effective batch sizes
   - Monitor training loss and stop if overfitting
   - Save checkpoints regularly

5. EVALUATION:
   - Test on held-out data
   - Compare against base model performance
   - Use domain-specific evaluation metrics
   - Consider human evaluation for quality

6. DEPLOYMENT:
   - LoRA adapters are small (typically < 100MB)
   - Can switch between different LoRA adapters quickly
   - Merge adapters for faster inference if needed
   - Test thoroughly before production deployment
"""

print(best_practices)

## 13. Cleanup and Resource Management

In [None]:
# Clean up GPU memory
del model
del tokenizer
torch.cuda.empty_cache()

print("GPU memory cleared")

if torch.cuda.is_available():
    memory_allocated = torch.cuda.memory_allocated() / 1024**3
    print(f"Current GPU memory usage: {memory_allocated:.2f} GB")

## Conclusion

This notebook demonstrated:

1. **Setting up LoRA fine-tuning** with Mistral 7B using 4-bit quantization
2. **Preparing training data** in instruction-response format
3. **Configuring LoRA parameters** for efficient training
4. **Training the model** with minimal resource usage
5. **Evaluating improvements** through before/after comparisons
6. **Saving and loading** the fine-tuned model

### Key Benefits of LoRA:
- **Memory Efficient**: Trains only 0.1-1% of original parameters
- **Fast Training**: Significantly reduced training time
- **Modular**: Easy to switch between different adaptations
- **Cost-Effective**: Can run on consumer GPUs

### Next Steps:
1. **Scale up** with larger, domain-specific datasets
2. **Experiment** with different LoRA configurations
3. **Evaluate** on downstream tasks
4. **Deploy** in production environments
5. **Combine** multiple LoRA adapters for multi-task models

**Note**: This is a demonstration with a small dataset. For production use, ensure you have high-quality, diverse training data and proper evaluation metrics.