# 🚀 Complete LLaMA Insurance Fine-tuning Pipeline for Google Colab

This notebook combines the entire fine-tuning pipeline in one place for seamless execution in Google Colab:

## Pipeline Overview
1. **Data Preprocessing** - PII removal, cleaning, dataset creation
2. **Tokenization** - LLaMA tokenizer setup and data formatting
3. **Model Training** - LoRA fine-tuning with quantization
4. **Evaluation** - Comprehensive metrics and analysis
5. **Inference Demo** - Interactive testing of the trained model

## Before Running:
1. **Enable GPU**: Runtime → Change runtime type → GPU (T4 recommended)
2. **Authenticate**: You'll need HuggingFace access for LLaMA models
3. **Optional**: Set up W&B for training monitoring

**⚠️ This notebook requires ~4-6 hours to complete on Colab T4 GPU**

# Part 1: Setup and Dependencies

Install required libraries and set up the environment.

In [None]:
# Install required packages
!pip install -q transformers accelerate datasets bitsandbytes peft wandb
!pip install -q torch torchvision torchaudio
!pip install -q rouge-score nltk scikit-learn matplotlib seaborn
!pip install -q pandas numpy tqdm ipywidgets

print("✅ All packages installed successfully!")

In [None]:
# Import core libraries
import os
import json
import torch
import gc
import time
from pathlib import Path
from typing import Dict, List, Tuple, Optional, Union
import warnings
from datetime import datetime

# Data processing
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# ML libraries  
import transformers
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    GenerationConfig,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType, PeftModel
from datasets import Dataset, DatasetDict

# Text processing
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

# HuggingFace authentication
from huggingface_hub import login

warnings.filterwarnings('ignore')
tqdm.pandas()

print(f"✅ Libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"Transformers version: {transformers.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ No GPU detected - training will be very slow!")

In [None]:
# Authenticate with HuggingFace (required for LLaMA access)
print("🔐 HuggingFace Authentication Required")
print("Please get your access token from: https://huggingface.co/settings/tokens")
print("Make sure you have access to LLaMA models")

try:
    login()
    print("✅ HuggingFace authentication successful")
except Exception as e:
    print(f"❌ Authentication failed: {e}")
    print("Please run this cell again and enter your HF token")

In [None]:
# Create project directory structure
project_dirs = [
    "data/raw",
    "data/processed",
    "data/tokenized", 
    "data/annotations",
    "outputs/checkpoints",
    "outputs/final_model",
    "outputs/logs",
    "outputs/evaluation",
    "config"
]

for dir_path in project_dirs:
    Path(dir_path).mkdir(parents=True, exist_ok=True)

print("✅ Project directory structure created")
print(f"Working directory: {os.getcwd()}")

# Save memory management function
def clear_memory():
    """Clear GPU and system memory"""
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    gc.collect()
    print(f"🧹 Memory cleared")
    if torch.cuda.is_available():
        print(f"GPU Memory: {torch.cuda.memory_allocated(0) / 1e9:.2f} GB allocated")

# Clear memory to start fresh
clear_memory()

# Part 2: Data Preprocessing

Create sample insurance data and preprocess it for training.

In [None]:
# Configuration for data preprocessing
RAW_DATA_DIR = Path("data/raw")
PROCESSED_DATA_DIR = Path("data/processed")
ANNOTATIONS_DIR = Path("data/annotations")

# Data split ratios
TRAIN_RATIO = 0.7
VAL_RATIO = 0.15
TEST_RATIO = 0.15

# Insurance task types
TASK_TYPES = {
    'CLAIM_CLASSIFICATION': 'Categorize insurance claims',
    'POLICY_SUMMARIZATION': 'Summarize policy documents',
    'FAQ_GENERATION': 'Generate FAQs from policies',
    'COMPLIANCE_CHECK': 'Identify compliance requirements',
    'CONTRACT_QA': 'Answer questions about contracts'
}

print(f"📊 Data preprocessing configuration loaded")
print(f"Task types: {list(TASK_TYPES.keys())}")

In [None]:
# PII Removal Class
import re

class PIIRemover:
    """Class to handle PII detection and removal from insurance documents"""
    
    def __init__(self):
        # Regex patterns for common PII
        self.patterns = {
            'ssn': r'\b\d{3}-\d{2}-\d{4}\b|\b\d{9}\b',
            'phone': r'\b(?:\+?1[-.]?)?\(?([0-9]{3})\)?[-.]?([0-9]{3})[-.]?([0-9]{4})\b',
            'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
            'zip_code': r'\b\d{5}(?:-\d{4})?\b',
            'credit_card': r'\b(?:\d{4}[-\s]?){3}\d{4}\b',
            'account_number': r'\b(?:account|acct|policy)\s*#?\s*\d{6,}\b',
            'date_of_birth': r'\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12]\d|3[01])[/-](?:19|20)\d{2}\b',
            'address_number': r'\b\d{1,5}\s+[A-Za-z\s]+(?:Street|St|Avenue|Ave|Road|Rd|Drive|Dr|Lane|Ln|Boulevard|Blvd)\b',
        }
    
    def remove_pii(self, text: str) -> Tuple[str, Dict[str, int]]:
        """Remove PII from text and return cleaned text with removal stats"""
        cleaned_text = text
        removal_stats = {}
        
        # Replace PII with generic placeholders
        replacements = {
            'ssn': '[SSN]',
            'phone': '[PHONE]',
            'email': '[EMAIL]',
            'zip_code': '[ZIP]',
            'credit_card': '[CARD_NUMBER]',
            'account_number': '[ACCOUNT_NUMBER]',
            'date_of_birth': '[DATE_OF_BIRTH]',
            'address_number': '[ADDRESS]',
        }
        
        for pii_type, replacement in replacements.items():
            pattern = self.patterns[pii_type]
            matches = re.findall(pattern, cleaned_text, re.IGNORECASE)
            removal_stats[pii_type] = len(matches)
            cleaned_text = re.sub(pattern, replacement, cleaned_text, flags=re.IGNORECASE)
        
        return cleaned_text, removal_stats

# Initialize PII remover
pii_remover = PIIRemover()

# Test PII removal
test_text = "John Smith's SSN is 123-45-6789 and phone is (555) 123-4567."
cleaned, stats = pii_remover.remove_pii(test_text)
print(f"PII Removal Test:")
print(f"Original: {test_text}")
print(f"Cleaned: {cleaned}")
print(f"✅ PII remover ready")

In [None]:
# Create sample insurance data for training
def create_sample_insurance_data():
    """Create comprehensive sample insurance documents for training"""
    
    sample_documents = [
        {
            'id': 'health_policy_001',
            'source': 'sample_data',
            'content': '''Health Insurance Policy - Premium Coverage
            
Coverage: This comprehensive health insurance policy provides coverage for medical expenses including hospital stays, doctor visits, prescription medications, and emergency care. The annual coverage limit is $1,000,000 per insured individual.
            
Deductible: Annual deductible of $1,500 per individual, $3,000 per family. After meeting the deductible, the plan covers 80% of eligible medical expenses.
            
Exclusions: Pre-existing conditions diagnosed within 12 months prior to policy effective date, cosmetic procedures, experimental treatments, and services not deemed medically necessary are excluded from coverage.
            
Premium: Monthly premium of $450 for individual coverage, $1,200 for family coverage. Premiums are due on the first of each month.''',
            'type': 'health_policy',
            'task_type': 'POLICY_SUMMARIZATION'
        },
        {
            'id': 'auto_claim_001', 
            'source': 'sample_data',
            'content': '''Auto Insurance Claim - Vehicle Collision
            
Claim Details: Vehicle collision occurred on Highway 101 involving two vehicles. Insured vehicle sustained front-end damage requiring repair. No injuries reported. Police report filed, case number 2024-001234.
            
Coverage Applied: Collision coverage with $500 deductible. Estimated repair cost $3,200. Coverage approved for $2,700 after deductible.
            
Settlement: Claim approved and processed. Payment issued to approved repair facility. Rental car coverage provided for 5 days during repair period.''',
            'type': 'auto_claim',
            'task_type': 'CLAIM_CLASSIFICATION'
        },
        {
            'id': 'life_policy_001',
            'source': 'sample_data', 
            'content': '''Life Insurance Policy - Term Life Coverage
            
Coverage: $500,000 term life insurance policy with 20-year level premium guarantee. Coverage includes accidental death and dismemberment benefit.
            
Beneficiaries: Primary beneficiary receives full death benefit. Secondary beneficiaries receive equal distribution if primary is deceased.
            
Premium: Monthly premium of $65 guaranteed for 20 years. Policy renewable at end of term with updated rates based on age and health.''',
            'type': 'life_policy',
            'task_type': 'POLICY_SUMMARIZATION'
        },
        {
            'id': 'home_claim_001',
            'source': 'sample_data',
            'content': '''Homeowners Insurance Claim - Water Damage
            
Claim Details: Water damage to kitchen and dining room caused by burst pipe. Damage includes flooring, cabinets, and drywall. Professional water extraction and drying completed.
            
Coverage Applied: Dwelling coverage and personal property coverage. $1,000 deductible applied. Total claim amount $8,500 minus deductible equals $7,500 payout.
            
Settlement: Claim approved. Contractor payment authorized for repairs. Temporary living expenses covered during repair period.''',
            'type': 'home_claim',
            'task_type': 'CLAIM_CLASSIFICATION'
        },
        {
            'id': 'compliance_doc_001',
            'source': 'sample_data',
            'content': '''Insurance Regulatory Compliance Requirements
            
HIPAA Compliance: All health insurance operations must comply with Health Insurance Portability and Accountability Act requirements for protecting patient health information privacy and security.
            
State Regulations: Insurance products must be filed with and approved by state insurance commissioners before sale. Rate changes require regulatory approval.
            
Consumer Protection: All marketing materials must be clear, truthful, and not misleading. Claims processing must be fair and timely according to state prompt payment laws.''',
            'type': 'compliance',
            'task_type': 'COMPLIANCE_CHECK'
        },
        {
            'id': 'disability_policy_001',
            'source': 'sample_data',
            'content': '''Disability Insurance Policy - Short Term Coverage
            
Coverage: Provides 60% of monthly income up to $3,000 per month for temporary disability lasting 14 days to 2 years. Covers illness and injury preventing work.
            
Waiting Period: 14-day elimination period before benefits begin. Pre-existing conditions covered after 12 months of continuous coverage.
            
Benefits: Monthly benefit payments made directly to insured. Coverage continues until recovery, end of benefit period, or return to work.''',
            'type': 'disability_policy',
            'task_type': 'POLICY_SUMMARIZATION' 
        }
    ]
    
    # Add more examples for better training
    expanded_documents = []
    
    # Create variations of each document type
    for doc in sample_documents:
        expanded_documents.append(doc)
        
        # Create a variation with different values
        if 'policy' in doc['type']:
            variation = doc.copy()
            variation['id'] = doc['id'].replace('001', '002')
            # Simple content variation
            variation['content'] = doc['content'].replace('$500,000', '$750,000').replace('$1,000,000', '$1,500,000')
            expanded_documents.append(variation)
        
        elif 'claim' in doc['type']:
            variation = doc.copy()
            variation['id'] = doc['id'].replace('001', '002')
            variation['content'] = doc['content'].replace('approved', 'pending review')
            expanded_documents.append(variation)
    
    # Save sample data to files
    sample_file = RAW_DATA_DIR / 'sample_insurance_docs.json'
    with open(sample_file, 'w', encoding='utf-8') as f:
        json.dump(expanded_documents, f, indent=2, ensure_ascii=False)
    
    print(f"✅ Created {len(expanded_documents)} sample documents")
    return expanded_documents

# Create sample data
sample_docs = create_sample_insurance_data()
print(f"Sample documents by task:")
for task_type in TASK_TYPES.keys():
    count = len([doc for doc in sample_docs if doc['task_type'] == task_type])
    print(f"  {task_type}: {count} documents")

In [None]:
# Process documents and create training datasets
def process_and_create_datasets(documents):
    """Process documents and create task-specific datasets"""
    
    processed_examples = []
    
    for doc in documents:
        # Clean text and remove PII
        content = doc['content']
        cleaned_content, pii_stats = pii_remover.remove_pii(content)
        
        task_type = doc['task_type']
        
        if task_type == 'POLICY_SUMMARIZATION':
            # Create summarization example
            summary = f"This {doc['type']} document covers key insurance terms including coverage details, deductibles, premiums, and important policy conditions."
            
            example = {
                'instruction': 'Summarize the following insurance policy document.',
                'input': cleaned_content,
                'output': summary,
                'task_type': task_type,
                'doc_id': doc['id']
            }
            processed_examples.append(example)
        
        elif task_type == 'CLAIM_CLASSIFICATION':
            # Create classification example
            classification = f"This is a {doc['type']} requiring {doc['type'].split('_')[0]} coverage review."
            
            example = {
                'instruction': 'Classify this insurance claim into the appropriate category.',
                'input': cleaned_content,
                'output': classification,
                'task_type': task_type,
                'doc_id': doc['id']
            }
            processed_examples.append(example)
        
        elif task_type == 'COMPLIANCE_CHECK':
            # Create compliance checking example
            compliance_info = 'Key compliance requirements include HIPAA privacy protections, state regulatory approvals, consumer protection standards, and prompt payment regulations.'
            
            example = {
                'instruction': 'Identify compliance requirements in this insurance document.',
                'input': cleaned_content,
                'output': compliance_info,
                'task_type': task_type,
                'doc_id': doc['id']
            }
            processed_examples.append(example)
        
        # Add FAQ generation examples for policies
        if 'policy' in doc['type']:
            faq_example = {
                'instruction': 'Generate frequently asked questions for this insurance policy.',
                'input': cleaned_content,
                'output': f"Q: What is the coverage limit? A: The policy provides comprehensive coverage as outlined. Q: What is the deductible? A: Deductible amounts vary by coverage type. Q: How do I file a claim? A: Contact your insurance provider to begin the claims process.",
                'task_type': 'FAQ_GENERATION',
                'doc_id': doc['id'] + '_faq'
            }
            processed_examples.append(faq_example)
        
        # Add Q&A examples
        qa_example = {
            'instruction': 'Answer the following question about this insurance document.',
            'input': f"Document: {cleaned_content}\n\nQuestion: What are the key benefits covered?",
            'output': "The key benefits include coverage for specified risks, with defined limits and deductibles as outlined in the policy terms.",
            'task_type': 'CONTRACT_QA',
            'doc_id': doc['id'] + '_qa'
        }
        processed_examples.append(qa_example)
    
    return processed_examples

def create_train_test_splits(examples):
    """Create train/validation/test splits"""
    
    if len(examples) < 10:
        # For small datasets, use simple splits
        train_size = int(len(examples) * 0.7)
        val_size = int(len(examples) * 0.15)
        
        train_examples = examples[:train_size]
        val_examples = examples[train_size:train_size + val_size]
        test_examples = examples[train_size + val_size:]
        
        # Ensure we have at least one example in each split
        if not val_examples:
            val_examples = train_examples[:1]
        if not test_examples:
            test_examples = train_examples[:1]
    else:
        # Use sklearn for larger datasets
        train_examples, temp_examples = train_test_split(
            examples, test_size=0.3, random_state=42
        )
        val_examples, test_examples = train_test_split(
            temp_examples, test_size=0.5, random_state=42
        )
    
    return {
        'train': train_examples,
        'validation': val_examples,
        'test': test_examples
    }

# Process documents and create datasets
print("🔄 Processing documents and creating datasets...")
processed_examples = process_and_create_datasets(sample_docs)

print(f"\nCreated {len(processed_examples)} training examples:")
task_counts = {}
for example in processed_examples:
    task_type = example['task_type']
    task_counts[task_type] = task_counts.get(task_type, 0) + 1

for task_type, count in task_counts.items():
    print(f"  {task_type}: {count} examples")

# Create splits
data_splits = create_train_test_splits(processed_examples)

print(f"\nData splits created:")
for split_name, examples in data_splits.items():
    print(f"  {split_name}: {len(examples)} examples")

# Save datasets
combined_dir = PROCESSED_DATA_DIR / 'combined'
combined_dir.mkdir(exist_ok=True)

for split_name, examples in data_splits.items():
    json_file = combined_dir / f"{split_name}.json"
    with open(json_file, 'w', encoding='utf-8') as f:
        json.dump(examples, f, indent=2, ensure_ascii=False)

print(f"\n✅ Data preprocessing complete!")
print(f"Datasets saved to: {combined_dir}")

# Clear memory
clear_memory()

# Part 3: Tokenization

Set up the LLaMA tokenizer and format data for training.

In [None]:
# Tokenization configuration
MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"
TOKENIZED_DATA_DIR = Path("data/tokenized")
TOKENIZED_DATA_DIR.mkdir(exist_ok=True)

# Tokenization parameters
MAX_LENGTH = 2048
PADDING_SIDE = "right"
TRUNCATION = True
ADD_EOS_TOKEN = True

# Instruction formatting template
INSTRUCTION_TEMPLATE = {
    'system': "You are a helpful AI assistant specialized in insurance and financial services. Provide accurate, helpful, and compliant information.",
    'user_prefix': "[INST]",
    'user_suffix': "[/INST]",
    'assistant_prefix': "",
    'assistant_suffix': "</s>"
}

print(f"🔤 Tokenization configuration:")
print(f"  Model: {MODEL_NAME}")
print(f"  Max length: {MAX_LENGTH}")
print(f"  Output: {TOKENIZED_DATA_DIR}")

In [None]:
# Setup tokenizer
def setup_tokenizer(model_name: str) -> AutoTokenizer:
    """Load and configure the LLaMA tokenizer"""
    print(f"Loading tokenizer for {model_name}...")
    
    try:
        tokenizer = AutoTokenizer.from_pretrained(
            model_name,
            use_fast=True,
            trust_remote_code=True,
            padding_side=PADDING_SIDE
        )
        
        # Set special tokens
        if tokenizer.pad_token is None:
            if tokenizer.eos_token is not None:
                tokenizer.pad_token = tokenizer.eos_token
            else:
                tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        
        print(f"✅ Tokenizer loaded successfully")
        print(f"  Vocab size: {len(tokenizer)}")
        print(f"  Pad token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
        print(f"  EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
        
        return tokenizer
        
    except Exception as e:
        print(f"❌ Error loading tokenizer: {e}")
        raise

# Load tokenizer
tokenizer = setup_tokenizer(MODEL_NAME)

# Test tokenization
test_text = "This is a test of the LLaMA tokenizer for insurance documents."
test_tokens = tokenizer.encode(test_text)
decoded_text = tokenizer.decode(test_tokens)

print(f"\nTokenization test:")
print(f"  Original: {test_text}")
print(f"  Tokens: {len(test_tokens)} tokens")
print(f"  Decoded: {decoded_text}")

In [None]:
# Format data for instruction tuning
def format_instruction(example: Dict[str, str]) -> str:
    """Format example into instruction-following format for LLaMA"""
    
    instruction = example.get('instruction', '')
    user_input = example.get('input', '')
    assistant_output = example.get('output', '')
    
    # Format in LLaMA chat format
    formatted_text = f"{INSTRUCTION_TEMPLATE['user_prefix']} {instruction}\n\n{user_input} {INSTRUCTION_TEMPLATE['user_suffix']} {assistant_output}{INSTRUCTION_TEMPLATE['assistant_suffix']}"
    
    return formatted_text

def tokenize_function(examples: Dict[str, List[str]]) -> Dict[str, List[List[int]]]:
    """Tokenize a batch of examples"""
    
    # Tokenize the text
    tokenized = tokenizer(
        examples['text'],
        truncation=TRUNCATION,
        padding=False,  # Pad dynamically during training
        max_length=MAX_LENGTH,
        return_tensors=None,
        add_special_tokens=True
    )
    
    # For causal LM, labels are same as input_ids
    tokenized['labels'] = tokenized['input_ids'].copy()
    
    # Calculate lengths for filtering
    tokenized['length'] = [len(ids) for ids in tokenized['input_ids']]
    
    return tokenized

# Load and tokenize datasets
print("🔄 Loading and tokenizing datasets...")

# Load processed data
combined_data = {}
for split in ['train', 'validation', 'test']:
    json_file = combined_dir / f"{split}.json"
    if json_file.exists():
        with open(json_file, 'r') as f:
            combined_data[split] = json.load(f)

# Format and tokenize each split
tokenized_datasets = {}

for split_name, examples in combined_data.items():
    print(f"\nTokenizing {split_name} ({len(examples)} examples)...")
    
    # Format examples for instruction tuning
    formatted_examples = []
    for example in examples:
        formatted_text = format_instruction(example)
        formatted_examples.append({
            'text': formatted_text,
            'task_type': example.get('task_type', 'POLICY_SUMMARIZATION'),
            'original_id': example.get('doc_id', 'unknown')
        })
    
    # Create dataset and tokenize
    dataset = Dataset.from_list(formatted_examples)
    
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=dataset.column_names,
        desc=f"Tokenizing {split_name}"
    )
    
    # Filter out examples that are too long or too short
    def filter_length(example):
        length = example['length']
        return 10 <= length <= MAX_LENGTH
    
    filtered_dataset = tokenized_dataset.filter(filter_length)
    
    print(f"  After filtering: {len(filtered_dataset)} examples")
    
    if len(filtered_dataset) > 0:
        lengths = [ex['length'] for ex in filtered_dataset]
        print(f"  Length stats - Min: {min(lengths)}, Max: {max(lengths)}, Avg: {np.mean(lengths):.1f}")
    
    tokenized_datasets[split_name] = filtered_dataset

# Save tokenized datasets
combined_tokenized_dir = TOKENIZED_DATA_DIR / "combined"
combined_tokenized_dir.mkdir(exist_ok=True)

for split_name, dataset in tokenized_datasets.items():
    if len(dataset) > 0:
        split_dir = combined_tokenized_dir / split_name
        dataset.save_to_disk(split_dir)
        print(f"✅ Saved {split_name} dataset: {len(dataset)} examples")

# Save tokenizer
tokenizer_dir = TOKENIZED_DATA_DIR / "tokenizer"
tokenizer.save_pretrained(tokenizer_dir)
print(f"✅ Tokenizer saved to: {tokenizer_dir}")

# Save tokenization metadata
metadata = {
    'model_name': MODEL_NAME,
    'max_length': MAX_LENGTH,
    'vocab_size': len(tokenizer),
    'datasets': {split: len(dataset) for split, dataset in tokenized_datasets.items()}
}

metadata_file = TOKENIZED_DATA_DIR / "tokenization_metadata.json"
with open(metadata_file, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"\n✅ Tokenization complete!")
print(f"Total tokenized examples: {sum(len(dataset) for dataset in tokenized_datasets.values())}")

# Clear memory
clear_memory()

# Part 4: LoRA Fine-tuning

Train the LLaMA model using LoRA for efficient fine-tuning.

In [None]:
# Training configuration
OUTPUT_DIR = Path("outputs")
CHECKPOINT_DIR = OUTPUT_DIR / "checkpoints"
FINAL_MODEL_DIR = OUTPUT_DIR / "final_model"
LOGS_DIR = OUTPUT_DIR / "logs"

# Training parameters optimized for Colab
TRAINING_CONFIG = {
    "output_dir": str(CHECKPOINT_DIR),
    "num_train_epochs": 3,
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 8,
    "gradient_checkpointing": True,
    "learning_rate": 2e-4,
    "weight_decay": 0.01,
    "fp16": True,
    "max_grad_norm": 0.3,
    "warmup_ratio": 0.03,
    "lr_scheduler_type": "cosine",
    "save_steps": 500,
    "eval_steps": 500,
    "logging_steps": 50,
    "save_total_limit": 2,
    "load_best_model_at_end": True,
    "metric_for_best_model": "eval_loss",
    "greater_is_better": False,
    "evaluation_strategy": "steps",
    "save_strategy": "steps",
    "report_to": []
}

# LoRA configuration
LORA_CONFIG = {
    "r": 8,
    "lora_alpha": 16,
    "lora_dropout": 0.05,
    "bias": "none",
    "task_type": "CAUSAL_LM",
    "target_modules": [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ]
}

# Quantization configuration
QUANTIZATION_CONFIG = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
)

print(f"🚀 Training configuration:")
print(f"  Epochs: {TRAINING_CONFIG['num_train_epochs']}")
print(f"  Batch size: {TRAINING_CONFIG['per_device_train_batch_size']}")
print(f"  Learning rate: {TRAINING_CONFIG['learning_rate']}")
print(f"  LoRA rank: {LORA_CONFIG['r']}")
print(f"  Quantization: 4-bit")

In [None]:
# Load tokenized datasets for training
def load_tokenized_datasets_for_training() -> tuple[DatasetDict, AutoTokenizer]:
    """Load tokenized datasets and tokenizer for training"""
    
    print(f"Loading tokenized datasets from {TOKENIZED_DATA_DIR}...")
    
    # Load tokenizer
    tokenizer_dir = TOKENIZED_DATA_DIR / "tokenizer"
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
    print(f"✅ Tokenizer loaded")
    
    # Load datasets
    dataset_dict = DatasetDict()
    combined_dir = TOKENIZED_DATA_DIR / "combined"
    
    for split in ['train', 'validation', 'test']:
        split_dir = combined_dir / split
        if split_dir.exists():
            from datasets import load_from_disk
            dataset = load_from_disk(split_dir)
            dataset_dict[split] = dataset
            print(f"✅ {split}: {len(dataset)} examples")
    
    return dataset_dict, tokenizer

# Load datasets for training
train_datasets, train_tokenizer = load_tokenized_datasets_for_training()

if not train_datasets:
    print("❌ No training datasets found")
else:
    total_examples = sum(len(dataset) for dataset in train_datasets.values())
    print(f"\n📊 Training data loaded: {total_examples} total examples")
    
    # Show sample
    if 'train' in train_datasets and len(train_datasets['train']) > 0:
        sample = train_datasets['train'][0]
        print(f"Sample input length: {len(sample['input_ids'])} tokens")

In [None]:
# Load and prepare model for training
def load_base_model(model_name: str, tokenizer: AutoTokenizer) -> AutoModelForCausalLM:
    """Load the base LLaMA model with quantization"""
    
    print(f"Loading base model {model_name}...")
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=QUANTIZATION_CONFIG,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16,
        use_cache=False  # Disable for training
    )
    
    # Resize embeddings if needed
    if len(tokenizer) > model.config.vocab_size:
        print(f"Resizing embeddings: {model.config.vocab_size} -> {len(tokenizer)}")
        model.resize_token_embeddings(len(tokenizer))
    
    print(f"✅ Model loaded successfully")
    print(f"  Parameters: ~{sum(p.numel() for p in model.parameters()) / 1e6:.0f}M")
    
    return model

def setup_lora_model(model: AutoModelForCausalLM, lora_config: dict) -> PeftModel:
    """Set up LoRA configuration and wrap the model"""
    
    print(f"Setting up LoRA...")
    
    # Prepare model for k-bit training
    model = prepare_model_for_kbit_training(model)
    
    # Create LoRA configuration
    peft_config = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        r=lora_config['r'],
        lora_alpha=lora_config['lora_alpha'],
        lora_dropout=lora_config['lora_dropout'],
        bias=lora_config['bias'],
        target_modules=lora_config['target_modules'],
        inference_mode=False
    )
    
    # Wrap with LoRA
    model = get_peft_model(model, peft_config)
    
    # Print trainable parameters
    trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total_params = sum(p.numel() for p in model.parameters())
    
    print(f"✅ LoRA setup complete")
    print(f"  Trainable parameters: {trainable_params:,} ({trainable_params/total_params:.2%})")
    
    return model

# Load and setup model if datasets are available
if train_datasets and train_tokenizer:
    print("🔄 Loading and setting up model...")
    
    # Clear memory first
    clear_memory()
    
    # Load base model
    base_model = load_base_model(MODEL_NAME, train_tokenizer)
    
    # Setup LoRA
    model = setup_lora_model(base_model, LORA_CONFIG)
    
    # Check GPU memory
    if torch.cuda.is_available():
        memory_used = torch.cuda.memory_allocated(0) / 1e9
        print(f"GPU Memory Usage: {memory_used:.2f} GB")
else:
    print("❌ Cannot load model - no training datasets")
    model = None

In [None]:
# Setup training components
if model is not None and train_datasets:
    print("⚙️ Setting up training components...")
    
    # Training arguments
    training_args = TrainingArguments(**TRAINING_CONFIG)
    
    # Data collator
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=train_tokenizer,
        mlm=False,  # Causal LM
        pad_to_multiple_of=8,
        return_tensors="pt"
    )
    
    # Create trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_datasets['train'],
        eval_dataset=train_datasets.get('validation'),
        tokenizer=train_tokenizer,
        data_collator=data_collator,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=3)]
    )
    
    print(f"✅ Training components ready")
    print(f"  Training samples: {len(train_datasets['train'])}")
    if 'validation' in train_datasets:
        print(f"  Validation samples: {len(train_datasets['validation'])}")
    
    # Calculate estimated training time
    train_dataloader = trainer.get_train_dataloader()
    num_batches = len(train_dataloader)
    total_steps = num_batches * training_args.num_train_epochs
    estimated_hours = (total_steps * 2.0) / 3600  # ~2 seconds per step
    
    print(f"\n⏱️ Training estimation:")
    print(f"  Total steps: {total_steps}")
    print(f"  Estimated time: {estimated_hours:.1f} hours")

else:
    print("❌ Cannot setup training - missing model or data")
    trainer = None

In [None]:
# Start training
if trainer is not None:
    print(f"🚀 Starting training at {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"=" * 60)
    
    # Clear memory before training
    clear_memory()
    
    try:
        start_time = time.time()
        
        # Start training
        training_result = trainer.train()
        
        end_time = time.time()
        training_duration = (end_time - start_time) / 3600  # hours
        
        print(f"\n✅ Training completed successfully!")
        print(f"Training time: {training_duration:.2f} hours")
        print(f"Final training loss: {training_result.training_loss:.4f}")
        
        # Save checkpoint info
        training_success = True
        
    except Exception as e:
        print(f"❌ Training failed: {e}")
        print(f"This might be due to insufficient GPU memory or other issues")
        
        # Try emergency save
        try:
            emergency_path = CHECKPOINT_DIR / "emergency_checkpoint"
            trainer.save_model(emergency_path)
            print(f"Emergency checkpoint saved to: {emergency_path}")
        except:
            pass
        
        training_success = False
        training_result = None

else:
    print("❌ Cannot start training - trainer not ready")
    training_success = False
    training_result = None

In [None]:
# Save final model
if training_success and trainer is not None:
    print(f"💾 Saving final model...")
    
    # Save LoRA model
    lora_model_dir = FINAL_MODEL_DIR / "lora_model"
    lora_model_dir.mkdir(parents=True, exist_ok=True)
    
    model.save_pretrained(lora_model_dir)
    train_tokenizer.save_pretrained(lora_model_dir)
    
    print(f"✅ LoRA model saved to: {lora_model_dir}")
    
    # Save training info
    model_info = {
        'base_model': MODEL_NAME,
        'model_type': 'LLaMA-2-7B with LoRA',
        'task': 'Insurance Domain Fine-tuning',
        'lora_config': LORA_CONFIG,
        'training_config': TRAINING_CONFIG,
        'training_completed': datetime.now().isoformat(),
        'final_loss': training_result.training_loss if training_result else None,
        'total_steps': training_result.global_step if training_result else None
    }
    
    info_file = FINAL_MODEL_DIR / "model_info.json"
    with open(info_file, 'w') as f:
        json.dump(model_info, f, indent=2)
    
    print(f"✅ Model info saved to: {info_file}")
    print(f"\n🎉 Training pipeline completed successfully!")
    
else:
    print("❌ Cannot save model - training was not successful")

# Clear memory after training
clear_memory()

# Part 5: Model Evaluation

Evaluate the fine-tuned model's performance on insurance tasks.

In [None]:
# Evaluation configuration
EVALUATION_RESULTS_DIR = Path("outputs/evaluation")
EVALUATION_RESULTS_DIR.mkdir(exist_ok=True)

LORA_MODEL_PATH = Path("outputs/final_model/lora_model")

# Generation configuration for evaluation
GENERATION_CONFIG = {
    "max_new_tokens": 256,
    "temperature": 0.7,
    "top_p": 0.9,
    "top_k": 50,
    "do_sample": True,
    "repetition_penalty": 1.1
}

print(f"📊 Evaluation setup:")
print(f"  Model path: {LORA_MODEL_PATH}")
print(f"  Results dir: {EVALUATION_RESULTS_DIR}")
print(f"  Max tokens: {GENERATION_CONFIG['max_new_tokens']}")

In [None]:
# Load fine-tuned model for evaluation
def load_finetuned_model_for_eval() -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
    """Load the fine-tuned LoRA model for evaluation"""
    
    if not LORA_MODEL_PATH.exists():
        print(f"❌ LoRA model not found at {LORA_MODEL_PATH}")
        return None, None
    
    try:
        print(f"Loading fine-tuned model for evaluation...")
        
        # Load tokenizer
        tokenizer = AutoTokenizer.from_pretrained(LORA_MODEL_PATH)
        
        # Load base model
        base_model = AutoModelForCausalLM.from_pretrained(
            MODEL_NAME,
            torch_dtype=torch.float16,
            device_map="auto",
            trust_remote_code=True
        )
        
        # Load LoRA model
        model = PeftModel.from_pretrained(base_model, LORA_MODEL_PATH)
        
        # Set pad token
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token
        
        GENERATION_CONFIG['pad_token_id'] = tokenizer.pad_token_id
        
        print(f"✅ Fine-tuned model loaded for evaluation")
        return model, tokenizer
        
    except Exception as e:
        print(f"❌ Error loading fine-tuned model: {e}")
        return None, None

# Load model if training was successful
if training_success:
    eval_model, eval_tokenizer = load_finetuned_model_for_eval()
else:
    print("⚠️ Skipping evaluation - no trained model available")
    eval_model, eval_tokenizer = None, None

In [None]:
# Load test data for evaluation
def load_test_data_for_eval() -> List[Dict]:
    """Load test data for evaluation"""
    
    test_file = Path("data/processed/combined/test.json")
    
    if test_file.exists():
        with open(test_file, 'r') as f:
            test_data = json.load(f)
        print(f"✅ Test data loaded: {len(test_data)} examples")
        return test_data
    else:
        print(f"❌ Test data not found at {test_file}")
        return []

# Evaluation functions
def generate_response(model, tokenizer, prompt: str, generation_config: dict) -> str:
    """Generate response from model"""
    
    try:
        inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True, max_length=2048)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                **generation_config,
                use_cache=True
            )
        
        # Decode response (remove input)
        input_length = inputs['input_ids'].shape[1]
        generated_tokens = outputs[0][input_length:]
        response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
        
        return response.strip()
        
    except Exception as e:
        print(f"Error generating response: {e}")
        return ""

def calculate_simple_metrics(predictions: List[str], references: List[str]) -> Dict[str, float]:
    """Calculate simple evaluation metrics"""
    
    if not predictions or not references:
        return {'accuracy': 0.0, 'avg_length': 0.0}
    
    # Simple exact match accuracy
    exact_matches = sum(1 for p, r in zip(predictions, references) 
                       if p.lower().strip() == r.lower().strip())
    accuracy = exact_matches / len(predictions)
    
    # Average response length
    avg_length = np.mean([len(p.split()) for p in predictions])
    
    # ROUGE-like overlap (simple version)
    overlap_scores = []
    for pred, ref in zip(predictions, references):
        pred_words = set(pred.lower().split())
        ref_words = set(ref.lower().split())
        if ref_words:
            overlap = len(pred_words & ref_words) / len(ref_words)
            overlap_scores.append(overlap)
        else:
            overlap_scores.append(0.0)
    
    avg_overlap = np.mean(overlap_scores) if overlap_scores else 0.0
    
    return {
        'exact_match_accuracy': accuracy,
        'average_response_length': avg_length,
        'word_overlap_score': avg_overlap
    }

# Load test data
if eval_model is not None:
    test_data = load_test_data_for_eval()
else:
    test_data = []

In [None]:
# Run evaluation
if eval_model is not None and eval_tokenizer is not None and test_data:
    print(f"🔍 Running evaluation on {len(test_data)} test examples...")
    
    # Group by task type
    task_examples = {}
    for example in test_data:
        task_type = example.get('task_type', 'POLICY_SUMMARIZATION')
        if task_type not in task_examples:
            task_examples[task_type] = []
        task_examples[task_type].append(example)
    
    print(f"\nTask distribution:")
    for task_type, examples in task_examples.items():
        print(f"  {task_type}: {len(examples)} examples")
    
    # Run evaluation for each task
    evaluation_results = {}
    
    for task_type, examples in task_examples.items():
        print(f"\nEvaluating {task_type}...")
        
        predictions = []
        references = []
        sample_outputs = []
        
        # Limit to first 5 examples for demo purposes (faster)
        eval_examples = examples[:5]
        
        for i, example in enumerate(tqdm(eval_examples, desc=f"Evaluating {task_type}")):
            # Create prompt
            instruction = example.get('instruction', '')
            input_text = example.get('input', '')
            expected_output = example.get('output', '')
            
            prompt = f"[INST] {instruction}\n\n{input_text} [/INST]"
            
            # Generate prediction
            response = generate_response(eval_model, eval_tokenizer, prompt, GENERATION_CONFIG)
            
            predictions.append(response)
            references.append(expected_output)
            
            # Save sample for manual review
            if i < 3:  # Save first 3 examples
                sample_outputs.append({
                    'prompt': prompt[:200] + '...' if len(prompt) > 200 else prompt,
                    'prediction': response,
                    'reference': expected_output,
                    'match': response.lower().strip() == expected_output.lower().strip()
                })
        
        # Calculate metrics
        metrics = calculate_simple_metrics(predictions, references)
        
        evaluation_results[task_type] = {
            'task_type': task_type,
            'num_examples': len(eval_examples),
            'metrics': metrics,
            'samples': sample_outputs
        }
        
        print(f"  Exact match accuracy: {metrics['exact_match_accuracy']:.3f}")
        print(f"  Word overlap score: {metrics['word_overlap_score']:.3f}")
        print(f"  Avg response length: {metrics['average_response_length']:.1f} words")
    
    print(f"\n✅ Evaluation completed!")
    
else:
    print("⚠️ Skipping evaluation - model or test data not available")
    evaluation_results = {}

In [None]:
# Display evaluation results
if evaluation_results:
    print("\n" + "="*80)
    print("EVALUATION RESULTS SUMMARY")
    print("="*80)
    
    # Create summary table
    summary_data = []
    for task_type, results in evaluation_results.items():
        metrics = results['metrics']
        summary_data.append({
            'Task': TASK_TYPES.get(task_type, task_type),
            'Examples': results['num_examples'],
            'Exact Match': f"{metrics['exact_match_accuracy']:.3f}",
            'Word Overlap': f"{metrics['word_overlap_score']:.3f}",
            'Avg Length': f"{metrics['average_response_length']:.1f}"
        })
    
    summary_df = pd.DataFrame(summary_data)
    print(summary_df.to_string(index=False))
    
    # Show sample predictions
    print("\n" + "="*80)
    print("SAMPLE PREDICTIONS")
    print("="*80)
    
    for task_type, results in evaluation_results.items():
        print(f"\n🔍 {task_type}")
        print("-" * 60)
        
        for i, sample in enumerate(results['samples'][:2], 1):  # Show first 2
            print(f"\nExample {i}:")
            print(f"Prompt: {sample['prompt']}")
            print(f"Prediction: {sample['prediction']}")
            print(f"Reference: {sample['reference']}")
            print(f"Match: {'✅' if sample['match'] else '❌'}")
            print("-" * 40)
    
    # Save results
    results_file = EVALUATION_RESULTS_DIR / 'evaluation_results.json'
    final_results = {
        'metadata': {
            'evaluation_date': datetime.now().isoformat(),
            'model_path': str(LORA_MODEL_PATH),
            'base_model': MODEL_NAME,
            'total_examples': sum(r['num_examples'] for r in evaluation_results.values())
        },
        'results': evaluation_results
    }
    
    with open(results_file, 'w') as f:
        json.dump(final_results, f, indent=2)
    
    # Save summary CSV
    summary_file = EVALUATION_RESULTS_DIR / 'evaluation_summary.csv'
    summary_df.to_csv(summary_file, index=False)
    
    print(f"\n💾 Results saved to:")
    print(f"  Detailed: {results_file}")
    print(f"  Summary: {summary_file}")

else:
    print("❌ No evaluation results to display")

# Part 6: Interactive Inference Demo

Test the trained model with interactive examples.

In [None]:
# Interactive inference setup
def test_model_inference(model, tokenizer, prompt: str, max_tokens: int = 200):
    """Test the model with a custom prompt"""
    
    if model is None or tokenizer is None:
        return "❌ Model not available for inference"
    
    try:
        # Format prompt
        formatted_prompt = f"[INST] {prompt} [/INST]"
        
        # Generate response
        inputs = tokenizer(formatted_prompt, return_tensors="pt", padding=True, truncation=True)
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
        
        generation_config = {
            'max_new_tokens': max_tokens,
            'temperature': 0.7,
            'top_p': 0.9,
            'do_sample': True,
            'pad_token_id': tokenizer.pad_token_id,
            'repetition_penalty': 1.1
        }
        
        with torch.no_grad():
            outputs = model.generate(**inputs, **generation_config)
        
        # Decode response
        input_length = inputs['input_ids'].shape[1]
        generated_tokens = outputs[0][input_length:]
        response = tokenizer.decode(generated_tokens, skip_special_tokens=True)
        
        return response.strip()
        
    except Exception as e:
        return f"❌ Error generating response: {e}"

# Test examples for different insurance tasks
test_prompts = {
    "Policy Summary": "Summarize this health insurance policy: This policy provides comprehensive medical coverage including hospital stays, doctor visits, and prescription drugs. Annual deductible is $2,000 with 80% coverage after deductible. Monthly premium is $400.",
    
    "Claim Classification": "Classify this insurance claim: Customer's vehicle was damaged in a parking lot collision. Estimated repair cost is $3,500. No injuries reported. Police report filed.",
    
    "FAQ Generation": "Generate FAQs for this life insurance policy: $500,000 term life insurance with 20-year level premiums. Coverage includes accidental death benefit. Monthly premium is $75.",
    
    "Compliance Check": "Identify compliance requirements in this document: All customer health information must be protected according to privacy regulations. Claims must be processed within 30 days. Marketing materials must be truthful.",
    
    "Contract Q&A": "Based on this policy, answer: What is the deductible? Policy text: This homeowners insurance includes dwelling coverage up to $300,000 with a $1,000 deductible. Personal property coverage is 50% of dwelling amount."
}

print("🤖 Interactive Inference Demo")
print("=" * 50)

if eval_model is not None and eval_tokenizer is not None:
    print("✅ Model ready for inference testing")
    
    for task_name, prompt in test_prompts.items():
        print(f"\n📝 {task_name}")
        print("-" * 30)
        print(f"Input: {prompt[:100]}...")
        
        response = test_model_inference(eval_model, eval_tokenizer, prompt)
        print(f"Response: {response}")
        print()
else:
    print("❌ Model not available for inference testing")
    print("This could be because:")
    print("- Training was not successful")
    print("- Model failed to load")
    print("- GPU memory issues")

In [None]:
# Custom prompt testing (you can modify this prompt)
print("🎯 Custom Prompt Testing")
print("=" * 30)

# You can change this prompt to test different scenarios
custom_prompt = "Explain the difference between term life insurance and whole life insurance in simple terms."

if eval_model is not None and eval_tokenizer is not None:
    print(f"Testing custom prompt: {custom_prompt}")
    print("\nResponse:")
    
    custom_response = test_model_inference(eval_model, eval_tokenizer, custom_prompt, max_tokens=300)
    print(custom_response)
    
    print("\n" + "="*50)
    print("✨ Feel free to modify the custom_prompt variable above to test different scenarios!")
    
else:
    print("❌ Model not available for custom prompt testing")

# Clear memory one final time
clear_memory()

# Pipeline Complete! 🎉

## Summary
This notebook has completed the entire LLaMA insurance fine-tuning pipeline:

1. ✅ **Data Preprocessing** - Created and cleaned sample insurance documents
2. ✅ **Tokenization** - Formatted data for LLaMA instruction tuning
3. ✅ **Fine-tuning** - Trained the model using LoRA for efficiency
4. ✅ **Evaluation** - Tested model performance on insurance tasks
5. ✅ **Inference Demo** - Interactive testing of the trained model

## Files Created
- **Model**: `outputs/final_model/lora_model/` - Your fine-tuned LoRA model
- **Evaluation**: `outputs/evaluation/` - Performance metrics and results
- **Data**: `data/processed/` and `data/tokenized/` - Processed datasets

## Next Steps
1. **Experiment** with different prompts in the inference demo above
2. **Improve** the model by adding more diverse training data
3. **Deploy** the model for production use
4. **Share** your model on Hugging Face Hub

## Need Help?
- Review the individual notebooks (01-05) for detailed explanations
- Check the outputs directory for detailed logs and results
- Modify the configuration sections to experiment with different settings

**Great job completing the LLaMA insurance fine-tuning pipeline! 🚀**