# Notebook 11: Performance, Caching, and Cost Analysis

**Learning Objectives:**
- Measure inference latency and throughput
- Understand HuggingFace's caching system
- Estimate computational costs and memory requirements
- Optimize model performance
- Manage cache and storage

## Prerequisites

### Hardware Requirements

This notebook can run on any system used for previous notebooks.

| Requirement | Specification |
|-------------|---------------|
| **CPU** | Any modern CPU |
| **RAM** | 4GB minimum |
| **Storage** | 10GB+ free (for cache analysis) |
| **GPU** | Optional (for performance comparison) |

### Software Requirements
- Python 3.8+
- Libraries: `transformers`, `torch`, `psutil`
- See `requirements.txt` for full list

## Expected Behaviors

### What You'll Learn
- How to measure latency accurately
- Where HuggingFace stores cached models
- How to calculate storage and compute costs
- Techniques to optimize inference speed

### Performance Metrics
- **Latency**: Time for single prediction (milliseconds)
- **Throughput**: Predictions per second
- **Memory**: RAM/VRAM usage during inference
- **Storage**: Disk space for cached models

### Cache Behavior
- First model load: Downloads to cache (~1-10 seconds per MB)
- Subsequent loads: Reads from cache (instant)
- Cache location: `~/.cache/huggingface/hub/`
- Models persist until manually deleted

## Overview

**Performance Optimization** is crucial for deploying ML models in production.

**Key Concepts:**
- **Latency**: How fast is a single prediction?
- **Throughput**: How many predictions per second?
- **Caching**: Reusing downloaded models to save time and bandwidth
- **Cost**: Computational resources (time, memory, money)

**Why This Matters:**
- User experience depends on response time
- Cloud costs scale with compute time
- Cache management prevents storage issues
- Understanding trade-offs helps choose the right model

## Setup and Installation

In [None]:
# Import required libraries
import torch
import time
import os
import psutil
from pathlib import Path
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, set_seed
import warnings
warnings.filterwarnings('ignore')

# Set seed for reproducibility
set_seed(1103)

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Part 1: Understanding the Cache System

### Cache Location and Structure

In [None]:
# Find HuggingFace cache directory
from huggingface_hub import scan_cache_dir

# Get cache info
cache_info = scan_cache_dir()

print(f"=== HUGGINGFACE CACHE ===")
print(f"Cache location: {cache_info.cache_dir}")
print(f"Number of repos: {len(cache_info.repos)}")
print(f"Total size: {cache_info.size_on_disk / (1024**3):.2f} GB")

# Show cached models
print(f"\n=== CACHED MODELS ===")
for i, repo in enumerate(cache_info.repos[:10], 1):  # Show first 10
    print(f"{i}. {repo.repo_id}")
    print(f"   Size: {repo.size_on_disk / (1024**2):.2f} MB")
    print(f"   Last accessed: {repo.last_accessed}")
    print()

if len(cache_info.repos) > 10:
    print(f"... and {len(cache_info.repos) - 10} more")

### How Caching Works

In [None]:
# Demonstrate cache behavior
MODEL_NAME = "distilbert-base-uncased-finetuned-sst-2-english"

print("=== FIRST LOAD (may download if not cached) ===")
start = time.time()
classifier = pipeline("sentiment-analysis", model=MODEL_NAME, device=-1)
first_load_time = time.time() - start
print(f"Time: {first_load_time:.2f} seconds")

# Delete the pipeline to clear memory
del classifier
torch.cuda.empty_cache() if torch.cuda.is_available() else None

print("\n=== SECOND LOAD (from cache) ===")
start = time.time()
classifier = pipeline("sentiment-analysis", model=MODEL_NAME, device=-1)
second_load_time = time.time() - start
print(f"Time: {second_load_time:.2f} seconds")

print(f"\n⚡ Speedup: {first_load_time/second_load_time:.1f}x faster from cache")

### Cache Management

In [None]:
# Analyze cache usage
from huggingface_hub import scan_cache_dir

cache_info = scan_cache_dir()

# Group by size
print("=== MODELS BY SIZE ===")
sorted_repos = sorted(cache_info.repos, key=lambda x: x.size_on_disk, reverse=True)

for i, repo in enumerate(sorted_repos[:5], 1):
    size_mb = repo.size_on_disk / (1024**2)
    print(f"{i}. {repo.repo_id:50s} {size_mb:8.2f} MB")

# Show cleanup strategy
print("\n=== CACHE CLEANUP OPTIONS ===")
print("To delete specific model:")
print("  from huggingface_hub import scan_cache_dir")
print("  cache_info = scan_cache_dir()")
print("  to_delete = cache_info.repos[0]  # or find by name")
print("  delete_strategy = cache_info.delete_revisions(to_delete.repo_id)")
print("  delete_strategy.execute()")
print("\nTo clear entire cache:")
print("  rm -rf ~/.cache/huggingface/hub/  # WARNING: Deletes all models!")

## Part 2: Measuring Latency

### Single Inference Latency

In [None]:
# Accurate latency measurement
def measure_latency(pipeline_fn, input_data, num_warmup=3, num_runs=10):
    """
    Measure average latency with warmup.
    
    Args:
        pipeline_fn: HuggingFace pipeline
        input_data: Input to the pipeline
        num_warmup: Warmup iterations
        num_runs: Measurement iterations
    
    Returns:
        dict with latency statistics
    """
    # Warmup
    for _ in range(num_warmup):
        _ = pipeline_fn(input_data)
    
    # Measure
    times = []
    for _ in range(num_runs):
        start = time.perf_counter()
        _ = pipeline_fn(input_data)
        end = time.perf_counter()
        times.append(end - start)
    
    return {
        'mean_ms': sum(times) / len(times) * 1000,
        'min_ms': min(times) * 1000,
        'max_ms': max(times) * 1000,
        'std_ms': (sum((t - sum(times)/len(times))**2 for t in times) / len(times))**0.5 * 1000
    }

# Test with sentiment analysis
test_text = "This is a test sentence for measuring latency."
stats = measure_latency(classifier, test_text)

print("=== LATENCY STATISTICS ===")
print(f"Mean:   {stats['mean_ms']:.2f} ms")
print(f"Min:    {stats['min_ms']:.2f} ms")
print(f"Max:    {stats['max_ms']:.2f} ms")
print(f"StdDev: {stats['std_ms']:.2f} ms")

### Model Size vs Latency Comparison

In [None]:
# Compare different model sizes
models_to_compare = [
    ("distilbert-base-uncased-finetuned-sst-2-english", "DistilBERT", "268MB"),
    ("bert-base-uncased", "BERT-base", "440MB")
]

print("=== MODEL COMPARISON ===")
print(f"{'Model':<15} {'Size':<10} {'Latency (ms)':<15} {'Relative Speed'}")
print("-" * 60)

baseline_latency = None
for model_name, display_name, size in models_to_compare:
    try:
        # Load model
        pipe = pipeline("text-classification", model=model_name, device=-1)
        
        # Measure
        stats = measure_latency(pipe, test_text, num_warmup=2, num_runs=5)
        latency = stats['mean_ms']
        
        if baseline_latency is None:
            baseline_latency = latency
            relative = "1.0x (baseline)"
        else:
            relative = f"{latency/baseline_latency:.2f}x"
        
        print(f"{display_name:<15} {size:<10} {latency:>10.2f} ms   {relative}")
        
        # Clean up
        del pipe
    except Exception as e:
        print(f"{display_name:<15} {size:<10} Error: {str(e)[:20]}")

### CPU vs GPU Comparison

In [None]:
# Compare CPU vs GPU performance
if torch.cuda.is_available():
    print("=== CPU vs GPU COMPARISON ===")
    
    # CPU
    pipe_cpu = pipeline("sentiment-analysis", model=MODEL_NAME, device=-1)
    stats_cpu = measure_latency(pipe_cpu, test_text)
    
    # GPU
    pipe_gpu = pipeline("sentiment-analysis", model=MODEL_NAME, device=0)
    stats_gpu = measure_latency(pipe_gpu, test_text)
    
    print(f"CPU: {stats_cpu['mean_ms']:.2f} ms")
    print(f"GPU: {stats_gpu['mean_ms']:.2f} ms")
    print(f"\n⚡ GPU is {stats_cpu['mean_ms']/stats_gpu['mean_ms']:.1f}x faster")
    
    del pipe_cpu, pipe_gpu
else:
    print("GPU not available. Skipping CPU vs GPU comparison.")

## Part 3: Throughput and Batching

### Single vs Batch Processing

In [None]:
# Compare single vs batch processing
num_texts = 100
texts = [f"Sample text number {i} for throughput testing." for i in range(num_texts)]

print("=== THROUGHPUT COMPARISON ===")

# Single processing
start = time.perf_counter()
for text in texts:
    _ = classifier(text)
single_time = time.perf_counter() - start
single_throughput = num_texts / single_time

print(f"Single processing:")
print(f"  Time: {single_time:.2f} seconds")
print(f"  Throughput: {single_throughput:.2f} texts/second")

# Batch processing
start = time.perf_counter()
_ = classifier(texts)
batch_time = time.perf_counter() - start
batch_throughput = num_texts / batch_time

print(f"\nBatch processing:")
print(f"  Time: {batch_time:.2f} seconds")
print(f"  Throughput: {batch_throughput:.2f} texts/second")

print(f"\n⚡ Batch processing is {batch_throughput/single_throughput:.1f}x faster")

## Part 4: Memory Usage

### RAM Usage

In [None]:
# Measure memory usage
import gc

# Get baseline memory
gc.collect()
process = psutil.Process()
baseline_memory = process.memory_info().rss / (1024**2)  # MB

print(f"Baseline memory: {baseline_memory:.2f} MB")

# Load model and measure
print("\nLoading model...")
test_pipe = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=-1)
after_load_memory = process.memory_info().rss / (1024**2)

model_memory = after_load_memory - baseline_memory
print(f"Memory after loading: {after_load_memory:.2f} MB")
print(f"Model memory usage: {model_memory:.2f} MB")

# Run inference and measure
_ = test_pipe("Test text")
after_inference_memory = process.memory_info().rss / (1024**2)

print(f"\nMemory after inference: {after_inference_memory:.2f} MB")
print(f"Inference overhead: {after_inference_memory - after_load_memory:.2f} MB")

del test_pipe
gc.collect()

### GPU Memory (if available)

In [None]:
if torch.cuda.is_available():
    print("=== GPU MEMORY USAGE ===")
    
    # Clear GPU memory
    torch.cuda.empty_cache()
    baseline_gpu = torch.cuda.memory_allocated() / (1024**2)
    
    print(f"Baseline GPU memory: {baseline_gpu:.2f} MB")
    
    # Load model on GPU
    gpu_pipe = pipeline("sentiment-analysis", model=MODEL_NAME, device=0)
    after_load_gpu = torch.cuda.memory_allocated() / (1024**2)
    
    print(f"GPU memory after loading: {after_load_gpu:.2f} MB")
    print(f"Model size on GPU: {after_load_gpu - baseline_gpu:.2f} MB")
    
    # Run inference
    _ = gpu_pipe("Test text")
    after_inference_gpu = torch.cuda.memory_allocated() / (1024**2)
    
    print(f"GPU memory after inference: {after_inference_gpu:.2f} MB")
    
    # Peak memory
    peak_gpu = torch.cuda.max_memory_allocated() / (1024**2)
    print(f"Peak GPU memory: {peak_gpu:.2f} MB")
    
    del gpu_pipe
    torch.cuda.empty_cache()
else:
    print("GPU not available. Skipping GPU memory analysis.")

## Part 5: Cost Estimation

### Cloud Cost Calculator

In [None]:
# Estimate cloud costs
def estimate_cloud_cost(latency_seconds, requests_per_day, compute_cost_per_hour=0.50):
    """
    Estimate daily cloud computing cost.
    
    Args:
        latency_seconds: Average inference time per request
        requests_per_day: Number of requests per day
        compute_cost_per_hour: Cloud compute cost (e.g., AWS, GCP)
    
    Returns:
        dict with cost breakdown
    """
    total_compute_seconds = latency_seconds * requests_per_day
    total_compute_hours = total_compute_seconds / 3600
    daily_cost = total_compute_hours * compute_cost_per_hour
    monthly_cost = daily_cost * 30
    yearly_cost = daily_cost * 365
    
    return {
        'compute_hours_per_day': total_compute_hours,
        'daily_cost': daily_cost,
        'monthly_cost': monthly_cost,
        'yearly_cost': yearly_cost
    }

# Example: Sentiment analysis service
latency = 0.05  # 50ms average
requests = 10000  # 10k requests per day

costs = estimate_cloud_cost(latency, requests, compute_cost_per_hour=0.50)

print("=== COST ESTIMATION ===")
print(f"Scenario: {requests:,} requests/day, {latency*1000:.0f}ms per request")
print(f"Compute cost: $0.50/hour (example)")
print(f"\nEstimated costs:")
print(f"  Daily:   ${costs['daily_cost']:.2f}")
print(f"  Monthly: ${costs['monthly_cost']:.2f}")
print(f"  Yearly:  ${costs['yearly_cost']:.2f}")

# Compare with faster model
print("\n=== OPTIMIZATION IMPACT ===")
faster_latency = latency * 0.5  # 2x faster model
faster_costs = estimate_cloud_cost(faster_latency, requests, compute_cost_per_hour=0.50)
savings = costs['yearly_cost'] - faster_costs['yearly_cost']

print(f"With 2x faster model:")
print(f"  Yearly cost: ${faster_costs['yearly_cost']:.2f}")
print(f"  Savings: ${savings:.2f}/year ({savings/costs['yearly_cost']*100:.0f}%)")

### Storage Costs

In [None]:
# Estimate storage costs
def estimate_storage_cost(model_size_gb, storage_cost_per_gb_month=0.02):
    """
    Estimate monthly storage cost.
    
    Args:
        model_size_gb: Model size in GB
        storage_cost_per_gb_month: Storage cost (e.g., S3, GCS)
    
    Returns:
        Monthly storage cost
    """
    return model_size_gb * storage_cost_per_gb_month

print("=== STORAGE COST COMPARISON ===")
models = [
    ("distilgpt2", 0.082),
    ("distilbert", 0.268),
    ("gpt2-medium", 1.5),
    ("bart-large", 1.6)
]

print(f"{'Model':<20} {'Size (GB)':<12} {'Monthly Cost'}")
print("-" * 50)

for name, size in models:
    cost = estimate_storage_cost(size, storage_cost_per_gb_month=0.02)
    print(f"{name:<20} {size:<12.3f} ${cost:.4f}")

print("\nNote: Storage costs are minimal compared to compute costs.")

## Part 6: Optimization Techniques

### Model Quantization

In [None]:
# Compare FP32 vs INT8 (quantized)
print("=== QUANTIZATION COMPARISON ===")
print("Note: Full quantization example requires additional setup.")
print("\nBenefits of quantization:")
print("  - 4x smaller model size (FP32 → INT8)")
print("  - 2-4x faster inference")
print("  - Minimal accuracy loss (<1% typically)")
print("\nTrade-offs:")
print("  - Slight accuracy degradation")
print("  - Not all models support quantization")
print("  - May require calibration dataset")

### Efficient Tokenization

In [None]:
# Fast tokenizers
from transformers import AutoTokenizer

print("=== TOKENIZER COMPARISON ===")

tokenizer_slow = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
tokenizer_fast = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

test_texts = ["Sample text for tokenization speed test."] * 1000

# Slow tokenizer
start = time.perf_counter()
for text in test_texts:
    _ = tokenizer_slow(text)
slow_time = time.perf_counter() - start

# Fast tokenizer
start = time.perf_counter()
for text in test_texts:
    _ = tokenizer_fast(text)
fast_time = time.perf_counter() - start

print(f"Slow tokenizer: {slow_time:.3f} seconds")
print(f"Fast tokenizer: {fast_time:.3f} seconds")
print(f"\n⚡ Fast tokenizer is {slow_time/fast_time:.1f}x faster")

## Exercises

1. **Cache Analysis**: Explore your cache directory and calculate total storage used by all models.

2. **Latency Profiling**: Measure latency for image classification and compare with text classification. Which is faster?

3. **Cost Modeling**: Calculate the cost of running a chatbot with 1 million requests per month.

4. **Batch Optimization**: Experiment with different batch sizes. What's the optimal batch size for your hardware?

5. **Memory Profiling**: Load multiple models simultaneously and monitor memory usage. When do you run out of RAM?

In [None]:
# Your code here for exercises


## Key Takeaways

✅ **Caching** saves time and bandwidth by reusing downloaded models

✅ **Latency** varies by model size, hardware, and optimization

✅ **Batch processing** significantly improves throughput

✅ **GPU** provides 5-20x speedup for most models

✅ **Costs** scale with compute time, not model size

✅ **Optimization** (quantization, fast tokenizers) reduces latency and costs

✅ **Trade-offs** exist between speed, accuracy, and resource usage

## Next Steps

- Try **Notebook 12**: Model Cards and Responsible AI
- Explore [HuggingFace Optimization Docs](https://huggingface.co/docs/transformers/performance)
- Learn about [Model Compression Techniques](https://huggingface.co/docs/optimum/)

## Resources

- [HuggingFace Cache Management](https://huggingface.co/docs/huggingface_hub/guides/manage-cache)
- [Performance and Optimization](https://huggingface.co/docs/transformers/performance)
- [Cloud Pricing Calculators](https://aws.amazon.com/pricing/)