# Lab 3: Memory Profiling and Optimization

**Module**: Module 1 - Foundations  
**Estimated Time**: 30-45 minutes  
**Difficulty**: Intermediate  

---

## Learning Objectives

By completing this lab, you will:
- [ ] Profile memory usage during model loading and inference
- [ ] Calculate theoretical memory requirements for LLMs
- [ ] Understand the KV cache and its impact on memory
- [ ] Experiment with different context window sizes
- [ ] Observe memory scaling with batch size and sequence length
- [ ] Optimize memory usage for your hardware constraints

## Prerequisites

- Completed Lab 1 (Setup and First Inference)
- Completed Lab 2 (GGUF Exploration)
- Understanding of model architecture basics
- Python programming knowledge

## What You'll Learn

Memory is often the limiting factor when running LLMs. In this lab, you'll learn to:
- Profile and measure actual memory usage
- Calculate memory requirements before loading models
- Understand the KV cache and its growth
- Optimize for your available RAM

---

## Part 1: Setup and Memory Monitoring (10 minutes)

First, let's set up tools for monitoring memory usage in real-time.

In [None]:
# Install required packages
!pip install psutil memory_profiler matplotlib -q

In [None]:
import psutil
import time
import os
from pathlib import Path
import numpy as np
import matplotlib.pyplot as plt
from llama_cpp import Llama

# Set up matplotlib
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')

print("‚úì All packages imported successfully")

In [None]:
# Helper function to get current memory usage
def get_memory_usage():
    """
    Get current process memory usage in MB.
    
    Returns:
        dict: Memory usage statistics
    """
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    
    return {
        'rss_mb': mem_info.rss / (1024 ** 2),  # Resident Set Size
        'vms_mb': mem_info.vms / (1024 ** 2),  # Virtual Memory Size
        'percent': process.memory_percent()
    }

# Test it
initial_memory = get_memory_usage()
print(f"Current memory usage:")
print(f"  RSS: {initial_memory['rss_mb']:.2f} MB")
print(f"  VMS: {initial_memory['vms_mb']:.2f} MB")
print(f"  Percent: {initial_memory['percent']:.2f}%")

In [None]:
# Memory tracking context manager
class MemoryTracker:
    """Context manager for tracking memory usage."""
    
    def __init__(self, name="Operation", interval=0.1):
        self.name = name
        self.interval = interval
        self.memory_samples = []
        self.timestamps = []
        self.start_memory = None
        self.peak_memory = 0
        
    def __enter__(self):
        self.start_memory = get_memory_usage()['rss_mb']
        self.start_time = time.time()
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        self.end_memory = get_memory_usage()['rss_mb']
        self.end_time = time.time()
        self.duration = self.end_time - self.start_time
        self.memory_delta = self.end_memory - self.start_memory
        
        print(f"\n=== {self.name} ===")
        print(f"Start memory: {self.start_memory:.2f} MB")
        print(f"End memory: {self.end_memory:.2f} MB")
        print(f"Delta: {self.memory_delta:+.2f} MB")
        print(f"Duration: {self.duration:.2f} seconds")
        
    def sample(self):
        """Take a memory sample."""
        current = get_memory_usage()['rss_mb']
        self.memory_samples.append(current)
        self.timestamps.append(time.time() - self.start_time)
        self.peak_memory = max(self.peak_memory, current)

# Test the tracker
with MemoryTracker("Test allocation"):
    # Allocate some memory
    big_array = np.zeros((1000, 1000))
    time.sleep(0.1)

---

## Part 2: Model Loading Memory Profile (10 minutes)

Let's measure how much memory is consumed when loading a model.

In [None]:
# Model path from previous labs
MODEL_PATH = Path("./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf")

if not MODEL_PATH.exists():
    print(f"‚úó Model not found. Please complete Lab 1 first.")
else:
    model_size_mb = MODEL_PATH.stat().st_size / (1024 ** 2)
    print(f"‚úì Model file size: {model_size_mb:.2f} MB")

In [None]:
# Measure model loading
print("Loading model with memory tracking...\n")

with MemoryTracker("Model Loading") as tracker:
    llm = Llama(
        model_path=str(MODEL_PATH),
        n_ctx=2048,
        n_threads=4,
        verbose=False
    )
    tracker.sample()

print(f"\nModel file size: {model_size_mb:.2f} MB")
print(f"Memory overhead: {tracker.memory_delta - model_size_mb:.2f} MB")
print(f"Overhead ratio: {((tracker.memory_delta / model_size_mb) - 1) * 100:.1f}%")

### Understanding Memory Overhead

The memory usage is typically larger than the model file because:
1. **Decompression**: Quantized weights may be partially decompressed
2. **Context Buffers**: Space for KV cache and activations
3. **Runtime Structures**: Data structures for inference
4. **Alignment**: Memory alignment requirements

### Exercise 2.1: Test Different Context Sizes

Load the model with different context window sizes and observe memory usage.

In [None]:
# TODO: Test different context sizes
# YOUR CODE HERE

context_sizes = [512, 1024, 2048, 4096]
memory_usage = []

for ctx_size in context_sizes:
    # Clear previous model
    if 'llm' in locals():
        del llm
    
    # Force garbage collection
    import gc
    gc.collect()
    time.sleep(1)
    
    # Load with new context size
    # Measure memory
    # Store results
    pass

# Plot results
plt.figure(figsize=(10, 6))
# Plot memory usage vs context size
plt.xlabel('Context Size (tokens)')
plt.ylabel('Memory Usage (MB)')
plt.title('Memory Usage vs Context Window Size')
plt.grid(True)
plt.show()

---

## Part 3: Understanding KV Cache (15 minutes)

### What is the KV Cache?

The KV (Key-Value) cache stores attention keys and values from previous tokens to avoid recomputing them. This is crucial for efficient autoregressive generation.

### KV Cache Memory Formula

```
KV_cache_size = 2 √ó n_layers √ó n_heads √ó d_head √ó seq_len √ó bytes_per_element
```

Where:
- **2**: For both keys and values
- **n_layers**: Number of transformer layers
- **n_heads**: Number of attention heads
- **d_head**: Dimension per head (typically hidden_size / n_heads)
- **seq_len**: Sequence length (context window)
- **bytes_per_element**: Typically 2 (FP16) or 4 (FP32)

In [None]:
def calculate_kv_cache_size(
    n_layers,
    n_heads,
    hidden_size,
    seq_len,
    bytes_per_element=2
):
    """
    Calculate KV cache size in MB.
    
    Args:
        n_layers: Number of transformer layers
        n_heads: Number of attention heads
        hidden_size: Hidden dimension size
        seq_len: Sequence length / context window
        bytes_per_element: Bytes per element (2 for FP16, 4 for FP32)
    
    Returns:
        float: KV cache size in MB
    """
    d_head = hidden_size // n_heads
    
    # 2 for K and V, multiply by all dimensions
    total_elements = 2 * n_layers * n_heads * d_head * seq_len
    total_bytes = total_elements * bytes_per_element
    total_mb = total_bytes / (1024 ** 2)
    
    return total_mb

# Example: TinyLlama-1.1B
tinyllama_params = {
    'n_layers': 22,
    'n_heads': 32,
    'hidden_size': 2048,
    'seq_len': 2048
}

kv_cache_mb = calculate_kv_cache_size(**tinyllama_params)
print(f"TinyLlama KV cache (2048 tokens, FP16): {kv_cache_mb:.2f} MB")

### Exercise 3.1: Calculate KV Cache for Different Models

Calculate the KV cache size for different model sizes and context lengths.

In [None]:
# TODO: Calculate KV cache for various configurations
# YOUR CODE HERE

models = {
    'TinyLlama-1.1B': {'n_layers': 22, 'n_heads': 32, 'hidden_size': 2048},
    'LLaMA-2-7B': {'n_layers': 32, 'n_heads': 32, 'hidden_size': 4096},
    'LLaMA-2-13B': {'n_layers': 40, 'n_heads': 40, 'hidden_size': 5120},
    'LLaMA-2-70B': {'n_layers': 80, 'n_heads': 64, 'hidden_size': 8192},
}

context_lengths = [512, 1024, 2048, 4096, 8192]

# Create a table showing KV cache sizes
print("=== KV Cache Size (MB) - FP16 ===")
print(f"{'Model':<20}", end="")
for ctx in context_lengths:
    print(f"{ctx:>10}", end="")
print()
print("="*70)

for model_name, params in models.items():
    print(f"{model_name:<20}", end="")
    for ctx_len in context_lengths:
        # Calculate KV cache size
        size = calculate_kv_cache_size(
            params['n_layers'],
            params['n_heads'],
            params['hidden_size'],
            ctx_len
        )
        print(f"{size:>10.1f}", end="")
    print()

### Exercise 3.2: Total Memory Calculator

Create a comprehensive memory calculator that estimates total memory requirements.

In [None]:
def estimate_total_memory(
    param_count_billions,
    bits_per_weight,
    n_layers,
    n_heads,
    hidden_size,
    seq_len,
    batch_size=1
):
    """
    Estimate total memory required to run a model.
    
    Returns:
        dict: Memory breakdown in MB
    """
    # TODO: Implement comprehensive memory calculation
    # YOUR CODE HERE
    
    # 1. Model weights
    model_weights_mb = (param_count_billions * 1e9 * bits_per_weight / 8) / (1024 ** 2)
    
    # 2. KV cache
    kv_cache_mb = calculate_kv_cache_size(
        n_layers, n_heads, hidden_size, seq_len
    ) * batch_size
    
    # 3. Activation memory (approximate)
    # Rule of thumb: ~2x hidden_size per layer per token
    activation_mb = (2 * n_layers * hidden_size * seq_len * 2 * batch_size) / (1024 ** 2)
    
    # 4. Overhead (runtime structures, etc.)
    overhead_mb = model_weights_mb * 0.1  # ~10% overhead
    
    total_mb = model_weights_mb + kv_cache_mb + activation_mb + overhead_mb
    
    return {
        'model_weights_mb': model_weights_mb,
        'kv_cache_mb': kv_cache_mb,
        'activation_mb': activation_mb,
        'overhead_mb': overhead_mb,
        'total_mb': total_mb,
        'total_gb': total_mb / 1024
    }

# Test with TinyLlama
memory = estimate_total_memory(
    param_count_billions=1.1,
    bits_per_weight=4.85,  # Q4_K_M
    n_layers=22,
    n_heads=32,
    hidden_size=2048,
    seq_len=2048,
    batch_size=1
)

print("=== TinyLlama-1.1B Q4_K_M Memory Estimate ===")
print(f"Model Weights: {memory['model_weights_mb']:.2f} MB")
print(f"KV Cache: {memory['kv_cache_mb']:.2f} MB")
print(f"Activations: {memory['activation_mb']:.2f} MB")
print(f"Overhead: {memory['overhead_mb']:.2f} MB")
print(f"‚îÄ" * 50)
print(f"Total: {memory['total_mb']:.2f} MB ({memory['total_gb']:.2f} GB)")

---

## Part 4: Inference Memory Profiling (10 minutes)

Let's observe how memory changes during actual inference.

In [None]:
# Profile memory during inference with different sequence lengths
def profile_inference(prompt, max_tokens, label="Inference"):
    """Profile memory usage during inference."""
    with MemoryTracker(label) as tracker:
        tracker.sample()  # Before generation
        
        output = llm(
            prompt,
            max_tokens=max_tokens,
            temperature=0.7,
            echo=False
        )
        
        tracker.sample()  # After generation
    
    return {
        'memory_delta': tracker.memory_delta,
        'duration': tracker.duration,
        'tokens': output['usage']['completion_tokens']
    }

# Test with different generation lengths
prompt = "Explain machine learning:"

print("Profiling inference with different token counts...\n")

for max_tokens in [50, 100, 200, 500]:
    result = profile_inference(prompt, max_tokens, f"Generation ({max_tokens} tokens)")
    print(f"Tokens/MB: {result['tokens'] / max(result['memory_delta'], 0.1):.2f}\n")

### Exercise 4.1: Batch Size Impact

Investigate how memory scales with batch size (if your llama.cpp version supports batching).

In [None]:
# Note: llama-cpp-python may not directly expose batch inference
# This exercise demonstrates the concept

# TODO: Profile multiple sequential inferences to simulate batching
# YOUR CODE HERE

# Simulate batch processing
prompts = [
    "What is AI?",
    "Explain neural networks:",
    "What is deep learning?",
    "Define machine learning:"
]

# Process and measure memory for different "batch sizes"

---

## Part 5: Memory Optimization Strategies (5 minutes)

Let's explore strategies to reduce memory usage.

In [None]:
# Strategy comparison function
def compare_memory_strategies():
    """Compare different memory optimization strategies."""
    
    strategies = []
    
    # Strategy 1: Baseline (2048 context)
    mem1 = estimate_total_memory(
        1.1, 4.85, 22, 32, 2048, 2048
    )
    strategies.append(("Baseline (2048 ctx)", mem1['total_mb']))
    
    # Strategy 2: Reduced context
    mem2 = estimate_total_memory(
        1.1, 4.85, 22, 32, 2048, 1024
    )
    strategies.append(("Reduced context (1024)", mem2['total_mb']))
    
    # Strategy 3: More aggressive quantization (Q4_0)
    mem3 = estimate_total_memory(
        1.1, 4.5, 22, 32, 2048, 2048
    )
    strategies.append(("Q4_0 quantization", mem3['total_mb']))
    
    # Strategy 4: Both optimizations
    mem4 = estimate_total_memory(
        1.1, 4.5, 22, 32, 2048, 1024
    )
    strategies.append(("Q4_0 + reduced ctx", mem4['total_mb']))
    
    # Print comparison
    print("=== Memory Optimization Strategies ===")
    baseline = strategies[0][1]
    
    for name, memory in strategies:
        saving = ((baseline - memory) / baseline) * 100
        print(f"{name:<25} {memory:>8.2f} MB   {saving:>6.1f}% saving")

compare_memory_strategies()

### Memory Optimization Checklist

To reduce memory usage:

1. **Use aggressive quantization** (Q4_0, Q4_K_M)
2. **Reduce context window** (n_ctx parameter)
3. **Smaller model variant** (7B instead of 13B)
4. **Optimize batch size** (smaller batches)
5. **Enable memory mapping** (mmap in llama.cpp)
6. **Reduce thread count** (less overhead)
7. **Close other applications** (free system RAM)

---

## Validation

Run this cell to validate your lab completion:

In [None]:
def validate_lab():
    """Validate lab completion."""
    checks = []
    
    # Check 1: Memory tracking works
    checks.append(("Memory tracking", 'get_memory_usage' in dir()))
    
    # Check 2: Model loaded
    checks.append(("Model loaded", 'llm' in dir()))
    
    # Check 3: KV cache calculator exists
    checks.append(("KV cache calculator", 'calculate_kv_cache_size' in dir()))
    
    # Check 4: Memory estimator exists
    checks.append(("Memory estimator", 'estimate_total_memory' in dir()))
    
    # Check 5: Profiling completed
    checks.append(("Inference profiling", 'profile_inference' in dir()))
    
    # Print results
    print("=== Lab Validation ===")
    all_passed = True
    for check_name, passed in checks:
        status = "‚úì" if passed else "‚úó"
        print(f"{status} {check_name}")
        if not passed:
            all_passed = False
    
    print("\n" + "="*50)
    if all_passed:
        print("üéâ Congratulations! You've completed Lab 3!")
        print("\nYou now understand:")
        print("  - How to profile memory usage")
        print("  - KV cache and its memory impact")
        print("  - Memory requirements calculation")
        print("  - Optimization strategies for limited RAM")
    else:
        print("‚ö†Ô∏è  Please complete all exercises before moving on.")
    
    return all_passed

validate_lab()

---

## Extension Challenges

### Challenge 1: Real-time Memory Monitor
Build a real-time memory monitoring dashboard that tracks memory during long generations.

### Challenge 2: Memory Budget Optimizer
Create a tool that, given available RAM, recommends the largest model and context size that will fit.

### Challenge 3: Batch Processing Simulator
Implement a batch processing system that maximizes throughput within memory constraints.

### Challenge 4: Multi-Model Memory Analyzer
Compare memory usage across different model architectures (LLaMA vs. GPT vs. Mistral).

### Challenge 5: Memory Leak Detective
Build a tool to detect and visualize memory leaks during extended inference sessions.

In [None]:
# Extension Challenge: Your implementation here


---

## Key Takeaways

In this lab, you learned:

1. **Memory Profiling**: How to monitor and measure memory usage in Python
2. **KV Cache**: The largest variable memory component in LLM inference
3. **Memory Calculation**: Formulas to estimate memory requirements
4. **Scaling Factors**: How batch size and context length affect memory
5. **Optimization**: Strategies to reduce memory usage

### Memory Budget Quick Reference (FP16)

| Model Size | Q4_K_M | Context | KV Cache | Total RAM Needed |
|------------|--------|---------|----------|------------------|
| 1.1B | ~700 MB | 2048 | ~180 MB | ~1.5 GB |
| 7B | ~4.5 GB | 2048 | ~1.0 GB | ~7 GB |
| 13B | ~8.5 GB | 2048 | ~1.9 GB | ~12 GB |
| 70B | ~42 GB | 2048 | ~10 GB | ~60 GB |

*Note: These are approximate values. Actual usage may vary.*

### Next Steps

- **Module 2**: Advanced quantization techniques
- **Module 3**: GPU acceleration and optimization
- **Module 4**: Production deployment strategies

### Troubleshooting

**Out of memory errors**: Reduce context size, use more aggressive quantization, or choose a smaller model.

**Slow inference**: May be swapping to disk. Check system RAM usage and close other applications.

**Memory not freed**: Python garbage collection may be delayed. Use `gc.collect()` and `del` explicitly.

---

**Lab Created By**: Agent 4 (Lab Designer)  
**Last Updated**: 2025-11-18  
**Feedback**: [Submit feedback](../../feedback/)  