# Day 32: Batching and Throughput Optimization - Part 2

In this notebook, we'll explore batching strategies for optimizing throughput when serving large language models. We'll implement basic batching and measure its impact on overall throughput.

## Overview

1. Setup and dependencies
2. Understanding batching for LLMs
3. Implementing static batching
4. Measuring throughput improvements

## 1. Setup and Dependencies

In [None]:
!pip install -q torch transformers datasets evaluate accelerate matplotlib

In [None]:
import os
import time
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoModelForCausalLM, AutoTokenizer

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

## 2. Understanding Batching for LLMs

Batching is a technique where multiple requests are processed together to maximize hardware utilization. For language models, this means generating tokens for multiple prompts simultaneously.

Benefits of batching include:
1. Better utilization of parallel processing capabilities (especially on GPUs)
2. Amortized overhead of model execution
3. Higher overall throughput (tokens per second)

However, batching can increase latency for individual requests, so there's a trade-off between throughput and latency.

## 3. Loading a Pre-trained Model

Let's load a small pre-trained model to demonstrate batching.

In [None]:
# Define model name
model_name = "gpt2"  # Using a smaller model for demonstration

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # Set padding token

model = AutoModelForCausalLM.from_pretrained(model_name).to(device)

# Print model information
print(f"Model: {model_name}")
print(f"Number of parameters: {sum(p.numel() for p in model.parameters()) / 1e6:.2f}M")

## 4. Implementing Batched Generation

Let's implement a function to generate text for multiple prompts in a batch.

In [None]:
def generate_batch(model, tokenizer, prompts, max_length=50, temperature=1.0):
    """Generate text for multiple prompts in a batch."""
    # Start timing
    start_time = time.time()
    
    # Tokenize all prompts
    batch_inputs = tokenizer(prompts, padding=True, return_tensors="pt").to(device)
    
    # Generate text
    with torch.no_grad():
        outputs = model.generate(
            input_ids=batch_inputs["input_ids"],
            attention_mask=batch_inputs["attention_mask"],
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            pad_token_id=tokenizer.pad_token_id,
            use_cache=True
        )
    
    # Decode the generated text
    generated_texts = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
    
    # End timing
    end_time = time.time()
    generation_time = end_time - start_time
    
    # Calculate tokens generated
    total_tokens = sum(len(output) for output in outputs)
    tokens_per_second = total_tokens / generation_time
    
    return generated_texts, generation_time, tokens_per_second

In [None]:
def generate_sequentially(model, tokenizer, prompts, max_length=50, temperature=1.0):
    """Generate text for multiple prompts sequentially (one by one)."""
    # Start timing
    start_time = time.time()
    
    generated_texts = []
    total_tokens = 0
    
    # Process each prompt sequentially
    for prompt in prompts:
        # Tokenize prompt
        inputs = tokenizer(prompt, return_tensors="pt").to(device)
        
        # Generate text
        with torch.no_grad():
            outputs = model.generate(
                input_ids=inputs["input_ids"],
                max_length=max_length,
                do_sample=True,
                temperature=temperature,
                use_cache=True
            )
        
        # Decode the generated text
        text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_texts.append(text)
        total_tokens += len(outputs[0])
    
    # End timing
    end_time = time.time()
    generation_time = end_time - start_time
    tokens_per_second = total_tokens / generation_time
    
    return generated_texts, generation_time, tokens_per_second

## 5. Measuring Throughput Improvements

Let's compare the throughput of batched vs. sequential generation for different batch sizes.

In [None]:
# Define test prompts
prompts = [
    "Artificial intelligence will transform the future by",
    "The key challenges in climate change are",
    "Space exploration in the next decade will focus on",
    "Quantum computing offers advantages such as",
    "The future of renewable energy depends on",
    "Biotechnology innovations are changing medicine through",
    "Smart cities of tomorrow will incorporate",
    "The most significant ethical concerns in technology are"
]

In [None]:
# Test sequential generation
print("Testing sequential generation...")
sequential_texts, sequential_time, sequential_tps = generate_sequentially(
    model, tokenizer, prompts, max_length=50, temperature=0.7
)

print(f"Sequential generation time: {sequential_time:.4f} seconds")
print(f"Sequential throughput: {sequential_tps:.2f} tokens per second")

In [None]:
# Test batched generation
print("Testing batched generation...")
batched_texts, batched_time, batched_tps = generate_batch(
    model, tokenizer, prompts, max_length=50, temperature=0.7
)

print(f"Batched generation time: {batched_time:.4f} seconds")
print(f"Batched throughput: {batched_tps:.2f} tokens per second")
print(f"Speedup: {batched_tps / sequential_tps:.2f}x")

## 6. Analyzing the Impact of Batch Size

Let's analyze how throughput changes with different batch sizes.

In [None]:
def measure_batch_performance(model, tokenizer, prompts, batch_sizes):
    """Measure performance for different batch sizes."""
    results = []
    
    # First, measure sequential performance (batch size 1)
    for batch_size in batch_sizes:
        print(f"Testing batch size: {batch_size}")
        
        # Create batches
        num_batches = len(prompts) // batch_size
        if len(prompts) % batch_size != 0:
            num_batches += 1
        
        total_time = 0
        total_tokens = 0
        
        for i in range(num_batches):
            batch_prompts = prompts[i * batch_size : min((i + 1) * batch_size, len(prompts))]
            
            if len(batch_prompts) == 1:
                # Sequential generation for batch size 1
                _, time_taken, tokens_per_second = generate_sequentially(
                    model, tokenizer, batch_prompts, max_length=50, temperature=0.7
                )
            else:
                # Batched generation
                _, time_taken, tokens_per_second = generate_batch(
                    model, tokenizer, batch_prompts, max_length=50, temperature=0.7
                )
            
            # Estimate tokens generated (approximate)
            tokens_generated = tokens_per_second * time_taken
            
            total_time += time_taken
            total_tokens += tokens_generated
        
        avg_throughput = total_tokens / total_time
        
        results.append({
            "batch_size": batch_size,
            "total_time": total_time,
            "throughput": avg_throughput
        })
    
    return results

In [None]:
# Test different batch sizes
batch_sizes = [1, 2, 4, 8]
batch_results = measure_batch_performance(model, tokenizer, prompts, batch_sizes)

In [None]:
# Display results
print("Batch Size Performance Results:")
print("-" * 60)
print(f"{'Batch Size':<15} {'Total Time (s)':<20} {'Throughput (tokens/s)':<25}")
print("-" * 60)
for result in batch_results:
    print(f"{result['batch_size']:<15} {result['total_time']:<20.4f} {result['throughput']:<25.2f}")
print("-" * 60)

In [None]:
# Visualize the results
plt.figure(figsize=(12, 5))

# Plot throughput vs batch size
plt.subplot(1, 2, 1)
plt.plot(
    [r["batch_size"] for r in batch_results],
    [r["throughput"] for r in batch_results],
    marker='o',
    linewidth=2
)
plt.title("Throughput vs Batch Size")
plt.xlabel("Batch Size")
plt.ylabel("Throughput (tokens/s)")
plt.grid(True, alpha=0.3)

# Plot total time vs batch size
plt.subplot(1, 2, 2)
plt.plot(
    [r["batch_size"] for r in batch_results],
    [r["total_time"] for r in batch_results],
    marker='o',
    linewidth=2,
    color='green'
)
plt.title("Total Time vs Batch Size")
plt.xlabel("Batch Size")
plt.ylabel("Total Time (seconds)")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Analyzing Latency vs. Throughput Trade-off

Let's examine the trade-off between latency (time per request) and throughput (total tokens per second).

In [None]:
# Calculate latency per request
for result in batch_results:
    # Average time per request
    result["latency_per_request"] = result["total_time"] / len(prompts)

# Display results with latency
print("Latency vs. Throughput Trade-off:")
print("-" * 80)
print(f"{'Batch Size':<15} {'Throughput (tokens/s)':<25} {'Avg Latency per Request (s)':<30}")
print("-" * 80)
for result in batch_results:
    print(f"{result['batch_size']:<15} {result['throughput']:<25.2f} {result['latency_per_request']:<30.4f}")
print("-" * 80)

In [None]:
# Visualize the trade-off
plt.figure(figsize=(10, 6))

# Create a scatter plot
plt.scatter(
    [r["latency_per_request"] for r in batch_results],
    [r["throughput"] for r in batch_results],
    s=100,  # marker size
    c=range(len(batch_results)),  # color by batch size
    cmap='viridis',
    alpha=0.8
)

# Add labels for each point
for i, result in enumerate(batch_results):
    plt.annotate(
        f"Batch={result['batch_size']}",
        (result["latency_per_request"], result["throughput"]),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center'
    )

plt.title("Throughput vs. Latency Trade-off", fontsize=14)
plt.xlabel("Latency per Request (seconds)", fontsize=12)
plt.ylabel("Throughput (tokens/second)", fontsize=12)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Practical Considerations for Batching

When implementing batching in production, consider the following factors:

1. **Optimal Batch Size**: The ideal batch size depends on your hardware, model size, and request patterns
2. **Variable Sequence Lengths**: Handling sequences of different lengths efficiently
3. **Latency Requirements**: Some applications prioritize low latency over throughput
4. **Memory Constraints**: Larger batch sizes require more memory
5. **Dynamic Batching**: Adapting batch size based on current load

## Conclusion

In this notebook, we've explored batching strategies for optimizing throughput when serving large language models. We've implemented basic batching and measured its impact on overall throughput.

Key takeaways:

1. Batching significantly improves throughput by better utilizing GPU parallelism
2. There's a trade-off between throughput and latency
3. The optimal batch size depends on hardware, model size, and application requirements
4. For latency-sensitive applications, smaller batch sizes may be preferred
5. For throughput-oriented applications, larger batch sizes are generally better

In the next part, we'll explore more advanced techniques like continuous batching and dynamic scheduling.