# Day 32: Static Scheduling Strategies - Part 5a

In this notebook, we'll explore static scheduling strategies for LLM inference. Static scheduling involves processing requests in fixed-size batches, which is simpler but less efficient than dynamic scheduling.

## Overview

1. Understanding static scheduling
2. Implementing a basic static scheduler
3. Measuring throughput and latency
4. Analyzing the impact of batch size

## 1. Understanding Static Scheduling

Static scheduling processes requests in fixed-size batches:

1. Requests are collected until a batch is filled
2. The entire batch is processed together
3. All sequences in the batch are processed until completion
4. A new batch is formed and processed

This approach is simple to implement but has several limitations:
- Waiting for a batch to fill introduces latency
- Processing all sequences to completion is inefficient
- Early-finishing sequences waste resources

In [None]:
# Import necessary libraries
import torch
import numpy as np
import matplotlib.pyplot as plt
import time
import queue
import threading
from collections import deque

# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

## 2. Implementing a Basic Static Scheduler

Let's implement a basic static scheduler that processes requests in fixed-size batches.

In [None]:
class Request:
    """Class representing an inference request."""
    def __init__(self, id, prompt, max_tokens=20):
        self.id = id
        self.prompt = prompt
        self.max_tokens = max_tokens
        self.result = None
        self.arrival_time = time.time()
        self.start_time = None
        self.completion_time = None
    
    def start(self):
        """Mark the request as started."""
        self.start_time = time.time()
    
    def complete(self, result):
        """Mark the request as completed."""
        self.result = result
        self.completion_time = time.time()
    
    @property
    def waiting_time(self):
        """Time spent waiting in the queue."""
        if self.start_time is None:
            return None
        return self.start_time - self.arrival_time
    
    @property
    def processing_time(self):
        """Time spent processing the request."""
        if self.start_time is None or self.completion_time is None:
            return None
        return self.completion_time - self.start_time
    
    @property
    def total_time(self):
        """Total time from arrival to completion."""
        if self.completion_time is None:
            return None
        return self.completion_time - self.arrival_time

In [None]:
class MockModel:
    """Mock model for simulation purposes."""
    def __init__(self, token_generation_time=0.05):
        self.token_generation_time = token_generation_time
    
    def generate(self, prompts, max_tokens=20):
        """Simulate token generation."""
        # Simulate processing time based on batch size and sequence length
        batch_size = len(prompts)
        
        # Simulate some parallelism benefit for larger batches
        if batch_size > 1:
            # Assume some efficiency gain with larger batches
            efficiency_factor = 1.0 - 0.1 * np.log(batch_size)
            efficiency_factor = max(0.5, efficiency_factor)  # Cap at 50% efficiency
        else:
            efficiency_factor = 1.0
        
        # Simulate generation time
        time.sleep(max_tokens * self.token_generation_time * efficiency_factor)
        
        # Generate mock results
        results = []
        for prompt in prompts:
            # Create a simple result by appending tokens to the prompt
            result = prompt + " " + "generated_text" * (max_tokens // 2)
            results.append(result)
        
        return results

In [None]:
class StaticScheduler:
    """Static scheduler for batch processing."""
    def __init__(self, model, batch_size=4):
        self.model = model
        self.batch_size = batch_size
        self.request_queue = queue.Queue()
        self.completed_requests = []
        self.running = False
        self.thread = None
    
    def submit_request(self, request):
        """Submit a request to the scheduler."""
        self.request_queue.put(request)
    
    def start(self):
        """Start the scheduler."""
        if self.running:
            return
        
        self.running = True
        self.thread = threading.Thread(target=self._process_batches)
        self.thread.daemon = True
        self.thread.start()
    
    def stop(self):
        """Stop the scheduler."""
        self.running = False
        if self.thread:
            self.thread.join()
    
    def _process_batches(self):
        """Process requests in batches."""
        while self.running:
            # Collect a batch of requests
            batch = []
            max_tokens = 0
            
            # Try to fill the batch
            try:
                while len(batch) < self.batch_size:
                    # Wait for a request with a timeout
                    request = self.request_queue.get(timeout=0.1)
                    batch.append(request)
                    max_tokens = max(max_tokens, request.max_tokens)
                    request.start()
            except queue.Empty:
                # If the queue is empty and we have at least one request, process it
                if not batch:
                    continue
            
            # If we have requests to process
            if batch:
                # Process the batch
                prompts = [request.prompt for request in batch]
                
                # Generate responses
                results = self.model.generate(prompts, max_tokens=max_tokens)
                
                # Update requests with results
                for request, result in zip(batch, results):
                    request.complete(result)
                    self.completed_requests.append(request)
    
    def get_metrics(self):
        """Get performance metrics."""
        if not self.completed_requests:
            return {}
        
        waiting_times = [r.waiting_time for r in self.completed_requests if r.waiting_time is not None]
        processing_times = [r.processing_time for r in self.completed_requests if r.processing_time is not None]
        total_times = [r.total_time for r in self.completed_requests if r.total_time is not None]
        
        return {
            "num_completed": len(self.completed_requests),
            "avg_waiting_time": np.mean(waiting_times) if waiting_times else 0,
            "avg_processing_time": np.mean(processing_times) if processing_times else 0,
            "avg_total_time": np.mean(total_times) if total_times else 0,
            "throughput": len(self.completed_requests) / (max(total_times) - min([r.arrival_time for r in self.completed_requests])) if total_times else 0
        }

## 3. Testing the Static Scheduler

Let's test our static scheduler with a simple workload.

In [None]:
def run_static_scheduler_test(batch_size=4, num_requests=20, arrival_rate=2):
    """Run a test of the static scheduler."""
    # Create a mock model
    model = MockModel(token_generation_time=0.05)
    
    # Create a scheduler
    scheduler = StaticScheduler(model, batch_size=batch_size)
    scheduler.start()
    
    # Generate test prompts
    test_prompts = [
        "The future of artificial intelligence is",
        "Climate change will impact our planet by",
        "Space exploration in the next decade will focus on",
        "Quantum computing offers advantages such as",
        "The most significant ethical concerns in technology are"
    ]
    
    # Submit requests at the specified arrival rate
    print(f"Submitting {num_requests} requests at rate of {arrival_rate} per second...")
    for i in range(num_requests):
        # Create a request with a random prompt and token length
        prompt = test_prompts[i % len(test_prompts)]
        max_tokens = np.random.randint(10, 30)
        request = Request(id=i, prompt=prompt, max_tokens=max_tokens)
        
        # Submit the request
        scheduler.submit_request(request)
        
        # Wait according to arrival rate
        time.sleep(1.0 / arrival_rate)
    
    # Wait for all requests to complete
    while scheduler.request_queue.qsize() > 0 or len(scheduler.completed_requests) < num_requests:
        time.sleep(0.1)
    
    # Stop the scheduler
    scheduler.stop()
    
    # Get metrics
    metrics = scheduler.get_metrics()
    
    print("\nTest Results:")
    print(f"Completed requests: {metrics['num_completed']}")
    print(f"Average waiting time: {metrics['avg_waiting_time']:.2f} seconds")
    print(f"Average processing time: {metrics['avg_processing_time']:.2f} seconds")
    print(f"Average total time: {metrics['avg_total_time']:.2f} seconds")
    print(f"Throughput: {metrics['throughput']:.2f} requests per second")
    
    return metrics, scheduler.completed_requests

In [None]:
# Run a test with batch size 4
metrics_batch4, completed_batch4 = run_static_scheduler_test(batch_size=4, num_requests=20, arrival_rate=2)

## 4. Analyzing the Impact of Batch Size

Let's analyze how batch size affects throughput and latency.

In [None]:
def analyze_batch_size_impact(batch_sizes, num_requests=20, arrival_rate=2):
    """Analyze the impact of batch size on performance."""
    throughputs = []
    avg_latencies = []
    
    for batch_size in batch_sizes:
        print(f"\nTesting batch size: {batch_size}")
        metrics, _ = run_static_scheduler_test(batch_size=batch_size, num_requests=num_requests, arrival_rate=arrival_rate)
        throughputs.append(metrics["throughput"])
        avg_latencies.append(metrics["avg_total_time"])
    
    return throughputs, avg_latencies

In [None]:
# Test different batch sizes
batch_sizes = [1, 2, 4, 8, 16]
throughputs, avg_latencies = analyze_batch_size_impact(batch_sizes, num_requests=30, arrival_rate=3)

In [None]:
# Plot the results
plt.figure(figsize=(12, 5))

# Plot throughput
plt.subplot(1, 2, 1)
plt.plot(batch_sizes, throughputs, marker='o')
plt.xlabel("Batch Size")
plt.ylabel("Throughput (requests/second)")
plt.title("Throughput vs. Batch Size")
plt.grid(True, alpha=0.3)

# Plot latency
plt.subplot(1, 2, 2)
plt.plot(batch_sizes, avg_latencies, marker='o')
plt.xlabel("Batch Size")
plt.ylabel("Average Latency (seconds)")
plt.title("Latency vs. Batch Size")
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5. Analyzing the Throughput-Latency Trade-off

There's a fundamental trade-off between throughput and latency in static scheduling.

In [None]:
# Plot the throughput-latency trade-off
plt.figure(figsize=(10, 6))
plt.scatter(avg_latencies, throughputs, s=100)

# Add labels for each point
for i, batch_size in enumerate(batch_sizes):
    plt.annotate(
        f"Batch={batch_size}",
        (avg_latencies[i], throughputs[i]),
        textcoords="offset points",
        xytext=(0, 10),
        ha='center'
    )

plt.xlabel("Average Latency (seconds)")
plt.ylabel("Throughput (requests/second)")
plt.title("Throughput-Latency Trade-off")
plt.grid(True, alpha=0.3)
plt.show()

## 6. Limitations of Static Scheduling

Static scheduling has several limitations that make it suboptimal for LLM inference:

1. **Head-of-Line Blocking**: Fast requests can be blocked behind slow ones
2. **Batch Formation Delay**: Waiting for a batch to fill introduces latency
3. **Resource Underutilization**: Early-finishing sequences waste resources
4. **Fixed Batch Size**: Doesn't adapt to varying workloads
5. **No Prioritization**: All requests are treated equally

## 7. Visualizing Request Processing

Let's visualize how requests are processed in a static scheduler.

In [None]:
def visualize_request_processing(completed_requests):
    """Visualize how requests are processed over time."""
    # Sort requests by arrival time
    sorted_requests = sorted(completed_requests, key=lambda r: r.arrival_time)
    
    # Get the earliest arrival time as reference
    t0 = sorted_requests[0].arrival_time
    
    # Prepare data for visualization
    request_ids = []
    arrival_times = []
    start_times = []
    completion_times = []
    waiting_times = []
    processing_times = []
    
    for request in sorted_requests:
        request_ids.append(request.id)
        arrival_times.append(request.arrival_time - t0)
        start_times.append(request.start_time - t0)
        completion_times.append(request.completion_time - t0)
        waiting_times.append(request.waiting_time)
        processing_times.append(request.processing_time)
    
    # Create a Gantt chart
    plt.figure(figsize=(12, 6))
    
    # Plot waiting time
    for i, request_id in enumerate(request_ids):
        plt.barh(
            request_id,
            waiting_times[i],
            left=arrival_times[i],
            color='lightgray',
            alpha=0.7
        )
    
    # Plot processing time
    for i, request_id in enumerate(request_ids):
        plt.barh(
            request_id,
            processing_times[i],
            left=start_times[i],
            color='blue',
            alpha=0.7
        )
    
    # Add legend
    plt.barh([], [], color='lightgray', label='Waiting Time')
    plt.barh([], [], color='blue', label='Processing Time')
    
    plt.xlabel("Time (seconds)")
    plt.ylabel("Request ID")
    plt.title("Request Processing Timeline")
    plt.legend()
    plt.grid(axis='x', alpha=0.3)
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize request processing
visualize_request_processing(completed_batch4)

## Conclusion

In this notebook, we've explored static scheduling strategies for LLM inference. We've implemented a basic static scheduler and analyzed its performance characteristics.

Key takeaways:

1. Static scheduling processes requests in fixed-size batches
2. There's a trade-off between throughput and latency based on batch size
3. Larger batch sizes generally improve throughput but increase latency
4. Static scheduling has several limitations that make it suboptimal for LLM inference

In the next part, we'll explore dynamic scheduling strategies that address these limitations.