# LLM Latency Benchmarking Framework

This notebook provides a comprehensive framework for benchmarking latency in Large Language Models (LLMs) through Amazon Bedrock. It enables systematic testing of:
- Standard vs. Optimized model variants
- Concurrent API calls
- Different workload patterns
- Various model configurations

The framework measures two critical performance metrics:

1. **Time to First Token (TTFT)**: How quickly the model starts responding
   - Lower is better
   - Affected by prompt length and network conditions
   
2. **Output Tokens Per Second (OTPS)**: Generation throughput after the model starts responding
   - Higher is better
   - Affected by prompt complexity, task complexity, and model intelligence

> **⚠️ Important**: Latency benchmarks are specific to your dataset and use case. While generally available public benchmarks provide a baseline, you should run this framework on your own prompts to get accurate performance metrics. Results can vary significantly based on prompt complexity, token lengths, and network conditions.


## Required Dataset Format

Your input JSONL file should contain one JSON object per line with the following fields:

```json
{
    "text_prompt": "Your question or instruction here",
    "expected_output_tokens": 50,  // number of tokens expected in output
    "task_type": "Text-Generation",  // currently supports Text-Generation
    "model_id": "us.meta.llama3-1-70b-instruct-v1:0",  // model identifier
    "region": "us-west-2", // region where you want to benchmark model latency metrics
    "inference_profile": "optimized"  // optimization setting 
}
```

#### Example entries from the test dataset:
```json
{"text_prompt": "Summarize the key features of cloud computing in one sentence.", "expected_output_tokens": 50, "task_type": "Text-Generation", "model_id": "us.meta.llama3-1-70b-instruct-v1:0", "region": "us-east-2", "inference_profile": "optimized"}
{"text_prompt": "Explain the concept of machine learning in simple terms.", "expected_output_tokens": 50, "task_type": "Text-Generation", "model_id": "us.anthropic.claude-3-5-haiku-20241022-v1:0", "region": "us-east-2", "inference_profile": "optimized"}
{"text_prompt": "Explain the concept of machine learning in simple terms.", "expected_output_tokens": 50, "task_type": "Text-Generation", "model_id": "us.anthropic.claude-3-5-haiku-20241022-v1:0", "region": "us-east-2", "inference_profile": "standard"}
```

Note: if you configure `"inference_profile": "optimized"`, you must use `us-east-2` region only because optimized inference is currently only available in `us-east-2` region. 

### What This Benchmark Framework Produces
Log File:
- Created automatically in your working directory
- Named latency-benchmarking-experiment-{timestamp}.log
- Tracks all API calls, errors, and execution details

### Results CSV Files:
- Saved in your specified directory
- Named invocations_{timestamp}.csv
- Contains detailed metrics for each request:
- TTFT and completion times
- Token counts
- API call status
- Model details
- Task types
- Final Analysis

### Aggregated metrics by model
A detailed performance report containing various statistics like P50 and P90

### Prerequisites
- AWS credentials with Bedrock access
- Input JSONL file containing test prompts in required format
- Access to desired AWS region
- Enable your selected models hosted on Amazon Bedrock in the region of your choice. Note that model availability can be different in different regions.

### Key Features
Concurrent API call testing
Configurable number of parallel requests
Comprehensive error handling and logging
Support for different model variants
Customizable test scenarios
How to Use This Framework
Prepare your JSONL file according to the format above
Configure parameters in Cell 1 (only cell requiring modification)
Run all cells
Results will be automatically saved to your specified directory
Check the log file for execution details
Review the final analysis cell for performance metrics

### Data Collection Guidelines

The `invocations_per_scenario` parameter determines how many times each prompt is repeated. Since we collect individual metrics for each API call (TTFT, OTPS, token counts), even with `invocations_per_scenario = 10` and 10 different prompts, we get 100 independent observations for the latency benchmarking analysis.

For meaningful benchmarking results:
- Aim for at least 1000 total observations (can be achieved with fewer repetitions across more prompts); remember higher is better
- Run tests for minimum 24 hours and also during your peak traffic times
- Align sample distribution with your actual traffic patterns
- Use `sleep_between_invocations` to control request rate and costs
- Leverage `num_parallel_calls` for concurrent testing

> **⚠️ Statistical Note: The Central Limit Theorem applies to our aggregate metrics as we're collecting individual observations for each API call. This means our sample means will approximate a normal distribution as long as we have sufficient total observations, regardless of how we split repetitions across prompts. While a 24-hour collection period helps control for time-of-day variations, extending the collection to multiple days (ideally a two full weeks) will account for day-of-week effects and provide more thorough performance benchmark on your dataset. This is particularly important if your workload patterns vary significantly across different days of the week.

## Configuration
Set your parameters below. This is the only section that needs modification.

In [None]:
# configure the environment
! pip install --upgrade pandas boto3 numpy==1.26.4 matplotlib seaborn pytz

In [2]:
# location of the prompt dataset and directory to save the results 
file_path = "<your-prompt-dataset-in-above-mentioned-JSONL-format>"
directory = "<name-of-folder-to-save-results>"

# Configuration to repeat experiment for reliable metrics
scenario_config = {
    "sleep_between_invocations": 60, # in seconds
    "invocations_per_scenario": 5 # number of times you want to run the same prompt to get more samples - note: this means more cost 
}

# think about how your `num_parallel_calls` value work with `invocations_per_scenario`, right now this means 4 Transactions per minute
# Set the number of parallel calls
num_parallel_calls = 4

# how many times do you want to run the experiment (increase this for longer experiments, helps with more reliable numbers)
experiment_counts = 5

# Other inference parameters
TEMPERATURE = 1
TOP_P = 1
TOP_K = 250
EXPERIMENT_NAME = '<name-and-version-of-your-experiment>' # your custom experiment name

In [None]:
import subprocess
import sys
import boto3
import botocore
import random
import pprint
import time
import json
import argparse
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta
import pytz
import os
import logging
from botocore.config import Config
from botocore.exceptions import ClientError
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
from threading import Lock
from typing import List, Dict

logging_lock = Lock()
os.makedirs(f"{directory}", exist_ok=True)
os.makedirs(f"{directory}-analysis", exist_ok=True)

# Configure logging
logging.basicConfig(filename=f"latency-benchmarking-experiment-{datetime.now().strftime('%Y%m%d_%H%M%S')}.log", 
                    level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

# Create a function to get a new boto3 client
def get_bedrock_client(region):
    config = Config(
        retries = dict(
            max_attempts = 1
        )
    )
    return boto3.client(
        service_name='bedrock-runtime',
        region_name=region,
        config=config
    )

def get_timestamp():
    dt = datetime.fromtimestamp(time.time(), tz=pytz.utc)
    return dt.strftime('%Y-%m-%dT%H:%M:%SZ')

def read_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

def get_body(model_id, file_path, prompt, max_tokens):
    body = [
        {
            'role': 'user',
            'content': [
                {
                'text': prompt
                },
            ]
        },
    ]
    inferenceConfig={
        'maxTokens': max_tokens,
        'temperature': 0,
        'topP': 1
    }
    return body, inferenceConfig

def read_jsonl_files(directory_path):
    all_data = []
    for filename in os.listdir(directory_path):
        if filename.endswith('.jsonl'):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r') as file:
                for line in file:
                    try:
                        json_object = json.loads(line.strip())
                        all_data.append(json_object)
                    except json.JSONDecodeError as e:
                        print(f"Error decoding JSON in file {filename}: {e}")
    return all_data

def post_iteration(scenario_config):
    logging.info(f'Sleeping for {scenario_config["sleep_between_invocations"]} seconds.')
    time.sleep(scenario_config["sleep_between_invocations"])

def benchmark(bedrock, file_path, prompt, latency_inference_profile, max_tokens, model_id="", stream=True, sleep_on_throttling=5):
    accept = 'application/json'
    content_type = 'application/json'
    api_call_status = 'Success'
    full_error_message = 'Success'
    duration_to_first_byte, duration_to_last_byte = None, None
    dt = datetime.fromtimestamp(time.time(), tz=pytz.utc)
    job_timestamp_iso = dt.strftime('%Y-%m-%dT%H:%M:%SZ')

    body, inference_config = get_body(model_id, file_path, prompt, max_tokens)
    output_token_size, input_token_size = None, None

    while True:
        try:
            start = time.time()
            response = bedrock.converse_stream(
                messages=body,
                modelId=model_id,
                inferenceConfig=inference_config,
                performanceConfig={
                        'latency': latency_inference_profile
                    }
            )
            first_byte = None
            event_stream = response.get('stream')
            for event in event_stream:   
                if 'contentBlockDelta' in event:
                    chunk = event['contentBlockDelta']
                    if chunk:
                        if not first_byte:
                            first_byte = time.time()  # update the time to first byte
                elif 'messageStop' in event:
                    stop_reason = event['messageStop'].get('stopReason', 'Unknown')
                elif 'metadata' in event:
                    metadata = event['metadata']
                    if 'usage' in metadata:
                        output_token_size = metadata['usage'].get('outputTokens', None)
                        input_token_size = metadata['usage'].get('inputTokens', None)
            last_byte = time.time()
            duration_to_first_byte = round(first_byte - start, 2)
            duration_to_last_byte = round(last_byte - start, 2)
        except ClientError as err:
            full_error_message = err
            api_call_status = err.response['Error']['Code']
            print(f"Got Error: {api_call_status}")
            print(f"Full Error Message: {full_error_message}")
            break
        else:
            break
    return duration_to_first_byte, duration_to_last_byte, job_timestamp_iso, api_call_status, full_error_message, output_token_size, input_token_size

def execute_benchmark(client, scenarios, scenario_config, num_parallel_calls=4, early_break=False):
    pp = pprint.PrettyPrinter(indent=2)
    all_invocations = []
    
    def process_scenario(scenario):
        local_client = get_bedrock_client(scenario['region'])
        local_invocations = []
        file_path = scenario['file_path']
        prompt = scenario['prompt']
        
        for invocation_id in range(scenario_config["invocations_per_scenario"]):
            try:
                time_to_first_byte, time_to_last_byte, job_timestamp_iso, api_call_status, \
                full_error_message, model_output_tokens, model_input_tokens = benchmark(
                    local_client,
                    file_path,
                    prompt,
                    latency_inference_profile=scenario['latency_inference_profile'],
                    max_tokens=scenario['configured_output_tokens_for_request'],
                    model_id=scenario['model_id'],
                    stream=scenario['stream'],
                    sleep_on_throttling=scenario_config['sleep_between_invocations']
                )

                invocation = {
                    'time_to_first_byte': time_to_first_byte,
                    'time_to_last_byte': time_to_last_byte,
                    'job_timestamp_iso': job_timestamp_iso,
                    'configured_output_tokens_for_request': scenario['configured_output_tokens_for_request'],
                    'model_input_tokens': model_input_tokens,
                    'model_output_tokens': model_output_tokens,
                    'model': scenario['model_id'],
                    'region': scenario['region'],
                    'invocation_id': invocation_id,
                    'api_call_status': api_call_status,
                    'full_error_message': full_error_message,
                    'TEMPERATURE': TEMPERATURE,
                    'TOP_P': TOP_P,
                    'TOP_K': TOP_K,
                    'EXPERIMENT_NAME': EXPERIMENT_NAME,
                    'task_type': scenario['task_type'],
                    'inference_profile': scenario['latency_inference_profile'],
                }
                local_invocations.append(invocation)
                
                # Thread-safe logging
                with logging_lock:
                    logging.info(f'Invocation: {invocation}')
                
                post_iteration(scenario_config=scenario_config)
                
            except Exception as e:
                with logging_lock:
                    logging.error(f"Error while processing scenario: {scenario['model_id']}. Error: {e}")
                
        return local_invocations

    # Execute scenarios in parallel
    with ThreadPoolExecutor(max_workers=num_parallel_calls) as executor:
        # Submit all scenarios and store futures
        future_to_scenario = {executor.submit(process_scenario, scenario): scenario 
                            for scenario in scenarios}
        
        # Print initial state
        print(f"Total scenarios submitted: {len(future_to_scenario)}")
        print(f"Number of parallel workers: {num_parallel_calls}")
        
        # Monitor futures as they complete
        start_time = time.time()
        running_futures = set()
        
        for future in concurrent.futures.as_completed(future_to_scenario):
            scenario = future_to_scenario[future]
            current_time = time.time() - start_time
            
            try:
                result = future.result()
                all_invocations.extend(result)
            except Exception as e:
                with logging_lock:
                    logging.error(f"Scenario failed: {e}")

        return all_invocations

if __name__ == "__main__":
    use_cases_scenarios = []

    # Read the JSONL file and process each line
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            file = json.loads(line.strip())
            prompt = file.get('text_prompt')
            task_type = file.get('task_type')
            model_id = file.get('model_id')
            region = file.get('region')
            latency_inference_profile = file.get('inference_profile', 'optimized')

            out_tokens = file.get('expected_output_tokens', 100)
            use_cases_scenarios.append({
                "file_path": file_path,
                "configured_output_tokens_for_request": out_tokens,
                "prompt": prompt,
                "stream": True,
                "model_id": model_id,
                "region": region,
                "task_type": task_type,
                "latency_inference_profile": latency_inference_profile
            })

    # Main loop
    run_count = 1
    while run_count <= experiment_counts:
        selected_scenarios = random.sample(
            use_cases_scenarios, 
            k=len(use_cases_scenarios) // 1
        )

        with logging_lock:
            logging.info(f"{len(selected_scenarios)} scenarios x {scenario_config['invocations_per_scenario']} invocations = {len(selected_scenarios) * scenario_config['invocations_per_scenario']} total invocations")
        
        logging.info(f"Running iteration {run_count}")
        
        # Create a new client for the main thread
        config = Config(
            retries = dict(
                max_attempts = 1
            )
        )
        client = boto3.client(
            service_name='bedrock-runtime',
            region_name=region,
            config=config
        )
        
        # Run the scenarios and measure times
        invocations = execute_benchmark(
            client, 
            selected_scenarios, 
            scenario_config, 
            num_parallel_calls=num_parallel_calls,
            early_break=False
        )

        # Convert the invocations list to a pandas DataFrame
        df = pd.DataFrame(invocations)
        df['timestamp'] = pd.Timestamp.now()
        df['run_count'] = run_count

        # Write the DataFrame to a CSV file
        output_file = f"{directory}/invocations_{datetime.now().strftime('%Y%m%d_%H%M%S')}.csv"
        df.to_csv(output_file, index=False)
        
        with logging_lock:
            logging.info(f"Results written to {output_file}")
            logging.info(f"Completed run {run_count} of 500")

        run_count += 1

## Analysis of Results

This section analyzes the collected latency metrics grouped by model ID. The analysis includes:

1. Time to First Token (TTFT)
   - Average
   - P50 (median)
   - P90

2. Output Tokens per Second (OTPS)
   - Average
   - P50 (median)
   - P90

Note: Results may vary based on network conditions, prompt length, and other factors.

In [None]:
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, HTML
from matplotlib.backends.backend_pdf import PdfPages
import datetime
import numpy as np

def combine_csv_files(directory):
    """Combine all CSV files in the directory into a single DataFrame."""
    all_files = glob.glob(os.path.join(directory, "invocations_*.csv"))
    df_list = []
    for filename in all_files:
        df = pd.read_csv(filename)
        df_list.append(df)
    return pd.concat(df_list, axis=0, ignore_index=True)

def calculate_metrics(df, group_columns):
    """Calculate latency metrics grouped by model, region, and inference profile."""
    metrics = df.groupby(group_columns).agg({
        'time_to_first_byte': ['count', 'mean', 'median', 
                              lambda x: x.quantile(0.9), 
                              lambda x: x.std()],
        'model_input_tokens': ['mean'],
        'model_output_tokens': ['mean'],
        'time_to_last_byte': ['mean', 'median', 
                             lambda x: x.quantile(0.9)]
    }).round(3)
    
    metrics.columns = ['sample_size', 'TTFT_mean', 'TTFT_p50', 'TTFT_p90', 'TTFT_std',
                      'avg_input_tokens',
                      'avg_output_tokens',
                      'total_time_mean', 'total_time_p50', 'total_time_p90']
    
    df['OTPS'] = df['model_output_tokens'] / df['time_to_last_byte']
    otps_metrics = df.groupby(group_columns)['OTPS'].agg(['mean', 'median', 
                                                         lambda x: x.quantile(0.9),
                                                         lambda x: x.std()]).round(3)
    otps_metrics.columns = ['OTPS_mean', 'OTPS_p50', 'OTPS_p90', 'OTPS_std']
    
    metrics = pd.concat([metrics, otps_metrics], axis=1)
    return metrics

def create_performance_summary_tables(df, metrics, pdf):
    """Create performance summary tables split across multiple pages if needed."""
    MAX_MODELS_PER_PAGE = 4
    
    # Use all metrics from the metrics DataFrame
    METRICS_TO_SHOW = metrics.columns.tolist()
    
    # Get unique models
    models = metrics.index.get_level_values('model').unique()
    model_chunks = [models[i:i + MAX_MODELS_PER_PAGE] 
                   for i in range(0, len(models), MAX_MODELS_PER_PAGE)]
    
    for page_num, models_subset in enumerate(model_chunks, 1):
        fig, ax = plt.subplots(figsize=(15, 10))
        plt.axis('off')
        
        data_rows = []
        row_labels = []
        col_labels = []
        valid_columns = []

        # Get actual existing combinations from the metrics DataFrame
        for model in models_subset:
            model_display_name = model.split('.')[-1]
            try:
                model_data = metrics.xs(model, level='model')
                for (region, profile) in model_data.index:
                    if not model_data.loc[(region, profile)].isna().all():
                        col_labels.append(f"{model_display_name}\n{region}\n{profile}")
                        valid_columns.append((model, region, profile))
            except KeyError:
                continue
        
        # Create row labels and data only for valid combinations
        for metric in METRICS_TO_SHOW:
            row_labels.append(metric)
            row_data = []
            
            for model, region, profile in valid_columns:
                try:
                    value = metrics.loc[(model, region, profile), metric]
                    if isinstance(value, (int, float)):
                        if metric == 'sample_size':
                            row_data.append(f"{value:.0f}")
                        else:
                            row_data.append(f"{value:.2f}")
                    else:
                        row_data.append(str(value))
                except KeyError:
                    continue  # Skip if combination doesn't exist
            
            data_rows.append(row_data)

        # Create table only if there are valid columns
        if valid_columns:
            table = ax.table(cellText=data_rows,
                           colLabels=col_labels,
                           rowLabels=row_labels,
                           cellLoc='center',
                           loc='center',
                           bbox=[0.05, 0.05, 0.95, 0.95])
            
            table.auto_set_font_size(False)
            table.set_fontsize(8)
            
            for k, cell in table._cells.items():
                if k[0] == 0:  # Header row
                    cell.set_height(0.15)
                    cell.set_text_props(ha='center', va='center')
                    cell.set_fontsize(7)
                    cell.set_text_props(weight='bold')
                
                if k[1] == -1:  # Row headers (metrics names)
                    cell.set_width(0.20)
                    cell.set_text_props(ha='left')
                else:
                    cell.set_width(0.80 / len(valid_columns))
            
            plt.title(f'Performance Metrics Summary (Page {page_num} of {len(model_chunks)})', pad=20)
            pdf.savefig(fig, bbox_inches='tight', dpi=300)
        plt.close()

def plot_model_distributions(df, metric, metric_name, pdf):
    """Create distribution plots grouped by model, region, and inference profile."""
    model_profiles = df.groupby(['model', 'region', 'inference_profile']).size().reset_index()
    n_combinations = len(model_profiles)
    
    n_cols = min(2, n_combinations)
    n_rows = (n_combinations + 1) // 2
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(8*n_cols, 6*n_rows))
    if n_rows == 1 and n_cols == 1:
        axes = np.array([[axes]])
    elif n_rows == 1 or n_cols == 1:
        axes = axes.reshape(-1, 1) if n_cols == 1 else axes.reshape(1, -1)
    
    axes_flat = axes.flatten()
    
    for idx, (_, row) in enumerate(model_profiles.iterrows()):
        model = row['model']
        region = row['region']
        profile = row['inference_profile']
        ax = axes_flat[idx]
        
        mask = (df['model'] == model) & (df['region'] == region) & (df['inference_profile'] == profile)
        data = df[mask][metric]
        
        sns.histplot(data=data, kde=True, bins=30, ax=ax)
        
        ax.axvline(data.mean(), color='r', linestyle='--', 
                  label=f'Mean: {data.mean():.2f}')
        ax.axvline(data.median(), color='g', linestyle='--', 
                  label=f'Median: {data.median():.2f}')
        ax.axvline(data.quantile(0.9), color='b', linestyle='--', 
                  label=f'P90: {data.quantile(0.9):.2f}')
        
        model_display_name = model.split('.')[-1]
        ax.set_title(f'{model_display_name}\n{region}\n{profile}')
        ax.set_xlabel(metric_name)
        ax.set_ylabel('Count')
        ax.legend(fontsize='small')
    
    for idx in range(len(model_profiles), len(axes_flat)):
        axes_flat[idx].set_visible(False)
    
    plt.suptitle(f'{metric_name} Distribution by Model, Region, and Inference Profile')
    plt.tight_layout()
    pdf.savefig(fig)
    plt.close()

def plot_model_comparison(df, metric, metric_name, pdf):
    """Create box plot comparing models by inference profile."""
    plt.figure(figsize=(15, 10))
    
    df = df.copy()
    # Create combined model-region display name
    df['model_display'] = df.apply(lambda x: f"{x['model'].split('.')[-1]}\n({x['region']})", axis=1)
    
    # Create box plot with inference_profile as hue
    ax = sns.boxplot(data=df, x='model_display', y=metric, hue='inference_profile')
    
    q1 = df[metric].quantile(0.25)
    q3 = df[metric].quantile(0.75)
    iqr = q3 - q1
    upper_whisker = q3 + 1.5 * iqr
    
    plt.ylim(0, upper_whisker * 1.2)
    plt.title(f'{metric_name} Comparison Across Models and Optimized-Inference')
    plt.xticks(rotation=45)
    plt.legend(title='Profile')
    plt.grid(True, axis='y', linestyle='--', alpha=0.7)
    
    plt.tight_layout()
    pdf.savefig()
    plt.close()

def check_and_create_new_page(y_pos, pdf, min_space_needed=0.2):
    """
    Check if we need a new page and create one if necessary.
    Returns: new y_position (either on same or new page)
    """
    if y_pos < min_space_needed:
        pdf.savefig(bbox_inches='tight', dpi=300)
        plt.close()
        
        # Create new page
        fig, ax = plt.subplots(figsize=(12, 12))
        plt.axis('off')
        return 0.95
    return y_pos

def analyze_latency_metrics(directory):
    """Main analysis function with PDF report generation."""
    # Turn off interactive plotting
    plt.ioff()
    # Close any existing plots
    plt.close('all')
    
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    pdf_file = os.path.join(f"{directory}-analysis", f'latency_analysis_report_{timestamp}.pdf')
    
    with PdfPages(pdf_file) as pdf:
        # Create title page
        fig, ax = plt.subplots(figsize=(12, 8))
        plt.axis('off')
        
        # Main title
        plt.text(0.5, 0.8, 'Latency Analysis Report',
                ha='center', va='center', size=24, weight='bold')
        
        # Timestamp
        plt.text(0.5, 0.7, f'Generated on: {datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")}',
                ha='center', va='center', size=12, style='italic', color='#666666')
        
        # Metrics section title
        plt.text(0.5, 0.5, 'Key Metrics',
                ha='center', va='center', size=18, weight='bold')
        
        # TTFT section
        plt.text(0.5, 0.4, 'TTFT (Time to First Token)',
                ha='center', va='center', size=14, weight='bold', color='#2E5A88')
        plt.text(0.5, 0.35, 'Lower values indicate better performance',
                ha='center', va='center', size=12, style='italic', color='#666666')
        
        # OTPS section
        plt.text(0.5, 0.25, 'OTPS (Output Tokens Per Second)',
                ha='center', va='center', size=14, weight='bold', color='#2E5A88')
        plt.text(0.5, 0.2, 'Higher values indicate better performance',
                ha='center', va='center', size=12, style='italic', color='#666666')
        
        pdf.savefig(bbox_inches='tight', dpi=300)
        plt.close()
        
        print("Loading data from:", directory)
        df = combine_csv_files(directory)
        
        # Count error requests
        errored_requests = df[df['api_call_status'] != 'Success']
        errored_count = len(errored_requests)

        # Count throttled requests
        throttled_requests = df[df['api_call_status'] == 'ThrottlingException']
        throttled_count = len(throttled_requests)
        
        # Remove error requests from analysis
        df = df[df['api_call_status'] == 'Success']
        
        # Calculate OTPS for valid requests
        df['OTPS'] = df['model_output_tokens'] / df['time_to_last_byte']
        
        # Summary statistics page
        fig, ax = plt.subplots(figsize=(12, 12))
        plt.axis('off')
               
        # Section 1: API Statistics
        plt.text(0.1, 0.95, 'Summary Statistics', size=18, weight='bold')
        plt.text(0.1, 0.90, f"Total API calls: {len(df) + errored_count}", size=12)
        plt.text(0.1, 0.86, f"Successful calls: {len(df)}", size=12)
        plt.text(0.1, 0.82, f"Errors calls: {errored_count} ({(errored_count/(len(df) + errored_count)*100):.1f}%)", 
                size=12, color='#666666')
        plt.text(0.1, 0.78, f"Throttled calls: {throttled_count} ({(throttled_count/(len(df) + throttled_count)*100):.1f}%)", 
                size=12, color='#666666')

        # Token Statistics section
        plt.text(0.1, 0.70, 'Token Statistics', size=18, weight='bold')
        plt.text(0.1, 0.65, f"Average Input Tokens: {df['model_input_tokens'].mean():.1f}", size=12)
        plt.text(0.1, 0.61, f"Max Input Tokens: {df['model_input_tokens'].max():.0f}", size=12)
        plt.text(0.1, 0.57, f"Average Output Tokens: {df['model_output_tokens'].mean():.1f}", size=12)
        plt.text(0.1, 0.53, f"Max Output Tokens: {df['model_output_tokens'].max():.0f}", size=12)

        # Section 2: Model Information
        plt.text(0.1, 0.45, 'Model Information', size=18, weight='bold')
        plt.text(0.1, 0.40, f"Number of unique models: {df['model'].nunique()}", size=12)
        plt.text(0.1, 0.36, "Models:", size=12)
        
        y_pos = 0.32
        for model in df['model'].unique():
            y_pos = check_and_create_new_page(y_pos, pdf)
            model_display_name = model
            plt.text(0.15, y_pos, f"• {model_display_name}", size=12, color='#2E5A88')
            y_pos -= 0.04

        # Section 3: Inference Profiles
        if 'inference_profile' in df.columns:
            y_pos = check_and_create_new_page(y_pos, pdf)
            y_pos -= 0.02  # Space between sections
            plt.text(0.1, y_pos, 'Inference Profiles', size=18, weight='bold')
            y_pos -= 0.05
            
            for profile in df['inference_profile'].unique():
                y_pos = check_and_create_new_page(y_pos, pdf)
                plt.text(0.15, y_pos, f"• {profile}", size=12, color='#2E5A88')
                y_pos -= 0.04

        # Section 4: Sample Distribution
        y_pos = check_and_create_new_page(y_pos, pdf)
        y_pos -= 0.02
        plt.text(0.1, y_pos, 'Sample Distribution', size=18, weight='bold')
        y_pos -= 0.05

        if 'inference_profile' in df.columns:
            model_profile_counts = df.groupby(['model', 'inference_profile', 'region']).size()
            for (model, profile, region), count in model_profile_counts.items():
                y_pos = check_and_create_new_page(y_pos, pdf)
                model_display_name = model.split('.')[-1]
                plt.text(0.15, y_pos, 
                        f"• {model_display_name} in {region} with ({profile}) inference: {count} samples",
                        size=12, color='#2E5A88')
                y_pos -= 0.04

        pdf.savefig(bbox_inches='tight', dpi=300)
        plt.close()
        
        # metrics = calculate_metrics(df, ['model', 'inference_profile'])
        # Basic metrics table
        metrics = calculate_metrics(df, ['model', 'region', 'inference_profile'])
        create_performance_summary_tables(df, metrics, pdf)
        
        # Distribution plots
        plot_model_distributions(df, 'time_to_first_byte', 'Time to First Token (seconds)', pdf)
        plot_model_distributions(df, 'OTPS', 'Output Tokens Per Second', pdf)
        
        # Model comparisons
        plot_model_comparison(df, 'time_to_first_byte', 'TTFT', pdf)
        plot_model_comparison(df, 'OTPS', 'OTPS', pdf)
        
        # Save metrics to CSV
        csv_file = os.path.join(f"{directory}-analysis", f'analysis_summary_{timestamp}.csv')
        metrics.to_csv(csv_file)
        
        print(f"\nAnalysis complete!")
        print(f"PDF report saved to: {pdf_file}")
        print(f"CSV summary saved to: {csv_file}")

# Run the analysis
analyze_latency_metrics(directory)

# End