# Performance Tuning & Latency Optimization

This notebook provides a framework for analyzing and optimizing the performance of the Self-Critique Chain Pipeline, with a focus on reducing latency and increasing throughput.

## Learning Objectives

- **Latency Analysis**: Profile pipeline execution to identify bottlenecks.
- **Parameter Tuning**: Understand the impact of temperature and `max_tokens` on performance.
- **Caching Strategies**: Implement and evaluate caching to reduce redundant API calls.
- **Batch Processing**: Explore strategies for processing multiple documents efficiently.
- **Throughput Measurement**: Measure the maximum sustainable throughput of the system.

## Business Context

For user-facing applications, low latency is critical for a good user experience. For large-scale batch processing, high throughput is essential for efficiency. This notebook helps answer:

- What is the end-to-end latency of the pipeline (P50, P95, P99)?
- Which stage of the pipeline is the slowest?
- How can we reduce latency without significantly impacting quality?
- How can we maximize the number of papers processed per hour?

---


## Section 1: Setup and Configuration

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import asyncio
from concurrent.futures import ThreadPoolExecutor
from functools import lru_cache
from typing import Dict, List, Any

from src.pipeline import SelfCritiquePipeline
from notebooks._shared_utilities import (
    create_benchmark_dataset,
    calculate_percentiles,
    format_duration
)

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (15, 7)

print("âœ“ Environment setup complete")

## Section 2: Latency Profiling

First, we'll measure the latency of a single pipeline execution and break it down by stage to identify bottlenecks.


In [None]:
def profile_pipeline_execution(pipeline: SelfCritiquePipeline, text: str) -> Dict[str, float]:
    """Profiles a single run of the pipeline and returns latency metrics."""
    # In a real scenario, this would come from pipeline.run_pipeline()
    # For now, we simulate it.
    stage_latencies = {
        'stage1_summary': np.random.uniform(1.5, 2.5),
        'stage2_critique': np.random.uniform(2.0, 3.0),
        'stage3_revision': np.random.uniform(1.8, 2.8),
    }
    total_latency = sum(stage_latencies.values())
    stage_latencies['total'] = total_latency
    return stage_latencies

# Load a sample paper
sample_paper = create_benchmark_dataset()[0]
pipeline = SelfCritiquePipeline(api_key="DUMMY_KEY")

# Profile the execution
latency_profile = profile_pipeline_execution(pipeline, sample_paper['text'])

# Visualize the breakdown
latency_df = pd.DataFrame.from_dict(latency_profile, orient='index', columns=['latency_seconds'])
latency_df.sort_values('latency_seconds', ascending=False, inplace=True)

ax = latency_df.plot(kind='barh', legend=False, color='skyblue')
ax.set_title('Latency Breakdown by Pipeline Stage')
ax.set_xlabel('Latency (seconds)')
plt.show()

print("Latency Profile:")
for stage, latency in latency_profile.items():
    print(f"- {stage:<20}: {format_duration(latency)}")

## Section 3: Temperature Parameter Impact

We'll analyze how the `temperature` parameter affects latency and (qualitatively) output diversity. Lower temperatures are often faster and more deterministic.


In [None]:
temperatures = [0.0, 0.3, 0.5, 0.7, 1.0]
latency_results = []

for temp in temperatures:
    # In a real test, you'd set pipeline.temperature = temp
    # Here, we simulate the effect: higher temp -> slightly higher latency
    simulated_latency = 5.0 + (temp * 1.5) + np.random.uniform(-0.2, 0.2)
    latency_results.append({'temperature': temp, 'latency': simulated_latency})

temp_df = pd.DataFrame(latency_results)

ax = sns.lineplot(data=temp_df, x='temperature', y='latency', marker='o')
ax.set_title('Impact of Temperature on Latency')
ax.set_xlabel('Temperature')
ax.set_ylabel('Average Latency (seconds)')
plt.show()

print(temp_df)

## Section 4: Caching Implementation

Implementing a cache can dramatically reduce latency for repeated requests with the same input. We'll use a simple in-memory LRU (Least Recently Used) cache.


In [None]:
@lru_cache(maxsize=128)
def cached_pipeline_run(text: str):
    """Simulates a cached pipeline run."""
    time.sleep(np.random.uniform(0.5, 1.5)) # Simulate work
    return f"Summary for: {text[:50]}..."

sample_texts = [p['text'] for p in create_benchmark_dataset()] * 3 # Create duplicates

# Time the uncached version
start_time_uncached = time.time()
for text in sample_texts:
    _ = cached_pipeline_run.__wrapped__(text)
end_time_uncached = time.time()
uncached_duration = end_time_uncached - start_time_uncached

# Time the cached version
cached_pipeline_run.cache_clear()
start_time_cached = time.time()
for text in sample_texts:
    _ = cached_pipeline_run(text)
end_time_cached = time.time()
cached_duration = end_time_cached - start_time_cached

print(f"Uncached execution time: {format_duration(uncached_duration)}")
print(f"Cached execution time:   {format_duration(cached_duration)}")
print(f"Performance Improvement: {uncached_duration / cached_duration:.2f}x")
print(f"Cache Info: {cached_pipeline_run.cache_info()}")

## Section 5: Batch Processing & Throughput Analysis

We'll measure throughput by running multiple requests in parallel using a thread pool.


In [None]:
async def run_batch(texts: List[str], max_workers: int) -> float:
    """Runs a batch of requests in parallel and returns total time."""
    start_time = time.time()
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        loop = asyncio.get_event_loop()
        tasks = [loop.run_in_executor(executor, cached_pipeline_run, text) for text in texts]
        await asyncio.gather(*tasks)
    end_time = time.time()
    return end_time - start_time

batch_sizes = [1, 2, 4, 8, 16, 32]
throughput_results = []
dataset = [p['text'] for p in create_benchmark_dataset()]
cached_pipeline_run.cache_clear()

async def main():
    for workers in batch_sizes:
        duration = await run_batch(dataset, max_workers=workers)
        throughput = len(dataset) / duration
        throughput_results.append({'workers': workers, 'duration': duration, 'throughput': throughput})
        print(f"Workers: {workers:<2} | Duration: {format_duration(duration):<10} | Throughput: {throughput:.2f} papers/sec")

# Run the async main function
await main()

throughput_df = pd.DataFrame(throughput_results)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.lineplot(data=throughput_df, x='workers', y='duration', marker='o', ax=axes[0])
axes[0].set_title('Batch Processing Duration')
axes[0].set_xlabel('Number of Parallel Workers')
axes[0].set_ylabel('Total Duration (seconds)')

sns.lineplot(data=throughput_df, x='workers', y='throughput', marker='o', ax=axes[1])
axes[1].set_title('Throughput vs. Parallel Workers')
axes[1].set_xlabel('Number of Parallel Workers')
axes[1].set_ylabel('Throughput (papers/second)')
plt.tight_layout()
plt.show()

## Conclusion

This notebook provides the tools to analyze and optimize pipeline performance. Key takeaways:

1. **Identify Bottlenecks**: Stage-level latency profiling is crucial for focusing optimization efforts.
2. **Tune Parameters**: `temperature` and `max_tokens` offer a trade-off between speed, cost, and quality.
3. **Implement Caching**: Caching is the most effective strategy for high-hit-rate scenarios.
4. **Parallelize for Throughput**: Using parallel execution with a thread pool can significantly boost throughput, though returns diminish as external service limits are reached.

### Next Steps

1. **Load Testing**: Use a tool like Locust to simulate real-world load and identify scaling limits.
2. **Asyncio Native Client**: For maximum performance, use an `asyncio`-native HTTP client (like `aiohttp`) instead of a thread pool.
3. **Cost vs. Performance**: Integrate cost analysis from `cost_economics_analysis.ipynb` to find the optimal balance between speed and cost.
