# üöÄ vLLM Performance Benchmark - GuideLLM Results

## Red Hat OpenShift AI - Stage 1: Sovereign AI Foundation

---

This notebook shows **professional benchmark results** using GuideLLM to compare:

- **Quantized Model:** Mistral 24B quantized (w4a16) on 1 GPU (g6.4xlarge)
- **Full Precision Model:** Mistral 24B full on 4 GPUs (g6.12xlarge)

**Configured Test Levels:** 1, 5, 10, 25, 50 concurrent requests

**Metrics:**
- üìä Throughput (tokens/second)
- ‚è±Ô∏è Latency (TTFT - Time To First Token P99, ITL - Inter-Token Latency P50)
- üí∞ Cost efficiency

In [None]:
# Setup
import json
import pandas as pd
from pathlib import Path

print("‚úÖ Libraries loaded")

---

## üìä Load GuideLLM Benchmark Results

In [None]:
# Load benchmark results from PVC
results_dir = Path('/results')
quantized_file = results_dir / 'mistral-24b-quantized-benchmark.json'
full_file = results_dir / 'mistral-24b-benchmark.json'

if quantized_file.exists() and full_file.exists():
    with open(quantized_file) as f:
        quantized_data = json.load(f)
    with open(full_file) as f:
        full_data = json.load(f)
    print("‚úÖ Benchmark results loaded successfully")
else:
    print("‚ùå Benchmark results not found. Run the GuideLLM jobs first.")
    quantized_data = None
    full_data = None

---

## üìà Extract and Map to Target Concurrency Levels

In [None]:
# Target concurrency levels we configured
TARGET_LEVELS = [1, 5, 10, 25, 50]

def map_to_target_level(actual_conc):
    """Map actual concurrency to nearest target level"""
    if actual_conc < 1:
        return None  # Skip warmup/zero levels
    # Find nearest target level
    return min(TARGET_LEVELS, key=lambda x: abs(x - actual_conc))

def extract_metrics(data, model_name):
    results = {}
    for benchmark in data['benchmarks']:
        m = benchmark['metrics']
        actual_conc = m['request_concurrency']['successful']['mean']
        target_conc = map_to_target_level(actual_conc)
        
        if target_conc is None:
            continue  # Skip warmup phases
        
        # Keep the best result for each target level
        if target_conc not in results or abs(actual_conc - target_conc) < abs(results[target_conc]['_actual'] - target_conc):
            results[target_conc] = {
                '_actual': actual_conc,
                'Throughput': round(m['tokens_per_second']['successful']['mean'], 1),
                'TTFT_P99': round(m['time_to_first_token_ms']['successful']['percentiles']['p99'], 1),
                'ITL_P50': round(m['inter_token_latency_ms']['successful']['median'], 1),
            }
    return results

if quantized_data and full_data:
    quant_metrics = extract_metrics(quantized_data, 'Quantized')
    full_metrics = extract_metrics(full_data, 'Full')
    
    # Build metric-first comparison table using TARGET_LEVELS
    comparison = []
    for target in TARGET_LEVELS:
        row = {'Concurrency': target}
        
        # Throughput (Full then Quantized)
        if target in full_metrics:
            row['Full Thr'] = full_metrics[target]['Throughput']
        if target in quant_metrics:
            row['Quant Thr'] = quant_metrics[target]['Throughput']
        
        # TTFT P99 (Full then Quantized)
        if target in full_metrics:
            row['Full TTFT'] = full_metrics[target]['TTFT_P99']
        if target in quant_metrics:
            row['Quant TTFT'] = quant_metrics[target]['TTFT_P99']
        
        # ITL P50 (Full then Quantized)
        if target in full_metrics:
            row['Full ITL'] = full_metrics[target]['ITL_P50']
        if target in quant_metrics:
            row['Quant ITL'] = quant_metrics[target]['ITL_P50']
        
        # Only include row if we have data
        if len(row) > 1:
            comparison.append(row)
    
    df_comparison = pd.DataFrame(comparison)
    print(f"‚úÖ Comparison table ready ({len(comparison)} concurrency levels)")
else:
    df_comparison = None
    quant_metrics = None
    full_metrics = None

---

## üìä Performance Results

**Column Legend:**
- **Thr** = Throughput (tokens/second)
- **TTFT** = Time To First Token P99 (milliseconds)
- **ITL** = Inter-Token Latency P50 (milliseconds)

For each metric, **Full** (4 GPUs) is shown first, then **Quant** (1 GPU)

In [None]:
if df_comparison is not None:
    display(df_comparison)
else:
    print("‚ùå No data")

---

## üí∞ Cost Analysis

In [None]:
if quant_metrics and full_metrics:
    cost_1gpu = 1.84  # g6.4xlarge $/hour
    cost_4gpu = 5.52  # g6.12xlarge $/hour
    
    # Use concurrency level 10 for cost comparison
    target_conc = 10
    
    if target_conc in quant_metrics and target_conc in full_metrics:
        tps_quant = quant_metrics[target_conc]['Throughput']
        tps_full = full_metrics[target_conc]['Throughput']
        
        cost_quant_1m = (cost_1gpu / (tps_quant * 3600)) * 1_000_000
        cost_full_1m = (cost_4gpu / (tps_full * 3600)) * 1_000_000
        
        cost_df = pd.DataFrame([
            {
                'Model': 'Full (4 GPUs)',
                'Instance': 'g6.12xlarge',
                'GPUs': 4,
                '$/Hour': f'${cost_4gpu:.2f}',
                f'Tok/s @{target_conc}': f'{tps_full:.0f}',
                '$ per 1M Tokens': f'${cost_full_1m:.2f}'
            },
            {
                'Model': 'Quant (1 GPU)',
                'Instance': 'g6.4xlarge',
                'GPUs': 1,
                '$/Hour': f'${cost_1gpu:.2f}',
                f'Tok/s @{target_conc}': f'{tps_quant:.0f}',
                '$ per 1M Tokens': f'${cost_quant_1m:.2f}'
            }
        ])
        
        display(cost_df)
        
        savings = ((cost_full_1m - cost_quant_1m) / cost_full_1m) * 100
        print(f"\nüí∞ Quantized model: {savings:.0f}% lower cost per token")
        print(f"‚ö° Quantized model: 75% fewer GPUs (1 vs 4)")
    else:
        print(f"‚ö†Ô∏è  Concurrency level {target_conc} not available in results")
else:
    print("‚ùå No metrics for cost analysis")