# WVA Model-Based Mode Analysis

This notebook analyzes WVA (Workload Variant Autoscaler) experiments running in **MODEL-ONLY mode**.

## Key Metrics

### Prediction vs Reality
- **Predicted ITL/TTFT** - From optimizer at reconciliation N (what WVA predicts)
- **Observed ITL/TTFT** - From Prometheus at reconciliation N+1 (what actually happened)
- **Prediction Error** - Difference between predicted and observed

### Scaling Behavior
- **Target Replicas** - What optimizer decided
- **Current Replicas** - Actual running replicas
- **Scaling Actions** - Scale-up, scale-down, no-change

### SLO Compliance
- **SLO ITL** - Target inter-token latency (typically 10ms)
- **SLO TTFT** - Target time-to-first-token (typically 1000ms)
- **SLO Violations** - When observed metrics exceed SLOs

## Workflow

1. Load experiment data from timestamped directory
2. Parse WVA controller logs
3. Align predictions (reconciliation N) with observations (reconciliation N+1)
4. Visualize scaling behavior and performance
5. Analyze prediction accuracy and SLO compliance

## 1. Setup and Configuration

In [None]:
import json
import sys
import re
from pathlib import Path
from datetime import datetime, timedelta
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.patches import Rectangle
import numpy as np

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

print("✓ Libraries imported successfully")

## 2. Select Experiment Directory

In [None]:
# Auto-detect latest model-based experiment
data_dir = Path('./experiment-data')
if data_dir.exists():
    experiments = sorted(data_dir.glob('model-based-*'), reverse=True)
    if experiments:
        EXPERIMENT_DIR = str(experiments[0])
        print(f"✓ Auto-detected latest experiment: {experiments[0].name}")
    else:
        EXPERIMENT_DIR = './experiment-data/model-based-moderate-load-20251126-120000'
        print(f"⚠ No model-based experiments found, using example path")
else:
    EXPERIMENT_DIR = './experiment-data/model-based-moderate-load-20251126-120000'
    print(f"⚠ Data directory not found, using example path")

EXPERIMENT_DIR = Path(EXPERIMENT_DIR)
LOG_FILE = EXPERIMENT_DIR / 'wva-controller-logs.jsonl'
METRICS_CSV = EXPERIMENT_DIR / 'metrics.csv'

print(f"Experiment directory: {EXPERIMENT_DIR}")
print(f"Log file: {LOG_FILE}")
print(f"Metrics CSV: {METRICS_CSV}")

# Verify files exist
if not EXPERIMENT_DIR.exists():
    print(f"❌ Experiment directory not found: {EXPERIMENT_DIR}")
    print(f"Please run an experiment first with: ./run-experiment.sh experiment-configs/model-based-moderate.yaml")
elif not LOG_FILE.exists():
    print(f"❌ Log file not found: {LOG_FILE}")
else:
    print(f"✓ Files found")

## 3. Parse WVA Logs

Extract key events from WVA controller logs:
- **Predictions** - From optimizer solution (reconciliation N)
- **Observations** - From Prometheus metrics (reconciliation N+1)
- **Scaling Decisions** - Target replica changes
- **SLO Values** - ITL and TTFT thresholds

In [None]:
def parse_optimization_solution(solution_str):
    """Extract predicted ITL and TTFT from optimization solution."""
    itl_match = re.search(r'itl=([0-9.]+)', solution_str)
    ttft_match = re.search(r'ttft=([0-9.]+)', solution_str)
    replicas_match = re.search(r'numRep=([0-9]+)', solution_str)
    maxbatch_match = re.search(r'maxBatch=([0-9]+)', solution_str)
    cost_match = re.search(r'cost=([0-9.]+)', solution_str)
    
    return {
        'predicted_itl': float(itl_match.group(1)) if itl_match else None,
        'predicted_ttft': float(ttft_match.group(1)) if ttft_match else None,
        'predicted_replicas': int(replicas_match.group(1)) if replicas_match else None,
        'max_batch': int(maxbatch_match.group(1)) if maxbatch_match else None,
        'cost': float(cost_match.group(1)) if cost_match else None
    }

# Parse logs
predictions = []
observations = []
scaling_decisions = []
slo_values = []

print("Parsing WVA controller logs...")
with open(LOG_FILE, 'r') as f:
    for line_num, line in enumerate(f, 1):
        try:
            log = json.loads(line.strip())
            ts = log.get('ts', '')
            msg = log.get('msg', '')
            level = log.get('level', '')
            
            # Convert timestamp to datetime
            try:
                dt = datetime.fromisoformat(ts.replace('Z', '+00:00'))
            except:
                continue
            
            # Extract predictions from optimizer
            if 'Optimization solution' in msg:
                prediction = parse_optimization_solution(msg)
                predictions.append({
                    'timestamp': dt,
                    **prediction
                })
            
            # Extract scaling decisions
            elif 'Processing decision' in msg:
                current_match = re.search(r'current=([0-9]+)', msg)
                target_match = re.search(r'target=([0-9]+)', msg)
                action_match = re.search(r'action=([a-z-]+)', msg)
                
                scaling_decisions.append({
                    'timestamp': dt,
                    'current_replicas': int(current_match.group(1)) if current_match else None,
                    'target_replicas': int(target_match.group(1)) if target_match else None,
                    'action': action_match.group(1) if action_match else None
                })
            
            # Extract SLO values
            elif 'Found SLO for model' in msg:
                slo_itl_match = re.search(r'slo-tpot=([0-9]+)', msg)
                slo_ttft_match = re.search(r'slo-ttft=([0-9]+)', msg)
                
                slo_values.append({
                    'timestamp': dt,
                    'slo_itl': int(slo_itl_match.group(1)) if slo_itl_match else None,
                    'slo_ttft': int(slo_ttft_match.group(1)) if slo_ttft_match else None
                })
            
            # Extract observed metrics (from Prometheus)
            # Note: These come from the prepareVariantAutoscalings function
            elif '✓ Metrics collected for VA' in msg:
                replicas_match = re.search(r'replicas=([0-9]+)', msg)
                ttft_match = re.search(r'ttft=([0-9.]+)', msg)
                itl_match = re.search(r'itl=([0-9.]+)', msg)
                
                # Extract values and remove 'ms' suffix if present
                ttft_str = ttft_match.group(1) if ttft_match else None
                itl_str = itl_match.group(1) if itl_match else None
                
                if ttft_str:
                    ttft_val = float(ttft_str.replace('ms', ''))
                else:
                    ttft_val = None
                    
                if itl_str:
                    itl_val = float(itl_str.replace('ms', ''))
                else:
                    itl_val = None
                
                observations.append({
                    'timestamp': dt,
                    'observed_replicas': int(replicas_match.group(1)) if replicas_match else None,
                    'observed_ttft': ttft_val,
                    'observed_itl': itl_val
                })
                    
        except json.JSONDecodeError:
            continue
        except Exception as e:
            print(f"Error parsing line {line_num}: {e}")
            continue

# Convert to DataFrames
df_predictions = pd.DataFrame(predictions)
df_observations = pd.DataFrame(observations)
df_scaling = pd.DataFrame(scaling_decisions)
df_slo = pd.DataFrame(slo_values)

print(f"\n✓ Parsed {len(df_predictions)} predictions")
print(f"✓ Parsed {len(df_observations)} observations")
print(f"✓ Parsed {len(df_scaling)} scaling decisions")
print(f"✓ Parsed {len(df_slo)} SLO entries")

# Show samples
if len(df_predictions) > 0:
    print("\nSample prediction:")
    print(df_predictions.head(1).to_string())

if len(df_observations) > 0:
    print("\nSample observation:")
    print(df_observations.head(1).to_string())

## 4. Align Predictions with Observations

**Key Concept**: Prediction at reconciliation N applies to reconciliation N+1.

- Optimizer runs at time T → predicts ITL/TTFT
- Those predictions are for the NEXT reconciliation
- Observed metrics at time T+interval reflect what the previous prediction estimated

We align predictions with the NEXT observation to measure accuracy.

In [None]:
# Sort by timestamp
df_predictions = df_predictions.sort_values('timestamp').reset_index(drop=True)
df_observations = df_observations.sort_values('timestamp').reset_index(drop=True)
df_scaling = df_scaling.sort_values('timestamp').reset_index(drop=True)

# Align predictions with next observation
aligned_data = []

for i in range(len(df_predictions)):
    pred_row = df_predictions.iloc[i]
    pred_time = pred_row['timestamp']
    
    # Find the next observation after this prediction
    future_obs = df_observations[df_observations['timestamp'] > pred_time]
    
    if len(future_obs) > 0:
        obs_row = future_obs.iloc[0]
        
        # Compute prediction error
        pred_error_itl = None
        pred_error_ttft = None
        
        if pred_row['predicted_itl'] is not None and obs_row['observed_itl'] is not None:
            pred_error_itl = obs_row['observed_itl'] - pred_row['predicted_itl']
        
        if pred_row['predicted_ttft'] is not None and obs_row['observed_ttft'] is not None:
            pred_error_ttft = obs_row['observed_ttft'] - pred_row['predicted_ttft']
        
        aligned_data.append({
            'prediction_time': pred_time,
            'observation_time': obs_row['timestamp'],
            'time_delta_seconds': (obs_row['timestamp'] - pred_time).total_seconds(),
            'predicted_itl': pred_row['predicted_itl'],
            'observed_itl': obs_row['observed_itl'],
            'pred_error_itl': pred_error_itl,
            'predicted_ttft': pred_row['predicted_ttft'],
            'observed_ttft': obs_row['observed_ttft'],
            'pred_error_ttft': pred_error_ttft,
            'predicted_replicas': pred_row['predicted_replicas'],
            'observed_replicas': obs_row['observed_replicas'],
        })

df_aligned = pd.DataFrame(aligned_data)

print(f"✓ Created {len(df_aligned)} aligned prediction-observation pairs")
print(f"\nSample aligned data:")
if len(df_aligned) > 0:
    print(df_aligned.head(3).to_string())

## 5. Add SLO Information

In [None]:
# Get SLO values (should be constant throughout experiment)
if len(df_slo) > 0:
    SLO_ITL = df_slo['slo_itl'].iloc[0]
    SLO_TTFT = df_slo['slo_ttft'].iloc[0]
else:
    # Default values
    SLO_ITL = 10
    SLO_TTFT = 1000

print(f"SLO ITL: {SLO_ITL} ms")
print(f"SLO TTFT: {SLO_TTFT} ms")

# Add SLO violation flags to aligned data
if len(df_aligned) > 0:
    df_aligned['slo_itl_violated'] = df_aligned['observed_itl'] > SLO_ITL
    df_aligned['slo_ttft_violated'] = df_aligned['observed_ttft'] > SLO_TTFT
    df_aligned['any_slo_violated'] = df_aligned['slo_itl_violated'] | df_aligned['slo_ttft_violated']
    
    violation_count = df_aligned['any_slo_violated'].sum()
    violation_pct = (violation_count / len(df_aligned)) * 100
    
    print(f"\nSLO Violations: {violation_count}/{len(df_aligned)} ({violation_pct:.1f}%)")

## 6. Summary Statistics

In [None]:
if len(df_aligned) > 0:
    print("="*60)
    print("PREDICTION ACCURACY SUMMARY")
    print("="*60)
    
    print("\nITL (Inter-Token Latency) Predictions:")
    print(f"  Mean Error: {df_aligned['pred_error_itl'].mean():.2f} ms")
    print(f"  Abs Mean Error: {df_aligned['pred_error_itl'].abs().mean():.2f} ms")
    print(f"  Std Dev: {df_aligned['pred_error_itl'].std():.2f} ms")
    print(f"  Min Error: {df_aligned['pred_error_itl'].min():.2f} ms")
    print(f"  Max Error: {df_aligned['pred_error_itl'].max():.2f} ms")
    
    print("\nTTFT (Time to First Token) Predictions:")
    print(f"  Mean Error: {df_aligned['pred_error_ttft'].mean():.2f} ms")
    print(f"  Abs Mean Error: {df_aligned['pred_error_ttft'].abs().mean():.2f} ms")
    print(f"  Std Dev: {df_aligned['pred_error_ttft'].std():.2f} ms")
    print(f"  Min Error: {df_aligned['pred_error_ttft'].min():.2f} ms")
    print(f"  Max Error: {df_aligned['pred_error_ttft'].max():.2f} ms")
    
    print("\n" + "="*60)
    print("SCALING BEHAVIOR SUMMARY")
    print("="*60)
    
    if len(df_scaling) > 0:
        action_counts = df_scaling['action'].value_counts()
        print("\nScaling Actions:")
        for action, count in action_counts.items():
            pct = (count / len(df_scaling)) * 100
            print(f"  {action}: {count} ({pct:.1f}%)")
        
        print(f"\nReplica Range: {df_scaling['current_replicas'].min()} - {df_scaling['current_replicas'].max()}")
else:
    print("No aligned data available for statistics")

## 7. Visualization: Prediction vs Observation

In [None]:
# Create plots directory if it doesn't exist
plots_dir = EXPERIMENT_DIR / 'plots'
plots_dir.mkdir(exist_ok=True)
print(f"✓ Plots directory ready: {plots_dir}")

In [None]:
if len(df_aligned) > 0:
    fig, axes = plt.subplots(2, 1, figsize=(14, 10))
    
    # Calculate relative times (minutes from start)
    start_time = df_aligned['prediction_time'].min()
    df_aligned['time_minutes'] = (df_aligned['prediction_time'] - start_time).dt.total_seconds() / 60
    
    # ITL Plot
    ax = axes[0]
    ax.plot(df_aligned['time_minutes'], df_aligned['predicted_itl'], 
            'b-o', label='Predicted ITL', markersize=4, linewidth=1.5)
    ax.plot(df_aligned['time_minutes'], df_aligned['observed_itl'], 
            'r-s', label='Observed ITL', markersize=4, linewidth=1.5)
    ax.axhline(y=SLO_ITL, color='orange', linestyle='--', label=f'SLO ITL ({SLO_ITL}ms)', linewidth=2)
    
    # Highlight SLO violations
    violations = df_aligned[df_aligned['slo_itl_violated']]
    if len(violations) > 0:
        ax.scatter(violations['time_minutes'], violations['observed_itl'], 
                   color='red', s=100, marker='x', linewidth=3, label='SLO Violation', zorder=5)
    
    ax.set_xlabel('Time (minutes)', fontsize=12)
    ax.set_ylabel('ITL (ms)', fontsize=12)
    ax.set_title('Inter-Token Latency: Predicted vs Observed', fontsize=14, fontweight='bold')
    ax.legend(loc='best', fontsize=10)
    ax.grid(True, alpha=0.3)
    
    # TTFT Plot
    ax = axes[1]
    ax.plot(df_aligned['time_minutes'], df_aligned['predicted_ttft'], 
            'b-o', label='Predicted TTFT', markersize=4, linewidth=1.5)
    ax.plot(df_aligned['time_minutes'], df_aligned['observed_ttft'], 
            'r-s', label='Observed TTFT', markersize=4, linewidth=1.5)
    ax.axhline(y=SLO_TTFT, color='orange', linestyle='--', label=f'SLO TTFT ({SLO_TTFT}ms)', linewidth=2)
    
    # Highlight SLO violations
    violations = df_aligned[df_aligned['slo_ttft_violated']]
    if len(violations) > 0:
        ax.scatter(violations['time_minutes'], violations['observed_ttft'], 
                   color='red', s=100, marker='x', linewidth=3, label='SLO Violation', zorder=5)
    
    ax.set_xlabel('Time (minutes)', fontsize=12)
    ax.set_ylabel('TTFT (ms)', fontsize=12)
    ax.set_title('Time to First Token: Predicted vs Observed', fontsize=14, fontweight='bold')
    ax.legend(loc='best', fontsize=10)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig(EXPERIMENT_DIR / 'plots' / 'prediction_vs_observation.png', dpi=150, bbox_inches='tight')
    plt.show()
else:
    print("No data to plot")

## 8. Visualization: Scaling Behavior

In [None]:
if len(df_scaling) > 0:
    fig, ax = plt.subplots(figsize=(14, 6))
    
    # Calculate relative times
    start_time = df_scaling['timestamp'].min()
    df_scaling['time_minutes'] = (df_scaling['timestamp'] - start_time).dt.total_seconds() / 60
    
    # Plot current and target replicas
    ax.plot(df_scaling['time_minutes'], df_scaling['current_replicas'], 
            'b-o', label='Current Replicas', markersize=6, linewidth=2)
    ax.plot(df_scaling['time_minutes'], df_scaling['target_replicas'], 
            'r--s', label='Target Replicas', markersize=6, linewidth=2, alpha=0.7)
    
    # Annotate scaling actions
    scale_up = df_scaling[df_scaling['action'] == 'scale-up']
    scale_down = df_scaling[df_scaling['action'] == 'scale-down']
    
    if len(scale_up) > 0:
        ax.scatter(scale_up['time_minutes'], scale_up['target_replicas'], 
                   color='green', s=150, marker='^', label='Scale Up', zorder=5)
    
    if len(scale_down) > 0:
        ax.scatter(scale_down['time_minutes'], scale_down['target_replicas'], 
                   color='orange', s=150, marker='v', label='Scale Down', zorder=5)
    
    ax.set_xlabel('Time (minutes)', fontsize=12)
    ax.set_ylabel('Number of Replicas', fontsize=12)
    ax.set_title('WVA Model-Based Scaling Behavior', fontsize=14, fontweight='bold')
    ax.legend(loc='best', fontsize=11)
    ax.grid(True, alpha=0.3)
    ax.set_ylim(bottom=0)
    
    # Use integer y-axis
    ax.yaxis.set_major_locator(plt.MaxNLocator(integer=True))
    
    plt.tight_layout()
    plt.savefig(EXPERIMENT_DIR / 'plots' / 'scaling_behavior.png', dpi=150, bbox_inches='tight')
    plt.show()
else:
    print("No scaling data to plot")

## 9. Visualization: Prediction Error Distribution

In [None]:
if len(df_aligned) > 0:
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # ITL Error Distribution
    ax = axes[0]
    errors = df_aligned['pred_error_itl'].dropna()
    if len(errors) > 0:
        ax.hist(errors, bins=20, color='steelblue', edgecolor='black', alpha=0.7)
        ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Perfect Prediction')
        ax.axvline(x=errors.mean(), color='orange', linestyle='--', linewidth=2, 
                   label=f'Mean Error: {errors.mean():.2f}ms')
        ax.set_xlabel('Prediction Error (ms)', fontsize=12)
        ax.set_ylabel('Frequency', fontsize=12)
        ax.set_title('ITL Prediction Error Distribution', fontsize=13, fontweight='bold')
        ax.legend(fontsize=10)
        ax.grid(True, alpha=0.3, axis='y')
    
    # TTFT Error Distribution
    ax = axes[1]
    errors = df_aligned['pred_error_ttft'].dropna()
    if len(errors) > 0:
        ax.hist(errors, bins=20, color='steelblue', edgecolor='black', alpha=0.7)
        ax.axvline(x=0, color='red', linestyle='--', linewidth=2, label='Perfect Prediction')
        ax.axvline(x=errors.mean(), color='orange', linestyle='--', linewidth=2, 
                   label=f'Mean Error: {errors.mean():.2f}ms')
        ax.set_xlabel('Prediction Error (ms)', fontsize=12)
        ax.set_ylabel('Frequency', fontsize=12)
        ax.set_title('TTFT Prediction Error Distribution', fontsize=13, fontweight='bold')
        ax.legend(fontsize=10)
        ax.grid(True, alpha=0.3, axis='y')
    
    plt.tight_layout()
    plt.savefig(EXPERIMENT_DIR / 'plots' / 'prediction_error_distribution.png', dpi=150, bbox_inches='tight')
    plt.show()
else:
    print("No data to plot")

## 10. Export Processed Data

In [None]:
# Save processed data
if len(df_aligned) > 0:
    output_file = EXPERIMENT_DIR / 'processed_data.csv'
    df_aligned.to_csv(output_file, index=False)
    print(f"✓ Saved processed data to: {output_file}")

# Create summary report
if len(df_aligned) > 0:
    summary_file = EXPERIMENT_DIR / 'ANALYSIS_SUMMARY.md'
    
    with open(summary_file, 'w') as f:
        f.write(f"# WVA Model-Based Experiment Analysis\n\n")
        f.write(f"**Experiment:** {EXPERIMENT_DIR.name}\n\n")
        f.write(f"**Generated:** {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
        f.write(f"---\n\n")
        
        f.write(f"## Summary Statistics\n\n")
        f.write(f"- **Total Predictions:** {len(df_predictions)}\n")
        f.write(f"- **Total Observations:** {len(df_observations)}\n")
        f.write(f"- **Aligned Pairs:** {len(df_aligned)}\n")
        f.write(f"- **Scaling Decisions:** {len(df_scaling)}\n\n")
        
        f.write(f"## SLO Configuration\n\n")
        f.write(f"- **ITL SLO:** {SLO_ITL} ms\n")
        f.write(f"- **TTFT SLO:** {SLO_TTFT} ms\n\n")
        
        violation_count = df_aligned['any_slo_violated'].sum()
        violation_pct = (violation_count / len(df_aligned)) * 100
        f.write(f"- **SLO Violations:** {violation_count}/{len(df_aligned)} ({violation_pct:.1f}%)\n\n")
        
        f.write(f"## Prediction Accuracy\n\n")
        f.write(f"### ITL Predictions\n\n")
        f.write(f"- Mean Error: {df_aligned['pred_error_itl'].mean():.2f} ms\n")
        f.write(f"- Abs Mean Error: {df_aligned['pred_error_itl'].abs().mean():.2f} ms\n")
        f.write(f"- Std Dev: {df_aligned['pred_error_itl'].std():.2f} ms\n\n")
        
        f.write(f"### TTFT Predictions\n\n")
        f.write(f"- Mean Error: {df_aligned['pred_error_ttft'].mean():.2f} ms\n")
        f.write(f"- Abs Mean Error: {df_aligned['pred_error_ttft'].abs().mean():.2f} ms\n")
        f.write(f"- Std Dev: {df_aligned['pred_error_ttft'].std():.2f} ms\n\n")
        
        if len(df_scaling) > 0:
            f.write(f"## Scaling Behavior\n\n")
            action_counts = df_scaling['action'].value_counts()
            for action, count in action_counts.items():
                pct = (count / len(df_scaling)) * 100
                f.write(f"- {action}: {count} ({pct:.1f}%)\n")
            f.write(f"\n- Replica Range: {df_scaling['current_replicas'].min()} - {df_scaling['current_replicas'].max()}\n\n")
        
        f.write(f"## Files Generated\n\n")
        f.write(f"- `processed_data.csv` - Aligned prediction-observation pairs\n")
        f.write(f"- `plots/prediction_vs_observation.png` - ITL/TTFT over time\n")
        f.write(f"- `plots/scaling_behavior.png` - Replica scaling timeline\n")
        f.write(f"- `plots/prediction_error_distribution.png` - Error histograms\n")
    
    print(f"✓ Saved summary report to: {summary_file}")

print("\n" + "="*60)
print("ANALYSIS COMPLETE")
print("="*60)
print(f"Results saved to: {EXPERIMENT_DIR}")