# GPU Demand vs Capacity Analytics

## Identifying Imbalances Between Workload Demand and GPU Capacity

This notebook demonstrates how to analyze GPU demand-versus-capacity imbalances in Kueue-managed Kubernetes clusters using:

- **Kueue LocalQueue metrics** (demand signals)
- **NVIDIA DCGM metrics** (capacity/efficiency signals)
- **Nodepool state** (inventory)

### Core Analytical Question

> **Where, when, and why does queued or unmet workload demand diverge from actual GPU capacity or effective utilization at the node pool level?**

---

## 1. Setup and Configuration

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from datetime import datetime
import warnings
import sys
from pathlib import Path

# Add src to path
sys.path.insert(0, str(Path('.').absolute()))

# Import project modules
from src.analysis.metrics import add_efficiency_metrics, GPU_SPECS
from src.analysis.imbalance import (
    calculate_all_imbalance_metrics,
    identify_top_contributors,
)
from src.analysis.aggregations import build_unified_model, create_time_series_summary
from src.visualization.charts import (
    plot_utilization_vs_power_intensity,
    plot_imbalance_heatmap,
    plot_demand_vs_capacity_timeseries,
    plot_top_contributors,
    plot_efficiency_distribution,
)
from src.visualization.styles import setup_matplotlib_defaults, COLORS

warnings.filterwarnings('ignore')
setup_matplotlib_defaults()

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Setup complete!")

## 2. Load and Validate Synthetic Data

Load the three synthetic datasets:
- **kueue_metrics.csv**: Queue demand signals (pending workloads, wait times)
- **dcgm_metrics.csv**: GPU efficiency signals (utilization, power, memory)
- **nodepool_state.csv**: Capacity inventory (GPU counts per nodegroup)

In [None]:
# Load datasets
DATA_DIR = Path('data/synthetic')

kueue_df = pd.read_csv(DATA_DIR / 'kueue_metrics.csv', parse_dates=['timestamp', 'timestamp_hour'])
dcgm_df = pd.read_csv(DATA_DIR / 'dcgm_metrics.csv', parse_dates=['timestamp', 'timestamp_hour'])
nodepool_df = pd.read_csv(DATA_DIR / 'nodepool_state.csv', parse_dates=['timestamp', 'timestamp_hour'])

print("Datasets Loaded:")
print(f"  Kueue metrics: {len(kueue_df):,} rows")
print(f"  DCGM metrics: {len(dcgm_df):,} rows")
print(f"  Nodepool state: {len(nodepool_df):,} rows")
print(f"\nDate range: {kueue_df['timestamp'].min()} to {kueue_df['timestamp'].max()}")
print(f"\nNodegroups: {kueue_df['nodegroup'].unique().tolist()}")
print(f"GPU Models: {dcgm_df['gpu_model'].unique().tolist()}")

In [None]:
# Sanity checks
print("=" * 60)
print("SANITY CHECKS")
print("=" * 60)

# Check for nulls
print(f"\nNull values in Kueue: {kueue_df.isnull().sum().sum()}")
print(f"Null values in DCGM: {dcgm_df.isnull().sum().sum()}")
print(f"Null values in Nodepool: {nodepool_df.isnull().sum().sum()}")

# Check ranges
print(f"\nGPU utilization range: {dcgm_df['gpu_utilization_pct'].min():.1f}% - {dcgm_df['gpu_utilization_pct'].max():.1f}%")
print(f"Power usage range: {dcgm_df['power_usage_watts'].min():.0f}W - {dcgm_df['power_usage_watts'].max():.0f}W")
print(f"Pending workloads range: {kueue_df['pending_workloads'].min()} - {kueue_df['pending_workloads'].max()}")

# Check label consistency
kueue_nodegroups = set(kueue_df['nodegroup'].unique())
dcgm_nodegroups = set(dcgm_df['nodegroup'].unique())
nodepool_nodegroups = set(nodepool_df['nodegroup'].unique())

print(f"\nNodegroup consistency:")
print(f"  Kueue nodegroups: {len(kueue_nodegroups)}")
print(f"  DCGM nodegroups: {len(dcgm_nodegroups)}")
print(f"  Nodepool nodegroups: {len(nodepool_nodegroups)}")
print(f"  All match: {kueue_nodegroups == dcgm_nodegroups == nodepool_nodegroups}")

## 3. Calculate Efficiency Metrics

Add derived efficiency metrics to the DCGM data:

- **Power Intensity Factor (PIF)**: `power / max_power` - proxy for actual compute work
- **Realized TFLOPS**: `achievable_tflops × PIF` - actual throughput
- **RFU %**: `(realized / achievable) × 100` - true efficiency
- **Efficiency Gap**: `GPU_util% - RFU%` - hidden waste

In [None]:
# Add efficiency metrics
dcgm_df = add_efficiency_metrics(dcgm_df)

print("Efficiency Metrics Added:")
print(f"  - power_intensity_factor (PIF)")
print(f"  - realized_tflops")
print(f"  - rfu_pct (Realized TFLOPS Utilization %)")
print(f"  - efficiency_gap")
print(f"  - efficiency_class")

print("\nEfficiency Summary:")
print(f"  Average GPU Utilization: {dcgm_df['gpu_utilization_pct'].mean():.1f}%")
print(f"  Average PIF: {dcgm_df['power_intensity_factor'].mean():.3f}")
print(f"  Average RFU: {dcgm_df['rfu_pct'].mean():.1f}%")
print(f"  Average Efficiency Gap: {dcgm_df['efficiency_gap'].mean():.1f} percentage points")

In [None]:
# Efficiency class distribution
print("Efficiency Class Distribution:")
print("-" * 40)
for cls, count in dcgm_df['efficiency_class'].value_counts().items():
    pct = count / len(dcgm_df) * 100
    print(f"  {cls:15} {count:>8,} ({pct:5.1f}%)")

## 4. Visualization: GPU Utilization vs Power Intensity

This critical chart reveals the relationship between reported utilization and actual computational work.

**Key patterns:**
- **Upper-right (green)**: Efficient - busy AND productive
- **Lower-right (red)**: Bottlenecked - busy but stalled on I/O
- **Lower-left (gray)**: Idle - minimal activity

In [None]:
fig = plot_utilization_vs_power_intensity(dcgm_df, sample_size=5000)
plt.show()

### So What?

Points in the **lower-right quadrant** (high utilization, low PIF) represent GPUs that appear busy but aren't doing productive work. These are likely:
- Waiting for data from slow storage
- Stalled on network synchronization
- Memory-bound with poor access patterns

**Action**: Investigate bottlenecked workloads for data pipeline optimization opportunities.

## 5. Visualization: Efficiency Class Distribution

In [None]:
fig = plot_efficiency_distribution(dcgm_df)
plt.show()

### So What?

The proportion of "Bottlenecked" samples indicates fleet-wide efficiency issues:
- **< 10% bottlenecked**: Normal operating conditions
- **10-20% bottlenecked**: Some workloads need attention
- **> 20% bottlenecked**: Systemic data/I/O issues requiring immediate investigation

## 6. Calculate Imbalance Metrics

Compute demand-versus-capacity imbalance metrics by joining all data sources:

- **Demand Capacity Ratio (DCR)**: pending_workloads / available_capacity
- **Queue Pressure Score (QPS)**: Composite of pending + wait time
- **Composite Imbalance Score (CIS)**: Overall imbalance indicator

In [None]:
# Calculate imbalance metrics
imbalance_df = calculate_all_imbalance_metrics(kueue_df, dcgm_df, nodepool_df)

print(f"Imbalance Analysis: {len(imbalance_df):,} observations")
print(f"  (aggregated by timestamp_hour × nodegroup)")

print("\nImbalance Metrics Summary:")
print(f"  Demand-Capacity Ratio (DCR):")
print(f"    Mean: {imbalance_df['demand_capacity_ratio'].mean():.2f}")
print(f"    Max:  {imbalance_df['demand_capacity_ratio'].max():.2f}")
print(f"  Queue Pressure Score (QPS):")
print(f"    Mean: {imbalance_df['queue_pressure_score'].mean():.3f}")
print(f"  Composite Imbalance Score (CIS):")
print(f"    Mean: {imbalance_df['composite_imbalance_score'].mean():.3f}")

In [None]:
# Imbalance severity distribution
print("Imbalance Severity Distribution:")
print("-" * 40)
for sev, count in imbalance_df['imbalance_severity'].value_counts().items():
    pct = count / len(imbalance_df) * 100
    print(f"  {sev:10} {count:>6,} ({pct:5.1f}%)")

## 7. Visualization: Imbalance Heatmap by Nodegroup

Shows where and when imbalances occur across the fleet.

In [None]:
fig = plot_imbalance_heatmap(imbalance_df)
plt.show()

### So What?

- **Horizontal red bands**: Persistent issues with specific nodegroups (capacity problem)
- **Vertical red bands**: Fleet-wide events affecting all nodegroups (demand spike)
- **Scattered red cells**: Transient issues, less concerning

**Action**: Focus optimization efforts on nodegroups with sustained high imbalance.

## 8. Visualization: Demand vs Capacity Time Series

Three-panel view showing trends over time.

In [None]:
# Aggregate to fleet-wide time series
fleet_ts = create_time_series_summary(imbalance_df)

fig = plot_demand_vs_capacity_timeseries(fleet_ts)
plt.show()

### So What?

- **Top panel**: When red (pending) exceeds green (capacity), expect queue growth
- **Middle panel**: Gap between utilization and RFU reveals hidden waste
- **Bottom panel**: Composite score crossing thresholds signals action needed

**Action**: Investigate time periods with sustained high imbalance scores.

## 9. Identify Top Contributors

Which nodegroups, queues, and namespaces contribute most to imbalance?

In [None]:
contributors = identify_top_contributors(imbalance_df, kueue_df, n_top=5)

fig = plot_top_contributors(contributors)
plt.show()

In [None]:
print("Top Contributing Nodegroups:")
print(contributors['by_nodegroup'].to_string(index=False))
print("\nTop Contributing Queues:")
print(contributors['by_queue'][['queue_name', 'pending_workloads', 'admission_wait_time_seconds', 'queue_pressure']].to_string(index=False))
print("\nTop Contributing Namespaces:")
print(contributors['by_namespace'][['namespace', 'pending_workloads', 'namespace_pressure']].to_string(index=False))

### So What?

These are the primary sources of demand-capacity imbalance. Optimization efforts should focus here:

1. **High-pressure nodegroups**: May need capacity scaling or workload redistribution
2. **High-pressure queues**: Review admission policies, consider priority adjustments
3. **High-pressure namespaces**: Engage with teams to optimize workloads or adjust quotas

## 10. Summary and Recommended Actions

Based on the analysis, here are the key findings and recommended actions.

In [None]:
print("=" * 70)
print("ANALYSIS SUMMARY")
print("=" * 70)

# Key metrics
avg_dcr = imbalance_df['demand_capacity_ratio'].mean()
avg_cis = imbalance_df['composite_imbalance_score'].mean()
avg_gap = dcgm_df['efficiency_gap'].mean()
bottleneck_pct = (dcgm_df['efficiency_class'] == 'Bottlenecked').sum() / len(dcgm_df) * 100

print(f"\n1. DEMAND-CAPACITY BALANCE")
print(f"   Average DCR: {avg_dcr:.2f}")
if avg_dcr > 1.0:
    print(f"   ⚠️  ALERT: Demand exceeds capacity on average!")
elif avg_dcr > 0.7:
    print(f"   ⚡ WARNING: Approaching capacity limits")
else:
    print(f"   ✅ Healthy: Capacity exceeds demand")

print(f"\n2. EFFICIENCY")
print(f"   Average Efficiency Gap: {avg_gap:.1f} percentage points")
print(f"   Bottlenecked Samples: {bottleneck_pct:.1f}%")
if avg_gap > 15:
    print(f"   ⚠️  ALERT: Significant hidden inefficiency!")
elif avg_gap > 8:
    print(f"   ⚡ WARNING: Some workloads may be data-starved")
else:
    print(f"   ✅ Healthy: Workloads are generally productive")

print(f"\n3. OVERALL IMBALANCE")
print(f"   Average Composite Score: {avg_cis:.3f}")
if avg_cis > 0.5:
    print(f"   ⚠️  ALERT: Significant imbalance detected!")
elif avg_cis > 0.3:
    print(f"   ⚡ WARNING: Minor imbalances present")
else:
    print(f"   ✅ Healthy: Well-balanced operation")

print(f"\n" + "=" * 70)
print("RECOMMENDED ACTIONS")
print("=" * 70)

# Generate recommendations
top_ng = contributors['by_nodegroup'].iloc[0]['nodegroup'] if len(contributors['by_nodegroup']) > 0 else 'N/A'
top_q = contributors['by_queue'].iloc[0]['queue_name'] if len(contributors['by_queue']) > 0 else 'N/A'

print(f"\n1. INVESTIGATE: {top_ng}")
print(f"   This nodegroup shows highest imbalance. Check:")
print(f"   - Is capacity scaling needed?")
print(f"   - Are workloads right-sized?")

print(f"\n2. REVIEW QUEUE: {top_q}")
print(f"   This queue has highest pressure. Consider:")
print(f"   - Adjusting admission policies")
print(f"   - Redistributing workloads")

if bottleneck_pct > 15:
    print(f"\n3. OPTIMIZE DATA PIPELINES")
    print(f"   {bottleneck_pct:.1f}% of GPUs are bottlenecked.")
    print(f"   - Review data loading patterns")
    print(f"   - Consider data caching/pre-fetching")
    print(f"   - Check storage throughput")

## 11. What Additional Data Would Help?

This analysis would benefit from:

1. **Workload metadata**: Job duration, GPU count per job, completion rates
2. **Storage metrics**: I/O throughput, latency to identify data bottlenecks
3. **Network metrics**: Bandwidth utilization for distributed training
4. **Cost data**: GPU-hour costs for ROI calculations
5. **Historical capacity changes**: Autoscaling events for correlation

See [docs/ASSUMPTIONS.md](../docs/ASSUMPTIONS.md) for full list of limitations.

## 12. Reproduction Steps

To reproduce this analysis:

```bash
# 1. Generate synthetic data
python -m src.generators.synthetic_generator --scenario balanced --seed 42 --days 7

# 2. Run this notebook
jupyter notebook notebooks/demand_capacity_analysis.ipynb
```

To try different scenarios:
```bash
python -m src.generators.synthetic_generator --scenario demand_exceeds_capacity
python -m src.generators.synthetic_generator --scenario capacity_fragmentation
python -m src.generators.synthetic_generator --scenario io_bottleneck
```

In [None]:
print("\nAnalysis complete!")
print("\nFor questions or feedback, see CONTRIBUTING.md")
print("To propose metric changes, use docs/rfcs/RFC_TEMPLATE.md")