# GPU Efficiency Analytics: Beyond Utilization

## Measuring True GPU Productivity in the GenAI Era

This notebook demonstrates how to calculate and visualize **Realized TFLOPS Utilization (RFU)** - a metric that goes beyond simple GPU utilization to measure actual computational productivity.

### Key Concepts

1. **GPU Utilization** - Traditional metric showing if a GPU is "busy" (0-100%)
2. **Power Intensity Factor (PIF)** - Ratio of current power draw to maximum power, indicating actual computational work
3. **Realized TFLOPS** - Actual mathematical throughput = Achievable TFLOPS × PIF
4. **Realized TFLOPS Utilization (RFU)** - How efficiently we're using available compute capacity

### The Problem

A GPU can show 100% utilization while being:
- **Data-starved** (waiting for data from slow storage)
- **Network-bottlenecked** (waiting for gradient synchronization)
- **Memory-bound** (stalled on memory access patterns)

In these cases, the silicon is "busy" but mathematically **unproductive**.

## 1. Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

print("Libraries imported successfully!")

In [None]:
# GPU Specifications from documentation
GPU_SPECS = {
    'NVIDIA A100-SXM4-40GB': {
        'max_power': 400,
        'theoretical_tflops_fp16': 312,
        'achievable_tflops_fp16': 102
    },
    'NVIDIA H100 80GB HBM3': {
        'max_power': 700,
        'theoretical_tflops_fp16': 1979,
        'achievable_tflops_fp16': 646,
        'theoretical_tflops_fp8': 3958,
        'achievable_tflops_fp8': 1293
    },
    'NVIDIA A10G': {
        'max_power': 300,
        'theoretical_tflops_fp16': 125,
        'achievable_tflops_fp16': 35
    }
}

print("GPU specifications loaded:")
for model, specs in GPU_SPECS.items():
    print(f"\n{model}:")
    print(f"  Max Power: {specs['max_power']}W")
    print(f"  Achievable TFLOPS (FP16): {specs['achievable_tflops_fp16']}")

In [None]:
# Load the synthetic dataset
df = pd.read_csv('synthetic_dcgm_metrics.csv')

# Convert datetime fields
df['SCRAPETIME'] = pd.to_datetime(df['SCRAPETIME'])
df['ACTIVITYDATE'] = pd.to_datetime(df['ACTIVITYDATE'])

print(f"Dataset loaded: {len(df):,} records")
print(f"Date range: {df['SCRAPETIME'].min()} to {df['SCRAPETIME'].max()}")
print(f"\nUnique GPUs: {df['UUID'].nunique()}")
print(f"GPU Models: {df['MODELNAME'].unique().tolist()}")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few records:")
df.head()

## 2. Calculate Efficiency Metrics

We'll calculate the key efficiency metrics:

### Power Intensity Factor (PIF)
$$PIF = \frac{Current\ Power\ Draw}{Maximum\ GPU\ Power}$$

### Realized TFLOPS
$$Realized\ TFLOPS = Achievable\ TFLOPS \times PIF$$

### Realized TFLOPS Utilization (RFU)
$$RFU = \frac{Realized\ TFLOPS}{Achievable\ TFLOPS} \times 100\%$$

In [None]:
def calculate_efficiency_metrics(df):
    """
    Calculate GPU efficiency metrics including PIF, Realized TFLOPS, and RFU
    """
    # Create a copy to avoid SettingWithCopyWarning
    df = df.copy()
    
    # Map GPU specs to each record
    df['max_power'] = df['MODELNAME'].map(lambda x: GPU_SPECS[x]['max_power'])
    df['achievable_tflops'] = df['MODELNAME'].map(lambda x: GPU_SPECS[x]['achievable_tflops_fp16'])
    
    # Calculate Power Intensity Factor (PIF)
    df['power_intensity_factor'] = df['DCGM_FI_DEV_POWER_USAGE'] / df['max_power']
    df['power_intensity_factor'] = df['power_intensity_factor'].clip(0, 1)  # Ensure 0-1 range
    
    # Calculate Realized TFLOPS
    df['realized_tflops'] = df['achievable_tflops'] * df['power_intensity_factor']
    
    # Calculate Realized TFLOPS Utilization (RFU) as percentage
    df['rfu_percent'] = (df['realized_tflops'] / df['achievable_tflops']) * 100
    
    # Calculate efficiency gap (difference between utilization and RFU)
    df['efficiency_gap'] = df['DCGM_FI_DEV_GPU_UTIL'] - df['rfu_percent']
    
    # Classify workload efficiency
    def classify_efficiency(row):
        util = row['DCGM_FI_DEV_GPU_UTIL']
        pif = row['power_intensity_factor']
        
        if util < 10:
            return 'Idle'
        elif util >= 70 and pif >= 0.75:
            return 'Efficient'
        elif util >= 70 and pif < 0.60:
            return 'Bottlenecked'
        elif util >= 40 and pif >= 0.50:
            return 'Moderate'
        else:
            return 'Inefficient'
    
    df['efficiency_class'] = df.apply(classify_efficiency, axis=1)
    
    return df

# Apply efficiency calculations
df = calculate_efficiency_metrics(df)

print("Efficiency metrics calculated!")
print("\nNew columns added:")
print("  - power_intensity_factor (PIF)")
print("  - realized_tflops")
print("  - rfu_percent (Realized TFLOPS Utilization %)")
print("  - efficiency_gap")
print("  - efficiency_class")

print("\nSample calculations:")
df[['MODELNAME', 'DCGM_FI_DEV_GPU_UTIL', 'DCGM_FI_DEV_POWER_USAGE', 
    'power_intensity_factor', 'realized_tflops', 'rfu_percent', 'efficiency_class']].head(10)

## 3. Fleet-Wide Analysis

### Environment GPU Capacity (Total Capacity View)
This view sums the total achievable TFLOPS of the entire fleet to understand the "size of the engine" available.

In [None]:
# Fleet-wide capacity analysis
fleet_summary = df.groupby('MODELNAME').agg({
    'UUID': 'nunique',
    'achievable_tflops': 'first',
    'DCGM_FI_DEV_GPU_UTIL': 'mean',
    'power_intensity_factor': 'mean',
    'rfu_percent': 'mean',
    'realized_tflops': 'sum'
}).round(2)

fleet_summary.columns = ['GPU Count', 'TFLOPS per GPU', 'Avg Utilization %', 
                         'Avg PIF', 'Avg RFU %', 'Total Realized TFLOPS']

# Calculate total capacity
fleet_summary['Total Capacity (TFLOPS)'] = (
    fleet_summary['GPU Count'] * fleet_summary['TFLOPS per GPU']
)

print("Fleet-Wide Capacity Summary")
print("=" * 80)
print(fleet_summary)
print("\n" + "=" * 80)
print(f"Total Fleet Capacity: {fleet_summary['Total Capacity (TFLOPS)'].sum():,.0f} TFLOPS")
print(f"Average Fleet Utilization: {df['DCGM_FI_DEV_GPU_UTIL'].mean():.1f}%")
print(f"Average Fleet RFU: {df['rfu_percent'].mean():.1f}%")
print(f"Efficiency Gap: {df['efficiency_gap'].mean():.1f} percentage points")

## 4. Active GPU Analysis (Workload Efficiency)

Filter to only GPUs with utilization > 1% to focus on active workloads and identify bottlenecks.

In [None]:
# Filter for active GPUs only (utilization > 1%)
active_df = df[df['DCGM_FI_DEV_GPU_UTIL'] > 1].copy()

print(f"Active GPU Analysis (Utilization > 1%)")
print("=" * 80)
print(f"Total samples: {len(df):,}")
print(f"Active samples: {len(active_df):,} ({len(active_df)/len(df)*100:.1f}%)")
print(f"\nActive GPU Metrics:")
print(f"  Average Utilization: {active_df['DCGM_FI_DEV_GPU_UTIL'].mean():.1f}%")
print(f"  Average PIF: {active_df['power_intensity_factor'].mean():.3f}")
print(f"  Average RFU: {active_df['rfu_percent'].mean():.1f}%")
print(f"  Efficiency Gap: {active_df['efficiency_gap'].mean():.1f} percentage points")

# Efficiency class distribution
print("\nEfficiency Class Distribution (Active GPUs):")
efficiency_dist = active_df['efficiency_class'].value_counts()
for cls, count in efficiency_dist.items():
    pct = count / len(active_df) * 100
    print(f"  {cls}: {count:,} samples ({pct:.1f}%)")

## 5. Visualization: Utilization vs Power Intensity

This scatter plot reveals the critical relationship between GPU utilization and actual computational work (power draw).

In [None]:
# Create sample for visualization (plotting all 500k points would be slow)
sample_size = 5000
plot_df = active_df.sample(n=min(sample_size, len(active_df)), random_state=42)

plt.figure(figsize=(14, 8))

# Scatter plot colored by efficiency class
colors = {'Efficient': 'green', 'Bottlenecked': 'red', 'Moderate': 'orange', 
          'Inefficient': 'purple', 'Idle': 'gray'}

for efficiency_class in plot_df['efficiency_class'].unique():
    mask = plot_df['efficiency_class'] == efficiency_class
    plt.scatter(plot_df[mask]['DCGM_FI_DEV_GPU_UTIL'], 
               plot_df[mask]['power_intensity_factor'],
               c=colors.get(efficiency_class, 'blue'),
               label=efficiency_class,
               alpha=0.5,
               s=20)

# Add diagonal reference line (ideal efficiency)
x_line = np.linspace(0, 100, 100)
y_line = x_line / 100
plt.plot(x_line, y_line, 'k--', linewidth=2, label='Ideal Efficiency', alpha=0.3)

# Add threshold lines
plt.axhline(y=0.75, color='green', linestyle=':', alpha=0.3, linewidth=1)
plt.axhline(y=0.60, color='orange', linestyle=':', alpha=0.3, linewidth=1)
plt.axvline(x=70, color='gray', linestyle=':', alpha=0.3, linewidth=1)

plt.xlabel('GPU Utilization (%)', fontsize=12)
plt.ylabel('Power Intensity Factor (PIF)', fontsize=12)
plt.title('GPU Utilization vs Power Intensity Factor\n' + 
          'Identifying Bottlenecked Workloads', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.xlim(0, 105)
plt.ylim(0, 1.05)

# Add annotations
plt.text(85, 0.9, 'Efficient\n(High Util + High Power)', 
         fontsize=9, ha='center', color='darkgreen', weight='bold')
plt.text(85, 0.4, 'BOTTLENECKED\n(High Util + Low Power)', 
         fontsize=9, ha='center', color='darkred', weight='bold')

plt.tight_layout()
plt.savefig('utilization_vs_power_intensity.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: utilization_vs_power_intensity.png")

## 6. Time Series Analysis: Fleet Efficiency Over Time

In [None]:
# Aggregate by hour
hourly_metrics = df.groupby(df['SCRAPETIME'].dt.floor('H')).agg({
    'DCGM_FI_DEV_GPU_UTIL': 'mean',
    'power_intensity_factor': 'mean',
    'rfu_percent': 'mean',
    'efficiency_gap': 'mean',
    'realized_tflops': 'sum'
}).reset_index()

fig, axes = plt.subplots(3, 1, figsize=(16, 12))

# Plot 1: Utilization vs RFU
axes[0].plot(hourly_metrics['SCRAPETIME'], hourly_metrics['DCGM_FI_DEV_GPU_UTIL'], 
            label='GPU Utilization %', linewidth=2, color='steelblue')
axes[0].plot(hourly_metrics['SCRAPETIME'], hourly_metrics['rfu_percent'], 
            label='Realized TFLOPS Utilization (RFU) %', linewidth=2, color='darkgreen')
axes[0].fill_between(hourly_metrics['SCRAPETIME'], 
                     hourly_metrics['DCGM_FI_DEV_GPU_UTIL'],
                     hourly_metrics['rfu_percent'],
                     alpha=0.2, color='red', label='Efficiency Gap')
axes[0].set_ylabel('Percentage (%)', fontsize=11)
axes[0].set_title('Fleet Utilization vs Realized TFLOPS Utilization Over Time', 
                  fontsize=13, fontweight='bold')
axes[0].legend(loc='best', fontsize=10)
axes[0].grid(True, alpha=0.3)

# Plot 2: Power Intensity Factor
axes[1].plot(hourly_metrics['SCRAPETIME'], hourly_metrics['power_intensity_factor'], 
            linewidth=2, color='darkorange')
axes[1].axhline(y=0.75, color='green', linestyle='--', alpha=0.5, label='High Efficiency Threshold')
axes[1].axhline(y=0.60, color='orange', linestyle='--', alpha=0.5, label='Moderate Efficiency')
axes[1].set_ylabel('Power Intensity Factor', fontsize=11)
axes[1].set_title('Average Power Intensity Factor Over Time', fontsize=13, fontweight='bold')
axes[1].legend(loc='best', fontsize=10)
axes[1].grid(True, alpha=0.3)

# Plot 3: Total Realized TFLOPS
axes[2].plot(hourly_metrics['SCRAPETIME'], hourly_metrics['realized_tflops'], 
            linewidth=2, color='purple')
axes[2].set_ylabel('Total Realized TFLOPS', fontsize=11)
axes[2].set_xlabel('Time', fontsize=11)
axes[2].set_title('Total Fleet Realized TFLOPS Over Time', fontsize=13, fontweight='bold')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('fleet_efficiency_timeseries.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: fleet_efficiency_timeseries.png")

## 7. GPU Model Comparison

In [None]:
# Compare efficiency across GPU models
model_comparison = active_df.groupby('MODELNAME').agg({
    'DCGM_FI_DEV_GPU_UTIL': 'mean',
    'power_intensity_factor': 'mean',
    'rfu_percent': 'mean',
    'efficiency_gap': 'mean',
    'UUID': 'nunique'
}).round(2)

model_comparison.columns = ['Avg Utilization %', 'Avg PIF', 'Avg RFU %', 
                           'Efficiency Gap', 'Active GPU Count']

fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# Plot 1: Average Utilization by Model
model_comparison['Avg Utilization %'].plot(kind='bar', ax=axes[0,0], color='steelblue')
axes[0,0].set_title('Average Utilization by GPU Model', fontweight='bold')
axes[0,0].set_ylabel('Utilization (%)')
axes[0,0].set_xlabel('')
axes[0,0].tick_params(axis='x', rotation=45)
axes[0,0].grid(True, alpha=0.3, axis='y')

# Plot 2: Average PIF by Model
model_comparison['Avg PIF'].plot(kind='bar', ax=axes[0,1], color='darkorange')
axes[0,1].axhline(y=0.75, color='green', linestyle='--', alpha=0.5)
axes[0,1].set_title('Average Power Intensity Factor by GPU Model', fontweight='bold')
axes[0,1].set_ylabel('PIF')
axes[0,1].set_xlabel('')
axes[0,1].tick_params(axis='x', rotation=45)
axes[0,1].grid(True, alpha=0.3, axis='y')

# Plot 3: Average RFU by Model
model_comparison['Avg RFU %'].plot(kind='bar', ax=axes[1,0], color='darkgreen')
axes[1,0].set_title('Average Realized TFLOPS Utilization by GPU Model', fontweight='bold')
axes[1,0].set_ylabel('RFU (%)')
axes[1,0].set_xlabel('')
axes[1,0].tick_params(axis='x', rotation=45)
axes[1,0].grid(True, alpha=0.3, axis='y')

# Plot 4: Efficiency Gap by Model
model_comparison['Efficiency Gap'].plot(kind='bar', ax=axes[1,1], color='crimson')
axes[1,1].set_title('Efficiency Gap by GPU Model\n(Utilization % - RFU %)', fontweight='bold')
axes[1,1].set_ylabel('Efficiency Gap (percentage points)')
axes[1,1].set_xlabel('')
axes[1,1].tick_params(axis='x', rotation=45)
axes[1,1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('gpu_model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print("Chart saved: gpu_model_comparison.png")
print("\nModel Comparison Table:")
print(model_comparison)

## 8. Identifying Bottlenecked Workloads

Find specific GPU instances that consistently show high utilization but low power draw - these are candidates for optimization.

In [None]:
# Identify consistently bottlenecked GPUs
# Criteria: Average utilization > 70% but average PIF < 0.60
gpu_performance = active_df.groupby('UUID').agg({
    'MODELNAME': 'first',
    'HOSTNAME': 'first',
    'NAMESPACE': 'first',
    'DCGM_FI_DEV_GPU_UTIL': 'mean',
    'power_intensity_factor': 'mean',
    'rfu_percent': 'mean',
    'efficiency_gap': 'mean',
    'SCRAPETIME': 'count'
}).reset_index()

gpu_performance.columns = ['UUID', 'Model', 'Hostname', 'Namespace', 'Avg Util %', 
                          'Avg PIF', 'Avg RFU %', 'Efficiency Gap', 'Sample Count']

# Filter for bottlenecked GPUs
bottlenecked = gpu_performance[
    (gpu_performance['Avg Util %'] > 70) & 
    (gpu_performance['Avg PIF'] < 0.60)
].sort_values('Efficiency Gap', ascending=False)

print(f"Bottlenecked GPU Analysis")
print("=" * 80)
print(f"Total GPUs analyzed: {len(gpu_performance)}")
print(f"Bottlenecked GPUs (>70% util, <60% PIF): {len(bottlenecked)}")
print(f"\nTop 10 Most Bottlenecked GPUs:")
print(bottlenecked[['Model', 'Hostname', 'Namespace', 'Avg Util %', 
                    'Avg PIF', 'Efficiency Gap']].head(10).round(2))

# Visualize bottlenecked vs efficient GPUs
fig, ax = plt.subplots(figsize=(12, 8))

efficient = gpu_performance[
    (gpu_performance['Avg Util %'] > 70) & 
    (gpu_performance['Avg PIF'] >= 0.75)
]

ax.scatter(efficient['Avg Util %'], efficient['Avg PIF'], 
          c='green', s=100, alpha=0.6, label=f'Efficient ({len(efficient)} GPUs)', edgecolors='black')
ax.scatter(bottlenecked['Avg Util %'], bottlenecked['Avg PIF'], 
          c='red', s=100, alpha=0.6, label=f'Bottlenecked ({len(bottlenecked)} GPUs)', edgecolors='black')

ax.axhline(y=0.75, color='green', linestyle='--', alpha=0.3, linewidth=2)
ax.axhline(y=0.60, color='orange', linestyle='--', alpha=0.3, linewidth=2)
ax.axvline(x=70, color='gray', linestyle='--', alpha=0.3, linewidth=2)

ax.set_xlabel('Average GPU Utilization (%)', fontsize=12)
ax.set_ylabel('Average Power Intensity Factor', fontsize=12)
ax.set_title('GPU Performance Profile: Efficient vs Bottlenecked Instances', 
            fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=11)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('bottlenecked_gpu_identification.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nChart saved: bottlenecked_gpu_identification.png")

## 9. ROI Analysis: The Value of Efficiency Improvements

Calculate the potential impact of improving efficiency across the fleet.

In [None]:
# Current fleet metrics
total_gpus = df['UUID'].nunique()
current_avg_rfu = df['rfu_percent'].mean()
current_realized_tflops = df['realized_tflops'].mean()

# Calculate potential with improved efficiency
improvement_scenarios = []

for target_rfu in [50, 60, 70, 80]:
    rfu_gain = target_rfu - current_avg_rfu
    multiplier = target_rfu / current_avg_rfu
    
    # Calculate equivalent GPU hours gained per day
    hours_per_day = 24
    additional_capacity_pct = (multiplier - 1) * 100
    equivalent_gpus_gained = total_gpus * (multiplier - 1)
    
    improvement_scenarios.append({
        'Target RFU %': target_rfu,
        'RFU Improvement': f'+{rfu_gain:.1f}pp',
        'Additional Capacity %': f'+{additional_capacity_pct:.1f}%',
        'Equivalent GPUs Gained': f'{equivalent_gpus_gained:.1f}',
        'Extra GPU-Hours/Day': f'{equivalent_gpus_gained * hours_per_day:.0f}'
    })

roi_df = pd.DataFrame(improvement_scenarios)

print("ROI Analysis: Impact of Efficiency Improvements")
print("=" * 80)
print(f"Current Fleet Configuration:")
print(f"  Total GPUs: {total_gpus}")
print(f"  Current Average RFU: {current_avg_rfu:.1f}%")
print(f"\nPotential Gains from Efficiency Improvements:")
print(roi_df.to_string(index=False))

print("\n" + "=" * 80)
print("Key Insight:")
print(f"Improving RFU from {current_avg_rfu:.1f}% to 60% would provide the equivalent")
print(f"of adding {(60/current_avg_rfu - 1) * total_gpus:.0f} GPUs to the fleet without any hardware cost!")

# Visualize ROI potential
fig, ax = plt.subplots(figsize=(12, 7))

scenarios_numeric = [s['Target RFU %'] for s in improvement_scenarios]
equivalent_gpus = [float(s['Equivalent GPUs Gained']) for s in improvement_scenarios]

bars = ax.bar(range(len(scenarios_numeric)), equivalent_gpus, color='darkgreen', alpha=0.7)
ax.axhline(y=0, color='black', linewidth=0.8)

# Add value labels on bars
for i, (bar, val) in enumerate(zip(bars, equivalent_gpus)):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 0.5,
           f'+{val:.1f} GPUs\n({improvement_scenarios[i]["Extra GPU-Hours/Day"]} hrs/day)',
           ha='center', va='bottom', fontsize=10, fontweight='bold')

ax.set_xticks(range(len(scenarios_numeric)))
ax.set_xticklabels([f'{s}% RFU' for s in scenarios_numeric])
ax.set_xlabel('Target Realized TFLOPS Utilization (RFU)', fontsize=12)
ax.set_ylabel('Equivalent GPUs Gained', fontsize=12)
ax.set_title(f'ROI of Efficiency Improvements\n' + 
            f'(Current Fleet: {total_gpus} GPUs @ {current_avg_rfu:.1f}% RFU)',
            fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('efficiency_roi_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nChart saved: efficiency_roi_analysis.png")

## 10. Workload Pattern Analysis by Namespace

In [None]:
# Analyze efficiency by namespace/workload type
namespace_analysis = active_df.groupby('NAMESPACE').agg({
    'DCGM_FI_DEV_GPU_UTIL': 'mean',
    'power_intensity_factor': 'mean',
    'rfu_percent': 'mean',
    'efficiency_gap': 'mean',
    'UUID': 'nunique',
    'SCRAPETIME': 'count'
}).round(2)

namespace_analysis.columns = ['Avg Util %', 'Avg PIF', 'Avg RFU %', 
                             'Efficiency Gap', 'Active GPUs', 'Total Samples']

namespace_analysis = namespace_analysis.sort_values('Efficiency Gap', ascending=False)

print("Workload Pattern Analysis by Namespace")
print("=" * 80)
print(namespace_analysis)

# Visualize namespace comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Efficiency metrics by namespace
x = np.arange(len(namespace_analysis))
width = 0.25

axes[0].bar(x - width, namespace_analysis['Avg Util %'], width, 
           label='Utilization %', color='steelblue')
axes[0].bar(x, namespace_analysis['Avg RFU %'], width, 
           label='RFU %', color='darkgreen')
axes[0].bar(x + width, namespace_analysis['Avg PIF'] * 100, width, 
           label='PIF × 100', color='darkorange')

axes[0].set_xlabel('Namespace', fontsize=12)
axes[0].set_ylabel('Percentage / Score', fontsize=12)
axes[0].set_title('Efficiency Metrics by Workload Type', fontsize=13, fontweight='bold')
axes[0].set_xticks(x)
axes[0].set_xticklabels(namespace_analysis.index, rotation=45, ha='right')
axes[0].legend()
axes[0].grid(True, alpha=0.3, axis='y')

# Plot 2: Efficiency gap by namespace
colors_gap = ['red' if gap > 20 else 'orange' if gap > 10 else 'green' 
              for gap in namespace_analysis['Efficiency Gap']]

axes[1].barh(range(len(namespace_analysis)), namespace_analysis['Efficiency Gap'], 
            color=colors_gap, alpha=0.7)
axes[1].set_yticks(range(len(namespace_analysis)))
axes[1].set_yticklabels(namespace_analysis.index)
axes[1].set_xlabel('Efficiency Gap (percentage points)', fontsize=12)
axes[1].set_title('Efficiency Gap by Workload Type\n(Higher = More Optimization Potential)', 
                 fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')
axes[1].axvline(x=20, color='red', linestyle='--', alpha=0.3, linewidth=2)
axes[1].axvline(x=10, color='orange', linestyle='--', alpha=0.3, linewidth=2)

plt.tight_layout()
plt.savefig('namespace_efficiency_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nChart saved: namespace_efficiency_analysis.png")

## 11. Summary and Recommendations

Generate actionable insights based on the analysis.

In [None]:
# Generate summary report
print("=" * 80)
print("GPU EFFICIENCY ANALYSIS - EXECUTIVE SUMMARY")
print("=" * 80)

print("\n1. FLEET OVERVIEW")
print("-" * 80)
print(f"   Total GPUs: {df['UUID'].nunique()}")
print(f"   GPU Models: {', '.join(df['MODELNAME'].unique())}")
print(f"   Analysis Period: {df['SCRAPETIME'].min().strftime('%Y-%m-%d')} to {df['SCRAPETIME'].max().strftime('%Y-%m-%d')}")
print(f"   Total Samples Analyzed: {len(df):,}")

print("\n2. KEY METRICS")
print("-" * 80)
print(f"   Average GPU Utilization: {df['DCGM_FI_DEV_GPU_UTIL'].mean():.1f}%")
print(f"   Average Power Intensity Factor (PIF): {df['power_intensity_factor'].mean():.3f}")
print(f"   Average Realized TFLOPS Utilization (RFU): {df['rfu_percent'].mean():.1f}%")
print(f"   Average Efficiency Gap: {df['efficiency_gap'].mean():.1f} percentage points")

print("\n3. EFFICIENCY CLASSIFICATION (Active GPUs Only)")
print("-" * 80)
for cls, count in active_df['efficiency_class'].value_counts().items():
    pct = count / len(active_df) * 100
    print(f"   {cls}: {pct:.1f}% of active samples")

print("\n4. BOTTLENECK IDENTIFICATION")
print("-" * 80)
high_util_low_power = len(active_df[
    (active_df['DCGM_FI_DEV_GPU_UTIL'] > 70) & 
    (active_df['power_intensity_factor'] < 0.60)
])
pct_bottlenecked = (high_util_low_power / len(active_df)) * 100
print(f"   Bottlenecked Samples (>70% util, <60% PIF): {high_util_low_power:,} ({pct_bottlenecked:.1f}%)")
print(f"   Potential root causes:")
print(f"     • Data I/O bottlenecks (slow storage, network latency)")
print(f"     • Memory bandwidth constraints")
print(f"     • Suboptimal batch sizes or data pipeline configuration")

print("\n5. ROI POTENTIAL")
print("-" * 80)
target_rfu = 60
improvement = target_rfu - current_avg_rfu
equivalent_gpus = (target_rfu / current_avg_rfu - 1) * total_gpus
print(f"   Improving average RFU from {current_avg_rfu:.1f}% to {target_rfu}%:")
print(f"     • Efficiency gain: +{improvement:.1f} percentage points")
print(f"     • Equivalent to adding: {equivalent_gpus:.0f} GPUs")
print(f"     • Additional GPU-hours per month: {equivalent_gpus * 24 * 30:.0f}")
print(f"     • Cost avoidance: Significant capital and operational savings")

print("\n6. RECOMMENDATIONS")
print("-" * 80)

# Find worst performing namespace
worst_namespace = namespace_analysis.index[0]
worst_gap = namespace_analysis.iloc[0]['Efficiency Gap']

print(f"   A. Immediate Actions:")
print(f"      1. Investigate '{worst_namespace}' workloads (efficiency gap: {worst_gap:.1f}pp)")
print(f"      2. Profile data loading pipelines for bottlenecked GPUs")
print(f"      3. Review network configuration and storage access patterns")

print(f"\n   B. Optimization Strategies:")
print(f"      1. Implement data pre-fetching and caching for training workloads")
print(f"      2. Optimize batch sizes to maximize GPU compute utilization")
print(f"      3. Consider GPU right-sizing: move low-intensity jobs to smaller SKUs")
print(f"      4. Improve data pipeline parallelization")

print(f"\n   C. Monitoring & Governance:")
print(f"      1. Set RFU targets by workload type (training: 70%, inference: 50%)")
print(f"      2. Create alerts for sustained efficiency gaps > 25 percentage points")
print(f"      3. Establish regular efficiency review cadence with ML teams")
print(f"      4. Incorporate RFU metrics into capacity planning processes")

print("\n" + "=" * 80)
print("CONCLUSION")
print("=" * 80)
print("By shifting focus from 'GPU busy' to 'GPU productive,' we can unlock")
print("significant capacity from existing infrastructure. Every percentage point")
print("of efficiency improvement translates directly to faster model training,")
print("reduced time-to-production, and substantial cost savings.")
print("=" * 80)

## 12. Export Analysis Results

In [None]:
# Export key datasets for further analysis

# 1. GPU-level performance summary
gpu_summary = df.groupby(['UUID', 'MODELNAME', 'HOSTNAME', 'NAMESPACE']).agg({
    'DCGM_FI_DEV_GPU_UTIL': 'mean',
    'power_intensity_factor': 'mean',
    'rfu_percent': 'mean',
    'efficiency_gap': 'mean',
    'SCRAPETIME': 'count'
}).round(3).reset_index()

gpu_summary.columns = ['UUID', 'Model', 'Hostname', 'Namespace', 
                       'Avg_Utilization_Pct', 'Avg_PIF', 'Avg_RFU_Pct', 
                       'Efficiency_Gap', 'Sample_Count']

gpu_summary.to_csv('gpu_efficiency_summary.csv', index=False)
print("Exported: gpu_efficiency_summary.csv")

# 2. Hourly aggregated metrics
hourly_metrics.to_csv('hourly_fleet_metrics.csv', index=False)
print("Exported: hourly_fleet_metrics.csv")

# 3. Bottlenecked GPUs list
if len(bottlenecked) > 0:
    bottlenecked.to_csv('bottlenecked_gpus.csv', index=False)
    print("Exported: bottlenecked_gpus.csv")

print("\nAll analysis results have been exported successfully!")