# Notebook 3: Efficiency Benchmarking & Anomaly Detection

**OSU Campus Energy Analysis — Data I/O 2026 Advanced Track**

This notebook identifies energy waste opportunities and anomalies across campus. We compute Energy Use Intensity (EUI) benchmarks, detect off-hours waste, apply statistical and ML-based anomaly detection, and quantify potential savings.

**Narrative arc**: "We identified specific buildings with quantifiable savings opportunities, backed by peer benchmarking and anomaly evidence."

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

sns.set_theme(style='whitegrid', font_scale=1.1)
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['figure.dpi'] = 100

from pathlib import Path
DATA_DIR = Path('/Users/Siddarth/Data IO/processed')

# Per documentation: energy utilities only for EUI and consumption analysis
ENERGY_UTILITIES = ['ELECTRICITY', 'STEAM', 'HEAT', 'GAS', 'COOLING', 'OIL28SEC']

print('Libraries loaded.')

Libraries loaded.


In [2]:
# Load processed data
meter_all = pd.read_parquet(DATA_DIR / 'meter_all_utilities.parquet')
meter_all['date'] = pd.to_datetime(meter_all['date'])
daily_weather = pd.read_parquet(DATA_DIR / 'daily_weather.parquet')
daily_weather['date'] = pd.to_datetime(daily_weather['date'])
buildings = pd.read_parquet(DATA_DIR / 'buildings.parquet')

# Filter to energy utilities only (per documentation: power utilities for peak demand, not EUI)
energy_data = meter_all[meter_all['utility'].isin(ENERGY_UTILITIES)].copy()

print(f'Energy data: {len(energy_data):,} rows')
print(f'Buildings in data: {energy_data["simscode"].nunique()}')
print(f'Utilities: {sorted(energy_data["utility"].unique())}')

Energy data: 8,812,752 rows
Buildings in data: 285
Utilities: ['COOLING', 'ELECTRICITY', 'GAS', 'HEAT', 'OIL28SEC', 'STEAM']


## 1. Energy Use Intensity (EUI) Calculation

EUI = Annual Energy / Gross Area (kWh/sqft/year). Per the documentation, we normalize energy-based utilities by square footage for fair comparison. We compute separate EUIs for each energy utility to avoid mixing units.

In [3]:
# Compute annual consumption per building per utility
annual_bldg = energy_data.groupby(['simscode', 'utility', 'buildingname', 'campusname', 'grossarea']).agg(
    annual_total=('readingvalue', 'sum'),
    n_days=('readingvalue', 'count'),
    n_meters=('meterid', 'nunique')
).reset_index()

# Only keep buildings with reasonable data coverage (>300 days)
annual_bldg = annual_bldg[annual_bldg['n_days'] >= 300].copy()
annual_bldg['grossarea'] = pd.to_numeric(annual_bldg['grossarea'], errors='coerce')

# Compute EUI (only where grossArea is valid and > 0)
annual_bldg['eui'] = np.where(
    (annual_bldg['grossarea'] > 0) & annual_bldg['grossarea'].notna(),
    annual_bldg['annual_total'] / annual_bldg['grossarea'],
    np.nan
)

# Summary by utility
for util in sorted(annual_bldg['utility'].unique()):
    sub = annual_bldg[annual_bldg['utility'] == util].dropna(subset=['eui'])
    if len(sub) > 0:
        print(f'{util}: {len(sub)} buildings, median EUI = {sub["eui"].median():.2f}, '
              f'mean EUI = {sub["eui"].mean():.2f}, range = [{sub["eui"].min():.2f}, {sub["eui"].max():.2f}]')

COOLING: 86 buildings, median EUI = 5.65, mean EUI = 126185.92, range = [0.00, 8991995.36]
ELECTRICITY: 270 buildings, median EUI = 3.13, mean EUI = 10867.07, range = [0.00, 1266988.82]
GAS: 147 buildings, median EUI = 1.60, mean EUI = 14116475.16, range = [0.00, 2074751506.33]
HEAT: 132 buildings, median EUI = 3.36, mean EUI = 1630.52, range = [0.00, 196689.06]
OIL28SEC: 1 buildings, median EUI = 1418145.31, mean EUI = 1418145.31, range = [1418145.31, 1418145.31]
STEAM: 27 buildings, median EUI = 11.71, mean EUI = 500247069.43, range = [0.00, 5452530712.03]


In [4]:
# EUI distribution for electricity (largest utility)
elec_eui = annual_bldg[(annual_bldg['utility'] == 'ELECTRICITY') & annual_bldg['eui'].notna()].copy()
elec_eui = elec_eui[elec_eui['eui'] > 0]

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Histogram
clip_val = elec_eui['eui'].quantile(0.98)
axes[0].hist(elec_eui['eui'].clip(upper=clip_val), bins=40, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(elec_eui['eui'].median(), color='red', linestyle='--', linewidth=2, label=f'Median: {elec_eui["eui"].median():.1f}')
axes[0].set_title('Electricity EUI Distribution (kWh/sqft/year)', fontweight='bold')
axes[0].set_xlabel('EUI (kWh/sqft/year)')
axes[0].set_ylabel('Number of Buildings')
axes[0].legend()

# Top and bottom performers
top10 = elec_eui.nlargest(10, 'eui')
bot10 = elec_eui.nsmallest(10, 'eui')
combined = pd.concat([top10, bot10])
combined['label'] = combined['buildingname'].apply(lambda x: str(x)[:30] if pd.notna(x) else 'Unknown')
colors = ['coral'] * 10 + ['green'] * 10
combined = combined.sort_values('eui')
axes[1].barh(combined['label'], combined['eui'], color=colors[:len(combined)])
axes[1].set_title('Top 10 Highest & Lowest Electricity EUI Buildings', fontweight='bold')
axes[1].set_xlabel('EUI (kWh/sqft/year)')

plt.tight_layout()
plt.show()

## 2. Peer Benchmarking

We group buildings into peer groups by campus + size tier + age tier, then rank each building's EUI within its peer group. Buildings that significantly exceed their peer median are candidates for efficiency improvements.

In [5]:
# Create peer groups for electricity EUI
elec_eui_peer = elec_eui.copy()
elec_eui_peer['building_age'] = pd.to_numeric(
    buildings.set_index('buildingnumber').loc[elec_eui_peer['simscode'].values, 'building_age'].values,
    errors='coerce'
)

# Size tiers
elec_eui_peer['size_tier'] = pd.cut(elec_eui_peer['grossarea'], 
                                      bins=[0, 10000, 50000, 150000, np.inf],
                                      labels=['Small (<10k)', 'Medium (10-50k)', 'Large (50-150k)', 'Very Large (>150k)'])
# Age tiers
elec_eui_peer['age_tier'] = pd.cut(elec_eui_peer['building_age'],
                                    bins=[0, 25, 50, 100, np.inf],
                                    labels=['New (0-25yr)', 'Mid (25-50yr)', 'Old (50-100yr)', 'Historic (>100yr)'])

# Compute peer group stats
peer_groups = elec_eui_peer.groupby(['campusname', 'size_tier']).agg(
    peer_median_eui=('eui', 'median'),
    peer_mean_eui=('eui', 'mean'),
    peer_count=('eui', 'count')
).reset_index()

# Merge back
elec_eui_peer = elec_eui_peer.merge(peer_groups, on=['campusname', 'size_tier'], how='left')
elec_eui_peer['eui_vs_peer'] = elec_eui_peer['eui'] / elec_eui_peer['peer_median_eui']

# Flag buildings >1.5x peer median
inefficient = elec_eui_peer[elec_eui_peer['eui_vs_peer'] > 1.5].sort_values('eui_vs_peer', ascending=False)
print(f'Buildings with EUI > 1.5x peer median: {len(inefficient)}')
print(inefficient[['buildingname', 'campusname', 'size_tier', 'eui', 'peer_median_eui', 'eui_vs_peer']].head(15).to_string(index=False))

Buildings with EUI > 1.5x peer median: 84
                                     buildingname campusname          size_tier          eui  peer_median_eui   eui_vs_peer
                   OSU Electric Substation (0079)   Columbus    Medium (10-50k) 1.266989e+06         3.024960 418844.867584
                    Substation West Campus (0134)   Columbus       Small (<10k) 6.198429e+05         3.020524 205210.381258
                    Waterman - Turf Shed 1 (0992)   Columbus       Small (<10k) 5.267399e+05         3.020524 174386.908166
  Energy Advancement and Innovation Center (1044)   Columbus    Large (50-150k) 4.336126e+05         3.179032 136397.684855
                       Dreese Laboratories (0279)   Columbus Very Large (>150k) 8.337157e+04         2.532811  32916.612203
             McPherson Chemical Laboratory (0053)   Columbus    Large (50-150k) 2.006869e+03         3.179032    631.282925
        Chilled Water Plant, East Regional (0376)   Columbus    Medium (10-50k) 1.272475e+

In [6]:
# Peer benchmarking visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# EUI by campus (boxplot)
campus_data = elec_eui_peer.dropna(subset=['campusname', 'eui'])
campus_order = campus_data.groupby('campusname')['eui'].median().sort_values(ascending=False).index
sns.boxplot(data=campus_data, y='campusname', x='eui', order=campus_order, ax=axes[0],
            showfliers=False, palette='Set2')
axes[0].set_title('Electricity EUI by Campus', fontweight='bold')
axes[0].set_xlabel('EUI (kWh/sqft/year)')
axes[0].set_ylabel('')

# EUI by size tier
size_data = elec_eui_peer.dropna(subset=['size_tier', 'eui'])
sns.boxplot(data=size_data, y='size_tier', x='eui', ax=axes[1], showfliers=False, palette='Set3')
axes[1].set_title('Electricity EUI by Building Size Tier', fontweight='bold')
axes[1].set_xlabel('EUI (kWh/sqft/year)')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

## 3. Off-Hours Waste Analysis

Many buildings consume nearly as much energy at night and on weekends as during occupied hours. The **base load ratio** (night avg / day avg) identifies buildings that "never turn off." A ratio above 0.7 suggests excessive off-hours consumption.

In [7]:
# Use readingwindowmin (daily minimum 15-min reading) as proxy for base load
# and readingwindowmax as proxy for peak load
elec_data = meter_all[meter_all['utility'] == 'ELECTRICITY'].copy()
elec_data['day_of_week'] = elec_data['date'].dt.dayofweek  # 0=Mon, 6=Sun
elec_data['is_weekend'] = elec_data['day_of_week'].isin([5, 6])

# Base load ratio = min / max (per day). Closer to 1 = flat load = never turns off
elec_data['base_load_ratio'] = elec_data['readingwindowmin'] / elec_data['readingwindowmax']
elec_data['base_load_ratio'] = elec_data['base_load_ratio'].replace([np.inf, -np.inf], np.nan)

# Building-level base load ratio
bldg_base = elec_data.groupby(['simscode', 'buildingname', 'campusname', 'grossarea']).agg(
    mean_base_ratio=('base_load_ratio', 'mean'),
    weekday_avg=('readingvalue', lambda x: x[~elec_data.loc[x.index, 'is_weekend']].mean()),
    weekend_avg=('readingvalue', lambda x: x[elec_data.loc[x.index, 'is_weekend']].mean()),
    n_days=('readingvalue', 'count')
).reset_index()

bldg_base = bldg_base[bldg_base['n_days'] >= 300]
bldg_base['weekend_weekday_ratio'] = bldg_base['weekend_avg'] / bldg_base['weekday_avg']
bldg_base['weekend_weekday_ratio'] = bldg_base['weekend_weekday_ratio'].replace([np.inf, -np.inf], np.nan)

# Buildings that never turn off (base load ratio > 0.7)
always_on = bldg_base[bldg_base['mean_base_ratio'] > 0.7].sort_values('mean_base_ratio', ascending=False)
print(f'Buildings with base load ratio > 0.7 ("never turn off"): {len(always_on)}')
print(always_on[['buildingname', 'campusname', 'mean_base_ratio', 'weekend_weekday_ratio', 'weekday_avg']].head(15).to_string(index=False))

Buildings with base load ratio > 0.7 ("never turn off"): 93
                               buildingname campusname  mean_base_ratio  weekend_weekday_ratio  weekday_avg
   Parking Garage - Ohio Union North (0288)   Columbus         0.949363               0.996518     6.392768
                         Hughes Hall (0042)   Columbus         0.903913               0.996294     3.733986
                    Evans Laboratory (0150)   Columbus         0.890823               0.985079    35.185762
             Bulk Chemical Warehouse (0362)   Columbus         0.884962               0.995613     9.783591
   Telecommunications Network Center (0379)   Columbus         0.884135               0.996931    45.407208
           Physics Research Building (0070)   Columbus         0.884074               0.978274   126.155432
                         Sisson Hall (0080)   Columbus         0.875737               0.971526    48.272201
  Parker Food Science and Technology (0064)   Columbus         0.853717     

In [8]:
# Off-hours waste visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Base load ratio distribution
valid_base = bldg_base['mean_base_ratio'].dropna()
valid_base = valid_base[(valid_base >= 0) & (valid_base <= 1)]
axes[0].hist(valid_base, bins=40, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].axvline(0.7, color='red', linestyle='--', linewidth=2, label='Threshold (0.7)')
axes[0].set_title('Distribution of Base Load Ratio (min/max)', fontweight='bold')
axes[0].set_xlabel('Base Load Ratio (1.0 = completely flat load)')
axes[0].set_ylabel('Number of Buildings')
axes[0].legend()

# Weekend vs weekday
valid_ww = bldg_base.dropna(subset=['weekday_avg', 'weekend_avg'])
valid_ww = valid_ww[(valid_ww['weekday_avg'] > 0) & (valid_ww['weekend_avg'] > 0)]
max_val = max(valid_ww['weekday_avg'].quantile(0.95), valid_ww['weekend_avg'].quantile(0.95))
axes[1].scatter(valid_ww['weekday_avg'], valid_ww['weekend_avg'], alpha=0.4, s=20, c='coral')
axes[1].plot([0, max_val], [0, max_val], 'k--', alpha=0.5, label='Equal usage')
axes[1].set_title('Weekend vs Weekday Average Electricity', fontweight='bold')
axes[1].set_xlabel('Weekday Avg (kWh/day)')
axes[1].set_ylabel('Weekend Avg (kWh/day)')
axes[1].legend()
axes[1].set_xlim(0, max_val * 1.1)
axes[1].set_ylim(0, max_val * 1.1)

plt.tight_layout()
plt.show()

print('Buildings above the diagonal use MORE energy on weekends — potential scheduling issue.')
print('Buildings near the diagonal never scale down — potential off-hours waste.')

Buildings above the diagonal use MORE energy on weekends — potential scheduling issue.
Buildings near the diagonal never scale down — potential off-hours waste.


## 4. Anomaly Detection (Statistical)

We use Z-score and IQR methods to flag individual days with anomalous consumption spikes or drops per meter.

In [9]:
# Z-score anomaly detection per meter
elec_data_clean = elec_data[elec_data['readingvalue'] > 0].copy()

# Compute per-meter mean and std
meter_stats = elec_data_clean.groupby('meterid')['readingvalue'].agg(['mean', 'std']).reset_index()
meter_stats.columns = ['meterid', 'meter_mean', 'meter_std']
meter_stats = meter_stats[meter_stats['meter_std'] > 0]  # skip constant meters

elec_anom = elec_data_clean.merge(meter_stats, on='meterid', how='inner')
elec_anom['z_score'] = (elec_anom['readingvalue'] - elec_anom['meter_mean']) / elec_anom['meter_std']

# Flag anomalies at |z| > 3
elec_anom['is_anomaly_zscore'] = elec_anom['z_score'].abs() > 3
n_anomalies = elec_anom['is_anomaly_zscore'].sum()
print(f'Z-score anomalies (|z| > 3): {n_anomalies:,} days ({100*n_anomalies/len(elec_anom):.2f}%)')

# IQR method
meter_iqr = elec_data_clean.groupby('meterid')['readingvalue'].agg(
    lambda x: x.quantile(0.75) - x.quantile(0.25)
).reset_index()
meter_iqr.columns = ['meterid', 'iqr']
meter_q = elec_data_clean.groupby('meterid')['readingvalue'].agg(['quantile']).reset_index()
meter_q1 = elec_data_clean.groupby('meterid')['readingvalue'].quantile(0.25).reset_index()
meter_q3 = elec_data_clean.groupby('meterid')['readingvalue'].quantile(0.75).reset_index()
meter_q1.columns = ['meterid', 'q1']
meter_q3.columns = ['meterid', 'q3']

elec_anom = elec_anom.merge(meter_iqr, on='meterid', how='left')
elec_anom = elec_anom.merge(meter_q1, on='meterid', how='left')
elec_anom = elec_anom.merge(meter_q3, on='meterid', how='left')
elec_anom['is_anomaly_iqr'] = (elec_anom['readingvalue'] < elec_anom['q1'] - 1.5 * elec_anom['iqr']) | \
                                (elec_anom['readingvalue'] > elec_anom['q3'] + 1.5 * elec_anom['iqr'])

n_iqr = elec_anom['is_anomaly_iqr'].sum()
print(f'IQR anomalies: {n_iqr:,} days ({100*n_iqr/len(elec_anom):.2f}%)')

Z-score anomalies (|z| > 3): 21,931 days (0.58%)
IQR anomalies: 100,326 days (2.68%)


In [10]:
# Visualize anomaly distribution over time
anom_by_date = elec_anom.groupby('date')['is_anomaly_zscore'].sum().reset_index()
anom_by_date.columns = ['date', 'n_anomalies']

fig, axes = plt.subplots(2, 1, figsize=(16, 8))

axes[0].bar(anom_by_date['date'], anom_by_date['n_anomalies'], color='coral', alpha=0.7, width=1)
axes[0].set_title('Daily Count of Z-Score Anomalies (|z| > 3) Across All Meters', fontweight='bold')
axes[0].set_ylabel('Number of Anomalous Meters')
import matplotlib.dates as mdates
axes[0].xaxis.set_major_formatter(mdates.DateFormatter('%b'))

# Anomaly rate by meter (which meters are most problematic?)
meter_anom_rate = elec_anom.groupby('meterid').agg(
    n_anomalies=('is_anomaly_zscore', 'sum'),
    n_total=('is_anomaly_zscore', 'count'),
    building=('buildingname', 'first')
).reset_index()
meter_anom_rate['anom_rate'] = meter_anom_rate['n_anomalies'] / meter_anom_rate['n_total']
top_anom_meters = meter_anom_rate.nlargest(15, 'anom_rate')

axes[1].barh(top_anom_meters['building'].apply(lambda x: str(x)[:30] if pd.notna(x) else ''),
             top_anom_meters['anom_rate'] * 100, color='steelblue', edgecolor='black')
axes[1].set_title('Top 15 Meters by Anomaly Rate', fontweight='bold')
axes[1].set_xlabel('% of Days with Anomalous Readings')

plt.tight_layout()
plt.show()

## 5. Anomaly Detection (Machine Learning)

We apply Isolation Forest — an unsupervised ML algorithm — using multiple features beyond simple thresholds. This detects multivariate anomalies that statistical methods miss.

In [11]:
# Prepare features for Isolation Forest
# Aggregate to building-day level with multiple features
bldg_day = elec_data.groupby(['simscode', 'date']).agg(
    daily_kwh=('readingvalue', 'sum'),
    min_reading=('readingwindowmin', 'mean'),
    max_reading=('readingwindowmax', 'mean'),
    stddev=('readingwindowstandarddeviation', 'mean'),
    pct_missing=('pct_missing', 'mean'),
    buildingname=('buildingname', 'first')
).reset_index()

bldg_day = bldg_day.merge(daily_weather[['date', 'temp_mean']], on='date', how='inner')
bldg_day['day_of_week'] = bldg_day['date'].dt.dayofweek
bldg_day['load_range'] = bldg_day['max_reading'] - bldg_day['min_reading']
bldg_day['base_load_ratio'] = np.where(bldg_day['max_reading'] > 0,
                                        bldg_day['min_reading'] / bldg_day['max_reading'], np.nan)

# Run Isolation Forest on the top 30 most consuming buildings
top30 = bldg_day.groupby('simscode')['daily_kwh'].sum().nlargest(30).index
iso_results = []

for bldg_id in top30:
    bdf = bldg_day[bldg_day['simscode'] == bldg_id].copy()
    feature_cols = ['daily_kwh', 'stddev', 'base_load_ratio', 'temp_mean', 'day_of_week']
    bdf_features = bdf[feature_cols].dropna()
    
    if len(bdf_features) < 60:
        continue
    
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(bdf_features)
    
    iso = IsolationForest(contamination=0.05, random_state=42, n_estimators=100)
    bdf.loc[bdf_features.index, 'iso_anomaly'] = iso.fit_predict(X_scaled)
    bdf.loc[bdf_features.index, 'iso_score'] = iso.decision_function(X_scaled)
    
    n_anom = (bdf['iso_anomaly'] == -1).sum()
    iso_results.append({
        'simscode': bldg_id,
        'buildingname': bdf['buildingname'].iloc[0],
        'n_anomalies': n_anom,
        'anom_pct': 100 * n_anom / len(bdf_features),
        'total_days': len(bdf_features)
    })

iso_summary = pd.DataFrame(iso_results)
print('=== Isolation Forest Anomaly Summary (top 30 buildings) ===')
print(iso_summary.sort_values('n_anomalies', ascending=False).head(15).to_string(index=False))

=== Isolation Forest Anomaly Summary (top 30 buildings) ===
simscode                                   buildingname  n_anomalies  anom_pct  total_days
      82                            Ohio Stadium (0082)           19  5.248619         362
      79                 OSU Electric Substation (0079)           19  5.248619         362
     308                          Rightmire Hall (0308)           19  5.248619         362
     171                               Dodd Hall (0171)           19  5.248619         362
     277                             Graves Hall (0277)           19  5.248619         362
      24                             Postle Hall (0024)           19  5.248619         362
      89                               Doan Hall (0089)           19  5.248619         362
     372                Brain and Spine Hospital (0372)           19  5.248619         362
     353                     Ross Heart Hospital (0353)           19  5.248619         362
      70               Physics

## 6. Change-Point Detection (CUSUM)

CUSUM (Cumulative Sum) detects step-changes in consumption patterns — indicating equipment failure, retrofit, or occupancy changes. This goes beyond daily anomalies to find persistent shifts.

In [12]:
def detect_cusum(series, threshold=5.0, drift=0.5):
    """CUSUM change-point detection. Returns indices of detected change points."""
    mean = series.mean()
    std = series.std()
    if std == 0:
        return []
    
    normalized = (series - mean) / std
    s_pos = np.zeros(len(normalized))
    s_neg = np.zeros(len(normalized))
    change_points = []
    
    for i in range(1, len(normalized)):
        s_pos[i] = max(0, s_pos[i-1] + normalized.iloc[i] - drift)
        s_neg[i] = max(0, s_neg[i-1] - normalized.iloc[i] - drift)
        if s_pos[i] > threshold or s_neg[i] > threshold:
            change_points.append(i)
            s_pos[i] = 0
            s_neg[i] = 0
    
    return change_points

# Run CUSUM on top 20 electricity consumers
top20 = bldg_day.groupby('simscode')['daily_kwh'].sum().nlargest(20).index
cusum_results = []

for bldg_id in top20:
    bdf = bldg_day[bldg_day['simscode'] == bldg_id].sort_values('date')
    if len(bdf) < 60:
        continue
    cps = detect_cusum(bdf['daily_kwh'])
    cusum_results.append({
        'simscode': bldg_id,
        'buildingname': bdf['buildingname'].iloc[0],
        'n_change_points': len(cps),
        'change_dates': [bdf.iloc[cp]['date'] for cp in cps[:5]] if cps else []
    })

cusum_df = pd.DataFrame(cusum_results)
print('=== CUSUM Change-Point Detection ===')
for _, row in cusum_df.iterrows():
    name = str(row['buildingname'])[:35] if pd.notna(row['buildingname']) else row['simscode']
    dates = [d.strftime('%b %d') if hasattr(d, 'strftime') else str(d) for d in row['change_dates']]
    print(f'{name}: {row["n_change_points"]} change-points — {dates}')

=== CUSUM Change-Point Detection ===
Energy Advancement and Innovation C: 7 change-points — ['Nov 10', 'Nov 11', 'Nov 13', 'Nov 17', 'Nov 24']
OSU Electric Substation (0079): 6 change-points — ['Jan 12', 'Feb 23', 'Mar 05', 'Jun 25', 'Sep 14']
Dreese Laboratories (0279): 4 change-points — ['Nov 16', 'Nov 27', 'Dec 09', 'Dec 30']
Substation West Campus (0134): 5 change-points — ['Apr 06', 'May 26', 'Jun 26', 'Nov 16', 'Dec 10']
Waterman - Turf Shed 1 (0992): 1 change-points — ['Nov 23']
McPherson Chemical Laboratory (0053: 18 change-points — ['Apr 06', 'Apr 09', 'Apr 12', 'Apr 15', 'Apr 18']
Hopkins Hall (0149): 1 change-points — ['Dec 24']
Scott Laboratory (0148): 1 change-points — ['Dec 14']
Chiller Plant, South Campus Central: 25 change-points — ['Jan 12', 'Jan 23', 'Feb 04', 'Feb 15', 'Feb 26']
James Cancer Hospital (0375): 5 change-points — ['May 26', 'Aug 22', 'Oct 07', 'Nov 02', 'Dec 17']
McCracken Power Plant (0069): 24 change-points — ['Jan 11', 'Jan 21', 'Jan 31', 'Feb 09', 'F

In [13]:
# Visualize CUSUM for a building with change points
bldg_with_cp = cusum_df[cusum_df['n_change_points'] > 0].iloc[0] if len(cusum_df[cusum_df['n_change_points'] > 0]) > 0 else None

if bldg_with_cp is not None:
    bdf = bldg_day[bldg_day['simscode'] == bldg_with_cp['simscode']].sort_values('date')
    cps = detect_cusum(bdf['daily_kwh'])
    
    fig, ax = plt.subplots(figsize=(16, 6))
    ax.plot(bdf['date'], bdf['daily_kwh'], color='steelblue', alpha=0.6, linewidth=0.8)
    ax.plot(bdf['date'], bdf['daily_kwh'].rolling(14).mean(), color='navy', linewidth=2, label='14-day rolling mean')
    
    for cp in cps:
        ax.axvline(bdf.iloc[cp]['date'], color='red', linestyle='--', alpha=0.7)
    
    name = str(bldg_with_cp['buildingname'])[:40] if pd.notna(bldg_with_cp['buildingname']) else bldg_with_cp['simscode']
    ax.set_title(f'CUSUM Change-Point Detection: {name}', fontsize=14, fontweight='bold')
    ax.set_xlabel('Date')
    ax.set_ylabel('Daily Electricity (kWh)')
    ax.legend()
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
    plt.tight_layout()
    plt.show()
    print(f'Red lines indicate CUSUM-detected step changes in consumption pattern.')
else:
    print('No buildings with detected change points.')

Red lines indicate CUSUM-detected step changes in consumption pattern.


## 7. Savings Quantification

We estimate potential savings for buildings identified as inefficient, combining peer benchmarking and off-hours waste analysis. We use a conservative electricity rate of $0.08/kWh.

In [14]:
ELEC_RATE = 0.08  # $/kWh (conservative campus rate)

# Savings from peer benchmarking: if each building reduced EUI to peer median
savings_peer = elec_eui_peer[elec_eui_peer['eui_vs_peer'] > 1.0].copy()
savings_peer['excess_eui'] = savings_peer['eui'] - savings_peer['peer_median_eui']
savings_peer['excess_kwh'] = savings_peer['excess_eui'] * savings_peer['grossarea']
savings_peer['savings_dollars'] = savings_peer['excess_kwh'] * ELEC_RATE
savings_peer = savings_peer.sort_values('excess_kwh', ascending=False)

total_savings_kwh = savings_peer['excess_kwh'].sum()
total_savings_dollars = savings_peer['savings_dollars'].sum()

print(f'=== Savings Potential from Peer Benchmarking ===')
print(f'If every above-median building reduced EUI to its peer median:')
print(f'  Total savings: {total_savings_kwh/1e6:.1f} million kWh/year')
print(f'  Estimated cost savings: ${total_savings_dollars/1e6:.2f} million/year')
print(f'\nTop 10 buildings by savings potential:')
print(savings_peer[['buildingname', 'campusname', 'eui', 'peer_median_eui', 'excess_kwh', 'savings_dollars']].head(10).to_string(index=False))

=== Savings Potential from Peer Benchmarking ===
If every above-median building reduced EUI to its peer median:
  Total savings: 68653.5 million kWh/year
  Estimated cost savings: $5492.28 million/year

Top 10 buildings by savings potential:
                                   buildingname campusname          eui  peer_median_eui   excess_kwh  savings_dollars
Energy Advancement and Innovation Center (1044)   Columbus 4.336126e+05         3.179032 2.869194e+10     2.295355e+09
                 OSU Electric Substation (0079)   Columbus 1.266989e+06         3.024960 1.715499e+10     1.372399e+09
                     Dreese Laboratories (0279)   Columbus 8.337157e+04         2.532811 1.545912e+10     1.236730e+09
                  Substation West Campus (0134)   Columbus 6.198429e+05         3.020524 6.034141e+09     4.827313e+08
                  Waterman - Turf Shed 1 (0992)   Columbus 5.267399e+05         3.020524 1.000800e+09     8.006400e+07
           McPherson Chemical Laboratory (00

In [15]:
# Savings visualization
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Top 10 savings opportunities
top10_savings = savings_peer.head(10).copy()
top10_savings['label'] = top10_savings['buildingname'].apply(lambda x: str(x)[:30] if pd.notna(x) else 'Unknown')
axes[0].barh(top10_savings['label'], top10_savings['savings_dollars'] / 1000, color='green', edgecolor='black')
axes[0].set_title('Top 10 Buildings: Annual Savings Potential', fontweight='bold')
axes[0].set_xlabel('Estimated Savings ($1,000/year)')

# Cumulative savings (Pareto-style)
savings_sorted = savings_peer.sort_values('excess_kwh', ascending=False)
savings_sorted['cum_savings'] = savings_sorted['excess_kwh'].cumsum()
savings_sorted['cum_pct'] = savings_sorted['cum_savings'] / total_savings_kwh * 100
axes[1].plot(range(1, len(savings_sorted) + 1), savings_sorted['cum_pct'], 'o-', color='steelblue', markersize=3)
axes[1].axhline(80, color='red', linestyle='--', alpha=0.5, label='80% of total')
n_80 = (savings_sorted['cum_pct'] <= 80).sum()
axes[1].axvline(n_80, color='red', linestyle='--', alpha=0.5)
axes[1].set_title(f'Cumulative Savings: {n_80} Buildings = 80% of Total', fontweight='bold')
axes[1].set_xlabel('Number of Buildings (ranked by savings)')
axes[1].set_ylabel('Cumulative % of Total Savings')
axes[1].legend()

plt.tight_layout()
plt.show()

## 8. Data Quality Scorecard

Per-meter data quality assessment: % missing readings, % filtered readings, and consecutive gap detection. This informs which meters need sensor maintenance.

In [16]:
# Data quality scorecard per meter
quality = meter_all.groupby(['meterid', 'simscode', 'buildingname', 'utility']).agg(
    total_days=('readingvalue', 'count'),
    mean_pct_missing=('pct_missing', 'mean'),
    max_pct_missing=('pct_missing', 'max'),
    days_with_missing=('pct_missing', lambda x: (x > 0).sum()),
    mean_pct_filtered=('pct_filtered', 'mean'),
    mean_reading=('readingvalue', 'mean')
).reset_index()

quality['data_quality_grade'] = np.where(
    quality['mean_pct_missing'] == 0, 'A (Perfect)',
    np.where(quality['mean_pct_missing'] < 5, 'B (Good)',
    np.where(quality['mean_pct_missing'] < 25, 'C (Fair)',
    np.where(quality['mean_pct_missing'] < 50, 'D (Poor)', 'F (Critical)'))))

print('=== Data Quality Scorecard ===')
print(quality['data_quality_grade'].value_counts().sort_index().to_string())

# Worst meters
worst = quality.nlargest(10, 'mean_pct_missing')
print(f'\nWorst 10 meters (highest missing rate):')
print(worst[['meterid', 'buildingname', 'utility', 'mean_pct_missing', 'total_days', 'data_quality_grade']].to_string(index=False))

=== Data Quality Scorecard ===
data_quality_grade
A (Perfect)     783
B (Good)         84
C (Fair)         32
D (Poor)         58
F (Critical)     58

Worst 10 meters (highest missing rate):
 meterid                       buildingname     utility  mean_pct_missing  total_days data_quality_grade
  245952   Baker Systems Engineering (0280) ELECTRICITY             100.0           0       F (Critical)
  246000      Pump House - Cannon Dr (1010) ELECTRICITY             100.0           0       F (Critical)
  246025 Comprehensive Cancer Center (0363) ELECTRICITY             100.0           0       F (Critical)
  246263                  Parks Hall (0273) ELECTRICITY             100.0           0       F (Critical)
  246264                  Parks Hall (0273) ELECTRICITY             100.0           0       F (Critical)
  246265                  Parks Hall (0273) ELECTRICITY             100.0           0       F (Critical)
  246386   Veterinary Medical Center (0299) ELECTRICITY             100.0 

In [17]:
# Save results for Notebook 5
try:
    savings_peer.to_parquet(DATA_DIR / 'savings_potential.parquet')
    quality.to_parquet(DATA_DIR / 'data_quality_scorecard.parquet')
    print('Saved savings_potential.parquet and data_quality_scorecard.parquet')
except NameError:
    print('Dataframes not found, cell execution order issue?')

Saved savings_potential.parquet and data_quality_scorecard.parquet


## Key Findings

1. **EUI Benchmarking**: Wide variation in Electricity EUI across campus. Some buildings consume several times more per sqft than their peers.
2. **Peer Comparison**: Buildings exceeding 1.5x their peer group median EUI represent the highest-impact efficiency opportunities.
3. **Off-Hours Waste**: A significant number of buildings maintain high base load ratios (>0.7), meaning they "never turn off" — indicating scheduling or controls issues.
4. **Anomaly Detection**: Both Z-score and Isolation Forest methods identify days with unusual consumption patterns, potentially indicating equipment malfunction or operational issues.
5. **Change-Point Detection**: CUSUM identifies persistent shifts in building consumption — useful for detecting equipment failures or successful retrofits.
6. **Savings Potential**: Reducing above-median buildings to peer median EUI yields millions of kWh and substantial cost savings annually. A Pareto analysis shows that a small number of buildings account for the majority of savings potential.
7. **Data Quality**: The majority of meters have excellent data quality, but specific meters with chronic missing data should be prioritized for sensor maintenance.