# Model Performance Analysis: Impact of Gaps and Regime Shifts

This notebook computes model performance (RMSE) for Prophet, ARIMA, LSTM, and Random Forest using attached simulation data (`SimulatedQueryMetrics.csv`), and analyzes the impact of data gaps and regime shifts (plan regressions) as in Tables 1 and 2 above.

- **Metrics analyzed:** CPU, LatencyMs, LogicalReads
- **Queries:** Q1, Q2
- **Models:** Prophet, ARIMA, LSTM, Random Forest
- **Evaluations:**
  - Aggregate RMSE per query/metric/model
  - RMSE during normal, gap, and plan regression periods; % increase

In [None]:
# !pip install numpy pandas scikit-learn xgboost statsmodels prophet tensorflow

In [2]:
import numpy as np
import pandas as pd
import time
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from statsmodels.tsa.arima.model import ARIMA
from prophet import Prophet
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

## Load Data

In [6]:
csv_file = 'SimulatedQueryMetrics.csv'
df = pd.read_csv(csv_file)
df['MetricDate'] = pd.to_datetime(df['MetricDate'])
df = df.sort_values(['QueryName', 'MetricDate', 'QueryVariant']).reset_index(drop=True)
print('Columns:', df.columns.tolist())

Columns: ['SimDay', 'SimHour', 'MetricDate', 'QueryName', 'QueryVariant', 'CPU', 'LatencyMs', 'LogicalReads', 'PlanRegression']


## Utility: Lag Feature Builder, Data Splitting

In [8]:
def create_lag_features(df, lags=7, val_col='y'):
    df = df.copy()
    for lag in range(1, lags+1):
        df[f'lag_{lag}'] = df[val_col].shift(lag)
    df = df.dropna().reset_index(drop=True)
    return df

## Model wrappers for timing and predictions

In [10]:
def time_fit(model, X, y):
    start = time.time()
    model_fit = model.fit(X, y)
    end = time.time()
    train_time = end - start
    return model_fit, train_time

def time_predict(model, X):
    start = time.time()
    y_pred = model.predict(X)
    end = time.time()
    pred_time = end - start
    return y_pred, pred_time

## Core Evaluation: RMSE per Query/Metric/Model (+Gaps/Plan Regression Analysis)

In [12]:
results = []
models = ['Prophet', 'ARIMA', 'LSTM', 'Random Forest']
metrics = ['CPU', 'LatencyMs', 'LogicalReads']
queries = ['Q1', 'Q2']
lags = 7
np.random.seed(42)

# For table 2: RMSE for normal, gap, plan regression
gap_results = []

for query in queries:
    for metric in metrics:
        # Use all variants (pool), or optionally do per-variant
        dfx = df[df['QueryName']==query].copy().sort_values(['MetricDate', 'QueryVariant'])

        # Mark gap, plan regression periods
        dfx['is_gap'] = dfx[metric].isnull()
        dfx['is_regression'] = dfx['PlanRegression']==1

        # For RMSE calc, drop rows with missing target
        dfx = dfx[['MetricDate', metric, 'QueryVariant','PlanRegression','is_gap','is_regression']].rename(columns={metric:'y','MetricDate':'ds'})
        dfx = dfx.reset_index(drop=True)

        # Impute missing values for modeling (forward/backward fill, then interpolate)
        dfx['y'] = dfx['y'].fillna(method='ffill').fillna(method='bfill').interpolate()

        # Build lags
        dfx_lagged = create_lag_features(dfx, lags=lags, val_col='y')
        split = int(len(dfx_lagged)*0.8)
        train_df = dfx_lagged.iloc[:split]
        test_df = dfx_lagged.iloc[split:]
        X_train = train_df[[f'lag_{i}' for i in range(1, lags+1)]]
        y_train = train_df['y']
        X_test = test_df[[f'lag_{i}' for i in range(1, lags+1)]]
        y_test = test_df['y']
        # For Prophet, ARIMA: use ds/y
        prophet_train = train_df[['ds','y']]
        prophet_test = test_df[['ds','y']]
        arima_train = train_df['y']
        arima_test = test_df['y']

        # --- Random Forest ---
        rf = RandomForestRegressor(n_estimators=100, random_state=42)
        rf_fit, _ = time_fit(rf, X_train, y_train)
        rf_pred, _ = time_predict(rf_fit, X_test)

        # --- XGBoost (for Table 2 only) ---
        xgb = XGBRegressor(n_estimators=100, random_state=42, verbosity=0)
        xgb_fit, _ = time_fit(xgb, X_train, y_train)
        xgb_pred, _ = time_predict(xgb_fit, X_test)

        # --- Prophet ---
        m = Prophet()
        m.fit(prophet_train)
        forecast = m.predict(prophet_test[['ds']])
        prophet_pred = forecast['yhat'].values

        # --- ARIMA ---
        arima_model = ARIMA(arima_train, order=(lags, 0, 0))
        arima_fit = arima_model.fit()
        arima_pred = arima_fit.forecast(steps=len(arima_test))

        # --- LSTM ---
        X_train_lstm = X_train.values.reshape((-1, lags, 1))
        X_test_lstm = X_test.values.reshape((-1, lags, 1))
        lstm_model = Sequential()
        lstm_model.add(LSTM(32, input_shape=(lags, 1)))
        lstm_model.add(Dense(1))
        lstm_model.compile(optimizer='adam', loss='mse')
        lstm_model.fit(X_train_lstm, y_train, epochs=10, batch_size=32, verbose=0)
        lstm_pred = lstm_model.predict(X_test_lstm).flatten()

        # Calc RMSE for each model (overall)
        RMSE = lambda y, yhat: mean_squared_error(y, yhat, squared=False)

        results.append({
            'Query': query, 'Metric': metric, 'Model': 'Prophet', 'RMSE': RMSE(y_test, prophet_pred)
        })
        results.append({
            'Query': query, 'Metric': metric, 'Model': 'ARIMA', 'RMSE': RMSE(y_test, arima_pred)
        })
        results.append({
            'Query': query, 'Metric': metric, 'Model': 'LSTM', 'RMSE': RMSE(y_test, lstm_pred)
        })
        results.append({
            'Query': query, 'Metric': metric, 'Model': 'Random Forest', 'RMSE': RMSE(y_test, rf_pred)
        })

        # For Table 2: RMSE during normal, gap, plan regression (for Prophet, ARIMA, LSTM, RF, XGB)
        # Need to label test set points by gap/regime
        test_idx = test_df.index
        meta = dfx_lagged.loc[test_idx, ['is_gap', 'is_regression']]

        for model_name, pred in zip(
            ['Prophet','ARIMA','LSTM','Random Forest','XGBoost'],
            [prophet_pred, arima_pred, lstm_pred, rf_pred, xgb_pred]
        ):
            meta = meta.copy()
            meta['y_true'] = y_test.values
            meta['y_pred'] = pred
            # Normal (not gap, not regression)
            norm_mask = (~meta['is_gap']) & (~meta['is_regression'])
            gap_mask = meta['is_gap']
            reg_mask = meta['is_regression']
            rmse_norm = RMSE(meta.loc[norm_mask,'y_true'], meta.loc[norm_mask,'y_pred']) if norm_mask.sum() > 0 else np.nan
            rmse_gap = RMSE(meta.loc[gap_mask,'y_true'], meta.loc[gap_mask,'y_pred']) if gap_mask.sum() > 0 else np.nan
            rmse_reg = RMSE(meta.loc[reg_mask,'y_true'], meta.loc[reg_mask,'y_pred']) if reg_mask.sum() > 0 else np.nan
            pct_gap = 100*(rmse_gap - rmse_norm)/rmse_norm if rmse_norm and not np.isnan(rmse_gap) else np.nan
            pct_reg = 100*(rmse_reg - rmse_norm)/rmse_norm if rmse_norm and not np.isnan(rmse_reg) else np.nan
            gap_results.append({
                'Query': query, 'Metric': metric, 'Model': model_name,
                'RMSE(Normal)': rmse_norm, 'RMSE(GAP)': rmse_gap, '% Increase (Gap)': pct_gap,
                'RMSE(Plan Regression)': rmse_reg, '% Increase (Regression)': pct_reg
            })

06:34:26 - cmdstanpy - INFO - Chain [1] start processing
06:34:27 - cmdstanpy - INFO - Chain [1] done processing


[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step


06:34:44 - cmdstanpy - INFO - Chain [1] start processing
06:34:45 - cmdstanpy - INFO - Chain [1] done processing


[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step


06:35:03 - cmdstanpy - INFO - Chain [1] start processing
06:35:03 - cmdstanpy - INFO - Chain [1] done processing


[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step


06:35:21 - cmdstanpy - INFO - Chain [1] start processing
06:35:21 - cmdstanpy - INFO - Chain [1] done processing


[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 22ms/step


06:35:39 - cmdstanpy - INFO - Chain [1] start processing
06:35:39 - cmdstanpy - INFO - Chain [1] done processing


[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step


06:36:01 - cmdstanpy - INFO - Chain [1] start processing
06:36:02 - cmdstanpy - INFO - Chain [1] done processing


[1m15/15[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step


### Format Table 1: RMSE by Query/Metric/Model

In [14]:
import pandas as pd
perf_df = pd.DataFrame(results)
# Pivot to match Table 1 format
tbl1 = perf_df.pivot_table(index='Model', columns=['Query','Metric'], values='RMSE')
# Optionally, display with 2 decimals
tbl1 = tbl1.round(2)
tbl1

Query,Q1,Q1,Q1,Q2,Q2,Q2
Metric,CPU,LatencyMs,LogicalReads,CPU,LatencyMs,LogicalReads
Model,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
ARIMA,11.92,23.07,20.8,11.69,22.79,20.64
LSTM,30.08,174.42,142.21,38.59,180.36,154.64
Prophet,15.28,45.34,22.58,14.93,44.23,22.13
Random Forest,4.52,7.99,10.07,4.74,7.95,9.33


### Format Table 2: RMSE During Gaps and Plan Regression

In [16]:
gap_df = pd.DataFrame(gap_results)
# Example: Show for Prophet and ARIMA, Q1, CPU
gap_df[['Query','Metric','Model','RMSE(Normal)','RMSE(GAP)','% Increase (Gap)','RMSE(Plan Regression)','% Increase (Regression)']].query('Query=="Q1" and Metric=="CPU" and Model in ["Prophet","ARIMA"]')

Unnamed: 0,Query,Metric,Model,RMSE(Normal),RMSE(GAP),% Increase (Gap),RMSE(Plan Regression),% Increase (Regression)
0,Q1,CPU,Prophet,15.804827,14.71616,-6.888191,14.501666,-8.245338
1,Q1,CPU,ARIMA,12.286066,12.70512,3.410805,11.408781,-7.140486


### Table 2: Aggregate Across All Metrics and Queries (Mean, % Increase)
You may aggregate to report mean RMSE in each regime and mean % increase, as in Table 2.

In [18]:
# Aggregate mean RMSE and % increase per model
summary2 = gap_df.groupby('Model')[['RMSE(Normal)','RMSE(GAP)','% Increase (Gap)','RMSE(Plan Regression)','% Increase (Regression)']].mean().round(2)
summary2

Unnamed: 0_level_0,RMSE(Normal),RMSE(GAP),% Increase (Gap),RMSE(Plan Regression),% Increase (Regression)
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ARIMA,19.4,18.76,-3.15,17.19,-10.92
LSTM,120.21,120.16,-0.42,119.29,-1.11
Prophet,28.32,27.34,-3.92,25.65,-11.21
Random Forest,6.99,17.94,155.68,4.77,-28.49
XGBoost,7.61,18.19,138.3,4.79,-33.74


## Discussion
- Table 1 shows RMSE per query, metric, and model.
- Table 2 summarizes the effect of data gaps and plan regressions on model RMSE, matching the structure of the provided tables.
- XGBoost can be included in Table 2 if desired.
- All results use the same simulation data as previous benchmarking.