# UK Energy AI Forecasting Evaluation


This notebook evaluates two forecasting models: 
1. Using 30-minute granularity 
2. Using daily aggregates 
<br>By comparing forecasted values with ground truth. 
<br>We assess model performance globally as well as per low voltage feeder (lve_feeder) using common error metrics: MAE, MSE, RMSE, MAPE, and SMAPE.

We're at the final stage and will be looking at the following:

<img src="../docs/imgs/energy-sa-evaluation.png" width="200">

Now is a good time to pause!
Training a good model is hard. The results in this notebook will be unsurprising - the approach we have taken to the specific models here are naivë. The focus is on the end to end framework for training and evaluating models in this space. A more comprehensive approach would look at making individual feeder, or meter level forecasts and aggregating up to get a portfolio level prediction.

Luckily, there is an accelerator that looks at many model forecasting, [publically available here](https://github.com/databricks-industry-solutions/ray-framework-on-databricks/tree/main/Many_Models_Training).

In [0]:
%run ./includes/common_functions_and_imports

In [0]:
from pyspark.sql import functions as F
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
import numpy as np
import mlflow
import matplotlib.pyplot as plt


## Data Loading and Preparation
Data Sources:

* 30-minute forecasts and ground truth 
* Daily forecasts and ground truth

Joins:
Forecast and ground truth are joined on keys such as lv_feeder_unique_id and timestamps (data_collection_log_timestamp or ts) to form consolidated DataFrames (i.e., joined_df and daily_joined_df).


In [0]:
table_prefix = f"{CONFIG.target_catalog}.{CONFIG.target_schema}"

thirtymin_univariate_table_name = f"{table_prefix}.ai_forecast_uk_energy_30min"
thirtymin_univariate_ground_truth_table = f"{table_prefix}.unscaled_test_features"

ai_daily_table_name = f"{table_prefix}.ai_forecast_uk_energy_daily"
prophet_daily_forecast_table = f"{table_prefix}.predictions_daily_forecast_scaled"
prophet_unscaled_daily_forecast_table = (
    f"{table_prefix}.predictions_daily_forecast_exogenous_noscaling_ext"
)
daily_min_ground_truth_table = f"{table_prefix}.ai_test_uk_energy_daily"

tables_to_check = [
    thirtymin_univariate_table_name,
    thirtymin_univariate_ground_truth_table,
    ai_daily_table_name,
    prophet_daily_forecast_table,
    prophet_unscaled_daily_forecast_table,
    daily_min_ground_truth_table,
]
table_existence_check = [spark.catalog.tableExists(t) for t in tables_to_check]

if not all(table_existence_check):
    dbutils.notebook.exit("One of our source tables does not exist")

In [0]:
# half-hourly
forecast_df = spark.table(thirtymin_univariate_table_name)
ground_truth_df = spark.table(thirtymin_univariate_ground_truth_table)
# daily
daily_forecast_df = spark.table(ai_daily_table_name)
daily_ground_truth_df = spark.table(daily_min_ground_truth_table)

df_exo_unscaled_pred = spark.table(prophet_unscaled_daily_forecast_table).withColumnRenamed("ds", "ts")  
df_exo_pred = spark.table(prophet_daily_forecast_table).withColumnRenamed("ds", "ts") 

joined_df = forecast_df.join(ground_truth_df, on=["lv_feeder_unique_id", "data_collection_log_timestamp"], how="inner")
daily_joined_df = daily_forecast_df.join(daily_ground_truth_df, on=["lv_feeder_unique_id", "ts"], how="inner")

In [0]:
forecast_df.printSchema()

In [0]:
joined_df.show(5)

In [0]:
daily_joined_df.show(5)


## Metrics Calculation Functions
Two functions are defined for evaluating the forecast accuracy:

1. `calculate_overall_metrics(df, ground_truth_col, forecast_col)`:
This function computes global (overall) metrics by:

  * Calculating per-row errors (absolute, squared, percentage, and symmetric percentage errors).

  * Aggregating these errors to generate MAE, MSE, RMSE, MAPE, and SMAPE values.

2. `calculate_group_metrics(df, ground_truth_col, forecast_col, group_col)`:
Similar to the overall metrics function, but computes error metrics grouped by lv_feeder_unique_id (i.e., feeder-level performance).


In [0]:
def calculate_overall_metrics(df, ground_truth_col, forecast_col):
    """
    Compute overall forecast metrics (MAE, MSE, RMSE, MAPE, and SMAPE)
    on the provided DataFrame.
    
    Parameters:
      df (DataFrame): The input DataFrame containing predictions and actual values.
      ground_truth_col (str): Column name for actual values.
      forecast_col (str): Column name for forecasted values.
    
    Returns:
      A Row with the aggregated metrics.
    """
    # Compute per-row errors and additional metrics
    df_errors = df.withColumn("error", F.col(forecast_col) - F.col(ground_truth_col)) \
        .withColumn("abs_error", F.abs(F.col("error"))) \
        .withColumn("squared_error", F.pow(F.col("error"), 2)) \
        .withColumn("percentage_error",
                    F.when(F.col(ground_truth_col) != 0,
                           F.abs(F.col("error") / F.col(ground_truth_col)))
                     .otherwise(0)) \
        .withColumn("smape",
                    F.when(
                        (F.col(ground_truth_col).isNotNull()) &
                        (F.col(forecast_col).isNotNull()) &
                        ((F.abs(F.col(ground_truth_col)) + F.abs(F.col(forecast_col))) != 0),
                        2 * F.abs(F.col("error")) / (F.abs(F.col(ground_truth_col)) + F.abs(F.col(forecast_col)))
                    ).otherwise(0))
    
    # Aggregate the error metrics over the entire dataset
    agg_metrics = df_errors.agg(
        F.avg("abs_error").alias("MAE"),
        F.avg("squared_error").alias("MSE"),
        F.sqrt(F.avg("squared_error")).alias("RMSE"),
        F.avg("percentage_error").alias("MAPE"),
        F.avg("smape").alias("SMAPE")
    )
    
    overall_metrics = agg_metrics.collect()[0]
    
    print("Overall Metrics:")
    print(f"MAE: {overall_metrics['MAE']}")
    print(f"MSE: {overall_metrics['MSE']}")
    print(f"RMSE: {overall_metrics['RMSE']}")
    print(f"MAPE: {overall_metrics['MAPE']}")
    print(f"SMAPE: {overall_metrics['SMAPE']}")
    
    return overall_metrics


In [0]:

def calculate_group_metrics(df, ground_truth_col, forecast_col, group_col):
    """
    Compute forecast metrics (MAE, MSE, RMSE, MAPE, and SMAPE)
    grouped by the specified group column (e.g., lv_feeder_unique_id).
    
    Parameters:
      df (DataFrame): The input DataFrame containing predictions and actual values.
      ground_truth_col (str): Column name for actual values.
      forecast_col (str): Column name for forecasted values.
      group_col (str): Column name for grouping (e.g., "lv_feeder_unique_id").
    
    Returns:
      A DataFrame with metrics computed for each group.
    """
    # Compute per-row errors and additional metrics, similar as above
    df_errors = df.withColumn("error", F.col(forecast_col) - F.col(ground_truth_col)) \
        .withColumn("abs_error", F.abs(F.col("error"))) \
        .withColumn("squared_error", F.pow(F.col("error"), 2)) \
        .withColumn("percentage_error",
                    F.when(F.col(ground_truth_col) != 0,
                           F.abs(F.col("error") / F.col(ground_truth_col)))
                     .otherwise(0)) \
        .withColumn("smape",
                    F.when(
                        (F.col(ground_truth_col).isNotNull()) &
                        (F.col(forecast_col).isNotNull()) &
                        ((F.abs(F.col(ground_truth_col)) + F.abs(F.col(forecast_col))) != 0),
                        2 * F.abs(F.col("error")) / (F.abs(F.col(ground_truth_col)) + F.abs(F.col(forecast_col)))
                    ).otherwise(0))
    
    # Group by the specified column and aggregate metrics per group
    group_metrics_df = df_errors.groupBy(group_col).agg(
        F.avg("abs_error").alias("MAE"),
        F.avg("squared_error").alias("MSE"),
        F.sqrt(F.avg("squared_error")).alias("RMSE"),
        F.avg("percentage_error").alias("MAPE"),
        F.avg("smape").alias("SMAPE")
    )
    
    return group_metrics_df


### Visualisation functions

In [0]:
def mape_smape_histogram(pdf, experiment_name):
    # Create clipped versions of the metrics (values above 1 become 1)
    pdf["MAPE_clipped"] = pdf["MAPE"].clip(upper=1)
    pdf["SMAPE_clipped"] = pdf["SMAPE"].clip(upper=1)

    # Histogram of clipped MAPE and SMAPE values
    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    plt.hist(pdf["MAPE_clipped"], bins=30, edgecolor="black")
    plt.title("Distribution of Clipped MAPE Across Feeders"+"\n"+experiment_name)
    plt.xlabel("Clipped MAPE")
    plt.ylabel("Frequency")

    plt.subplot(1, 2, 2)
    plt.hist(pdf["SMAPE_clipped"], bins=30, edgecolor="black", color="orange")
    plt.title("Distribution of Clipped SMAPE Across Feeders"+"\n"+experiment_name)
    plt.xlabel("Clipped SMAPE")
    plt.ylabel("Frequency")

    plt.tight_layout()
    plt.show()

In [0]:
def mape_smape_buckets(pdf):
    # Define bucket thresholds (these are just example thresholds; adjust based on your domain)
    mape_bins = [0, 0.1, 0.2, 0.4, 1.0]
    mape_labels = ["Excellent (<10%)", "Good (10-20%)", "Average (20-40%)", "Poor (>=40%)"]

    smape_bins = [0, 0.1, 0.2, 0.4, 1.0]
    smape_labels = ["Excellent (<10%)", "Good (10-20%)", "Average (20-40%)", "Poor (>=40%)"]

    # Create categorical buckets for MAPE and SMAPE
    pdf["MAPE_bucket"] = pd.cut(pdf["MAPE"], bins=mape_bins, labels=mape_labels, right=False)
    pdf["SMAPE_bucket"] = pd.cut(pdf["SMAPE"], bins=smape_bins, labels=smape_labels, right=False)

    # Get counts for each bucket
    mape_bucket_counts = pdf["MAPE_bucket"].value_counts().sort_index()
    smape_bucket_counts = pdf["SMAPE_bucket"].value_counts().sort_index()

    # Calculate proportions (as a percentage of total feeders)
    total_feeders = len(pdf)
    mape_proportions = (mape_bucket_counts / total_feeders) * 100
    smape_proportions = (smape_bucket_counts / total_feeders) * 100

    return mape_bucket_counts, mape_proportions, smape_bucket_counts, smape_proportions

In [0]:
def visualise_buckets(mape_bucket_counts, mape_proportions, smape_bucket_counts, smape_proportions, experiment_name):
    print("MAPE Buckets:")
    print(mape_bucket_counts)
    print("\nMAPE Proportions (in %):")
    print(mape_proportions)

    print("\nSMAPE Buckets:")
    print(smape_bucket_counts)
    print("\nSMAPE Proportions (in %):")
    print(smape_proportions)

    fig, ax = plt.subplots(1, 2, figsize=(14, 6))

    # Plot MAPE bucket counts on the first subplot
    mape_bucket_counts.plot(kind="bar", ax=ax[0], color="blue", edgecolor="black")
    ax[0].set_title("Distribution of Feeders by MAPE Bucket"+"\n"+experiment_name)
    ax[0].set_xlabel("MAPE Bucket")
    ax[0].set_ylabel("Number of Feeders")
    ax[0].set_xticklabels(mape_bucket_counts.index, rotation=45)

    # Plot SMAPE bucket counts on the second subplot
    smape_bucket_counts.plot(kind="bar", ax=ax[1], color="orange", edgecolor="black")
    ax[1].set_title("Distribution of Feeders by SMAPE Bucket"+"\n"+experiment_name)
    ax[1].set_xlabel("SMAPE Bucket")
    ax[1].set_ylabel("Number of Feeders")
    ax[1].set_xticklabels(smape_bucket_counts.index, rotation=45)

    plt.tight_layout()
    plt.show()


## 30-Minute Forecasting Evaluation
### Global Evaluation:
The overall performance is calculated for the 30-minute model. 


In [0]:
overall_results = calculate_overall_metrics(joined_df, "normalized_consumption_kwh", "normalized_consumption_kwh_forecast")


Our global forecasting model exhibits a MAPE of roughly 43.8%, indicating that, on average, forecasts deviate by 43.8% from actual values. However, the SMAPE of about 31% suggests a more balanced performance when both predicted and observed values are considered, mitigating the impact of low or near-zero actuals. These metrics imply that while there is room for improvement, especially in handling low consumption periods, the model delivers a more robust performance under symmetric evaluation.


### Feeder-Level Evaluation:

In [0]:
group_results_df = calculate_group_metrics(joined_df, "normalized_consumption_kwh", "normalized_consumption_kwh_forecast","lv_feeder_unique_id")
group_results_df.show(truncate=False)

In [0]:
pdf = group_results_df.toPandas()
mape_smape_histogram(pdf,"30-min univariate forcaster")

In [0]:
thirtymin_univariate_mape_bucket_counts, thirtymin_univariate_mape_proportions, thirtymin_univariate_smape_bucket_counts, thirtymin_univariate_smape_proportions = mape_smape_buckets(pdf)

In [0]:
visualise_buckets(thirtymin_univariate_mape_bucket_counts, thirtymin_univariate_mape_proportions, thirtymin_univariate_smape_bucket_counts, thirtymin_univariate_smape_proportions, '30-min univariate forcaster')

The distribution of forecast errors across feeders reveals that, according to MAPE, an almost negligible fraction (<0.01%) achieve an excellent error level (<10%), about 8% are classified as good (10–20%), the majority (55%) fall into the average range (20–40%), and roughly 34% are in the poor category (>=40%). In contrast, when viewed through SMAPE, only about 1% of feeders are excellent, approximately 14% are good, 71% have average performance, and around 14% are poor. This indicates that most feeders exhibit moderate forecast errors, but there is a notable subset with high errors, suggesting that further model fine-tuning or segmentation might be necessary to improve forecast accuracy for those underperforming groups.

## Daily Forecasting Evaluation
### Global Evaluation:

In [0]:
daily_overall_results = calculate_overall_metrics(daily_joined_df, "daily_normalized_consumption_kwh", "daily_normalized_consumption_kwh_forecast")


Overall, the daily forecasting model yields a mean absolute error (MAE) of about 3.5 and a root mean square error (RMSE) of around 45.51. The MAPE is calculated as 1.9, indicating that, on average, the absolute forecast error is almost twice the actual value, while the symmetric MAPE (SMAPE) is 0.21 (21%). The large difference between MAPE and SMAPE suggests that the traditional MAPE might be inflated by very low actual values, whereas SMAPE provides a more balanced view of forecast accuracy. This discrepancy is important to consider when evaluating model performance and when comparing forecasts across feeders with varying consumption levels.

### Feeder-Level Evaluation:

In [0]:
daily_group_results_df = calculate_group_metrics(daily_joined_df, "daily_normalized_consumption_kwh", "daily_normalized_consumption_kwh_forecast","lv_feeder_unique_id")
daily_group_results_df.show(truncate=False)

In [0]:
daily_pdf = daily_group_results_df.toPandas()
mape_smape_histogram(daily_pdf, "Daily univariate forcaster")

In [0]:
daily_univariate_mape_bucket_counts, daily_univariate_mape_proportions, daily_univariate_smape_bucket_counts, daily_univariate_smape_proportions = mape_smape_buckets(daily_pdf)

In [0]:
visualise_buckets(daily_univariate_mape_bucket_counts, daily_univariate_mape_proportions, daily_univariate_smape_bucket_counts, daily_univariate_smape_proportions, 'Daily univariate forcaster')

For the daily forecaster, the MAPE breakdown indicates that about 16.6% of feeders achieve excellent performance (<10% error), 11.6% are good (10–20% error), 2.7% fall in the average range (20–40% error), and 20.4% are classified as poor (>=40% error). Meanwhile, SMAPE offers a more favorable view with 17.4% excellent, 43.1% good, 35.6% average, and only 3.3% poor. In short, while MAPE suggests a significant portion of forecasts have relatively high errors, the SMAPE analysis reveals that the majority of daily forecasts are within acceptable symmetric error bounds, highlighting a generally robust performance with room for targeted improvements in the few high-error cases identified by MAPE.

## Comparison with exogenous forecaster

### Unscaled exogenous variables

In [0]:
df_exo_unscaled_eval = df_exo_unscaled_pred.join(daily_ground_truth_df, on=["lv_feeder_unique_id", "ts"], how="inner") \
                         .withColumnRenamed("y", "actual") \
                         .withColumnRenamed("yhat", "forecast")

In [0]:
df_exo_unscaled_pred.printSchema()

In [0]:
daily_unscaled_overall_results = calculate_overall_metrics(df_exo_unscaled_eval, "daily_normalized_consumption_kwh", "forecast")

In [0]:
daily_prophet_unscaled_group_results_df = calculate_group_metrics(df_exo_unscaled_eval, "daily_normalized_consumption_kwh", "forecast","lv_feeder_unique_id")
daily_group_results_df.show(truncate=False)

In [0]:
prophet_unscaled_daily_pdf = daily_prophet_unscaled_group_results_df.toPandas()
mape_smape_histogram(prophet_unscaled_daily_pdf, "Daily forcaster with unscaled exogenous variables")

In [0]:
daily_exogenous_unscaled_mape_bucket_counts, daily_exogenous_unscaled_mape_proportions, daily_exogenous_unscaled_smape_bucket_counts, daily_exogenous_unscaled_smape_proportions = mape_smape_buckets(prophet_unscaled_daily_pdf)

In [0]:
visualise_buckets(daily_exogenous_unscaled_mape_bucket_counts, daily_exogenous_unscaled_mape_proportions, daily_exogenous_unscaled_smape_bucket_counts, daily_exogenous_unscaled_smape_proportions, 'Daily forcaster with unscaled exogenous variables')

### Scaled exogenous variables

In [0]:
# Read Exogenous Forecaster Predictions (Scaled Numerical Inputs)
df_exo_eval = df_exo_pred.join(daily_ground_truth_df, on=["lv_feeder_unique_id", "ts"], how="inner") \
                         .withColumnRenamed("y", "actual") \
                         .withColumnRenamed("yhat", "forecast")

In [0]:
df_exo_eval.printSchema()

In [0]:
daily_overall_results = calculate_overall_metrics(df_exo_eval, "daily_normalized_consumption_kwh", "forecast")

In [0]:
daily_prophet_group_results_df = calculate_group_metrics(df_exo_eval, "daily_normalized_consumption_kwh", "forecast","lv_feeder_unique_id")
daily_group_results_df.show(truncate=False)

In [0]:
prophet_daily_pdf = daily_prophet_group_results_df.toPandas()
mape_smape_histogram(prophet_daily_pdf, "Daily forcaster with scaled exogenous variables")

In [0]:
daily_exogenous_scaled_mape_bucket_counts, daily_exogenous_scaled_mape_proportions, daily_exogenous_scaled_smape_bucket_counts, daily_exogenous_scaled_smape_proportions = mape_smape_buckets(prophet_daily_pdf)

In [0]:
visualise_buckets(daily_exogenous_scaled_mape_bucket_counts, daily_exogenous_scaled_mape_proportions, daily_exogenous_scaled_smape_bucket_counts, daily_exogenous_scaled_smape_proportions, 'Daily forcaster with scaled exogenous variables')

### Comparing all models

In [0]:
import matplotlib.pyplot as plt
import numpy as np

model_names = [
    "daily_exogenous_unscaled",
    "daily_exogenous_scaled",
    "daily_univariate",
    "thirtymin_univariate"
]

buckets = ["Excellent (<10%)", "Good (10-20%)", "Average (20-40%)", "Poor (>=40%)"]

# Assume raw_mape and raw_smape dicts exist as before
# Re‑normalize each to sum to 100%
mape_dict = {
    model: (series.reindex(buckets, fill_value=0) / series.sum() * 100).values
    for model, series in {
        "daily_exogenous_unscaled": daily_exogenous_unscaled_mape_proportions,
        "daily_exogenous_scaled":   daily_exogenous_scaled_mape_proportions,
        "daily_univariate":         daily_univariate_mape_proportions,
        "thirtymin_univariate":     thirtymin_univariate_mape_proportions
    }.items()
}

smape_dict = {
    model: (series.reindex(buckets, fill_value=0) / series.sum() * 100).values
    for model, series in {
        "daily_exogenous_unscaled": daily_exogenous_unscaled_smape_proportions,
        "daily_exogenous_scaled":   daily_exogenous_scaled_smape_proportions,
        "daily_univariate":         daily_univariate_smape_proportions,
        "thirtymin_univariate":     thirtymin_univariate_smape_proportions
    }.items()
}

# --- Print MAPE proportions ---
print("MAPE proportions (%):")
for model, vals in mape_dict.items():
    print(f"  {model}:")
    for bucket, v in zip(buckets, vals):
        print(f"    {bucket}: {v:.2f}%")

# --- Print SMAPE proportions ---
print("\nSMAPE proportions (%):")
for model, vals in smape_dict.items():
    print(f"  {model}:")
    for bucket, v in zip(buckets, vals):
        print(f"    {bucket}: {v:.2f}%")

# Now plot and annotate
x = np.arange(len(buckets))
n_models = len(model_names)
total_width = 0.8
bar_width = total_width / n_models
offsets = np.linspace(-total_width/2 + bar_width/2,
                      total_width/2 - bar_width/2,
                      n_models)
colors = plt.cm.tab10.colors[:n_models]

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,12), constrained_layout=True)

# MAPE
for i, model in enumerate(model_names):
    bars = ax1.bar(
        x + offsets[i],
        mape_dict[model],
        width=bar_width,
        label=model,
        color=colors[i],
        edgecolor="black"
    )
    ax1.bar_label(bars, fmt="%.1f%%", padding=3)
ax1.set_xticks(x)
ax1.set_xticklabels(buckets, rotation=45, ha='right')
ax1.set_ylabel("Proportion (%)")
ax1.set_title("MAPE Bucket Comparison (Normalized to 100%)")
ax1.legend(loc="upper left")
ax1.grid(axis="y", linestyle="--", alpha=0.5)
ax1.margins(x=0.05)

# SMAPE
for i, model in enumerate(model_names):
    bars = ax2.bar(
        x + offsets[i],
        smape_dict[model],
        width=bar_width,
        label=model,
        color=colors[i],
        edgecolor="black"
    )
    ax2.bar_label(bars, fmt="%.1f%%", padding=3)
ax2.set_xticks(x)
ax2.set_xticklabels(buckets, rotation=45, ha='right')
ax2.set_ylabel("Proportion (%)")
ax2.set_title("SMAPE Bucket Comparison (Normalized to 100%)")
ax2.legend(loc="upper left")
ax2.grid(axis="y", linestyle="--", alpha=0.5)
ax2.margins(x=0.05)

plt.show()


The key takeaway is that the simple daily univariate baseline far outperforms the Prophet models with exogenous variables. Adding temperature, solar radiation, and device‑count regressors did not improve overall forecast accuracy compared to the straightforward univariate approach, and in fact, the scaled version made SMAPE substantially worse. This suggests that either: these covariates aren’t the right drivers at the daily level, or the model needs further tuning or feature engineering (e.g. different lags, interactions, or variable transformations) before an exogenous Prophet model can surpass the univariate baseline.