# Daily Forecast with Prophet & Exogenous Variables (Scaled)
This notebook generates daily energy consumption forecasts per feeder using a Prophet model augmented with selected exogenous variables. 

We're looking at this part of the flow:

<img src="../docs/imgs/energy-sa-forecasting-prophet.png " width="300">

## Environment Setup & Configuration

In [0]:
%run ./includes/common_functions_and_imports

In [0]:
%pip install prophet==1.1.4
%pip install holidays
%pip install scikit-learn

In [0]:
from pyspark.sql import functions as F
from pyspark.sql import types as T
import pandas as pd
from prophet import Prophet
from sklearn.preprocessing import StandardScaler

## Data Ingestion & Aggregation

In [0]:
source_table_name = (
  f"{CONFIG.target_catalog}.{CONFIG.target_schema}.unscaled_train_features"
)

if not spark.catalog.tableExists(source_table_name):
  dbutils.notebook.exit('Source table does not exist')

target_table_name = f"{CONFIG.target_catalog}.{CONFIG.target_schema}.predictions_daily_forecast_scaled"

if spark.catalog.tableExists(target_table_name) and not CONFIG.overwrite_data:
  dbutils.notebook.exit('Target table already exists, skipping run to save on processing')
  
df_training = spark.table(source_table_name)


In [0]:
# For this experiment, we select dynamic regressors based on correlation analysis (see unscaled notebook)
selected_regressors = ["ssrd", "aggregated_device_count_active"]

In [0]:
df_daily = (
    df_training
    .withColumn("ds", F.date_trunc("day", F.col("data_collection_log_timestamp")))
    .groupBy("lv_feeder_unique_id", "ds")
    .agg(
        F.sum("normalized_consumption_kwh").alias("y"),
        *[F.avg(r).alias(r) for r in selected_regressors]
    )
)

## Scaling & Forecast Function

SSRD (surface solar radiation downwards) and active device count were chosen because they exhibit significant day‑to‑day variability and demonstrated the strongest correlations with daily consumption (|corr| ≈ 0.19 and |corr| ≈ 0.15, respectively). Including these dynamic covariates complements Prophet’s built‑in seasonal components to capture external demand drivers.

In [0]:
# Define the function that will be applied to each feeder’s daily data via Pandas UDF.
# It fits a Prophet model on historical daily observations and forecasts into the future.
def apply_forecast_daily(pdf):
    # If there are less than 2 non-NaN rows, return an empty DataFrame with correct columns.
    if pdf.shape[0] < 2:
        return pd.DataFrame(columns=["lv_feeder_unique_id", "ds", "yhat"])
    
    # Define forecast horizon locally.
    forecast_horizon = 90  # Adjust as needed.
    
    changepoint_prior_scale = 0.01
    seasonality_mode = 'multiplicative'
    n_changepoints = 50  # Adjust if necessary.
    
    # Initialize and configure the Prophet model.
    m = Prophet(
        changepoint_prior_scale=changepoint_prior_scale,
        seasonality_mode=seasonality_mode,
        daily_seasonality=True
    ).add_country_holidays(country_name="GB")\
     .add_seasonality(name="weekly", period=7, fourier_order=3)\
     .add_seasonality(name="annual", period=365, fourier_order=10)
     
    # Optionally, add regressors if your pdf contains those columns.
    for reg in selected_regressors:
        if reg in pdf.columns:
            m.add_regressor(reg)
    
    # Fit the model on this feeder's historical data.
    m.fit(pdf)
    
    # Create a future DataFrame for the forecast horizon using daily frequency.
    future = m.make_future_dataframe(periods=forecast_horizon, freq='D', include_history=False)
    
    # For additional regressors, fill in the future DataFrame with the last observed values.
    last_vals = pdf.iloc[-1][selected_regressors].to_dict() \
                if set(selected_regressors).issubset(pdf.columns) \
                else {}
    for reg in last_vals:
        future[reg] = last_vals[reg]
    
    # Generate forecast using Prophet.
    forecast = m.predict(future)[["ds", "yhat"]]
    forecast["lv_feeder_unique_id"] = pdf["lv_feeder_unique_id"].iloc[0]
    
    return forecast[["lv_feeder_unique_id", "ds", "yhat"]]




In [0]:
output_schema = T.StructType([
    T.StructField("lv_feeder_unique_id", T.StringType(), True),
    T.StructField("ds", T.DateType(), True),
    T.StructField("yhat", T.DoubleType(), True)
])

In [0]:
# Apply the forecasting function per feeder using applyInPandas.
results_df = df_daily.groupBy("lv_feeder_unique_id").applyInPandas(apply_forecast_daily, schema=output_schema)

In [0]:
results_df.write.mode("overwrite").option("overwriteSchema", "true").saveAsTable(target_table_name)