| **Phase 3: Forecasting Models** | | |
| 9 | Phase 3 | **Forecasting Models:** Development and calibration of the spot price forecasting framework. |
| 9.1 | 3.1 | **Data Preparation for Forecasting:** Feature engineering (lags, rolling stats) and train/test split setup. |
| 9.2 | 3.2 | **Baseline Models:** Implementation of simple models (e.g., Seasonal ARIMA, persistence model) for comparison. |
| 9.3 | 3.3 | **Machine Learning Models:** Implementation of advanced models (e.g., Gradient Boosting, Neural Networks) leveraging EDA insights. |
| 9.4 | 3.4 | **Model Evaluation & Comparison:** Using relevant metrics (e.g., MAE, RMSE, $\text{WAPE}$, $\text{Q95}$ loss) to select the best performer. |
| 9.5 | 3.5 | **Forecast Outputs for Power BI:** Generating and formatting forecast and scenario data for the decision-support dashboard. |
| **Final** | | |
| 10 | Final | **Key Insights & Business Recommendations:** Summarizing actionable findings for energy market participants and outlining next steps. |

## PHASE 3 — Forecasting Models (Spot Price Forecasting)

Goal: Build forecasting models for NSW spot price (RRP) using a simple, industry-style progression:

- Create modelling dataset (train/validation/test split)

- Establish baseline benchmarks

- Train ML models (XGBoost as main)

- Evaluate with clear metrics and error analysis

- Save forecasts for Power BI

### PHASE 3.1 — Prepare Dataset for Forecasting

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Fix: Use a valid matplotlib style instead of "seaborn"
# Common valid styles include 'seaborn-v0_8', 'ggplot', 'fivethirtyeight', etc.
# Or you can check available styles with plt.style.available
plt.style.use("seaborn-v0_8")  # Updated to use a valid style name

# Alternative approach: Instead of using plt.style.use, you could also
# set the seaborn style directly with:
# sns.set_style("darkgrid")  # or "whitegrid", "dark", "white", "ticks"

df_final = pd.read_csv("data/processed/final_spot_price_dataset.csv", parse_dates=["timestamp"])

df_final.head(), df_final.shape

(            timestamp        RRP  TOTALDEMAND  net_demand_after_pv  \
 0 2025-01-01 00:30:00  121.87000      7162.36              7162.36   
 1 2025-01-01 00:35:00  111.65192      7078.85              7078.85   
 2 2025-01-01 00:40:00  119.04851      7051.56              7051.56   
 3 2025-01-01 00:45:00  119.49351      7029.33              7029.33   
 4 2025-01-01 00:50:00  115.95158      6941.93              6941.93   
 
    pv_rooftop_mw  TOTALINTERMITTENTGENERATION  temperature  wind_speed  hour  \
 0            0.0                     89.49685         27.3        13.0     0   
 1            0.0                     89.09685         27.3        13.0     0   
 2            0.0                     89.26684         27.3        13.0     0   
 3            0.0                     89.38685         27.3        13.0     0   
 4            0.0                     89.66684         27.3        13.0     0   
 
    dayofweek  month  
 0          2      1  
 1          2      1  
 2          2  

In [3]:
df_model = df_final.copy()


In [1]:
import pandas as pd
import numpy as np

df_final = pd.read_csv(
    "data/processed/final_spot_price_dataset.csv",
    parse_dates=["timestamp"]
)

df = df_final.sort_values("timestamp").copy()
df = df.set_index("timestamp")

print("Shape:", df.shape)
df.head()


Shape: (87547, 10)


Unnamed: 0_level_0,RRP,TOTALDEMAND,net_demand_after_pv,pv_rooftop_mw,TOTALINTERMITTENTGENERATION,temperature,wind_speed,hour,dayofweek,month
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2025-01-01 00:30:00,121.87,7162.36,7162.36,0.0,89.49685,27.3,13.0,0,2,1
2025-01-01 00:35:00,111.65192,7078.85,7078.85,0.0,89.09685,27.3,13.0,0,2,1
2025-01-01 00:40:00,119.04851,7051.56,7051.56,0.0,89.26684,27.3,13.0,0,2,1
2025-01-01 00:45:00,119.49351,7029.33,7029.33,0.0,89.38685,27.3,13.0,0,2,1
2025-01-01 00:50:00,115.95158,6941.93,6941.93,0.0,89.66684,27.3,13.0,0,2,1


In [3]:
df.columns

Index(['RRP', 'TOTALDEMAND', 'net_demand_after_pv', 'pv_rooftop_mw',
       'TOTALINTERMITTENTGENERATION', 'temperature', 'wind_speed', 'hour',
       'dayofweek', 'month', 'target_rrp', 'rrp_clip_for_log',
       'target_log_rrp'],
      dtype='object')

## Define the target variable

#### We must tell the model what we want to predict.

- Your target is:

**RRP (spot price)** 

But RRP has a challenge:

- it has spikes

- it can be negative

- it’s not normally distributed
So some models learn better if we also create a stable version:

- target_rrp = original price (business-friendly)

- target_log_rrp = log-transformed (model-friendly)

In [2]:
df["target_rrp"] = df["RRP"]

# log needs non-negative values
df["rrp_clip_for_log"] = df["RRP"].clip(lower=0)
df["target_log_rrp"] = np.log1p(df["rrp_clip_for_log"])

df[["target_rrp", "target_log_rrp"]].head()


Unnamed: 0_level_0,target_rrp,target_log_rrp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2025-01-01 00:30:00,121.87,4.811127
2025-01-01 00:35:00,111.65192,4.724303
2025-01-01 00:40:00,119.04851,4.787896
2025-01-01 00:45:00,119.49351,4.791596
2025-01-01 00:50:00,115.95158,4.76176


In [4]:
df[["hour", "dayofweek", "month"]].describe()


Unnamed: 0,hour,dayofweek,month
count,87547.0,87547.0,87547.0
mean,11.500657,3.000091,5.526689
std,6.921878,1.991822,2.870929
min,0.0,0.0,1.0
25%,6.0,1.0,3.0
50%,12.0,3.0,6.0
75%,18.0,5.0,8.0
max,23.0,6.0,11.0


### Step 3.3 RAMP FEATURES (system stress indicators)

In [5]:
# RAMP FEATURES (system stress indicators)

# Demand ramps
df["total_demand_ramp"] = df["TOTALDEMAND"].diff()
df["net_demand_ramp"] = df["net_demand_after_pv"].diff()

# Renewable generation ramp
df["renewable_ramp"] = df["TOTALINTERMITTENTGENERATION"].diff()

# Optional: absolute ramps (magnitude of change)
df["abs_total_demand_ramp"] = df["total_demand_ramp"].abs()
df["abs_net_demand_ramp"] = df["net_demand_ramp"].abs()
df["abs_renewable_ramp"] = df["renewable_ramp"].abs()

# Quick sanity check
df[[
    "TOTALDEMAND", "total_demand_ramp",
    "net_demand_after_pv", "net_demand_ramp",
    "TOTALINTERMITTENTGENERATION", "renewable_ramp"
]].head(10)


Unnamed: 0_level_0,TOTALDEMAND,total_demand_ramp,net_demand_after_pv,net_demand_ramp,TOTALINTERMITTENTGENERATION,renewable_ramp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2025-01-01 00:30:00,7162.36,,7162.36,,89.49685,
2025-01-01 00:35:00,7078.85,-83.51,7078.85,-83.51,89.09685,-0.4
2025-01-01 00:40:00,7051.56,-27.29,7051.56,-27.29,89.26684,0.16999
2025-01-01 00:45:00,7029.33,-22.23,7029.33,-22.23,89.38685,0.12001
2025-01-01 00:50:00,6941.93,-87.4,6941.93,-87.4,89.66684,0.27999
2025-01-01 00:55:00,6962.78,20.85,6962.78,20.85,87.71685,-1.94999
2025-01-01 01:00:00,6970.56,7.78,6970.56,7.78,88.32857,0.61172
2025-01-01 01:05:00,6949.87,-20.69,6949.87,-20.69,88.82856,0.49999
2025-01-01 01:10:00,6960.56,10.69,6960.56,10.69,88.47857,-0.34999
2025-01-01 01:15:00,6899.02,-61.54,6899.02,-61.54,87.80857,-0.67


## Interpretation — Demand & Renewable Ramp Features (5-Minute Intervals)

### 1. Key Insight
The ramp features clearly capture **short-term system dynamics**, showing frequent small-to-moderate changes in demand and renewable generation even within a short time window. These rapid fluctuations are exactly the conditions under which spot price volatility tends to emerge.

---

### 2. Statistical Observations
- **Total demand ramp** values fluctuate both positively and negatively within short intervals, indicating continuous adjustments in system load rather than smooth transitions.
- **Net demand ramp** mirrors total demand ramp almost exactly in this early window, suggesting minimal rooftop PV impact during this time (late night / early morning).
- **Renewable ramp** values are much smaller in magnitude compared to demand ramps, typically within ±2 MW, reflecting gradual changes in intermittent generation at this time.

---

### 3. Patterns Identified
- Demand declines dominate the early intervals (negative ramps), consistent with post-midnight load reduction.
- Occasional positive ramps (e.g., +20.85 MW, +10.69 MW) indicate short-lived rebounds in demand rather than a steady trend.
- Renewable generation changes are **noisy but low magnitude**, implying limited short-term stress contribution from renewables during this period.

---

### 4. Impact on Forecasting / Market Behaviour
- Even when absolute demand levels are moderate, **rapid negative or positive ramps** can signal upcoming price movements.
- Demand ramps provide early-warning signals for system stress that are not visible in demand levels alone.
- Renewable ramps, while smaller here, become critical during sunrise/sunset and weather-driven events and should be retained as volatility indicators.

---

### 5. What to Explore Next
- Compare ramp magnitudes during known price spike periods to confirm their predictive strength.
- Analyse ramp distributions by hour to identify high-risk transition periods (e.g. morning ramp-up, evening ramp-down).
- Combine ramp features with lagged prices to capture momentum-driven price behaviour.


### Step 3.5 : Lag Features

In [6]:
# LAG FEATURES (market memory)

LAGS = [1, 12, 288]  # 5-min, 1-hour, 1-day

lag_cols = [
    "RRP",
    "TOTALDEMAND",
    "net_demand_after_pv",
    "pv_rooftop_mw",
    "TOTALINTERMITTENTGENERATION",
    "temperature",
    "wind_speed"
]

for col in lag_cols:
    for lag in LAGS:
        df[f"{col}_lag_{lag}"] = df[col].shift(lag)

# Quick check
df[[c for c in df.columns if "_lag_" in c]].head(10)


Unnamed: 0_level_0,RRP_lag_1,RRP_lag_12,RRP_lag_288,TOTALDEMAND_lag_1,TOTALDEMAND_lag_12,TOTALDEMAND_lag_288,net_demand_after_pv_lag_1,net_demand_after_pv_lag_12,net_demand_after_pv_lag_288,pv_rooftop_mw_lag_1,...,pv_rooftop_mw_lag_288,TOTALINTERMITTENTGENERATION_lag_1,TOTALINTERMITTENTGENERATION_lag_12,TOTALINTERMITTENTGENERATION_lag_288,temperature_lag_1,temperature_lag_12,temperature_lag_288,wind_speed_lag_1,wind_speed_lag_12,wind_speed_lag_288
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-01-01 00:30:00,,,,,,,,,,,...,,,,,,,,,,
2025-01-01 00:35:00,121.87,,,7162.36,,,7162.36,,,0.0,...,,89.49685,,,27.3,,,13.0,,
2025-01-01 00:40:00,111.65192,,,7078.85,,,7078.85,,,0.0,...,,89.09685,,,27.3,,,13.0,,
2025-01-01 00:45:00,119.04851,,,7051.56,,,7051.56,,,0.0,...,,89.26684,,,27.3,,,13.0,,
2025-01-01 00:50:00,119.49351,,,7029.33,,,7029.33,,,0.0,...,,89.38685,,,27.3,,,13.0,,
2025-01-01 00:55:00,115.95158,,,6941.93,,,6941.93,,,0.0,...,,89.66684,,,27.3,,,13.0,,
2025-01-01 01:00:00,128.51108,,,6962.78,,,6962.78,,,0.0,...,,87.71685,,,27.3,,,13.0,,
2025-01-01 01:05:00,116.72427,,,6970.56,,,6970.56,,,0.0,...,,88.32857,,,27.9,,,27.7,,
2025-01-01 01:10:00,116.33505,,,6949.87,,,6949.87,,,0.0,...,,88.82856,,,27.9,,,27.7,,
2025-01-01 01:15:00,109.74,,,6960.56,,,6960.56,,,0.0,...,,88.47857,,,27.9,,,27.7,,


In [None]:
#### Feature Freeze

In [7]:
# -------------------------------
# FEATURE FREEZE
# -------------------------------

# Target columns
target_cols = [
    "target_rrp",
    "target_log_rrp"
]

# Core level features
core_features = [
    "TOTALDEMAND",
    "net_demand_after_pv",
    "pv_rooftop_mw",
    "TOTALINTERMITTENTGENERATION",
    "temperature",
    "wind_speed"
]

# Time features (already present)
time_features = [
    "hour",
    "dayofweek",
    "month"
]

# Ramp features
ramp_features = [
    "total_demand_ramp",
    "net_demand_ramp",
    "renewable_ramp",
    "abs_total_demand_ramp",
    "abs_net_demand_ramp",
    "abs_renewable_ramp"
]

# Lag features (automatically pick all lag columns)
lag_features = [c for c in df.columns if "_lag_" in c]

# Combine all features
final_feature_cols = (
    core_features
    + time_features
    + ramp_features
    + lag_features
    + target_cols
)

# Create final modelling dataframe
df_model = df[final_feature_cols].copy()

# Drop rows with NaNs caused by lags/ramps
df_model = df_model.dropna()

print("Final model dataset shape:", df_model.shape)
df_model.head()


Final model dataset shape: (87259, 38)


Unnamed: 0_level_0,TOTALDEMAND,net_demand_after_pv,pv_rooftop_mw,TOTALINTERMITTENTGENERATION,temperature,wind_speed,hour,dayofweek,month,total_demand_ramp,...,TOTALINTERMITTENTGENERATION_lag_12,TOTALINTERMITTENTGENERATION_lag_288,temperature_lag_1,temperature_lag_12,temperature_lag_288,wind_speed_lag_1,wind_speed_lag_12,wind_speed_lag_288,target_rrp,target_log_rrp
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2025-01-02 00:30:00,7051.85,7051.85,0.0,167.15794,22.2,27.7,0,3,1,-0.23,...,193.10306,89.49685,22.2,22.2,27.3,27.7,35.3,13.0,76.00048,4.343812
2025-01-02 00:35:00,6913.44,6913.44,0.0,165.9869,22.2,27.7,0,3,1,-138.41,...,201.29928,89.09685,22.2,22.2,27.3,27.7,35.3,13.0,76.00054,4.343812
2025-01-02 00:40:00,6956.15,6956.15,0.0,164.53814,22.2,27.7,0,3,1,42.71,...,199.37626,89.26684,22.2,22.2,27.3,27.7,35.3,13.0,82.98951,4.430692
2025-01-02 00:45:00,6854.92,6854.92,0.0,165.65535,22.2,27.7,0,3,1,-101.23,...,195.66558,89.38685,22.2,22.2,27.3,27.7,35.3,13.0,75.99976,4.343802
2025-01-02 00:50:00,6833.35,6833.35,0.0,165.27923,22.2,27.7,0,3,1,-21.57,...,196.06196,89.66684,22.2,22.2,27.3,27.7,35.3,13.0,76.00054,4.343812


In [8]:
df_model.isna().sum().sum()


0

In [9]:
df_model[["target_rrp", "target_log_rrp"]].describe()


Unnamed: 0,target_rrp,target_log_rrp
count,87259.0,87259.0
mean,109.320505,3.892245
std,485.001347,1.673189
min,-999.99406,0.0
25%,51.78441,3.966216
50%,81.8825,4.417424
75%,121.984715,4.81206
max,20300.0,9.918425


In [10]:
# Save final dataset
output_path = "data/processed/model_features_dataset.csv"
df_model.reset_index().to_csv(output_path, index=False)

print(f"Saved final modelling dataset → {output_path}")


Saved final modelling dataset → data/processed/model_features_dataset.csv
