# Feature Engineering for Electricity Demand Forecasting (h+1)

This notebook focuses on transforming the cleaned and merged dataset into a model-ready format for **short-term electricity demand forecasting**.

The objective is to predict electricity load at **t+1 hour** (one hour ahead).

Feature selection is driven by the EDA findings, highlighting:
- Strong temporal dependence of load
- Daily and weekly seasonality
- Temperature as the main exogenous driver

## 1. Imports & Setup

In [68]:
from pathlib import Path
import pandas as pd
import holidays

## 2. Project Paths and Parameters

In [69]:
# We rely on pathlib for robust and portable file paths
PROJECT_ROOT = Path.cwd().parents[0]
PROCESSED_BASE_PATH = PROJECT_ROOT / "data" / "processed"

Modeling scope:
- Country: France (FR)
- Forecast horizon: h+1
- We load multiple years continuously to compute lag features correctly

In [70]:
# Parameters
country = "FR"
years = list(range(2012, 2026)) # Full historical range

## 3. Load Preprocessed Demand + Weather Data

We load all available years and concatenate them into **a single continuous time series**. This is required to compute lag features (h-1, h-24, h-168) without breaking temporal continuity.

In [71]:
# Load preprocessed data for several years
dfs = []

for year in years:
    path = (
    PROCESSED_BASE_PATH
    / f"country={country}"
    / f"year={year}"
    / "load_weather.parquet"
    )
    if path.exists():
        dfs.append(pd.read_parquet(path))

assert len(dfs) > 0, "No data loaded"

# Continuous time series
df = pd.concat(dfs, ignore_index=True).sort_values("datetime").reset_index(drop=True)

In [72]:
print(df.shape)
df.head()

(17544, 8)


Unnamed: 0,datetime,load_MW,country,temperature_2m,relative_humidity_2m,wind_speed_10m,shortwave_radiation_instant,year
0,2023-01-01 00:00:00+00:00,45709.0,FR,14.85,53.719143,27.859905,0.0,2023
1,2023-01-01 01:00:00+00:00,44640.0,FR,14.95,52.638969,26.302181,0.0,2023
2,2023-01-01 02:00:00+00:00,41533.0,FR,14.75,53.321426,23.0653,0.0,2023
3,2023-01-01 03:00:00+00:00,39248.0,FR,14.2,55.827904,21.385939,0.0,2023
4,2023-01-01 04:00:00+00:00,38389.0,FR,14.15,57.980919,20.683559,0.0,2023


## 4. Target Variable (h+1)

We define the prediction target as **load at time t+1**.

In [73]:
df["target_load_t+1"] = df["load_MW"].shift(-1)

## 5. Calendar Features

Electricity demand strongly depends on human activity cycles.

In [74]:
# Hour of day
df["hour"] = df["datetime"].dt.hour

In [75]:
# Day of week and weekday indicator
df["day_of_week"] = df["datetime"].dt.dayofweek # 0=Monday, 6=Sunday

df["is_weekday"] = (df["day_of_week"] < 5).astype(int)

In [76]:
# Week of year (ISO calendar)
df["week_of_year"] = df["datetime"].dt.isocalendar().week.astype(int)

# Replace week 53 by 52 to ensure consistent annual seasonality
df.loc[df["week_of_year"] == 53, "week_of_year"] = 52

## 6. Holidays

In [77]:
# Holidays indicator per country
countryHolidays = holidays.country_holidays(country=country)

In [78]:
# Create a holiday feature (1 if holiday, 0 otherwise)
df["is_holiday"] = df["datetime"].apply(lambda x: 1 if x.date() in countryHolidays else 0)

# Holidays are treated as non-working days by electricity network operators
df.loc[df["is_holiday"] == 1, "is_weekday"] = 0

## 7. Lag Features

Electricity demand shows **strong temporal dependence**. Lagged values are therefore among the most powerful predictors.

In [79]:
# Load lags
df["load_t-1"] = df["load_MW"].shift(1)
df["load_t-24"] = df["load_MW"].shift(24)
df["load_t-168"] = df["load_MW"].shift(24 * 7)

These lags capture:
- short-term inertia (h-1)
- daily seasonality (h-24)
- weekly seasonality (h-168)

## 8. Weather Features

We use weather variables at time t to predict load at t+1.

In [80]:
weather_features = [
    "temperature_2m",
    "relative_humidity_2m",
    "wind_speed_10m",
    "shortwave_radiation_instant"
]

## 9. Final Feature Set

In [88]:
# Rename the load column
df = df.rename(columns = {"load_MW": "load_t", "temperature_2m": "temperature_t"})

# Features columns
feature_cols = [
    "load_t",
    "load_t-1",
    "load_t-24",
    "load_t-168",
    "temperature_t", # For the moment, we only use temperature from weather data
    "hour",
    "is_weekday",
    "week_of_year",
]

In [89]:
# Target column
target_col = "target_load_t+1"

## 10. Remove Invalid Rows (Lag & Target NaNs)

This step removes:
- the first hours of the dataset (due to lag features)
- the last hour (missing target)

In [90]:
# Final dataset for modeling
df_model = (
    df
    .assign(datetime=df["datetime"]) # Ensure datetime column is present as index
    .set_index("datetime")[feature_cols + [target_col]] # Select features and target
    .dropna() # Drop rows with missing values
    .copy()
)

In [91]:
# Final checks
assert df_model.index.is_monotonic_increasing
assert df_model.index.is_unique
assert df_model.isna().sum().sum() == 0

In [92]:
df_model

Unnamed: 0_level_0,load_t,load_t-1,load_t-24,load_t-168,temperature_t,hour,is_weekday,week_of_year,target_load_t+1
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2023-01-08 00:00:00+00:00,47749.00,49353.00,49262.0,45709.0,9.50,0,0,1,46757.00
2023-01-08 01:00:00+00:00,46757.00,47749.00,48267.0,44640.0,9.40,1,0,1,44097.00
2023-01-08 02:00:00+00:00,44097.00,46757.00,45683.0,41533.0,9.60,2,0,1,42462.00
2023-01-08 03:00:00+00:00,42462.00,44097.00,44278.0,39248.0,9.35,3,0,1,43667.00
2023-01-08 04:00:00+00:00,43667.00,42462.00,44889.0,38389.0,8.70,4,0,1,44210.00
...,...,...,...,...,...,...,...,...,...
2024-12-31 18:00:00+00:00,60473.50,59745.75,72000.0,57992.0,5.15,18,1,1,61201.25
2024-12-31 19:00:00+00:00,61201.25,60473.50,68613.0,54931.0,5.15,19,1,1,61929.00
2024-12-31 20:00:00+00:00,61929.00,61201.25,65922.0,52952.0,5.40,20,1,1,62676.50
2024-12-31 21:00:00+00:00,62676.50,61929.00,66599.0,54334.0,5.40,21,1,1,63424.00


## 11. Save Feature Dataset (Partioned by Year)

Each year is stored separately to enable:
- clean backtesting
- scalable training pipelines

In [None]:
# Save features per year
for year, df_year in df_model.groupby(df_model.index.year):

    output_dir = (
    PROCESSED_BASE_PATH
    / f"country={country}"
    / f"year={year}"
    )
    output_dir.mkdir(parents=True, exist_ok=True)

    output_path = output_dir / "load_forecasting_features.parquet"
    df_year.to_parquet(output_path, index=False)

    print(f"[SAVED] {output_path} | rows={len(df_year)}")

## 12. Summary

At this stage, we have:
- A clean **feature matrix**
- No temporal leakage
- Strong domain-driven predictors
- A dataset ready for **time-series backtesting and modeling**

The next step will consist in:
- defining train / validation / test splits
- training baseline models (naive, linear, tree-based)
- evaluating performance on unseen years