# Feature Engineering for Electricity Demand Forecasting (h+1)

This notebook focuses on transforming the cleaned and merged dataset into a model-ready format for **short-term electricity demand forecasting**.

The objective is to predict electricity load at **t+1 hour** (one hour ahead).

Feature selection is driven by the EDA findings, highlighting:
- Strong temporal dependence of load
- Daily and weekly seasonality
- Temperature as the main exogenous driver

## 1. Imports & Setup

In [1]:
from pathlib import Path
import pandas as pd
import holidays

## 2. Project Paths and Parameters

In [2]:
# We rely on pathlib for robust and portable file paths
PROJECT_ROOT = Path.cwd().parents[0]
PROCESSED_BASE_PATH = PROJECT_ROOT / "data" / "processed"
FEATURED_BASE_PATH = PROJECT_ROOT / "data" / "featured"

Modeling scope:
- Country: France (FR)
- Forecast horizon: h+1
- We load multiple years continuously to compute lag features correctly

In [3]:
# Parameters
country = "FR"
years = list(range(2015, 2024+1)) # Full historical range

## 3. Load Preprocessed Demand + Weather Data

We load all available years and concatenate them into **a single continuous time series**. This is required to compute lag features (h-1, h-24, h-168) without breaking temporal continuity.

In [4]:
# Load preprocessed data for several years
dfs = []

for year in years:
    path = (
    PROCESSED_BASE_PATH
    / f"country={country}"
    / f"year={year}"
    / "load_weather.parquet"
    )
    if path.exists():
        dfs.append(pd.read_parquet(path))

assert len(dfs) > 0, "No data loaded"

# Continuous time series
df = pd.concat(dfs, ignore_index=True).sort_values("datetime").reset_index(drop=True)

In [5]:
print(df.shape)
df.head()

(87671, 8)


Unnamed: 0,datetime,load_MW,country,temperature_2m,relative_humidity_2m,wind_speed_10m,shortwave_radiation_instant,year
0,2015-01-01 00:00:00+00:00,70929.0,FR,-0.646,97.476974,2.968636,0.0,2015
1,2015-01-01 01:00:00+00:00,69773.0,FR,-0.946,97.470795,4.213692,0.0,2015
2,2015-01-01 02:00:00+00:00,66417.0,FR,-1.096,97.825935,5.4,0.0,2015
3,2015-01-01 03:00:00+00:00,64182.0,FR,-1.846,97.812531,6.638072,0.0,2015
4,2015-01-01 04:00:00+00:00,63859.0,FR,-3.196,97.423698,5.351785,0.0,2015


## 4. Target Variable (h+1)

We define the prediction target as **load at time t+1**.

In [6]:
df["target_load_t+1"] = df["load_MW"].shift(-1)

## 5. Calendar Features

Electricity demand strongly depends on human activity cycles.

In [7]:
# Hour of day
df["hour"] = df["datetime"].dt.hour

In [8]:
# Day of week and weekday indicator
df["day_of_week"] = df["datetime"].dt.dayofweek # 0=Monday, 6=Sunday

df["is_weekday"] = (df["day_of_week"] < 5).astype(int)

In [9]:
# Week of year (ISO calendar)
df["week_of_year"] = df["datetime"].dt.isocalendar().week.astype(int)

# Replace week 53 by 52 to ensure consistent annual seasonality
df.loc[df["week_of_year"] == 53, "week_of_year"] = 52

## 6. Holidays

In [10]:
# Holidays indicator per country
countryHolidays = holidays.country_holidays(country=country)

In [11]:
# Create a holiday feature (1 if holiday, 0 otherwise)
df["is_holiday"] = df["datetime"].apply(lambda x: 1 if x.date() in countryHolidays else 0)

# Holidays are treated as non-working days by electricity network operators
df.loc[df["is_holiday"] == 1, "is_weekday"] = 0

## 7. Lag Features

Electricity demand shows **strong temporal dependence**. Lagged values are therefore among the most powerful predictors.

In [12]:
# Load lags
df["load_t-1"] = df["load_MW"].shift(1)
df["load_t-24"] = df["load_MW"].shift(24)
df["load_t-168"] = df["load_MW"].shift(24 * 7)

These lags capture:
- short-term inertia (h-1)
- daily seasonality (h-24)
- weekly seasonality (h-168)

## 8. Weather Features

We use weather variables at time t to predict load at t+1.

In [13]:
weather_features = [
    "temperature_2m",
    "relative_humidity_2m",
    "wind_speed_10m",
    "shortwave_radiation_instant"
]

## 9. Final Feature Set

In [14]:
# Rename the load column
df = df.rename(columns = {"load_MW": "load_t", "temperature_2m": "temperature_t"})

# Features columns
feature_cols = [
    "datetime",
    "load_t",
    "load_t-1",
    "load_t-24",
    "load_t-168",
    "temperature_t", # For the moment, we only use temperature from weather data
    "hour",
    "is_weekday",
    "week_of_year",
]

In [15]:
# Target column
target_col = "target_load_t+1"

## 10. Remove Invalid Rows (Lag & Target NaNs)

This step removes:
- the first hours of the dataset (due to lag features)
- the last hour (missing target)

In [16]:
# Final dataset for modeling
#df_model = (
    #df
    #.assign(datetime=df["datetime"]) # Ensure datetime column is present as index
    #.set_index("datetime")[feature_cols + [target_col]] # Select features and target
    #.dropna() # Drop rows with missing values
    #.copy()
#)

df_model = df[feature_cols + [target_col]].dropna().copy()

In [17]:
# Final checks
assert df_model.index.is_monotonic_increasing
# assert df_model.index.is_unique
assert df_model.isna().sum().sum() == 0

In [18]:
df_model.head()

Unnamed: 0,datetime,load_t,load_t-1,load_t-24,load_t-168,temperature_t,hour,is_weekday,week_of_year,target_load_t+1
168,2015-01-08 00:00:00+00:00,65948.0,68372.0,67621.0,70929.0,5.154,0,1,2,64676.0
169,2015-01-08 01:00:00+00:00,64676.0,65948.0,66393.0,69773.0,5.704,1,1,2,61551.0
170,2015-01-08 02:00:00+00:00,61551.0,64676.0,63640.0,66417.0,6.354,2,1,2,60541.0
171,2015-01-08 03:00:00+00:00,60541.0,61551.0,62955.0,64182.0,6.954,3,1,2,62833.0
172,2015-01-08 04:00:00+00:00,62833.0,60541.0,65636.0,63859.0,7.554,4,1,2,68782.0


## 11. Save Feature Dataset (Partioned by Year)

Each year is stored separately to enable:
- clean backtesting
- scalable training pipelines

In [19]:
# Save features per year
for year, df_year in df_model.groupby(df_model["datetime"].dt.year): # groupby(df_model.index.year)

    output_dir = (
    FEATURED_BASE_PATH
    / f"country={country}"
    / f"year={year}"
    )
    output_dir.mkdir(parents=True, exist_ok=True)

    output_path = output_dir / "load_forecasting_features.parquet"
    df_year.to_parquet(output_path, index=False)

    print(f"[SAVED] {output_path} | rows={len(df_year)}")

[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2015/load_forecasting_features.parquet | rows=8592
[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2016/load_forecasting_features.parquet | rows=8784
[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2017/load_forecasting_features.parquet | rows=8760
[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2018/load_forecasting_features.parquet | rows=8760
[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2019/load_forecasting_features.parquet | rows=8760
[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2020/load_forecasting_features.parquet | rows=8784
[SAVED] /Users/bachirijihane/energy-intelligence-platform/data/featured/country=FR/year=2021/load_forecasting_features.parquet | rows=8759
[SAVED] /Users/bachirijihan

## 12. Summary

At this stage, we have:
- A clean **feature matrix**
- No temporal leakage
- Strong domain-driven predictors
- A dataset ready for **time-series backtesting and modeling**

The next step will consist in:
- defining train / validation / test splits
- training baseline models (naive, linear, tree-based)
- evaluating performance on unseen years