# Electricity Demand & Weather Proprecessing

This notebook explains step by step how the preprocessing pipeline works to build a clean, continuous, hourly dataset combining:

- Electricity demand (load_MW)
- Weather variables

# 1. Context & Objectives 

We start from raw time series data that may contain:
- Missing timestamps
- Irregular hourly frequency
- Gaps (NaNs)

### Final objectives: 
- Ensure a complete hourly timeline
- Interpolate numeric variables only
- Safely handle categorical variables
- Merge demand and weather into a single dataset
- Produce a modelâ€‘ready dataset

# 2. Imports & Project Structure

In [None]:
import pandas as pd
from pathlib import Path

In [None]:
# We rely on pathlib for robust and portable file paths
PROJECT_ROOT = Path.cwd().parents[0]
RAW_BASE_PATH = PROJECT_ROOT / "data" / "raw"
PROCESSED_BASE_PATH = PROJECT_ROOT / "data" / "processed"

# 3. Load Raw Data

In [None]:
country = "FR"
year = 2015

# Raw file paths
demand_path = (
RAW_BASE_PATH
/ "electricity_demand"
/ f"country={country}"
/ f"year={year}"
/ "demand.parquet"
)

weather_path = (
RAW_BASE_PATH
/ "weather"
/ f"country={country}"
/ f"year={year}"
/ "weather.parquet"
)

# Load dataframes
df_demand = pd.read_parquet(demand_path)
df_weather = pd.read_parquet(weather_path)

In [None]:
df_demand.head()

In [None]:
df_demand.info()

In [None]:
df_weather.head()

In [None]:
df_weather.info()

# 4. Helper functions

In [None]:
# Build a full hourly time index
def build_full_hourly_index(df: pd.DataFrame, time_col: str) -> pd.DatetimeIndex:
    """
    Build a complete hourly DatetimeIndex between min and max timestamps.
    """
    return pd.date_range(
    start=df[time_col].min(),
    end=df[time_col].max(),
    freq="h"
)

Why this matters?
- Models expect regular time steps,
- Missing hours must be explicitly created before interpolation

In [None]:
# Reindex & interpolate a time series
def reindex_and_interpolate_ts(
    df: pd.DataFrame,
    time_col: str,
    numeric_cols: list[str],
    categorical_cols: list[str] | None = None,
) -> pd.DataFrame:

    df = (
        df
        .drop_duplicates(subset=time_col)
        .sort_values(time_col)
        .copy()
    )

    df[time_col] = pd.to_datetime(df[time_col])

    full_index = build_full_hourly_index(df, time_col)

    df = (
        df
        .set_index(time_col)
        .reindex(full_index)
    )

    df[numeric_cols] = (
        df[numeric_cols]
        .interpolate(method="time", limit_area="inside")
    )

    if categorical_cols:
        df[categorical_cols] = (
            df[categorical_cols]
            .ffill()
            .bfill()
        )

    df = (
        df
        .rename_axis(time_col)
        .reset_index()
    )

    assert df[time_col].is_monotonic_increasing

    return df

# 5. Functions application

## 5.1. Demand data

In [None]:
df_demand_processed = reindex_and_interpolate_ts(
df=df_demand,
time_col="datetime",
numeric_cols=["load_MW"],
categorical_cols=["country"]
)

In [None]:
df_demand_processed.head()

In [None]:
df_demand_processed.info()

In [None]:
df_demand_processed.isna().sum()

## 5.2. Weather data

In [None]:
weather_cols = [
"temperature_2m",
"relative_humidity_2m",
"wind_speed_10m",
"shortwave_radiation_instant"
]

df_weather_processed = reindex_and_interpolate_ts(
df=df_weather,
time_col="datetime",
numeric_cols=weather_cols
)

In [None]:
df_weather_processed.head()

# 6. Merge Demand & Weather DataFrames

In [None]:
# Weather: drop redundant metadata
df_weather_processed = df_weather_processed.drop(columns=["country"])

df_merged = df_demand_processed.merge(
df_weather_processed,
on="datetime",
how="inner"
)

# Add metadata
df_merged["year"] = year

In [None]:
df_merged.head()

In [None]:
df_merged.info()

# 7. Final Data Quality Checks

In [None]:
# No missing values
assert df_merged.isna().sum().sum() == 0

# Strict hourly continuity
assert df_merged['datetime'].diff().dropna().unique()[0] == pd.Timedelta('1h')

The dataset is now clean, regular and model-ready.

This preprocessing logic is reused as-is inside the production script src/preprocessing/build_preprocessed_dataset.py.