# 02. Preprocessing — Building the Processed Dataset

This notebook aims to **test, understand, and document** the preprocessing pipeline used to build the `processed` dataset from raw data sources:

- Hourly electricity demand (ENTSO-E)
- Hourly weather data (Open-Meteo)

The production-ready logic is implemented in `src/preprocessing/build_processed_dataset.py`.

This notebook is intentionally used for:
- step-by-step exploration
- sanity checks
- visualization
- documentation for recruiters and reviewers.

## 1. Environment setup

In [1]:
import sys
import os

PROJECT_ROOT = os.path.abspath("..")
if PROJECT_ROOT not in sys.path:
    sys.path.append(PROJECT_ROOT)

PROJECT_ROOT

'/Users/bachirijihane/energy-intelligence-platform'

## 2. Imports

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

from src.preprocessing.build_preprocessed_dataset import check_hourly_continuity, interpolate_time_series, build_processed_dataset_for_country_year, build_processed_dataset

## 3. Load raw datasets

In [6]:
# Parameters
country = "FR"
year = 2023

demand_path = f"../data/raw/electricity_demand/country={country}/year={year}/demand.parquet"
weather_path = f"../data/raw/weather/country={country}/year={year}/weather.parquet"

# Load raw data
df_demand = pd.read_parquet(demand_path)
df_weather = pd.read_parquet(weather_path)

In [7]:
df_demand.head(), df_weather.head()

(                   datetime  load_MW country
 0 2023-01-01 00:00:00+00:00  45709.0      FR
 1 2023-01-01 01:00:00+00:00  44640.0      FR
 2 2023-01-01 02:00:00+00:00  41533.0      FR
 3 2023-01-01 03:00:00+00:00  39248.0      FR
 4 2023-01-01 04:00:00+00:00  38389.0      FR,
                    datetime  temperature_2m  relative_humidity_2m  \
 0 2023-01-01 00:00:00+00:00           14.85             53.719143   
 1 2023-01-01 01:00:00+00:00           14.95             52.638969   
 2 2023-01-01 02:00:00+00:00           14.75             53.321426   
 3 2023-01-01 03:00:00+00:00           14.20             55.827904   
 4 2023-01-01 04:00:00+00:00           14.15             57.980919   
 
    wind_speed_10m  shortwave_radiation_instant country  
 0       27.859905                          0.0      FR  
 1       26.302181                          0.0      FR  
 2       23.065300                          0.0      FR  
 3       21.385939                          0.0      FR  
 4       20

In [15]:
df_demand

Unnamed: 0,datetime,load_MW,country
0,2023-01-01 00:00:00+00:00,45709.0,FR
1,2023-01-01 01:00:00+00:00,44640.0,FR
2,2023-01-01 02:00:00+00:00,41533.0,FR
3,2023-01-01 03:00:00+00:00,39248.0,FR
4,2023-01-01 04:00:00+00:00,38389.0,FR
...,...,...,...
267,2023-01-12 03:00:00+00:00,48990.0,FR
268,2023-01-12 04:00:00+00:00,51213.0,FR
269,2023-01-12 05:00:00+00:00,56161.0,FR
270,2023-01-12 06:00:00+00:00,61413.0,FR


## 4. Inspect raw data structure

In [8]:
# Checking dataframes info
df_demand.info()
df_weather.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 272 entries, 0 to 271
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype              
---  ------    --------------  -----              
 0   datetime  272 non-null    datetime64[ns, UTC]
 1   load_MW   272 non-null    float64            
 2   country   272 non-null    object             
dtypes: datetime64[ns, UTC](1), float64(1), object(1)
memory usage: 6.5+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 6 columns):
 #   Column                       Non-Null Count  Dtype              
---  ------                       --------------  -----              
 0   datetime                     8760 non-null   datetime64[ns, UTC]
 1   temperature_2m               8760 non-null   float32            
 2   relative_humidity_2m         8760 non-null   float32            
 3   wind_speed_10m               8760 non-null   float32            
 4   shortwave_radiation_instant  8760 

In [9]:
# Checking timezone information
df_demand["datetime"].dt.tz

<UTC>

In [10]:
# Detecting duplicate timestamps
df_demand["datetime"].duplicated().sum()

np.int64(0)

In [11]:
# Detecting missing timestamps
df_demand.set_index("datetime").asfreq("h").isna().sum()
df_weather.set_index("datetime").asfreq("h").isna().sum()

temperature_2m                 0
relative_humidity_2m           0
wind_speed_10m                 0
shortwave_radiation_instant    0
country                        0
dtype: int64

### Missing data handling strategy

Hourly time series data may contain missing timestamps or values due to:
- data source issues
- API downtime
- aggregation problems

In this project, the following strategy is applied:

- All missing hourly timestamps are explicitly added
- Time-based interpolation is applied
- Interpolation is strictly limited to avoid artificial signal creation
- No forward-fill is used to prevent unrealistic persistence

In [12]:
# Checking hourly continuity
full_index = check_hourly_continuity(df_demand, "datetime")

In [14]:
full_index.size

272

## 5. Running the preprocessing pipeline

In [None]:
# Building the processed dataset
build_processed_dataset_for_country_year(
    country="FR",
    year=2023
) # Missing raw data for FR 2023 -> to correct later

[SKIP] Missing raw data for FR 2023


In [21]:
processed_path = f"../data/processed/country=FR/year=2023/load_weather.parquet"
df_processed = pd.read_parquet(processed_path)

FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/country=FR/year=2023/load_weather.parquet'

In [None]:
df_processed.head()