# Preprocessing: Weather Aggregation & Master Dataset

**Project Phase:** 2. Data Integration and Preprocessing
**Goal:** Process the raw ERA5 NetCDF weather files and merge them with the cleaned hourly price data to create the final "Master Dataset" for the prediction models.

### Objectives (from Project Proposal)
1.  **Load Weather Data:** Read all monthly `.nc` files for each zone (downloaded by `01_ingestion/02_weather_era5`).
2.  **Load Price Data:** Read the cleaned hourly price data (created by `02_preprocessing/01_cleaning_and_profiles`).
3.  [cite_start]**Aggregate Weather:** Calculate the mean for temperature, radiation, and wind components across the zone's latitude/longitude grid[cite: 117]. Calculate the sum for precipitation.
4.  [cite_start]**Engineer Features:** Calculate `wind_speed` from the `u10` and `v10` components[cite: 71].
5.  [cite_start]**Merge & Sync:** Combine price and weather data into a single, hourly-timestamped DataFrame[cite: 80].
6.  **Save:** Store the final `master_dataset_hourly.parquet` for use in the `03_analysis` (XGBoost) phase.

In [None]:
import pandas as pd
import numpy as np
import xarray as xr
from pathlib import Path
import os
import warnings

warnings.filterwarnings('ignore')

# 1. Setup Paths
# Navigate up two levels to reach the project root
current_dir = Path(os.getcwd())
project_dir = current_dir.parent.parent

# Input paths
weather_raw_dir = project_dir / 'data' / 'raw' / 'weather'
prices_clean_dir = project_dir / 'data' / 'clean'

# Output path
processed_dir = project_dir / 'data' / 'processed'
processed_dir.mkdir(exist_ok=True)

print(f"Project Directory: {project_dir}")
print(f"Weather Input: {weather_raw_dir}")
print(f"Price Input: {prices_clean_dir}")
print(f"Output Directory: {processed_dir}")

## Configuration

We define the same zones and timeframe used during ingestion to ensure all files are processed correctly.
[cite_start](Note: We use **NO4** to align strictly with the project proposal [cite: 57]).

In [None]:
# Timeframe: 10 Years (must match 02_weather_era5.ipynb)
YEARS = [str(year) for year in range(2015, 2025)]
MONTHS = [f"{month:02d}" for month in range(1, 13)]

# Zones (must match 01_prices_entsoe.ipynb and 02_weather_era5.ipynb)
ZONES = ['ES', 'DK1', 'NO4']

print(f"Processing {len(ZONES)} zones over {len(YEARS)} years...")

## Processing and Merge Function

This function handles the core logic:

1.  **Weather Processing (xarray):** It opens all monthly `.nc` files for a zone at once (`open_mfdataset`). `xarray` handles the time concatenation automatically.
2.  **Aggregation:** It calculates the spatial mean (`.mean(dim=['latitude', 'longitude'])`) for all variables. This flattens the 4D data (time, lat, lon, var) into a 2D time series (time, var).
3.  **Feature Engineering:** Calculates `wind_speed` from U and V components.
4.  **Price Loading:** Loads the cleaned hourly prices.
5.  **Merging:** Joins the two datasets on their hourly timestamp.

In [None]:
def process_and_merge_zone(zone):
    print(f"\nProcessing Zone: {zone}")
    
    # --- 1. Process Weather Data ---
    print(f"  Loading raw weather files for {zone}...")
    
    # Find all NetCDF files for the zone
    file_paths = list(weather_raw_dir.glob(f'era5_{zone}_*.nc'))
    
    if not file_paths:
        print(f"  Warning: No weather files found for {zone}. Skipping.")
        return None

    # Use xarray to open all files as a single dataset
    # This automatically handles time alignment and merging
    with xr.open_mfdataset(file_paths, combine='by_coords') as ds:
        # Aggregate across space: calculate the mean for the entire bounding box
        # This converts the 4D grid (time, lat, lon, var) to 2D (time, var)
        ds_agg = ds.mean(dim=['latitude', 'longitude'])
        
        # Convert to a pandas DataFrame
        weather_df = ds_agg.to_dataframe()

    # Rename columns for clarity (e.g., 't2m' -> 'temp_c')
    weather_df = weather_df.rename(columns={
        't2m': 'temp_k',  # Temperature is in Kelvin
        'tp': 'precipitation_m', # Total precipitation (in meters)
        'ssrd': 'solar_radiation_j_m2', # Solar radiation (Joule / m^2)
        'u10': 'wind_u_ms', # Wind U-component (m/s)
        'v10': 'wind_v_ms'  # Wind V-component (m/s)
    })

    # --- 2. Feature Engineering (Weather) ---
    
    # Convert Temperature from Kelvin to Celsius
    weather_df['temp_c'] = weather_df['temp_k'] - 273.15
    
    # Calculate Wind Speed (m/s) from U and V components
    weather_df['wind_speed_ms'] = np.sqrt(
        weather_df['wind_u_ms']**2 + weather_df['wind_v_ms']**2
    )
    
    # Select final columns
    weather_final = weather_df[[
        'temp_c', 
        'precipitation_m', 
        'solar_radiation_j_m2', 
        'wind_speed_ms'
    ]]
    
    # Ensure index is a clean UTC timestamp (required for merging)
    weather_final.index.name = 'timestamp'
    weather_final = weather_final.tz_convert(None).tz_localize('UTC')

    print(f"  Weather data aggregated: {weather_final.shape[0]} hourly records")

    # --- 3. Load Clean Price Data ---
    print(f"  Loading clean price data for {zone}...")
    price_path = prices_clean_dir / f"{zone}_clean.csv"
    
    if not price_path.exists():
        print(f"  Error: Clean price file not found at {price_path}. Skipping merge.")
        return None
        
    price_df = pd.read_csv(price_path)
    price_df['timestamp'] = pd.to_datetime(price_df['timestamp'], utc=True)
    price_df = price_df.set_index('timestamp')

    # --- 4. Merge Price and Weather ---
    print("  Merging price and weather data...")
    
    # Merge on the timestamp index
    master_df = pd.merge(price_df, weather_final, left_index=True, right_index=True, how='inner')
    
    master_df = master_df.reset_index()
    
    print(f"  Merge complete. Final dataset shape: {master_df.shape}")
    
    return master_df

## Execution

Run the pipeline for all zones and save the master datasets.

In [None]:
print("Starting Master Dataset Creation...")

master_datasets = {}

for zone in ZONES:
    df = process_and_merge_zone(zone)
    
    if df is not None:
        master_datasets[zone] = df
        
        # Save the final dataset
        save_path = processed_dir / f"{zone}_master_hourly.parquet"
        df.to_parquet(save_path, index=False)
        print(f"  Saved master dataset to {save_path.name}")

print("\n----------------------------------")
print("Master Dataset Pipeline Complete.")
print(f"Output files are in: {processed_dir}")