# Live Weather Forecast Ingestion (Inference)

**Project:** Electricity Price Forecasting
**Stage:** Ingestion (Live Data)

**Objective:**
This notebook fetches live weather forecasts for the next 3 days. This data serves as the input (inference data) for the machine learning model to predict future electricity prices.

**Data Source:**
* **Provider:** WeatherAPI.com
* **Method:** REST API
* **Granularity:** Hourly

**Methodology:**
To match the "Zonal Spatial Mean" used in our training data (ERA5), we do not rely on a single city. Instead, we:
1.  Query **3-4 representative cities** spread geographically across each bidding zone.
2.  Aggregate these forecasts into a single **Zonal Mean**.
3.  Map variables to match the ERA5 feature set exactly.
4.  **Fallback Strategy:** Since the standard API tier restricts explicit solar radiation data, we estimate it using the UV Index ($Solar \approx UV \times 25$) to preserve day/night cycles.

In [None]:
import os
import requests
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from dotenv import load_dotenv

# 1. Load Environment Variables
env_path = Path('../../secrets.env')
load_dotenv(dotenv_path=env_path)

# Retrieve API Key
API_KEY = os.getenv('WEATHER_API_KEY')

if not API_KEY:
    raise ValueError("API Key not found! Please check your secrets.env file.")

print("[INFO] Environment loaded successfully.")

## Configuration

We define the geographic scope for each zone by selecting representative cities.

* **DK1 (West Denmark):** Aarhus, Aalborg, Esbjerg, Odense
* **NO2 (South Norway):** Kristiansand, Stavanger, Skien
* **ES (Spain):** Madrid, Seville, Bilbao, Barcelona

**Output:**
Data will be saved to `data/live` in the project root.

In [None]:
# API Endpoint
BASE_URL = "http://api.weatherapi.com/v1/forecast.json"

# Output Directory (Project Root / data / live)
OUTPUT_DIR = Path('../../data/live')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

# Representative Cities for Zonal Mean
ZONE_LOCATIONS = {
    "DK1": ["Aarhus", "Aalborg", "Esbjerg", "Odense"], 
    "NO2": ["Kristiansand", "Stavanger", "Skien"],     
    "ES":  ["Madrid", "Seville", "Bilbao", "Barcelona"] 
}

print(f"[INFO] Target Directory: {OUTPUT_DIR.resolve()}")

## Fetch Function

This function queries the API for a specific city.
It extracts the exact fields required by our XGBoost model:
* `temp_c` -> **temperature_2m**
* `precip_mm` -> **precipitation_mm**
* `wind_kph` -> **wind_speed_10m** (Requires conversion later)
* `short_rad` -> **solar_radiation_W**

**Note on Solar Radiation:**
If `short_rad` is missing (common in standard API tiers), we use `uv` (UV Index) multiplied by 25.0 as a physical proxy.

In [None]:
def fetch_city_forecast(city, days=3):
    """
    Fetches hourly forecast for a specific city.
    Returns a DataFrame with raw API columns.
    """
    params = {
        'key': API_KEY,
        'q': city,
        'days': days,
        'aqi': 'no',
        'alerts': 'no'
    }
    
    try:
        response = requests.get(BASE_URL, params=params)
        data = response.json()
        
        # Check for API error messages
        if 'error' in data:
            print(f"[ERROR] Fetching {city}: {data['error']['message']}")
            return pd.DataFrame()
            
        # Parse Hourly Data
        forecast_days = data['forecast']['forecastday']
        hourly_list = []
        
        for day in forecast_days:
            for hour in day['hour']:
                
                # --- SOLAR RADIATION FALLBACK ---
                # Try to get explicit Shortwave Radiation
                solar_rad = hour.get('short_rad')
                
                # If missing, approximate using UV Index
                # 1 UV Index unit approx 25 W/m2
                if solar_rad is None:
                    uv_index = hour.get('uv', 0.0)
                    solar_rad = uv_index * 25.0
                
                hourly_list.append({
                    'time_local': hour['time'],
                    'temperature_2m': hour.get('temp_c'),
                    'precipitation_mm': hour.get('precip_mm'),
                    'wind_kph': hour.get('wind_kph'),
                    'solar_radiation_W': solar_rad
                })
                
        df = pd.DataFrame(hourly_list)
        df['time'] = pd.to_datetime(df['time_local'])
        return df
        
    except Exception as e:
        print(f"[EXCEPTION] Fetching {city}: {e}")
        return pd.DataFrame()

## Zonal Aggregation

This main loop iterates through all defined zones. For each zone, it:
1.  Fetches forecasts for all defined cities.
2.  Aggregates the data by calculating the mean across all cities (Spatial Mean).
3.  Converts Wind Speed from `kph` to `m/s` to align with ERA5 training data.
4.  Saves the result as a clean CSV for the inference pipeline.

In [None]:
def process_and_save_forecasts():
    print("Starting Forecast Ingestion...")
    print("-" * 30)
    
    for zone, cities in ZONE_LOCATIONS.items():
        print(f"Processing Zone: {zone}")
        all_cities_dfs = []
        
        # 1. Fetch data for each city
        for city in cities:
            df_city = fetch_city_forecast(city)
            if not df_city.empty:
                df_city = df_city.set_index('time')
                all_cities_dfs.append(df_city)
        
        if not all_cities_dfs:
            print(f"[SKIP] {zone}: No data fetched.")
            continue
            
        # 2. Aggregate (Spatial Mean)
        # Concatenate all city dataframes
        df_combined = pd.concat(all_cities_dfs)
        
        # Calculate mean, ignoring non-numeric columns (like 'time_local')
        df_zonal_mean = df_combined.groupby(df_combined.index).mean(numeric_only=True)
        
        # 3. Feature Engineering
        # Convert Wind Speed: kph -> m/s (1 m/s = 3.6 kph)
        if 'wind_kph' in df_zonal_mean.columns:
            df_zonal_mean['wind_speed_10m'] = df_zonal_mean['wind_kph'] / 3.6
        
        # Select and Reorder Final Columns
        final_cols = [
            'temperature_2m', 
            'precipitation_mm', 
            'wind_speed_10m', 
            'solar_radiation_W'
        ]
        
        # Ensure columns exist (fill 0.0 if missing)
        for col in final_cols:
            if col not in df_zonal_mean.columns:
                df_zonal_mean[col] = 0.0
                
        df_final = df_zonal_mean[final_cols]
        
        # 4. Save to CSV
        save_path = OUTPUT_DIR / f"{zone}_forecast.csv"
        df_final.to_csv(save_path)
        
        print(f"[SUCCESS] Saved: {save_path.name}")
        print(f"Time Range: {df_final.index.min()} to {df_final.index.max()}")
        print("-" * 30)

# Run the pipeline
if __name__ == "__main__":
    process_and_save_forecasts()

## Sanity Check

We validate the quality of the downloaded data to ensure it is physically realistic before using it for predictions.

**Checks:**
1.  **Missing Values:** Should be 0.
2.  **Physical Ranges:** No negative wind/solar, realistic temperatures.
3.  **Visual Check:** Plotting the time series to verify patterns (e.g., daily solar cycles).

In [None]:
def check_data_quality():
    print("\n--- Running Sanity Checks ---")
    
    for zone in ZONE_LOCATIONS.keys():
        file_path = OUTPUT_DIR / f"{zone}_forecast.csv"
        
        if not file_path.exists():
            continue
            
        print(f"\nZone: {zone}")
        df = pd.read_csv(file_path, index_col=0, parse_dates=True)
        
        # 1. Missing Values
        missing = df.isna().sum().sum()
        if missing > 0:
            print(f"[WARNING] Found {missing} missing values.")
        else:
            print("[OK] No missing values.")
            
        # 2. Physical Constraints
        if df['wind_speed_10m'].min() < 0:
            print("[CRITICAL] Negative wind speed detected.")
        
        if df['solar_radiation_W'].min() < 0:
            print("[CRITICAL] Negative solar radiation detected.")
            
        # 3. Print Stats
        print(df.describe().loc[['min', 'max', 'mean']])

check_data_quality()

In [None]:
def plot_forecasts():
    for zone in ZONE_LOCATIONS.keys():
        file_path = OUTPUT_DIR / f"{zone}_forecast.csv"
        if not file_path.exists():
            continue
            
        df = pd.read_csv(file_path, index_col=0, parse_dates=True)
        
        # Setup Plot (2 Rows: Temp/Solar and Wind/Precip)
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8), sharex=True)
        
        # Row 1: Temperature & Solar
        color_temp = 'tab:red'
        ax1.set_ylabel('Temperature (C)', color=color_temp)
        ax1.plot(df.index, df['temperature_2m'], color=color_temp, label='Temp')
        ax1.tick_params(axis='y', labelcolor=color_temp)
        ax1.grid(True, alpha=0.3)
        
        ax1_right = ax1.twinx()
        color_solar = 'tab:orange'
        ax1_right.set_ylabel('Solar (W/m2)', color=color_solar)
        ax1_right.plot(df.index, df['solar_radiation_W'], color=color_solar, label='Solar', linestyle='--')
        ax1_right.tick_params(axis='y', labelcolor=color_solar)
        ax1.set_title(f"Forecast Validation: {zone}")
        
        # Row 2: Wind & Precip
        color_wind = 'tab:blue'
        ax2.set_ylabel('Wind Speed (m/s)', color=color_wind)
        ax2.plot(df.index, df['wind_speed_10m'], color=color_wind, label='Wind')
        ax2.tick_params(axis='y', labelcolor=color_wind)
        ax2.grid(True, alpha=0.3)
        
        ax2_right = ax2.twinx()
        color_rain = 'tab:cyan'
        ax2_right.set_ylabel('Precip (mm)', color=color_rain)
        ax2_right.plot(df.index, df['precipitation_mm'], color=color_rain, label='Rain', linestyle=':')
        ax2_right.tick_params(axis='y', labelcolor=color_rain)
        
        plt.tight_layout()
        plt.show()

plot_forecasts()