# Renewable Energy Forecasting Pipeline

This notebook walks through building a **next-24h renewable generation forecast system** with:

- **EIA data integration** - Hourly wind/solar generation for US regions
- **Weather features** - Open-Meteo integration (wind speed, solar radiation)
- **Probabilistic forecasting** - Dual prediction intervals (80%, 95%)
- **Drift monitoring** - Automatic detection of model degradation

## Architecture Overview

```
EIA API (WND/SUN) ──┐
                    ├──► Data Pipeline ──► StatsForecast ──► Predictions
Open-Meteo API ─────┘         │                  │              │
                              ▼                  ▼              ▼
                         Validation        Multi-Series    Probabilistic
                         & Quality         [unique_id,     (80%, 95%
                                           ds, y, X]       intervals)
                                                              │
                                                              ▼
                                                         Streamlit
                                                         Dashboard
                                                         (drift, alerts)
```

## Key Concepts

1. **StatsForecast format**: `[unique_id, ds, y]` - where `unique_id` = `{region}_{fuel_type}`
2. **Zero-value handling**: Solar generates 0 at night - we use RMSE/MAE, NOT MAPE
3. **Leakage prevention**: Use **forecasted** weather for predictions, not historical
4. **Drift detection**: Threshold = mean + 2*std from backtest

## Setup

First, let's ensure we have the project root in our path and configure logging.

In [1]:
import sys
import logging
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Configure logging for visibility
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

print(f"Project root: {project_root}")

Project root: c:\docker_projects\atsaf


---

# Module 1: Region Definitions

**File:** `src/renewable/regions.py`

This module maps **EIA balancing authority regions** to their geographic coordinates. Why do we need coordinates?

- **Weather API lookup**: Open-Meteo requires latitude/longitude
- **Regional analysis**: Compare forecast accuracy across regions
- **Timezone handling**: Each region has a primary timezone

## Key Design Decisions

1. **NamedTuple for RegionInfo**: Immutable, type-safe, and memory-efficient
2. **Centroid coordinates**: Approximate centers - good enough for hourly weather
3. **Fuel type codes**: `WND` (wind), `SUN` (solar) - match EIA's API

In [2]:
# Module 1: Region Definitions
# This is the foundation - defines WHERE our data comes from

from typing import NamedTuple


class RegionInfo(NamedTuple):
    """Region metadata for EIA and weather lookups."""
    name: str
    lat: float
    lon: float
    timezone: str


# EIA Balancing Authority regions with centroid coordinates
REGIONS: dict[str, RegionInfo] = {
    # Western Interconnection
    "CALI": RegionInfo(name="California ISO", lat=36.7, lon=-119.4, timezone="America/Los_Angeles"),
    "NW": RegionInfo(name="Northwest", lat=45.5, lon=-122.0, timezone="America/Los_Angeles"),
    "SW": RegionInfo(name="Southwest", lat=33.5, lon=-112.0, timezone="America/Phoenix"),
    
    # Texas Interconnection (its own grid!)
    "ERCO": RegionInfo(name="ERCOT (Texas)", lat=31.0, lon=-100.0, timezone="America/Chicago"),
    
    # Eastern Interconnection - Northeast
    "NE": RegionInfo(name="New England ISO", lat=42.3, lon=-71.5, timezone="America/New_York"),
    "NY": RegionInfo(name="New York ISO", lat=42.5, lon=-75.5, timezone="America/New_York"),
    "PJM": RegionInfo(name="PJM Interconnection", lat=40.0, lon=-77.0, timezone="America/New_York"),
    
    # Eastern Interconnection - Midwest
    "MISO": RegionInfo(name="Midcontinent ISO", lat=41.0, lon=-93.0, timezone="America/Chicago"),
    "SWPP": RegionInfo(name="Southwest Power Pool", lat=36.0, lon=-97.0, timezone="America/Chicago"),
    "CENT": RegionInfo(name="Central", lat=39.0, lon=-95.0, timezone="America/Chicago"),
    
    # Eastern Interconnection - Southeast
    "SE": RegionInfo(name="Southeast", lat=33.0, lon=-84.0, timezone="America/New_York"),
    "FLA": RegionInfo(name="Florida", lat=28.0, lon=-82.0, timezone="America/New_York"),
    "CAR": RegionInfo(name="Carolinas", lat=35.5, lon=-80.0, timezone="America/New_York"),
    "TEN": RegionInfo(name="Tennessee Valley", lat=35.5, lon=-86.0, timezone="America/Chicago"),
    
    # Aggregate
    "US48": RegionInfo(name="Lower 48 States", lat=39.8, lon=-98.5, timezone="America/Chicago"),
}

# Fuel type codes for renewable generation
FUEL_TYPES = {
    "WND": "Wind",
    "SUN": "Solar",
}


def get_region_coords(region_code: str) -> tuple[float, float]:
    """Return (lat, lon) for weather API lookup."""
    region = REGIONS[region_code]
    return (region.lat, region.lon)


def list_regions() -> list[str]:
    """Return all valid region codes."""
    return sorted(REGIONS.keys())


def get_region_info(region_code: str) -> RegionInfo:
    """Get full region information."""
    return REGIONS[region_code]


def validate_region(region_code: str) -> bool:
    """Check if region code is valid."""
    return region_code in REGIONS


def validate_fuel_type(fuel_type: str) -> bool:
    """Check if fuel type code is valid."""
    return fuel_type in FUEL_TYPES

### Example: Using Region Definitions

In [3]:
# Example run - test region functions

print("=== Available Regions ===")
print(f"Total regions: {len(REGIONS)}")
print(f"Region codes: {list_regions()}")

print("\n=== Example: California ===")
cali_info = get_region_info("CALI")
print(f"Name: {cali_info.name}")
print(f"Coordinates: ({cali_info.lat}, {cali_info.lon})")
print(f"Timezone: {cali_info.timezone}")

print("\n=== Weather API Coordinates ===")
for region in ["CALI", "ERCO", "MISO"]:
    lat, lon = get_region_coords(region)
    print(f"{region}: lat={lat}, lon={lon}")

print("\n=== Fuel Types ===")
for code, name in FUEL_TYPES.items():
    print(f"{code}: {name}")

print("\n=== Validation ===")
print(f"validate_region('CALI'): {validate_region('CALI')}")
print(f"validate_region('INVALID'): {validate_region('INVALID')}")
print(f"validate_fuel_type('WND'): {validate_fuel_type('WND')}")

=== Available Regions ===
Total regions: 15
Region codes: ['CALI', 'CAR', 'CENT', 'ERCO', 'FLA', 'MISO', 'NE', 'NW', 'NY', 'PJM', 'SE', 'SW', 'SWPP', 'TEN', 'US48']

=== Example: California ===
Name: California ISO
Coordinates: (36.7, -119.4)
Timezone: America/Los_Angeles

=== Weather API Coordinates ===
CALI: lat=36.7, lon=-119.4
ERCO: lat=31.0, lon=-100.0
MISO: lat=41.0, lon=-93.0

=== Fuel Types ===
WND: Wind
SUN: Solar

=== Validation ===
validate_region('CALI'): True
validate_region('INVALID'): False
validate_fuel_type('WND'): True


---

# Module 2: EIA Data Fetcher

**File:** `src/renewable/eia_renewable.py`

This module fetches **hourly wind and solar generation** from the EIA API.

## Critical Concepts

### StatsForecast Format
StatsForecast expects data in a specific format:
```
unique_id | ds                  | y
----------|---------------------|--------
CALI_WND  | 2024-01-01 00:00:00 | 1234.5
CALI_WND  | 2024-01-01 01:00:00 | 1456.7
ERCO_WND  | 2024-01-01 00:00:00 | 2345.6
```

- `unique_id`: Identifies the time series (e.g., "CALI_WND" = California Wind)
- `ds`: Datetime column (timezone-naive UTC)
- `y`: Target value (generation in MWh)

### API Rate Limiting
- EIA API has rate limits (~5 requests/second)
- We use controlled parallelism with delays

In [4]:
# Module 2: EIA Data Fetcher
# Fetches wind/solar generation data for multi-series forecasting

import os
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from typing import Optional

import pandas as pd
import requests
from dotenv import load_dotenv

load_dotenv()


class EIARenewableFetcher:
    """Fetch wind/solar data for multiple regions from EIA API.
    
    Outputs data in StatsForecast multi-series format:
    - unique_id: "{region}_{fuel_type}" (e.g., "CALI_WND")
    - ds: timezone-naive UTC datetime
    - y: generation value (MWh)
    """
    
    BASE_URL = "https://api.eia.gov/v2/electricity/rto/fuel-type-data/data/"
    MAX_RECORDS_PER_REQUEST = 5000
    RATE_LIMIT_DELAY = 0.2  # 5 requests/second max
    
    def __init__(self, api_key: Optional[str] = None):
        """Initialize with EIA API key."""
        self.api_key = api_key or os.getenv("EIA_API_KEY")
        if not self.api_key:
            raise ValueError(
                "EIA API key required. Set EIA_API_KEY environment variable "
                "or pass api_key parameter."
            )
    
    def fetch_region(
        self,
        region: str,
        fuel_type: str,
        start_date: str,
        end_date: str,
    ) -> pd.DataFrame:
        """Fetch data for a single region and fuel type."""
        if not validate_region(region):
            raise ValueError(f"Invalid region: {region}")
        if not validate_fuel_type(fuel_type):
            raise ValueError(f"Invalid fuel type: {fuel_type}")
        
        all_records = []
        offset = 0
        
        while True:
            params = {
                "api_key": self.api_key,
                "data[]": "value",
                "facets[respondent][]": region,
                "facets[fueltype][]": fuel_type,
                "frequency": "hourly",
                "start": f"{start_date}T00",
                "end": f"{end_date}T23",
                "length": self.MAX_RECORDS_PER_REQUEST,
                "offset": offset,
                "sort[0][column]": "period",
                "sort[0][direction]": "asc",
            }
            
            response = requests.get(self.BASE_URL, params=params, timeout=30)
            response.raise_for_status()
            
            data = response.json()
            
            if "response" not in data or "data" not in data["response"]:
                break
            
            records = data["response"]["data"]
            if not records:
                break
            
            all_records.extend(records)
            offset += self.MAX_RECORDS_PER_REQUEST
            time.sleep(self.RATE_LIMIT_DELAY)
        
        if not all_records:
            return pd.DataFrame(columns=["ds", "value", "region", "fuel_type"])
        
        df = pd.DataFrame(all_records)
        df["ds"] = pd.to_datetime(df["period"], errors="coerce")
        df["value"] = pd.to_numeric(df["value"], errors="coerce")
        df["region"] = region
        df["fuel_type"] = fuel_type
        df = df.dropna(subset=["ds", "value"])
        
        # Convert to UTC naive
        if df["ds"].dt.tz is not None:
            df["ds"] = df["ds"].dt.tz_convert("UTC").dt.tz_localize(None)
        
        return df[["ds", "value", "region", "fuel_type"]].sort_values("ds").reset_index(drop=True)
    
    def fetch_all_regions(
        self,
        fuel_type: str,
        start_date: str,
        end_date: str,
        regions: Optional[list[str]] = None,
        max_workers: int = 3,
    ) -> pd.DataFrame:
        """Fetch data for all regions, return in StatsForecast format.
        
        Returns DataFrame with columns [unique_id, ds, y]
        where unique_id = "{region}_{fuel_type}" (e.g., "CALI_WND")
        """
        if regions is None:
            regions = [r for r in REGIONS.keys() if r != "US48"]
        
        all_dfs = []
        
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {
                executor.submit(
                    self.fetch_region, region, fuel_type, start_date, end_date
                ): region
                for region in regions
            }
            
            for future in as_completed(futures):
                region = futures[future]
                try:
                    df = future.result()
                    if len(df) > 0:
                        all_dfs.append(df)
                        print(f"[OK] {region}: {len(df)} rows")
                except Exception as e:
                    print(f"[FAIL] {region}: {e}")
        
        if not all_dfs:
            return pd.DataFrame(columns=["unique_id", "ds", "y"])
        
        combined = pd.concat(all_dfs, ignore_index=True)
        
        # Create unique_id for StatsForecast format
        combined["unique_id"] = combined["region"] + "_" + combined["fuel_type"]
        combined = combined.rename(columns={"value": "y"})
        
        return combined[["unique_id", "ds", "y"]].sort_values(["unique_id", "ds"]).reset_index(drop=True)
    
    def get_series_summary(self, df: pd.DataFrame) -> pd.DataFrame:
        """Get summary statistics per series."""
        return df.groupby("unique_id").agg(
            count=("y", "count"),
            min_value=("y", "min"),
            max_value=("y", "max"),
            mean_value=("y", "mean"),
            zero_count=("y", lambda x: (x == 0).sum()),
        ).reset_index()

### Example: Fetching EIA Data

**Note:** This requires a valid EIA API key. Get one free at: https://www.eia.gov/opendata/

In [5]:
# Example run - test EIA fetcher
# Requires EIA_API_KEY environment variable

try:
    fetcher = EIARenewableFetcher()
    
    print("=== Testing Single Region Fetch ===")
    df_single = fetcher.fetch_region(
        region="CALI",
        fuel_type="WND",
        start_date="2024-12-01",
        end_date="2024-12-03",
    )
    print(f"Single region: {len(df_single)} rows")
    print(df_single.head())
    
    print("\n=== Testing Multi-Region Fetch ===")
    df_multi = fetcher.fetch_all_regions(
        fuel_type="WND",
        start_date="2024-12-01",
        end_date="2024-12-03",
        regions=["CALI", "ERCO", "MISO"],
    )
    print(f"\nMulti-region: {len(df_multi)} rows")
    print(f"Series: {df_multi['unique_id'].unique().tolist()}")
    
    print("\n=== Series Summary ===")
    print(fetcher.get_series_summary(df_multi))
    
except ValueError as e:
    print(f"\nNote: {e}")
    print("Set your EIA API key to run this example.")
    print("Get a free key at: https://www.eia.gov/opendata/")

=== Testing Single Region Fetch ===
Single region: 0 rows
Empty DataFrame
Columns: [ds, value, region, fuel_type]
Index: []

=== Testing Multi-Region Fetch ===
[OK] ERCO: 49 rows
[OK] MISO: 49 rows

Multi-region: 98 rows
Series: ['ERCO_WND', 'MISO_WND']

=== Series Summary ===
  unique_id  count  min_value  max_value   mean_value  zero_count
0  ERCO_WND     49        747       9331  4775.061224           0
1  MISO_WND     49       4648      11717  8110.673469           0


---

# Module 3: Weather Integration

**File:** `src/renewable/open_meteo.py`

Weather is **critical** for renewable forecasting:
- **Wind generation** depends on wind speed (especially at hub height ~100m)
- **Solar generation** depends on radiation and cloud cover

## Key Concept: Preventing Leakage

**Data leakage** occurs when training uses information that wouldn't be available at prediction time.

```
❌ WRONG: Using historical weather to predict future generation
   - At prediction time, we don't have future actual weather!
   
✅ CORRECT: Use forecasted weather for predictions
   - Training: historical weather aligned with historical generation
   - Prediction: weather forecast for the prediction horizon
```

## Open-Meteo API

Open-Meteo is **free** and requires no API key:
- Historical API: Past weather data
- Forecast API: Up to 16 days ahead

In [6]:
# Module 3: Weather Integration
# Fetches weather features from Open-Meteo (free, no API key needed)

from datetime import datetime, timedelta
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


class OpenMeteoRenewable:
    """Fetch weather features for renewable energy forecasting.
    
    Uses Open-Meteo APIs (free, no API key required):
    - Historical: archive-api.open-meteo.com (past data)
    - Forecast: api.open-meteo.com (up to 16 days ahead)
    """
    
    HISTORICAL_URL = "https://archive-api.open-meteo.com/v1/archive"
    FORECAST_URL = "https://api.open-meteo.com/v1/forecast"
    
    # Weather variables relevant for renewable forecasting
    WEATHER_VARS = [
        "temperature_2m",      # Ambient temperature
        "wind_speed_10m",      # Wind at 10m (standard measurement height)
        "wind_speed_100m",     # Wind at hub height (~100m for turbines)
        "wind_direction_10m",  # Wind direction
        "direct_radiation",    # Direct solar radiation
        "diffuse_radiation",   # Diffuse solar radiation
        "cloud_cover",         # Cloud coverage (%)
    ]
    
    def __init__(self, timeout: int = 30):
        self.timeout = timeout
        self.session = self._create_session()
    
    def _create_session(self) -> requests.Session:
        """Create session with retry logic."""
        session = requests.Session()
        retries = Retry(total=3, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504])
        session.mount("https://", HTTPAdapter(max_retries=retries))
        return session
    
    def fetch_historical(
        self,
        lat: float,
        lon: float,
        start_date: str,
        end_date: str,
        variables: Optional[list[str]] = None,
    ) -> pd.DataFrame:
        """Fetch historical hourly weather data.
        
        Use this for TRAINING data to align with historical generation.
        """
        if variables is None:
            variables = self.WEATHER_VARS
        
        params = {
            "latitude": lat,
            "longitude": lon,
            "start_date": start_date,
            "end_date": end_date,
            "hourly": ",".join(variables),
            "timezone": "UTC",
        }
        
        response = self.session.get(self.HISTORICAL_URL, params=params, timeout=self.timeout)
        response.raise_for_status()
        
        return self._parse_response(response.json(), variables)
    
    def fetch_forecast(
        self,
        lat: float,
        lon: float,
        horizon_hours: int = 48,
        variables: Optional[list[str]] = None,
    ) -> pd.DataFrame:
        """Fetch weather forecast for future predictions.
        
        CRITICAL: Use this for PREDICTION features to avoid leakage!
        Do NOT use historical weather for future forecasts.
        """
        if variables is None:
            variables = self.WEATHER_VARS
        
        forecast_days = min((horizon_hours // 24) + 1, 16)
        
        params = {
            "latitude": lat,
            "longitude": lon,
            "hourly": ",".join(variables),
            "timezone": "UTC",
            "forecast_days": forecast_days,
        }
        
        response = self.session.get(self.FORECAST_URL, params=params, timeout=self.timeout)
        response.raise_for_status()
        
        df = self._parse_response(response.json(), variables)
        
        # Trim to requested horizon
        if len(df) > 0:
            cutoff = datetime.utcnow() + timedelta(hours=horizon_hours)
            df = df[df["ds"] <= cutoff].reset_index(drop=True)
        
        return df
    
    def fetch_for_region(
        self,
        region_code: str,
        start_date: str,
        end_date: str,
    ) -> pd.DataFrame:
        """Fetch historical weather using region centroid coordinates."""
        lat, lon = get_region_coords(region_code)
        df = self.fetch_historical(lat, lon, start_date, end_date)
        df["region"] = region_code
        return df
    
    def fetch_all_regions_historical(
        self,
        regions: list[str],
        start_date: str,
        end_date: str,
    ) -> pd.DataFrame:
        """Fetch historical weather for multiple regions."""
        all_dfs = []
        
        for region in regions:
            try:
                df = self.fetch_for_region(region, start_date, end_date)
                all_dfs.append(df)
                print(f"[OK] Weather for {region}: {len(df)} rows")
            except Exception as e:
                print(f"[FAIL] Weather for {region}: {e}")
        
        if not all_dfs:
            return pd.DataFrame()
        
        return pd.concat(all_dfs, ignore_index=True).sort_values(["region", "ds"]).reset_index(drop=True)
    
    def _parse_response(self, data: dict, variables: list[str]) -> pd.DataFrame:
        """Parse Open-Meteo API response to DataFrame."""
        hourly = data.get("hourly", {})
        times = hourly.get("time", [])
        
        if not times:
            return pd.DataFrame(columns=["ds"] + variables)
        
        df_data = {"ds": pd.to_datetime(times, errors="coerce", utc=True).tz_localize(None)}
        
        for var in variables:
            values = hourly.get(var)
            df_data[var] = values if values else [None] * len(times)
        
        df = pd.DataFrame(df_data)
        
        for var in variables:
            if var in df.columns:
                df[var] = pd.to_numeric(df[var], errors="coerce")
        
        return df.sort_values("ds").reset_index(drop=True)

### Example: Fetching Weather Data

In [7]:
# Example run - test weather fetcher (no API key needed!)

weather = OpenMeteoRenewable()

print("=== Testing Historical Weather ===")
hist_df = weather.fetch_for_region(
    region_code="CALI",
    start_date="2024-12-01",
    end_date="2024-12-03",
)
print(f"Historical: {len(hist_df)} rows")
print(hist_df.head())

print("\n=== Testing Weather Forecast ===")
fcst_df = weather.fetch_forecast(
    lat=36.7, lon=-119.4,
    horizon_hours=24,
)
print(f"Forecast: {len(fcst_df)} rows")
print(fcst_df.head())

print("\n=== Weather Variables Explained ===")
print("Wind forecasting:")
print("  - wind_speed_10m: Standard measurement height")
print("  - wind_speed_100m: Hub height for wind turbines (critical!)")
print("\nSolar forecasting:")
print("  - direct_radiation: Direct sunlight (W/m²)")
print("  - diffuse_radiation: Scattered light from clouds")
print("  - cloud_cover: Affects both direct and diffuse")

=== Testing Historical Weather ===
Historical: 72 rows
                   ds  temperature_2m  wind_speed_10m  wind_speed_100m  \
0 2024-12-01 00:00:00            12.2             5.9              7.0   
1 2024-12-01 01:00:00             9.2             1.2              5.3   
2 2024-12-01 02:00:00             8.4             2.4              3.6   
3 2024-12-01 03:00:00             8.7             3.8              5.4   
4 2024-12-01 04:00:00            10.3             0.9              2.2   

   wind_direction_10m  direct_radiation  diffuse_radiation  cloud_cover region  
0                  72              84.0               57.0          100   CALI  
1                  27               6.0               12.0           41   CALI  
2                  63               0.0                0.0          100   CALI  
3                  90               0.0                0.0          100   CALI  
4                 101               0.0                0.0          100   CALI  

=== Testing W

  cutoff = datetime.utcnow() + timedelta(hours=horizon_hours)


---

# Module 4: Probabilistic Modeling

**File:** `src/renewable/modeling.py`

This is where the forecasting happens! We use **StatsForecast** for:

1. **Multi-series forecasting**: Handle multiple regions/fuel types in one model
2. **Probabilistic predictions**: Get prediction intervals, not just point forecasts
3. **Weather exogenous**: Include weather features as predictors

## Key Concepts

### Why Prediction Intervals?

Point forecasts are useful, but energy traders need **uncertainty quantification**:
- **80% interval**: "I'm 80% confident generation will be between X and Y"
- **95% interval**: Wider, for risk management

### Zero-Value Safety (CRITICAL)

**Solar panels generate ZERO at night!** This breaks MAPE:

```
MAPE = mean(|actual - predicted| / actual)

When actual = 0:
MAPE = |0 - pred| / 0 = undefined (division by zero!)
```

**Solution**: Always use RMSE and MAE for renewable forecasting.

In [8]:
# Module 4: Probabilistic Modeling
# Multi-series forecasting with prediction intervals

import numpy as np
from dataclasses import dataclass, field

# Import our evaluation metrics (from chapter 2)
from src.chapter2.evaluation import ForecastMetrics


@dataclass
class ForecastConfig:
    """Configuration for renewable forecasting."""
    horizon: int = 24
    confidence_levels: tuple[int, int] = (80, 95)
    season_length: int = 24  # Hourly seasonality
    weekly_season: int = 168  # 24 * 7
    models: list[str] = field(default_factory=lambda: ["AutoARIMA", "MSTL"])


class RenewableForecastModel:
    """Multi-series probabilistic forecasting with weather exogenous.
    
    Designed for wind/solar generation with:
    - Weather features (wind speed, solar radiation)
    - Dual prediction intervals (80%, 95%)
    - Zero-safe metrics (solar has 0s at night)
    """
    
    def __init__(
        self,
        horizon: int = 24,
        confidence_levels: tuple[int, int] = (80, 95),
    ):
        self.horizon = horizon
        self.confidence_levels = confidence_levels
        self.sf = None
        self.fitted = False
    
    def prepare_features(
        self,
        df: pd.DataFrame,
        weather_df: Optional[pd.DataFrame] = None,
    ) -> pd.DataFrame:
        """Add time and weather features.
        
        Time features use cyclic encoding (sin/cos) because:
        - Hour 23 and Hour 0 are adjacent, but 23 > 0 numerically
        - Sin/cos creates a smooth circular representation
        """
        result = df.copy()
        
        # Cyclic encoding for hour of day
        result["hour"] = result["ds"].dt.hour
        result["hour_sin"] = np.sin(2 * np.pi * result["hour"] / 24)
        result["hour_cos"] = np.cos(2 * np.pi * result["hour"] / 24)
        
        # Cyclic encoding for day of week
        result["dayofweek"] = result["ds"].dt.dayofweek
        result["dow_sin"] = np.sin(2 * np.pi * result["dayofweek"] / 7)
        result["dow_cos"] = np.cos(2 * np.pi * result["dayofweek"] / 7)
        
        # Merge weather if provided
        if weather_df is not None and len(weather_df) > 0:
            result["region"] = result["unique_id"].str.split("_").str[0]
            weather_cols = [c for c in weather_df.columns if c not in ["ds", "region"]]
            
            result = result.merge(
                weather_df[["ds", "region"] + weather_cols],
                on=["ds", "region"],
                how="left",
            )
            
            # Forward fill missing weather
            for col in weather_cols:
                if col in result.columns:
                    result[col] = result.groupby("unique_id")[col].ffill()
            
            result = result.drop(columns=["region"])
        
        # Lag features (shifted to prevent leakage)
        result = result.sort_values(["unique_id", "ds"])
        result["y_lag_1"] = result.groupby("unique_id")["y"].shift(1)
        result["y_lag_24"] = result.groupby("unique_id")["y"].shift(24)
        
        result = result.drop(columns=["hour", "dayofweek"], errors="ignore")
        
        return result
    
    def fit(self, df: pd.DataFrame, weather_df: Optional[pd.DataFrame] = None) -> None:
        """Train StatsForecast models."""
        from statsforecast import StatsForecast
        from statsforecast.models import MSTL, AutoARIMA, AutoETS, SeasonalNaive
        
        train_df = self.prepare_features(df, weather_df)
        
        # Define models
        models = [
            AutoARIMA(season_length=24),
            SeasonalNaive(season_length=24),
            AutoETS(season_length=24),
            MSTL(
                season_length=[24, 168],  # Daily and weekly seasonality
                trend_forecaster=AutoARIMA(),
                alias="MSTL_ARIMA",
            ),
        ]
        
        self.sf = StatsForecast(models=models, freq="h", n_jobs=-1)
        self._train_df = train_df[["unique_id", "ds", "y"]].copy()
        
        print(f"Model fit: {len(train_df)} rows, {train_df['unique_id'].nunique()} series")
        self.fitted = True
    
    def predict(self, future_weather: Optional[pd.DataFrame] = None) -> pd.DataFrame:
        """Generate forecasts with dual prediction intervals."""
        if not self.fitted:
            raise RuntimeError("Model must be fitted before prediction. Call fit() first.")
        
        forecasts = self.sf.forecast(
            h=self.horizon,
            df=self._train_df,
            level=list(self.confidence_levels),
        )
        
        forecasts = forecasts.reset_index()
        result = self._standardize_forecast_columns(forecasts)
        
        print(f"Predictions generated: {len(result)} rows, {self.horizon}h horizon")
        return result
    
    def cross_validate(
        self,
        df: pd.DataFrame,
        weather_df: Optional[pd.DataFrame] = None,
        n_windows: int = 5,
        step_size: int = 168,
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        """Run rolling-origin cross-validation."""
        from statsforecast import StatsForecast
        from statsforecast.models import MSTL, AutoARIMA, AutoETS, SeasonalNaive
        
        cv_df = self.prepare_features(df, weather_df)
        cv_df = cv_df[["unique_id", "ds", "y"]].copy()
        
        models = [
            AutoARIMA(season_length=24),
            SeasonalNaive(season_length=24),
            AutoETS(season_length=24),
            MSTL(season_length=[24, 168], trend_forecaster=AutoARIMA(), alias="MSTL_ARIMA"),
        ]
        
        sf = StatsForecast(models=models, freq="h", n_jobs=-1)
        
        print(f"Running CV: {n_windows} windows, step={step_size}h, horizon={self.horizon}h")
        
        cv_results = sf.cross_validation(
            df=cv_df,
            h=self.horizon,
            step_size=step_size,
            n_windows=n_windows,
            level=list(self.confidence_levels),
        )
        
        cv_results = cv_results.reset_index()
        leaderboard = self._compute_leaderboard(cv_results)
        
        return cv_results, leaderboard
    
    def _standardize_forecast_columns(self, df: pd.DataFrame) -> pd.DataFrame:
        """Standardize column names to yhat, yhat_lo_80, etc."""
        result = df.copy()
        
        model_cols = [c for c in result.columns if c not in ["unique_id", "ds", "cutoff"]]
        point_cols = [c for c in model_cols if not any(x in c for x in ["-lo-", "-hi-"])]
        
        # Prefer MSTL_ARIMA as the main model
        if "MSTL_ARIMA" in point_cols:
            best_model = "MSTL_ARIMA"
        elif "AutoARIMA" in point_cols:
            best_model = "AutoARIMA"
        else:
            best_model = point_cols[0] if point_cols else None
        
        if best_model:
            result["yhat"] = result[best_model]
            
            for level in self.confidence_levels:
                lo_col = f"{best_model}-lo-{level}"
                hi_col = f"{best_model}-hi-{level}"
                
                if lo_col in result.columns:
                    result[f"yhat_lo_{level}"] = result[lo_col]
                if hi_col in result.columns:
                    result[f"yhat_hi_{level}"] = result[hi_col]
        
        return result
    
    def _compute_leaderboard(self, cv_results: pd.DataFrame) -> pd.DataFrame:
        """Compute model leaderboard from CV results."""
        model_cols = [
            c for c in cv_results.columns
            if c not in ["unique_id", "ds", "cutoff", "y"]
            and not any(x in c for x in ["-lo-", "-hi-"])
        ]
        
        rows = []
        for model in model_cols:
            y_true = cv_results["y"].values
            y_pred = cv_results[model].values
            
            # CRITICAL: Use RMSE and MAE, NOT MAPE (solar has zeros)
            rmse = ForecastMetrics.rmse(y_true, y_pred)
            mae = ForecastMetrics.mae(y_true, y_pred)
            
            # Coverage for each level
            coverages = {}
            for level in self.confidence_levels:
                lo_col = f"{model}-lo-{level}"
                hi_col = f"{model}-hi-{level}"
                
                if lo_col in cv_results.columns and hi_col in cv_results.columns:
                    coverage = ForecastMetrics.coverage(
                        y_true,
                        cv_results[lo_col].values,
                        cv_results[hi_col].values,
                    )
                    coverages[f"coverage_{level}"] = coverage
            
            rows.append({"model": model, "rmse": rmse, "mae": mae, **coverages})
        
        return pd.DataFrame(rows).sort_values("rmse")
    
    def compute_metrics(self, y_true: np.ndarray, y_pred: np.ndarray) -> dict:
        """Compute metrics - RMSE and MAE only (no MAPE for solar!)."""
        return {
            "rmse": ForecastMetrics.rmse(y_true, y_pred),
            "mae": ForecastMetrics.mae(y_true, y_pred),
            # NOTE: MAPE intentionally excluded - undefined when y=0
        }


def compute_baseline_metrics(
    cv_results: pd.DataFrame,
    model_name: str = "MSTL_ARIMA",
) -> dict:
    """Compute baseline metrics from backtest for drift detection.
    
    Drift threshold = mean + 2*std (flags unusual performance)
    """
    if model_name not in cv_results.columns:
        raise ValueError(f"Model {model_name} not found in CV results")
    
    window_metrics = []
    for cutoff in cv_results["cutoff"].unique():
        window = cv_results[cv_results["cutoff"] == cutoff]
        y_true = window["y"].values
        y_pred = window[model_name].values
        
        rmse = ForecastMetrics.rmse(y_true, y_pred)
        mae = ForecastMetrics.mae(y_true, y_pred)
        window_metrics.append({"cutoff": cutoff, "rmse": rmse, "mae": mae})
    
    metrics_df = pd.DataFrame(window_metrics)
    
    baseline = {
        "model": model_name,
        "rmse_mean": metrics_df["rmse"].mean(),
        "rmse_std": metrics_df["rmse"].std(),
        "mae_mean": metrics_df["mae"].mean(),
        "mae_std": metrics_df["mae"].std(),
        "n_windows": len(metrics_df),
    }
    
    # Drift threshold: mean + 2*std
    baseline["drift_threshold_rmse"] = baseline["rmse_mean"] + 2 * baseline["rmse_std"]
    baseline["drift_threshold_mae"] = baseline["mae_mean"] + 2 * baseline["mae_std"]
    
    return baseline

### Example: Training and Cross-Validation

In [9]:
# Example run - test modeling with synthetic data

np.random.seed(42)

# Create synthetic generation data (simulates 30 days of hourly data)
dates = pd.date_range("2024-01-01", periods=720, freq="h")
series_ids = ["CALI_WND", "ERCO_WND"]

dfs = []
for sid in series_ids:
    # Simulate generation with daily seasonality + noise
    y = 100 + 20 * np.sin(np.arange(720) * 2 * np.pi / 24) + np.random.normal(0, 5, 720)
    dfs.append(pd.DataFrame({"unique_id": sid, "ds": dates, "y": y}))

df = pd.concat(dfs, ignore_index=True)
print(f"Synthetic data: {len(df)} rows, {df['unique_id'].nunique()} series")

print("\n=== Feature Preparation ===")
model = RenewableForecastModel(horizon=24)
features = model.prepare_features(df)
print(f"Features added: {[c for c in features.columns if c not in ['unique_id', 'ds', 'y']]}")

print("\n=== Cross-Validation ===")
cv_results, leaderboard = model.cross_validate(df, n_windows=3, step_size=168)
print(f"CV results: {len(cv_results)} rows")
print("\nLeaderboard (sorted by RMSE):")
print(leaderboard.to_string(index=False))

print("\n=== Baseline Metrics (for drift detection) ===")
baseline = compute_baseline_metrics(cv_results)
print(f"Best model: {baseline['model']}")
print(f"RMSE: {baseline['rmse_mean']:.2f} ± {baseline['rmse_std']:.2f}")
print(f"Drift threshold: {baseline['drift_threshold_rmse']:.2f}")
print("\n(Drift detected when current RMSE > threshold)")

Synthetic data: 1440 rows, 2 series

=== Feature Preparation ===
Features added: ['hour_sin', 'hour_cos', 'dow_sin', 'dow_cos', 'y_lag_1', 'y_lag_24']

=== Cross-Validation ===
Running CV: 3 windows, step=168h, horizon=24h
CV results: 144 rows

Leaderboard (sorted by RMSE):
        model      rmse       mae  coverage_80  coverage_95
      AutoETS  4.882920  3.763942    79.861111    93.750000
    AutoARIMA  5.171008  4.104589    79.166667    95.138889
   MSTL_ARIMA  6.070609  4.806524    56.944444    72.916667
SeasonalNaive  6.327316  5.083790    80.555556    98.611111
        index 53.896518 44.150693          NaN          NaN

=== Baseline Metrics (for drift detection) ===
Best model: MSTL_ARIMA
RMSE: 6.07 ± 0.28
Drift threshold: 6.63

(Drift detected when current RMSE > threshold)


---

# Module 5: Database Operations

**File:** `src/renewable/db.py`

This module handles **persistence** of forecasts, metrics, and drift alerts.

## Database Schema

```
renewable_forecasts    - Forecasts with dual intervals
renewable_scores       - Evaluation metrics per run
weather_features       - Weather data by region
drift_alerts           - Drift detection history
baseline_metrics       - Backtest baselines for thresholds
```

## Why SQLite?

- **Simple**: Single file, no server needed
- **Fast**: WAL mode for concurrent reads/writes
- **Portable**: Easy to share or backup

In [10]:
# Module 5: Database Operations
# Persistence for forecasts, metrics, and alerts

import json
import sqlite3
import tempfile


def connect(db_path: str) -> sqlite3.Connection:
    """Connect to SQLite with optimized settings."""
    Path(db_path).parent.mkdir(parents=True, exist_ok=True)
    con = sqlite3.connect(db_path)
    con.execute("PRAGMA journal_mode=WAL;")  # Write-Ahead Logging for performance
    con.execute("PRAGMA synchronous=NORMAL;")
    return con


def init_renewable_db(db_path: str) -> None:
    """Initialize database schema."""
    con = connect(db_path)
    cur = con.cursor()
    
    # Forecasts with dual prediction intervals
    cur.execute("""
    CREATE TABLE IF NOT EXISTS renewable_forecasts (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        run_id TEXT NOT NULL,
        created_at TEXT NOT NULL,
        unique_id TEXT NOT NULL,
        region TEXT NOT NULL,
        fuel_type TEXT NOT NULL,
        ds TEXT NOT NULL,
        model TEXT NOT NULL,
        yhat REAL,
        yhat_lo_80 REAL,
        yhat_hi_80 REAL,
        yhat_lo_95 REAL,
        yhat_hi_95 REAL,
        UNIQUE (run_id, model, unique_id, ds)
    );
    """)
    
    # Evaluation scores
    cur.execute("""
    CREATE TABLE IF NOT EXISTS renewable_scores (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        scored_at TEXT NOT NULL,
        run_id TEXT NOT NULL,
        unique_id TEXT NOT NULL,
        model TEXT NOT NULL,
        rmse REAL,
        mae REAL,
        coverage_80 REAL,
        coverage_95 REAL
    );
    """)
    
    # Weather features
    cur.execute("""
    CREATE TABLE IF NOT EXISTS weather_features (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        region TEXT NOT NULL,
        ds TEXT NOT NULL,
        temperature_2m REAL,
        wind_speed_10m REAL,
        wind_speed_100m REAL,
        direct_radiation REAL,
        cloud_cover REAL,
        UNIQUE (region, ds)
    );
    """)
    
    # Drift alerts
    cur.execute("""
    CREATE TABLE IF NOT EXISTS drift_alerts (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        alert_at TEXT NOT NULL,
        run_id TEXT,
        unique_id TEXT,
        alert_type TEXT NOT NULL,
        severity TEXT NOT NULL,
        current_rmse REAL,
        threshold_rmse REAL,
        message TEXT
    );
    """)
    
    # Baseline metrics for drift detection
    cur.execute("""
    CREATE TABLE IF NOT EXISTS baseline_metrics (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        created_at TEXT NOT NULL,
        unique_id TEXT NOT NULL,
        model TEXT NOT NULL,
        rmse_mean REAL NOT NULL,
        rmse_std REAL NOT NULL,
        drift_threshold_rmse REAL NOT NULL,
        UNIQUE (unique_id, model)
    );
    """)
    
    con.commit()
    con.close()


def save_forecasts(db_path: str, forecasts_df: pd.DataFrame, run_id: str, model: str = "MSTL_ARIMA") -> int:
    """Save forecasts to database."""
    con = connect(db_path)
    created_at = datetime.utcnow().isoformat()
    
    rows = []
    for _, row in forecasts_df.iterrows():
        unique_id = row["unique_id"]
        parts = unique_id.split("_")
        region = parts[0] if len(parts) > 0 else ""
        fuel_type = parts[1] if len(parts) > 1 else ""
        
        rows.append((
            run_id, created_at, unique_id, region, fuel_type, str(row["ds"]), model,
            row.get("yhat"), row.get("yhat_lo_80"), row.get("yhat_hi_80"),
            row.get("yhat_lo_95"), row.get("yhat_hi_95"),
        ))
    
    cur = con.cursor()
    cur.executemany("""
        INSERT OR REPLACE INTO renewable_forecasts
        (run_id, created_at, unique_id, region, fuel_type, ds, model,
         yhat, yhat_lo_80, yhat_hi_80, yhat_lo_95, yhat_hi_95)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, rows)
    
    con.commit()
    con.close()
    return len(rows)


def save_drift_alert(
    db_path: str, run_id: str, unique_id: str,
    current_rmse: float, threshold_rmse: float, severity: str = "warning"
) -> None:
    """Save drift detection alert."""
    con = connect(db_path)
    
    alert_type = "drift_detected" if current_rmse > threshold_rmse else "drift_check"
    message = f"RMSE {current_rmse:.1f} {'>' if current_rmse > threshold_rmse else '<='} threshold {threshold_rmse:.1f}"
    
    cur = con.cursor()
    cur.execute("""
        INSERT INTO drift_alerts (alert_at, run_id, unique_id, alert_type, severity, current_rmse, threshold_rmse, message)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?)
    """, (datetime.utcnow().isoformat(), run_id, unique_id, alert_type, severity, current_rmse, threshold_rmse, message))
    
    con.commit()
    con.close()


def get_recent_forecasts(db_path: str, hours: int = 48) -> pd.DataFrame:
    """Get recent forecasts from database."""
    con = connect(db_path)
    query = f"""
        SELECT * FROM renewable_forecasts
        WHERE datetime(created_at) > datetime('now', '-{hours} hours')
        ORDER BY ds DESC
    """
    df = pd.read_sql_query(query, con)
    con.close()
    return df


def get_drift_alerts(db_path: str, hours: int = 24) -> pd.DataFrame:
    """Get recent drift alerts."""
    con = connect(db_path)
    query = f"""
        SELECT * FROM drift_alerts
        WHERE datetime(alert_at) > datetime('now', '-{hours} hours')
        ORDER BY alert_at DESC
    """
    df = pd.read_sql_query(query, con)
    con.close()
    return df

### Example: Database Operations

In [11]:
# Example run - test database operations

with tempfile.TemporaryDirectory() as tmpdir:
    db_path = f"{tmpdir}/test_renewable.db"
    
    print("=== Initializing Database ===")
    init_renewable_db(db_path)
    
    # Check tables
    con = connect(db_path)
    cur = con.cursor()
    cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
    tables = [row[0] for row in cur.fetchall()]
    print(f"Tables created: {tables}")
    con.close()
    
    print("\n=== Saving Test Forecasts ===")
    test_forecasts = pd.DataFrame({
        "unique_id": ["CALI_WND", "CALI_WND"],
        "ds": [datetime.utcnow(), datetime.utcnow() + timedelta(hours=1)],
        "yhat": [100.0, 105.0],
        "yhat_lo_80": [90.0, 95.0],
        "yhat_hi_80": [110.0, 115.0],
        "yhat_lo_95": [80.0, 85.0],
        "yhat_hi_95": [120.0, 125.0],
    })
    rows = save_forecasts(db_path, test_forecasts, run_id="test_run")
    print(f"Saved {rows} forecast rows")
    
    print("\n=== Saving Drift Alert ===")
    save_drift_alert(
        db_path,
        run_id="test_run",
        unique_id="CALI_WND",
        current_rmse=150.0,
        threshold_rmse=100.0,
        severity="warning",
    )
    print("Alert saved")
    
    print("\n=== Retrieving Data ===")
    forecasts = get_recent_forecasts(db_path)
    print(f"Retrieved {len(forecasts)} forecasts")
    
    alerts = get_drift_alerts(db_path)
    print(f"Retrieved {len(alerts)} alerts")
    print(alerts[["alert_at", "unique_id", "severity", "message"]].to_string(index=False))

=== Initializing Database ===
Tables created: ['renewable_forecasts', 'sqlite_sequence', 'renewable_scores', 'weather_features', 'drift_alerts', 'baseline_metrics']

=== Saving Test Forecasts ===
Saved 2 forecast rows

=== Saving Drift Alert ===
Alert saved

=== Retrieving Data ===
Retrieved 2 forecasts
Retrieved 1 alerts
                  alert_at unique_id severity                      message


  "ds": [datetime.utcnow(), datetime.utcnow() + timedelta(hours=1)],
  created_at = datetime.utcnow().isoformat()
  """, (datetime.utcnow().isoformat(), run_id, unique_id, alert_type, severity, current_rmse, threshold_rmse, message))


---

# Module 6: Pipeline Tasks

**File:** `src/renewable/tasks.py`

This module orchestrates the complete pipeline:

1. **Fetch generation data** from EIA
2. **Fetch weather data** from Open-Meteo
3. **Train models** with cross-validation
4. **Generate forecasts** with prediction intervals
5. **Compute drift metrics** vs baseline

## Key Feature: Adaptive CV

Cross-validation requires sufficient data:
```
Minimum rows = horizon + (n_windows × step_size)
```

For short series, we **adapt** the CV settings automatically.

In [12]:
# Module 6: Pipeline Tasks
# Orchestrates the complete forecasting pipeline

from datetime import timezone


@dataclass
class RenewablePipelineConfig:
    """Configuration for the renewable forecasting pipeline."""
    
    # Data parameters
    regions: list[str] = field(default_factory=lambda: ["CALI", "ERCO", "MISO"])
    fuel_types: list[str] = field(default_factory=lambda: ["WND", "SUN"])
    start_date: str = ""
    end_date: str = ""
    lookback_days: int = 30
    
    # Forecast parameters
    horizon: int = 24
    confidence_levels: tuple[int, int] = (80, 95)
    
    # CV parameters
    cv_windows: int = 5
    cv_step_size: int = 168  # 1 week
    
    # Output paths
    data_dir: str = "data/renewable"
    overwrite: bool = False
    
    def __post_init__(self):
        """Set default dates if not provided."""
        if not self.end_date:
            self.end_date = datetime.now(timezone.utc).strftime("%Y-%m-%d")
        if not self.start_date:
            end = datetime.strptime(self.end_date, "%Y-%m-%d")
            start = end - timedelta(days=self.lookback_days)
            self.start_date = start.strftime("%Y-%m-%d")
    
    def generation_path(self) -> Path:
        return Path(self.data_dir) / "generation.parquet"
    
    def weather_path(self) -> Path:
        return Path(self.data_dir) / "weather.parquet"
    
    def forecasts_path(self) -> Path:
        return Path(self.data_dir) / "forecasts.parquet"


def run_full_pipeline(config: RenewablePipelineConfig) -> dict:
    """Run the complete renewable forecasting pipeline.
    
    Steps:
    1. Fetch generation data from EIA
    2. Fetch weather data from Open-Meteo
    3. Train models with adaptive cross-validation
    4. Generate forecasts with prediction intervals
    """
    print(f"Pipeline: {config.start_date} to {config.end_date}")
    print(f"Regions: {config.regions}")
    print(f"Fuel types: {config.fuel_types}")
    
    results = {}
    
    # Step 1: Fetch generation (or use cached)
    generation_path = config.generation_path()
    if generation_path.exists() and not config.overwrite:
        print(f"\n[Step 1] Loading cached generation data")
        generation_df = pd.read_parquet(generation_path)
    else:
        print(f"\n[Step 1] Fetching generation data from EIA...")
        fetcher = EIARenewableFetcher()
        all_dfs = []
        for fuel_type in config.fuel_types:
            df = fetcher.fetch_all_regions(
                fuel_type=fuel_type,
                start_date=config.start_date,
                end_date=config.end_date,
                regions=config.regions,
            )
            all_dfs.append(df)
        generation_df = pd.concat(all_dfs, ignore_index=True)
        generation_path.parent.mkdir(parents=True, exist_ok=True)
        generation_df.to_parquet(generation_path, index=False)
    
    results["generation_rows"] = len(generation_df)
    results["series_count"] = generation_df["unique_id"].nunique()
    print(f"Generation: {results['generation_rows']} rows, {results['series_count']} series")
    
    # Step 2: Fetch weather (or use cached)
    weather_path = config.weather_path()
    if weather_path.exists() and not config.overwrite:
        print(f"\n[Step 2] Loading cached weather data")
        weather_df = pd.read_parquet(weather_path)
    else:
        print(f"\n[Step 2] Fetching weather data from Open-Meteo...")
        weather_api = OpenMeteoRenewable()
        weather_df = weather_api.fetch_all_regions_historical(
            regions=config.regions,
            start_date=config.start_date,
            end_date=config.end_date,
        )
        weather_df.to_parquet(weather_path, index=False)
    
    results["weather_rows"] = len(weather_df)
    print(f"Weather: {results['weather_rows']} rows")
    
    # Step 3: Train with adaptive CV
    print(f"\n[Step 3] Training models with cross-validation...")
    model = RenewableForecastModel(
        horizon=config.horizon,
        confidence_levels=config.confidence_levels,
    )
    
    # Adaptive CV settings based on data length
    min_series_len = generation_df.groupby("unique_id").size().min()
    available_for_cv = min_series_len - config.horizon
    
    step_size = min(config.cv_step_size, max(24, available_for_cv // 3))
    n_windows = min(config.cv_windows, max(2, available_for_cv // step_size))
    
    print(f"Adaptive CV: {n_windows} windows, step={step_size}h (min_series={min_series_len} rows)")
    
    cv_results, leaderboard = model.cross_validate(
        df=generation_df,
        weather_df=weather_df,
        n_windows=n_windows,
        step_size=step_size,
    )
    
    best_model = leaderboard.iloc[0]["model"]
    baseline = compute_baseline_metrics(cv_results, model_name=best_model)
    
    results["best_model"] = best_model
    results["best_rmse"] = float(leaderboard.iloc[0]["rmse"])
    results["baseline"] = baseline
    print(f"Best model: {best_model}, RMSE: {results['best_rmse']:.1f}")
    
    # Step 4: Generate forecasts
    print(f"\n[Step 4] Generating {config.horizon}h forecasts...")
    model.fit(generation_df, weather_df)
    forecasts = model.predict()
    
    forecasts_path = config.forecasts_path()
    forecasts.to_parquet(forecasts_path, index=False)
    
    results["forecast_rows"] = len(forecasts)
    print(f"Forecasts saved: {results['forecast_rows']} rows")
    
    return results

### Example: Running the Full Pipeline

This example runs the complete pipeline with synthetic data (to avoid API calls).

In [13]:
# Example run - demonstrate pipeline with synthetic data

print("=== Running Pipeline with Synthetic Data ===")

# Create synthetic data (simulates what the pipeline would fetch)
np.random.seed(42)
dates = pd.date_range("2024-01-01", periods=360, freq="h")  # 15 days

# Synthetic generation data
generation_data = []
for region in ["CALI", "ERCO"]:
    for fuel in ["WND"]:
        unique_id = f"{region}_{fuel}"
        # Simulate generation with daily pattern
        y = 100 + 30 * np.sin(np.arange(360) * 2 * np.pi / 24) + np.random.normal(0, 10, 360)
        for i, date in enumerate(dates):
            generation_data.append({"unique_id": unique_id, "ds": date, "y": max(0, y[i])})

generation_df = pd.DataFrame(generation_data)
print(f"Synthetic generation: {len(generation_df)} rows, {generation_df['unique_id'].nunique()} series")

# Synthetic weather data
weather_data = []
for region in ["CALI", "ERCO"]:
    for i, date in enumerate(dates):
        weather_data.append({
            "ds": date,
            "region": region,
            "temperature_2m": 15 + 10 * np.sin(i * 2 * np.pi / 24),
            "wind_speed_10m": 5 + 3 * np.random.random(),
            "wind_speed_100m": 8 + 4 * np.random.random(),
            "direct_radiation": max(0, 500 * np.sin((date.hour - 6) * np.pi / 12)) if 6 < date.hour < 18 else 0,
        })

weather_df = pd.DataFrame(weather_data)
print(f"Synthetic weather: {len(weather_df)} rows")

# Run model training and forecasting
print("\n--- Training and Cross-Validation ---")
model = RenewableForecastModel(horizon=24, confidence_levels=(80, 95))

# Adaptive CV for our short series
cv_results, leaderboard = model.cross_validate(
    df=generation_df,
    weather_df=weather_df,
    n_windows=3,
    step_size=72,
)

print("\nModel Leaderboard:")
print(leaderboard.to_string(index=False))

print("\n--- Generating Forecasts ---")
model.fit(generation_df, weather_df)
forecasts = model.predict()

print(f"\nForecast shape: {forecasts.shape}")
print(f"Columns: {forecasts.columns.tolist()}")
print("\nSample forecasts:")
print(forecasts[["unique_id", "ds", "yhat", "yhat_lo_80", "yhat_hi_80"]].head(10).to_string(index=False))

=== Running Pipeline with Synthetic Data ===
Synthetic generation: 720 rows, 2 series
Synthetic weather: 720 rows

--- Training and Cross-Validation ---
Running CV: 3 windows, step=72h, horizon=24h

Model Leaderboard:
        model      rmse       mae  coverage_80  coverage_95
      AutoETS 10.277663  8.030556    75.694444    96.527778
    AutoARIMA 10.620644  8.273038    79.166667    95.833333
SeasonalNaive 13.564617 10.646426    83.333333    95.833333
   MSTL_ARIMA 14.122738 10.984680    34.722222    56.944444
        index 57.895321 47.814595          NaN          NaN

--- Generating Forecasts ---
Model fit: 720 rows, 2 series
Predictions generated: 48 rows, 24h horizon

Forecast shape: (48, 28)
Columns: ['index', 'unique_id', 'ds', 'AutoARIMA', 'AutoARIMA-lo-95', 'AutoARIMA-lo-80', 'AutoARIMA-hi-80', 'AutoARIMA-hi-95', 'SeasonalNaive', 'SeasonalNaive-lo-80', 'SeasonalNaive-lo-95', 'SeasonalNaive-hi-80', 'SeasonalNaive-hi-95', 'AutoETS', 'AutoETS-lo-95', 'AutoETS-lo-80', 'AutoETS-hi

---

# Module 7: End-to-End Demonstration

Let's tie it all together with a complete workflow.

In [14]:
# Complete end-to-end demonstration

print("="*60)
print("RENEWABLE ENERGY FORECASTING - END-TO-END DEMO")
print("="*60)

# 1. Configuration
print("\n1. CONFIGURATION")
print("-"*40)
print(f"Regions available: {list_regions()[:5]}... ({len(REGIONS)} total)")
print(f"Fuel types: {list(FUEL_TYPES.keys())}")

# 2. Data Summary
print("\n2. DATA SUMMARY")
print("-"*40)
print(f"Generation series: {generation_df['unique_id'].unique().tolist()}")
print(f"Date range: {generation_df['ds'].min()} to {generation_df['ds'].max()}")
print(f"Weather features: {[c for c in weather_df.columns if c not in ['ds', 'region']]}")

# 3. Model Performance
print("\n3. MODEL PERFORMANCE")
print("-"*40)
best_model = leaderboard.iloc[0]
print(f"Best model: {best_model['model']}")
print(f"RMSE: {best_model['rmse']:.2f}")
print(f"MAE: {best_model['mae']:.2f}")
if 'coverage_80' in leaderboard.columns:
    print(f"80% Coverage: {best_model.get('coverage_80', 'N/A')}")
    print(f"95% Coverage: {best_model.get('coverage_95', 'N/A')}")

# 4. Forecast Output
print("\n4. FORECAST OUTPUT")
print("-"*40)
print(f"Horizon: 24 hours")
print(f"Total predictions: {len(forecasts)}")
print(f"\nSample forecast for {forecasts['unique_id'].iloc[0]}:")
sample = forecasts[forecasts['unique_id'] == forecasts['unique_id'].iloc[0]].head(6)
print(sample[["ds", "yhat", "yhat_lo_80", "yhat_hi_80"]].to_string(index=False))

# 5. Drift Detection Setup
print("\n5. DRIFT DETECTION")
print("-"*40)
baseline = compute_baseline_metrics(cv_results, model_name=best_model['model'])
print(f"Baseline RMSE: {baseline['rmse_mean']:.2f} ± {baseline['rmse_std']:.2f}")
print(f"Drift threshold: {baseline['drift_threshold_rmse']:.2f}")
print("\n(Production system would alert when RMSE exceeds threshold)")

print("\n" + "="*60)
print("DEMO COMPLETE")
print("="*60)

RENEWABLE ENERGY FORECASTING - END-TO-END DEMO

1. CONFIGURATION
----------------------------------------
Regions available: ['CALI', 'CAR', 'CENT', 'ERCO', 'FLA']... (15 total)
Fuel types: ['WND', 'SUN']

2. DATA SUMMARY
----------------------------------------
Generation series: ['CALI_WND', 'ERCO_WND']
Date range: 2024-01-01 00:00:00 to 2024-01-15 23:00:00
Weather features: ['temperature_2m', 'wind_speed_10m', 'wind_speed_100m', 'direct_radiation']

3. MODEL PERFORMANCE
----------------------------------------
Best model: AutoETS
RMSE: 10.28
MAE: 8.03
80% Coverage: 75.69444444444444
95% Coverage: 96.52777777777779

4. FORECAST OUTPUT
----------------------------------------
Horizon: 24 hours
Total predictions: 48

Sample forecast for CALI_WND:
                 ds       yhat  yhat_lo_80  yhat_hi_80
2024-01-16 00:00:00  95.900289   87.827969  103.972608
2024-01-16 01:00:00 101.007711   92.934154  109.081268
2024-01-16 02:00:00 116.013987  107.873635  124.154339
2024-01-16 03:00:00 125

---

# Module 8: Dashboard

**File:** `src/renewable/dashboard.py`

The Streamlit dashboard provides:
- **Forecast visualization** with prediction intervals
- **Drift monitoring** and alerts
- **Coverage analysis** (nominal vs empirical)
- **Weather features** by region

## Running the Dashboard

```bash
streamlit run src/renewable/dashboard.py
```

The dashboard will:
1. Load forecasts from `data/renewable/forecasts.parquet`
2. Display interactive charts with Plotly
3. Show drift alerts from the database

In [15]:
# Dashboard code preview (runs with Streamlit, not in Jupyter)

dashboard_preview = '''
# Key Dashboard Components:

## 1. Forecast Tab
- Interactive line chart with 80% and 95% prediction intervals
- Series selector (e.g., CALI_WND, ERCO_SUN)
- Data table with forecast details

## 2. Drift Monitor Tab  
- Real-time drift status (stable/warning/critical)
- Alert history with timestamps
- RMSE vs threshold visualization

## 3. Coverage Tab
- Nominal vs empirical coverage comparison
- Bar chart showing calibration quality
- Per-series coverage breakdown

## 4. Weather Tab
- Wind speed at 10m and 100m by region
- Solar radiation patterns
- Cloud cover trends
'''

print(dashboard_preview)

print("\n" + "="*50)
print("To launch the dashboard, run:")
print("="*50)
print("\n  streamlit run src/renewable/dashboard.py")
print("\nThe dashboard will open in your browser.")


# Key Dashboard Components:

## 1. Forecast Tab
- Interactive line chart with 80% and 95% prediction intervals
- Series selector (e.g., CALI_WND, ERCO_SUN)
- Data table with forecast details

## 2. Drift Monitor Tab  
- Alert history with timestamps
- RMSE vs threshold visualization

## 3. Coverage Tab
- Nominal vs empirical coverage comparison
- Bar chart showing calibration quality
- Per-series coverage breakdown

## 4. Weather Tab
- Wind speed at 10m and 100m by region
- Solar radiation patterns
- Cloud cover trends


To launch the dashboard, run:

  streamlit run src/renewable/dashboard.py

The dashboard will open in your browser.


---

# Summary

## What We Built

| Module | Purpose | Key Concept |
|--------|---------|-------------|
| `regions.py` | Region definitions | EIA codes + coordinates |
| `eia_renewable.py` | Data fetching | StatsForecast format |
| `open_meteo.py` | Weather integration | Leakage prevention |
| `modeling.py` | Forecasting | Probabilistic intervals |
| `db.py` | Persistence | SQLite with WAL |
| `tasks.py` | Pipeline orchestration | Adaptive CV |
| `dashboard.py` | Visualization | Streamlit + Plotly |

## Key Takeaways

1. **StatsForecast format**: `[unique_id, ds, y]` enables multi-series modeling
2. **No MAPE for renewables**: Solar has zeros - use RMSE/MAE instead
3. **Weather leakage**: Use forecast weather for predictions, not historical
4. **Drift detection**: threshold = baseline_mean + 2 × baseline_std
5. **Adaptive CV**: Adjust window count for short time series

## Next Steps

1. Get an EIA API key and run with real data
2. Launch the dashboard: `streamlit run src/renewable/dashboard.py`
3. Experiment with different regions and fuel types
4. Set up scheduled pipeline runs for production

In [16]:
# Final test - verify all imports work

print("Verifying module imports...")

try:
    from src.renewable import (
        REGIONS,
        FUEL_TYPES,
        EIARenewableFetcher,
        OpenMeteoRenewable,
        RenewableForecastModel,
        RenewablePipelineConfig,
        init_renewable_db,
    )
    print("All imports successful!")
    print(f"\nAvailable regions: {len(REGIONS)}")
    print(f"Available fuel types: {list(FUEL_TYPES.keys())}")
except ImportError as e:
    print(f"Import error: {e}")
    print("\nMake sure you're running from the project root.")

Verifying module imports...
All imports successful!

Available regions: 15
Available fuel types: ['WND', 'SUN']
