# Renewable Energy Forecasting Pipeline

This notebook walks through building a **next-24h renewable generation forecast system** with:

- **EIA data integration** - Hourly wind/solar generation for US regions
- **Weather features** - Open-Meteo integration (wind speed, solar radiation)
- **Probabilistic forecasting** - Dual prediction intervals (80%, 95%)
- **Drift monitoring** - Automatic detection of model degradation

## Architecture Overview

```
EIA API (WND/SUN) ──┐
                    ├──► Data Pipeline ──► StatsForecast ──► Predictions
Open-Meteo API ─────┘         │                  │              │
                              ▼                  ▼              ▼
                         Validation        Multi-Series    Probabilistic
                         & Quality         [unique_id,     (80%, 95%
                                           ds, y, X]       intervals)
                                                              │
                                                              ▼
                                                         Streamlit
                                                         Dashboard
                                                         (drift, alerts)
```

## Key Concepts

1. **StatsForecast format**: `[unique_id, ds, y]` - where `unique_id` = `{region}_{fuel_type}`
2. **Zero-value handling**: Solar generates 0 at night - we use RMSE/MAE, NOT MAPE
3. **Leakage prevention**: Use **forecasted** weather for predictions, not historical
4. **Drift detection**: Threshold = mean + 2*std from backtest

## Setup

First, let's ensure we have the project root in our path and configure logging.

In [2]:
import sys
import logging
from pathlib import Path
import os 

# Add project root to path
project_root = r"c:\docker_projects\atsaf"
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

if os.getcwd() != str(project_root):
    os.chdir(project_root)
    print(f"Changed working directory to project root: {project_root} we are currently at {os.getcwd()}")

# Configure logging for visibility
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

print(f"Project root: {project_root}")

Changed working directory to project root: c:\docker_projects\atsaf we are currently at c:\docker_projects\atsaf
Project root: c:\docker_projects\atsaf


---

# Module 1: Region Definitions

**File:** `src/renewable/regions.py`

This module maps **EIA balancing authority regions** to their geographic coordinates. Why do we need coordinates?

- **Weather API lookup**: Open-Meteo requires latitude/longitude
- **Regional analysis**: Compare forecast accuracy across regions
- **Timezone handling**: Each region has a primary timezone

## Key Design Decisions

1. **NamedTuple for RegionInfo**: Immutable, type-safe, and memory-efficient
2. **Centroid coordinates**: Approximate centers - good enough for hourly weather
3. **Fuel type codes**: `WND` (wind), `SUN` (solar) - match EIA's API

In [None]:
%%writefile src/renewable/regions.py
# src/renewable/regions.py
from __future__ import annotations

from typing import NamedTuple, Optional


class RegionInfo(NamedTuple):
    """Region metadata for EIA and weather lookups."""
    name: str
    lat: float
    lon: float
    timezone: str
    # Some internal regions may not map cleanly to an EIA respondent.
    # We keep them in REGIONS for weather/features, but EIA fetch requires this.
    eia_respondent: Optional[str] = None


REGIONS: dict[str, RegionInfo] = {
    # Western Interconnection
    "CALI": RegionInfo(
        name="California ISO",
        lat=36.7,
        lon=-119.4,
        timezone="America/Los_Angeles",
        eia_respondent="CISO",
    ),
    "NW": RegionInfo(
        name="Northwest",
        lat=45.5,
        lon=-122.0,
        timezone="America/Los_Angeles",
        eia_respondent=None,  # intentionally unset until verified
    ),
    "SW": RegionInfo(
        name="Southwest",
        lat=33.5,
        lon=-112.0,
        timezone="America/Phoenix",
        eia_respondent=None,  # intentionally unset until verified
    ),

    # Texas Interconnection
    "ERCO": RegionInfo(
        name="ERCOT (Texas)",
        lat=31.0,
        lon=-100.0,
        timezone="America/Chicago",
        eia_respondent="ERCO",
    ),

    # Midwest
    "MISO": RegionInfo(
        name="Midcontinent ISO",
        lat=41.0,
        lon=-93.0,
        timezone="America/Chicago",
        eia_respondent="MISO",
    ),

    # Internal/aggregate regions kept for non-EIA use (weather/features/etc.)
    "SE": RegionInfo(name="Southeast", lat=33.0, lon=-84.0, timezone="America/New_York", eia_respondent=None),
    "FLA": RegionInfo(name="Florida", lat=28.0, lon=-82.0, timezone="America/New_York", eia_respondent=None),
    "CAR": RegionInfo(name="Carolinas", lat=35.5, lon=-80.0, timezone="America/New_York", eia_respondent=None),
    "TEN": RegionInfo(name="Tennessee Valley", lat=35.5, lon=-86.0, timezone="America/Chicago", eia_respondent=None),

    "US48": RegionInfo(name="Lower 48 States", lat=39.8, lon=-98.5, timezone="America/Chicago", eia_respondent=None),
}

FUEL_TYPES = {"WND": "Wind", "SUN": "Solar"}


def list_regions() -> list[str]:
    return sorted(REGIONS.keys())


def get_region_info(region_code: str) -> RegionInfo:
    return REGIONS[region_code]


def get_region_coords(region_code: str) -> tuple[float, float]:
    r = REGIONS[region_code]
    return (r.lat, r.lon)


def get_eia_respondent(region_code: str) -> str:
    """Return the code EIA expects for facets[respondent][]. Fail loudly if missing."""
    info = REGIONS[region_code]
    if not info.eia_respondent:
        raise ValueError(
            f"Region '{region_code}' has no configured eia_respondent. "
            f"Set REGIONS['{region_code}'].eia_respondent to a verified EIA respondent code "
            f"before using it for EIA fetches."
        )
    return info.eia_respondent


def validate_region(region_code: str) -> bool:
    return region_code in REGIONS


def validate_fuel_type(fuel_type: str) -> bool:
    return fuel_type in FUEL_TYPES



if __name__ == "__main__":
    # Example run - test region functions

    print("=== Available Regions ===")
    print(f"Total regions: {len(REGIONS)}")
    print(f"Region codes: {list_regions()}")

    print("\n=== Example: California ===")
    cali_info = get_region_info("CALI")
    print(f"Name: {cali_info.name}")
    print(f"Coordinates: ({cali_info.lat}, {cali_info.lon})")
    print(f"Timezone: {cali_info.timezone}")

    print("\n=== Weather API Coordinates ===")
    for region in ["CALI", "ERCO", "MISO"]:
        lat, lon = get_region_coords(region)
        print(f"{region}: lat={lat}, lon={lon}")

    print("\n=== Fuel Types ===")
    for code, name in FUEL_TYPES.items():
        print(f"{code}: {name}")

    print("\n=== Validation ===")
    print(f"validate_region('CALI'): {validate_region('CALI')}")
    print(f"validate_region('INVALID'): {validate_region('INVALID')}")
    print(f"validate_fuel_type('WND'): {validate_fuel_type('WND')}")


Overwriting src/renewable/regions.py


: 

: 

: 

: 

: 

### Example: Using Region Definitions

---

# Module 2: EIA Data Fetcher

**File:** `src/renewable/eia_renewable.py`

This module fetches **hourly wind and solar generation** from the EIA API.

## Critical Concepts

### StatsForecast Format
StatsForecast expects data in a specific format:
```
unique_id | ds                  | y
----------|---------------------|--------
CALI_WND  | 2024-01-01 00:00:00 | 1234.5
CALI_WND  | 2024-01-01 01:00:00 | 1456.7
ERCO_WND  | 2024-01-01 00:00:00 | 2345.6
```

- `unique_id`: Identifies the time series (e.g., "CALI_WND" = California Wind)
- `ds`: Datetime column (timezone-naive UTC)
- `y`: Target value (generation in MWh)

### API Rate Limiting
- EIA API has rate limits (~5 requests/second)
- We use controlled parallelism with delays

In [None]:
%%writefile src/renewable/eia_renewable.py
# src/renewable/eia_renewable.py
from __future__ import annotations

import logging
import os
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Optional

import pandas as pd
import requests
from dotenv import find_dotenv, load_dotenv

from src.renewable.regions import REGIONS, get_eia_respondent, validate_fuel_type, validate_region

logger = logging.getLogger(__name__)


def _load_env_once(*, debug: bool = False) -> Optional[str]:
    """
    Load .env if present.
    - Primary: find_dotenv(usecwd=True) (walk up from CWD)
    - Fallback: repo_root/.env based on this file location
    Returns the path loaded (or None).
    """
    # 1) Try from current working directory upward
    dotenv_path = find_dotenv(usecwd=True)
    if dotenv_path:
        load_dotenv(dotenv_path, override=False)
        if debug:
            logger.info("Loaded .env via find_dotenv: %s", dotenv_path)
        return dotenv_path

    # 2) Fallback: assume src-layout -> repo root is ../../ from this file
    try:
        repo_root = Path(__file__).resolve().parents[2]
        fallback = repo_root / ".env"
        if fallback.exists():
            load_dotenv(fallback, override=False)
            if debug:
                logger.info("Loaded .env via fallback: %s", str(fallback))
            return str(fallback)
    except Exception:
        pass

    if debug:
        logger.info("No .env found to load.")
    return None


class EIARenewableFetcher:
    BASE_URL = "https://api.eia.gov/v2/electricity/rto/fuel-type-data/data/"
    MAX_RECORDS_PER_REQUEST = 5000
    RATE_LIMIT_DELAY = 0.2  # 5 requests/second max

    def __init__(self, api_key: Optional[str] = None, *, debug_env: bool = False):
        """
        Initialize API key. Pulls from:
        1) explicit api_key argument
        2) environment variable EIA_API_KEY (optionally loaded from .env)
        """
        loaded_env = _load_env_once(debug=debug_env)

        self.api_key = api_key or os.getenv("EIA_API_KEY")
        if not self.api_key:
            raise ValueError(
                "EIA API key required but not found.\n"
                "- Ensure .env contains EIA_API_KEY=...\n"
                "- Ensure your process CWD is under the repo (so find_dotenv can locate it), OR\n"
                "- Pass api_key=... explicitly.\n"
                f"Loaded .env path: {loaded_env}"
            )

        # Debug without leaking the key
        if debug_env:
            masked = self.api_key[:4] + "..." + self.api_key[-4:] if len(self.api_key) >= 8 else "***"
            logger.info("EIA_API_KEY loaded (masked): %s", masked)

    @staticmethod
    def _extract_eia_response(payload: dict, *, request_url: Optional[str] = None) -> tuple[list[dict], dict]:
        if not isinstance(payload, dict):
            raise TypeError(f"EIA payload is not a dict. type={type(payload)} url={request_url}")

        if "error" in payload and payload.get("response") is None:
            raise ValueError(f"EIA returned error payload. url={request_url} error={payload.get('error')}")

        if "response" not in payload:
            raise ValueError(
                f"EIA payload missing 'response'. url={request_url} keys={list(payload.keys())[:25]}"
            )

        response = payload.get("response") or {}
        if not isinstance(response, dict):
            raise TypeError(f"EIA payload['response'] is not a dict. type={type(response)} url={request_url}")

        if "data" not in response:
            raise ValueError(
                f"EIA response missing 'data'. url={request_url} response_keys={list(response.keys())[:25]}"
            )

        records = response.get("data") or []
        if not isinstance(records, list):
            raise TypeError(f"EIA response['data'] is not a list. type={type(records)} url={request_url}")

        total = response.get("total", None)
        offset = response.get("offset", None)

        meta_obj = response.get("metadata") or {}
        if isinstance(meta_obj, dict):
            if total is None and "total" in meta_obj:
                total = meta_obj.get("total")
            if offset is None and "offset" in meta_obj:
                offset = meta_obj.get("offset")

        try:
            total = int(total) if total is not None else None
        except Exception:
            pass
        try:
            offset = int(offset) if offset is not None else None
        except Exception:
            pass

        return records, {"total": total, "offset": offset}

    def fetch_region(
        self,
        region: str,
        fuel_type: str,
        start_date: str,
        end_date: str,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if not validate_region(region):
            raise ValueError(f"Invalid region: {region}")
        if not validate_fuel_type(fuel_type):
            raise ValueError(f"Invalid fuel type: {fuel_type}")

        respondent = get_eia_respondent(region)

        all_records: list[dict] = []
        offset = 0

        while True:
            params = {
                "api_key": self.api_key,
                "data[]": "value",
                "facets[respondent][]": respondent,
                "facets[fueltype][]": fuel_type,
                "frequency": "hourly",
                "start": f"{start_date}T00",
                "end": f"{end_date}T23",
                "length": self.MAX_RECORDS_PER_REQUEST,
                "offset": offset,
                "sort[0][column]": "period",
                "sort[0][direction]": "asc",
            }

            resp = requests.get(self.BASE_URL, params=params, timeout=30)
            resp.raise_for_status()
            payload = resp.json()

            records, meta = self._extract_eia_response(payload, request_url=resp.url)
            returned = len(records)

            if debug:
                print(f"[PAGE] region={region} fuel={fuel_type} returned={returned} offset={offset} total={meta.get('total')} url={resp.url}")

            if returned == 0 and offset == 0:
                return pd.DataFrame(columns=["ds", "value", "region", "fuel_type"])
            if returned == 0:
                break

            all_records.extend(records)

            if returned < self.MAX_RECORDS_PER_REQUEST:
                break

            offset += self.MAX_RECORDS_PER_REQUEST
            time.sleep(self.RATE_LIMIT_DELAY)

        df = pd.DataFrame(all_records)

        missing_cols = [c for c in ["period", "value"] if c not in df.columns]
        if missing_cols:
            sample_keys = sorted(set().union(*(r.keys() for r in all_records[:5]))) if all_records else []
            raise ValueError(
                f"EIA records missing expected keys {missing_cols}. "
                f"columns={df.columns.tolist()} sample_record_keys={sample_keys}"
            )

        raw_rows = len(df)
        df["ds"] = pd.to_datetime(df["period"], utc=True, errors="coerce").dt.tz_convert("UTC").dt.tz_localize(None)
        df["value"] = pd.to_numeric(df["value"], errors="coerce")

        bad_ds = int(df["ds"].isna().sum())
        bad_val = int(df["value"].isna().sum())

        df["region"] = region
        df["fuel_type"] = fuel_type

        df = df.dropna(subset=["ds", "value"]).sort_values("ds").reset_index(drop=True)

        if debug:
            kept = len(df)
            print(f"[PARSE] raw_rows={raw_rows} kept={kept} dropped_bad_ds={bad_ds} dropped_bad_value={bad_val}")
            expected = pd.date_range(f"{start_date} 00:00", f"{end_date} 23:00", freq="h")
            dup = int(df["ds"].duplicated().sum())
            missing = int(len(expected.difference(df["ds"])))
            print(f"[QC] expected_hours={len(expected)} actual_hours={len(df)} duplicates={dup} missing_hours={missing}")

        return df[["ds", "value", "region", "fuel_type"]]

    def fetch_all_regions(
        self,
        fuel_type: str,
        start_date: str,
        end_date: str,
        regions: Optional[list[str]] = None,
        max_workers: int = 3,
    ) -> pd.DataFrame:
        if regions is None:
            regions = [r for r in REGIONS.keys() if r != "US48"]

        all_dfs: list[pd.DataFrame] = []

        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {
                executor.submit(self.fetch_region, region, fuel_type, start_date, end_date): region
                for region in regions
            }
            for future in as_completed(futures):
                region = futures[future]
                try:
                    df = future.result()
                    if len(df) > 0:
                        all_dfs.append(df)
                        print(f"[OK] {region}: {len(df)} rows")
                except Exception as e:
                    print(f"[FAIL] {region}: {e}")

        if not all_dfs:
            return pd.DataFrame(columns=["unique_id", "ds", "y"])

        combined = pd.concat(all_dfs, ignore_index=True)
        combined["unique_id"] = combined["region"] + "_" + combined["fuel_type"]
        combined = combined.rename(columns={"value": "y"})
        return combined[["unique_id", "ds", "y"]].sort_values(["unique_id", "ds"]).reset_index(drop=True)

    def get_series_summary(self, df: pd.DataFrame) -> pd.DataFrame:
        return df.groupby("unique_id").agg(
            count=("y", "count"),
            min_value=("y", "min"),
            max_value=("y", "max"),
            mean_value=("y", "mean"),
            zero_count=("y", lambda x: (x == 0).sum()),
        ).reset_index()


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)

    fetcher = EIARenewableFetcher(debug_env=True)

    print("=== Testing Single Region Fetch ===")
    df_single = fetcher.fetch_region("CALI", "WND", "2024-12-01", "2024-12-03", debug=True)
    print(f"Single region: {len(df_single)} rows")
    print(df_single.head())

    print("\n=== Testing Multi-Region Fetch ===")
    df_multi = fetcher.fetch_all_regions("WND", "2024-12-01", "2024-12-03", regions=["CALI", "ERCO", "MISO"])
    print(f"\nMulti-region: {len(df_multi)} rows")
    print(f"Series: {df_multi['unique_id'].unique().tolist()}")

    print("\n=== Series Summary ===")
    print(fetcher.get_series_summary(df_multi))


Overwriting src/renewable/eia_renewable.py


: 

: 

: 

: 

: 

---

# Module 3: Weather Integration

**File:** `src/renewable/open_meteo.py`

Weather is **critical** for renewable forecasting:
- **Wind generation** depends on wind speed (especially at hub height ~100m)
- **Solar generation** depends on radiation and cloud cover

## Key Concept: Preventing Leakage

**Data leakage** occurs when training uses information that wouldn't be available at prediction time.

```
❌ WRONG: Using historical weather to predict future generation
   - At prediction time, we don't have future actual weather!
   
✅ CORRECT: Use forecasted weather for predictions
   - Training: historical weather aligned with historical generation
   - Prediction: weather forecast for the prediction horizon
```

## Open-Meteo API

Open-Meteo is **free** and requires no API key:
- Historical API: Past weather data
- Forecast API: Up to 16 days ahead

In [None]:
%%writefile src/renewable/open_meteo.py
# src/renewable/open_meteo.py
from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional

import pandas as pd
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

from src.renewable.regions import get_region_coords, validate_region


@dataclass(frozen=True)
class OpenMeteoEndpoints:
    historical_url: str = "https://archive-api.open-meteo.com/v1/archive"
    forecast_url: str = "https://api.open-meteo.com/v1/forecast"


class OpenMeteoRenewable:
    """
    Fetch weather features for renewable energy forecasting.

    Strict-by-default:
    - If Open-Meteo doesn't return a requested variable, we raise.
    - We do NOT fabricate values or silently "fill" missing columns.
    """

    WEATHER_VARS = [
        "temperature_2m",
        "wind_speed_10m",
        "wind_speed_100m",
        "wind_direction_10m",
        "direct_radiation",
        "diffuse_radiation",
        "cloud_cover",
    ]

    def __init__(self, timeout: int = 30, *, strict: bool = True):
        self.timeout = timeout
        self.strict = strict
        self.endpoints = OpenMeteoEndpoints()
        self.session = self._create_session()

    def _create_session(self) -> requests.Session:
        session = requests.Session()
        retries = Retry(
            total=3,
            backoff_factor=0.5,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=frozenset(["GET"]),
        )
        session.mount("https://", HTTPAdapter(max_retries=retries))
        return session

    def fetch_historical(
        self,
        lat: float,
        lon: float,
        start_date: str,
        end_date: str,
        variables: Optional[list[str]] = None,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if variables is None:
            variables = self.WEATHER_VARS

        params = {
            "latitude": lat,
            "longitude": lon,
            "start_date": start_date,
            "end_date": end_date,
            "hourly": ",".join(variables),
            "timezone": "UTC",
        }

        resp = self.session.get(self.endpoints.historical_url, params=params, timeout=self.timeout)
        if debug:
            print(f"[OPENMETEO][HIST] status={resp.status_code} url={resp.url}")
        resp.raise_for_status()

        return self._parse_response(resp.json(), variables, debug=debug, request_url=resp.url)

    def fetch_forecast(
        self,
        lat: float,
        lon: float,
        horizon_hours: int = 48,
        variables: Optional[list[str]] = None,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if variables is None:
            variables = self.WEATHER_VARS

        forecast_days = min((horizon_hours // 24) + 1, 16)
        params = {
            "latitude": lat,
            "longitude": lon,
            "hourly": ",".join(variables),
            "timezone": "UTC",
            "forecast_days": forecast_days,
        }

        resp = self.session.get(self.endpoints.forecast_url, params=params, timeout=self.timeout)
        if debug:
            print(f"[OPENMETEO][FCST] status={resp.status_code} url={resp.url}")
        resp.raise_for_status()

        df = self._parse_response(resp.json(), variables, debug=debug, request_url=resp.url)

        # Trim to requested horizon (ds is naive UTC)
        if len(df) > 0:
            cutoff = datetime.utcnow() + timedelta(hours=horizon_hours)
            df = df[df["ds"] <= cutoff].reset_index(drop=True)

        return df

    def fetch_for_region(
        self,
        region_code: str,
        start_date: str,
        end_date: str,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if not validate_region(region_code):
            raise ValueError(f"Invalid region_code: {region_code}")

        lat, lon = get_region_coords(region_code)
        df = self.fetch_historical(lat, lon, start_date, end_date, debug=debug)
        df["region"] = region_code
        return df

    def fetch_all_regions_historical(
        self,
        regions: list[str],
        start_date: str,
        end_date: str,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        all_dfs: list[pd.DataFrame] = []
        for region in regions:
            try:
                df = self.fetch_for_region(region, start_date, end_date, debug=debug)
                all_dfs.append(df)
                print(f"[OK] Weather for {region}: {len(df)} rows")
            except Exception as e:
                print(f"[FAIL] Weather for {region}: {e}")

        if not all_dfs:
            return pd.DataFrame()

        return (
            pd.concat(all_dfs, ignore_index=True)
            .sort_values(["region", "ds"])
            .reset_index(drop=True)
        )

    def _parse_response(
        self,
        data: dict,
        variables: list[str],
        *,
        debug: bool,
        request_url: str,
    ) -> pd.DataFrame:
        hourly = data.get("hourly")
        if not isinstance(hourly, dict):
            raise ValueError(f"Open-Meteo response missing/invalid 'hourly'. url={request_url}")

        times = hourly.get("time")
        if not isinstance(times, list) or len(times) == 0:
            raise ValueError(f"Open-Meteo response has no hourly time grid. url={request_url}")

        # Build ds (naive UTC)
        ds = pd.to_datetime(times, errors="coerce", utc=True).tz_localize(None)
        if ds.isna().any():
            bad = int(ds.isna().sum())
            raise ValueError(f"Open-Meteo returned unparsable times. bad={bad} url={request_url}")

        df_data = {"ds": ds}

        # Strict variable presence: raise if missing (no silent None padding)
        missing_vars = [v for v in variables if v not in hourly]
        if missing_vars and self.strict:
            raise ValueError(f"Open-Meteo missing requested vars={missing_vars}. url={request_url}")

        for var in variables:
            values = hourly.get(var)
            if values is None:
                # If not strict, keep as all-NA but be explicit (not hidden)
                df_data[var] = [None] * len(ds)
                continue

            if not isinstance(values, list):
                raise ValueError(f"Open-Meteo var '{var}' not a list. type={type(values)} url={request_url}")

            if len(values) != len(ds):
                raise ValueError(
                    f"Open-Meteo length mismatch for '{var}': "
                    f"len(values)={len(values)} len(time)={len(ds)} url={request_url}"
                )

            df_data[var] = pd.to_numeric(values, errors="coerce")

        df = pd.DataFrame(df_data).sort_values("ds").reset_index(drop=True)

        if debug:
            dup = int(df["ds"].duplicated().sum())
            na_counts = {v: int(df[v].isna().sum()) for v in variables if v in df.columns}
            print(f"[OPENMETEO][PARSE] rows={len(df)} dup_ds={dup} na_counts(sample)={dict(list(na_counts.items())[:3])}")

        return df


if __name__ == "__main__": 
    # Real API smoke test (no key needed)
    weather = OpenMeteoRenewable(strict=True)

    print("=== Testing Historical Weather (REAL API) ===")
    hist_df = weather.fetch_for_region("CALI", "2024-12-01", "2024-12-03", debug=True)
    print(f"Historical rows: {len(hist_df)}")
    print(hist_df.head())


Overwriting src/renewable/open_meteo.py


: 

: 

: 

---

# Module 4: Probabilistic Modeling

**File:** `src/renewable/modeling.py`

This is where the forecasting happens! We use **StatsForecast** for:

1. **Multi-series forecasting**: Handle multiple regions/fuel types in one model
2. **Probabilistic predictions**: Get prediction intervals, not just point forecasts
3. **Weather exogenous**: Include weather features as predictors

## Key Concepts

### Why Prediction Intervals?

Point forecasts are useful, but energy traders need **uncertainty quantification**:
- **80% interval**: "I'm 80% confident generation will be between X and Y"
- **95% interval**: Wider, for risk management

### Zero-Value Safety (CRITICAL)

**Solar panels generate ZERO at night!** This breaks MAPE:

```
MAPE = mean(|actual - predicted| / actual)

When actual = 0:
MAPE = |0 - pred| / 0 = undefined (division by zero!)
```

**Solution**: Always use RMSE and MAE for renewable forecasting.

In [None]:
%%writefile src/chapter2/evaluation.py
# file: src/chapter2/evaluation.py
"""
Chapter 2: Model Evaluation Metrics

Computes forecasting metrics with explicit NaN handling (fail-loud principle).
"""

import logging
from typing import Dict, Optional, Tuple

import numpy as np
import pandas as pd

logger = logging.getLogger(__name__)


class ForecastMetrics:
    """Compute and track forecasting evaluation metrics"""

    @staticmethod
    def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Root Mean Squared Error

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Masks NaN/inf values before computation
        """
        valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)

        if valid_mask.sum() == 0:
            return np.nan

        return np.sqrt(np.mean((y_pred[valid_mask] - y_true[valid_mask]) ** 2))

    @staticmethod
    def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Mean Absolute Error

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Masks NaN/inf values before computation
        """
        valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)

        if valid_mask.sum() == 0:
            return np.nan

        return np.mean(np.abs(y_pred[valid_mask] - y_true[valid_mask]))

    @staticmethod
    def mape(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Mean Absolute Percentage Error (%)

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Masks NaN/inf values and zero y_true before computation
        """
        valid_mask = (
            np.isfinite(y_pred) &
            np.isfinite(y_true) &
            (np.abs(y_true) > 1e-10)
        )

        if valid_mask.sum() == 0:
            return np.nan

        ape = np.abs((y_pred[valid_mask] - y_true[valid_mask]) / np.abs(y_true[valid_mask]))
        return 100 * np.mean(ape)

    @staticmethod
    def mase(
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_train: np.ndarray,
        season_length: int = 24
    ) -> float:
        """
        Mean Absolute Scaled Error

        Scales error relative to naive seasonal forecasting.

        Explicit NaN masking (fail-loud):
        - Returns NaN if insufficient training data
        - Masks NaN/inf values before computation
        """
        # Check minimum training data
        if len(y_train) < season_length:
            return np.nan

        # Compute seasonal naive MAE
        try:
            mae_train = np.mean(np.abs(
                y_train[season_length:] - y_train[:-season_length]
            ))
        except:
            return np.nan

        if mae_train < 1e-10:
            return np.nan

        # Compute test MAE
        valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)

        if valid_mask.sum() == 0:
            return np.nan

        mae_test = np.mean(np.abs(y_pred[valid_mask] - y_true[valid_mask]))

        return mae_test / mae_train

    @staticmethod
    def coverage(
        y_true: np.ndarray,
        lower: np.ndarray,
        upper: np.ndarray
    ) -> float:
        """
        Prediction Interval Coverage (%)

        Percentage of actual values within prediction interval.

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Counts valid (non-NaN) rows in denominator
        """
        valid_mask = (
            np.isfinite(y_true) &
            np.isfinite(lower) &
            np.isfinite(upper)
        )

        if valid_mask.sum() == 0:
            return np.nan

        covered = (y_true[valid_mask] >= lower[valid_mask]) & \
                  (y_true[valid_mask] <= upper[valid_mask])

        return 100 * np.mean(covered)

    @staticmethod
    def compute_all(
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_train: Optional[np.ndarray] = None
    ) -> Dict[str, float]:
        """
        Compute all metrics at once

        Args:
            y_true: Actual values
            y_pred: Predictions
            y_train: Training values (for MASE)

        Returns:
            Dictionary of metrics
        """
        metrics = {
            "rmse": ForecastMetrics.rmse(y_true, y_pred),
            "mae": ForecastMetrics.mae(y_true, y_pred),
            "mape": ForecastMetrics.mape(y_true, y_pred),
        }

        if y_train is not None:
            metrics["mase"] = ForecastMetrics.mase(
                y_true, y_pred, y_train, season_length=24
            )

        return metrics


def compute_series_metrics(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    y_train: Optional[np.ndarray] = None,
    valid_threshold: int = 1
) -> Dict[str, float]:
    """
    Compute metrics with explicit validation

    Args:
        y_true: Actual values
        y_pred: Predictions
        y_train: Training values (for MASE)
        valid_threshold: Minimum valid predictions required

    Returns:
        Dictionary of metrics
    """
    # Count valid predictions
    valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)
    valid_count = valid_mask.sum()

    if valid_count < valid_threshold:
        return {
            "rmse": np.nan,
            "mae": np.nan,
            "mape": np.nan,
            "mase": np.nan,
            "valid_count": valid_count,
            "error": f"Insufficient valid predictions: {valid_count} < {valid_threshold}"
        }

    metrics = ForecastMetrics.compute_all(y_true, y_pred, y_train)
    metrics["valid_count"] = valid_count

    return metrics


def aggregate_metrics(
    results: pd.DataFrame,
    by: Optional[str] = None
) -> pd.DataFrame:
    """
    Aggregate metrics across splits and series

    Args:
        results: DataFrame with metric columns
        by: Groupby column ("model_name", "unique_id", etc.)

    Returns:
        Aggregated metrics DataFrame
    """
    metric_cols = ["rmse", "mae", "mape", "mase"]

    if by is None:
        # Overall aggregation
        agg = results[metric_cols].agg([
            ("mean", "mean"),
            ("std", "std"),
            ("min", "min"),
            ("max", "max")
        ])
        return agg
    else:
        # Grouped aggregation
        agg = results.groupby(by)[metric_cols].agg([
            ("mean", "mean"),
            ("std", "std"),
            ("count", "count")
        ])
        return agg
        return agg


: 

In [1]:
# %%writefile src/renewable/modeling.py
# file: src/renewable/modeling.py
# Module 4: Probabilistic Modeling
# Multi-series forecasting with prediction intervals

import numpy as np
from dataclasses import dataclass, field

# Import our evaluation metrics (from chapter 2)
from src.chapter2.evaluation import ForecastMetrics
import pandas as pd
from typing import Optional



@dataclass
class ForecastConfig:
    """Configuration for renewable forecasting."""
    horizon: int = 24
    confidence_levels: tuple[int, int] = (80, 95)
    season_length: int = 24  # Hourly seasonality
    weekly_season: int = 168  # 24 * 7
    models: list[str] = field(default_factory=lambda: ["AutoARIMA", "MSTL"])


class RenewableForecastModel:
    """Multi-series probabilistic forecasting with weather exogenous.
    
    Designed for wind/solar generation with:
    - Weather features (wind speed, solar radiation)
    - Dual prediction intervals (80%, 95%)
    - Zero-safe metrics (solar has 0s at night)
    """
    
    def __init__(
        self,
        horizon: int = 24,
        confidence_levels: tuple[int, int] = (80, 95),
    ):
        self.horizon = horizon
        self.confidence_levels = confidence_levels
        self.sf = None
        self.fitted = False
    
    def prepare_features(
        self,
        df: pd.DataFrame,
        weather_df: Optional[pd.DataFrame] = None,
    ) -> pd.DataFrame:
        """Add time and weather features.
        
        Time features use cyclic encoding (sin/cos) because:
        - Hour 23 and Hour 0 are adjacent, but 23 > 0 numerically
        - Sin/cos creates a smooth circular representation
        """
        result = df.copy()
        
        # Cyclic encoding for hour of day
        result["hour"] = result["ds"].dt.hour
        result["hour_sin"] = np.sin(2 * np.pi * result["hour"] / 24)
        result["hour_cos"] = np.cos(2 * np.pi * result["hour"] / 24)
        
        # Cyclic encoding for day of week
        result["dayofweek"] = result["ds"].dt.dayofweek
        result["dow_sin"] = np.sin(2 * np.pi * result["dayofweek"] / 7)
        result["dow_cos"] = np.cos(2 * np.pi * result["dayofweek"] / 7)
        
        # Merge weather if provided
        if weather_df is not None and len(weather_df) > 0:
            result["region"] = result["unique_id"].str.split("_").str[0]
            weather_cols = [c for c in weather_df.columns if c not in ["ds", "region"]]
            
            result = result.merge(
                weather_df[["ds", "region"] + weather_cols],
                on=["ds", "region"],
                how="left",
            )
            
            # Forward fill missing weather
            for col in weather_cols:
                if col in result.columns:
                    result[col] = result.groupby("unique_id")[col].ffill()
            
            result = result.drop(columns=["region"])
        
        # Lag features (shifted to prevent leakage)
        result = result.sort_values(["unique_id", "ds"])
        result["y_lag_1"] = result.groupby("unique_id")["y"].shift(1)
        result["y_lag_24"] = result.groupby("unique_id")["y"].shift(24)
        
        result = result.drop(columns=["hour", "dayofweek"], errors="ignore")
        
        return result
    
    def fit(self, df: pd.DataFrame, weather_df: Optional[pd.DataFrame] = None) -> None:
        """Train StatsForecast models."""
        from statsforecast import StatsForecast
        from statsforecast.models import MSTL, AutoARIMA, AutoETS, SeasonalNaive
        
        train_df = self.prepare_features(df, weather_df)
        
        # Define models
        models = [
            AutoARIMA(season_length=24),
            SeasonalNaive(season_length=24),
            AutoETS(season_length=24),
            MSTL(
                season_length=[24, 168],  # Daily and weekly seasonality
                trend_forecaster=AutoARIMA(),
                alias="MSTL_ARIMA",
            ),
        ]
        
        self.sf = StatsForecast(models=models, freq="h", n_jobs=-1)
        self._train_df = train_df[["unique_id", "ds", "y"]].copy()
        
        print(f"Model fit: {len(train_df)} rows, {train_df['unique_id'].nunique()} series")
        self.fitted = True
    
    def predict(self, future_weather: Optional[pd.DataFrame] = None) -> pd.DataFrame:
        """Generate forecasts with dual prediction intervals."""
        if not self.fitted:
            raise RuntimeError("Model must be fitted before prediction. Call fit() first.")
        
        forecasts = self.sf.forecast(
            h=self.horizon,
            df=self._train_df,
            level=list(self.confidence_levels),
        )
        
        forecasts = forecasts.reset_index()
        result = self._standardize_forecast_columns(forecasts)
        
        print(f"Predictions generated: {len(result)} rows, {self.horizon}h horizon")
        return result
    
    def cross_validate(
        self,
        df: pd.DataFrame,
        weather_df: Optional[pd.DataFrame] = None,
        n_windows: int = 5,
        step_size: int = 168,
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        """Run rolling-origin cross-validation."""
        from statsforecast import StatsForecast
        from statsforecast.models import MSTL, AutoARIMA, AutoETS, SeasonalNaive
        
        cv_df = self.prepare_features(df, weather_df)
        cv_df = cv_df[["unique_id", "ds", "y"]].copy()
        
        models = [
            AutoARIMA(season_length=24),
            SeasonalNaive(season_length=24),
            AutoETS(season_length=24),
            MSTL(season_length=[24, 168], trend_forecaster=AutoARIMA(), alias="MSTL_ARIMA"),
        ]
        
        sf = StatsForecast(models=models, freq="h", n_jobs=-1)
        
        print(f"Running CV: {n_windows} windows, step={step_size}h, horizon={self.horizon}h")
        
        cv_results = sf.cross_validation(
            df=cv_df,
            h=self.horizon,
            step_size=step_size,
            n_windows=n_windows,
            level=list(self.confidence_levels),
        )
        
        cv_results = cv_results.reset_index()
        leaderboard = self._compute_leaderboard(cv_results)
        
        return cv_results, leaderboard
    
    def _standardize_forecast_columns(self, df: pd.DataFrame) -> pd.DataFrame:
        """Standardize column names to yhat, yhat_lo_80, etc."""
        result = df.copy()
        
        model_cols = [c for c in result.columns if c not in ["unique_id", "ds", "cutoff"]]
        point_cols = [c for c in model_cols if not any(x in c for x in ["-lo-", "-hi-"])]
        
        # Prefer MSTL_ARIMA as the main model
        if "MSTL_ARIMA" in point_cols:
            best_model = "MSTL_ARIMA"
        elif "AutoARIMA" in point_cols:
            best_model = "AutoARIMA"
        else:
            best_model = point_cols[0] if point_cols else None
        
        if best_model:
            result["yhat"] = result[best_model]
            
            for level in self.confidence_levels:
                lo_col = f"{best_model}-lo-{level}"
                hi_col = f"{best_model}-hi-{level}"
                
                if lo_col in result.columns:
                    result[f"yhat_lo_{level}"] = result[lo_col]
                if hi_col in result.columns:
                    result[f"yhat_hi_{level}"] = result[hi_col]
        
        return result
    
    def _compute_leaderboard(self, cv_results: pd.DataFrame) -> pd.DataFrame:
        """Compute model leaderboard from CV results."""
        model_cols = [
            c for c in cv_results.columns
            if c not in ["unique_id", "ds", "cutoff", "y"]
            and not any(x in c for x in ["-lo-", "-hi-"])
        ]
        
        rows = []
        for model in model_cols:
            y_true = cv_results["y"].values
            y_pred = cv_results[model].values
            
            # CRITICAL: Use RMSE and MAE, NOT MAPE (solar has zeros)
            rmse = ForecastMetrics.rmse(y_true, y_pred)
            mae = ForecastMetrics.mae(y_true, y_pred)
            
            # Coverage for each level
            coverages = {}
            for level in self.confidence_levels:
                lo_col = f"{model}-lo-{level}"
                hi_col = f"{model}-hi-{level}"
                
                if lo_col in cv_results.columns and hi_col in cv_results.columns:
                    coverage = ForecastMetrics.coverage(
                        y_true,
                        cv_results[lo_col].values,
                        cv_results[hi_col].values,
                    )
                    coverages[f"coverage_{level}"] = coverage
            
            rows.append({"model": model, "rmse": rmse, "mae": mae, **coverages})
        
        return pd.DataFrame(rows).sort_values("rmse")
    
    def compute_metrics(self, y_true: np.ndarray, y_pred: np.ndarray) -> dict:
        """Compute metrics - RMSE and MAE only (no MAPE for solar!)."""
        return {
            "rmse": ForecastMetrics.rmse(y_true, y_pred),
            "mae": ForecastMetrics.mae(y_true, y_pred),
            # NOTE: MAPE intentionally excluded - undefined when y=0
        }


def compute_baseline_metrics(
    cv_results: pd.DataFrame,
    model_name: str = "MSTL_ARIMA",
) -> dict:
    """Compute baseline metrics from backtest for drift detection.
    
    Drift threshold = mean + 2*std (flags unusual performance)
    """
    if model_name not in cv_results.columns:
        raise ValueError(f"Model {model_name} not found in CV results")
    
    window_metrics = []
    for cutoff in cv_results["cutoff"].unique():
        window = cv_results[cv_results["cutoff"] == cutoff]
        y_true = window["y"].values
        y_pred = window[model_name].values
        
        rmse = ForecastMetrics.rmse(y_true, y_pred)
        mae = ForecastMetrics.mae(y_true, y_pred)
        window_metrics.append({"cutoff": cutoff, "rmse": rmse, "mae": mae})
    
    metrics_df = pd.DataFrame(window_metrics)
    
    baseline = {
        "model": model_name,
        "rmse_mean": metrics_df["rmse"].mean(),
        "rmse_std": metrics_df["rmse"].std(),
        "mae_mean": metrics_df["mae"].mean(),
        "mae_std": metrics_df["mae"].std(),
        "n_windows": len(metrics_df),
    }
    
    # Drift threshold: mean + 2*std
    baseline["drift_threshold_rmse"] = baseline["rmse_mean"] + 2 * baseline["rmse_std"]
    baseline["drift_threshold_mae"] = baseline["mae_mean"] + 2 * baseline["mae_std"]
    
    return baseline

if __name__ == "__main__":
    import pandas as pd

    print("=== Renewable Forecast Modeling Test ===")
    # Example run - test modeling with synthetic data

    # np.random.seed(42)

    # # Create synthetic generation data (simulates 30 days of hourly data)
    # dates = pd.date_range("2024-01-01", periods=720, freq="h")
    # series_ids = ["CALI_WND", "ERCO_WND"]

    # dfs = []
    # for sid in series_ids:
    #     # Simulate generation with daily seasonality + noise
    #     y = 100 + 20 * np.sin(np.arange(720) * 2 * np.pi / 24) + np.random.normal(0, 5, 720)
    #     dfs.append(pd.DataFrame({"unique_id": sid, "ds": dates, "y": y}))

    # df = pd.concat(dfs, ignore_index=True)
    # print(f"Synthetic data: {len(df)} rows, {df['unique_id'].nunique()} series")

    from src.renewable.eia_renewable import EIARenewableFetcher
    from src.renewable.open_meteo import OpenMeteoRenewable
    from src.renewable.regions import get_eia_respondent, get_region_coords

    # 1) Region config check (fail early)
    for r in ["CALI", "ERCO", "MISO"]:
        print(r, "EIA respondent =", get_eia_respondent(r), "coords =", get_region_coords(r))

    # 2) Pull a tiny real slice
    fetcher = EIARenewableFetcher(debug_env=True)
    gen = fetcher.fetch_region("CALI", "WND", "2024-12-01", "2024-12-03", debug=True)

    weather_api = OpenMeteoRenewable(strict=True)
    wx = weather_api.fetch_for_region("CALI", "2024-12-01", "2024-12-03", debug=True)

    # 3) Alignment gate (no filling)
    gen2 = gen.rename(columns={"value": "y"})
    joined = gen2.merge(wx, on=["ds"], how="left")  # wx is already region-tagged in fetch_for_region
    missing_cols = [c for c in wx.columns if c not in ["ds", "region"]]
    missing_any = joined[missing_cols].isna().any(axis=1)

    print("gen rows:", len(gen2), "wx rows:", len(wx), "joined rows:", len(joined))
    print("rows missing any weather:", int(missing_any.sum()))
    if missing_any.any():
        print(joined.loc[missing_any, ["ds", "y"] + missing_cols].head(10))
        raise RuntimeError("Weather alignment failed (missing weather for generation rows). Fix ds/region alignment; do not fill.")



    # print("\n=== Feature Preparation ===")
    # model = RenewableForecastModel(horizon=24)
    # features = model.prepare_features(df)
    # print(f"Features added: {[c for c in features.columns if c not in ['unique_id', 'ds', 'y']]}")

    # print("\n=== Cross-Validation ===")
    # cv_results, leaderboard = model.cross_validate(df, n_windows=3, step_size=168)
    # print(f"CV results: {len(cv_results)} rows")
    # print("\nLeaderboard (sorted by RMSE):")
    # print(leaderboard.to_string(index=False))

    # print("\n=== Baseline Metrics (for drift detection) ===")
    # baseline = compute_baseline_metrics(cv_results)
    # print(f"Best model: {baseline['model']}")
    # print(f"RMSE: {baseline['rmse_mean']:.2f} ± {baseline['rmse_std']:.2f}")
    # print(f"Drift threshold: {baseline['drift_threshold_rmse']:.2f}")
    # print("\n(Drift detected when current RMSE > threshold)")

ModuleNotFoundError: No module named 'src'

---

# Module 6: Pipeline Tasks

**File:** `src/renewable/tasks.py`

This module orchestrates the complete pipeline:

1. **Fetch generation data** from EIA
2. **Fetch weather data** from Open-Meteo
3. **Train models** with cross-validation
4. **Generate forecasts** with prediction intervals
5. **Compute drift metrics** vs baseline

## Key Feature: Adaptive CV

Cross-validation requires sufficient data:
```
Minimum rows = horizon + (n_windows × step_size)
```

For short series, we **adapt** the CV settings automatically.

In [None]:
%%writefile src/renewable/tasks.py
# file: src\renewable\tasks.py
"""Renewable energy forecasting pipeline tasks.

Idempotent tasks for:
- Fetching EIA renewable generation data
- Fetching weather data from Open-Meteo
- Training probabilistic models
- Generating forecasts with intervals
- Computing drift metrics
"""

import argparse
import logging
import os
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Optional

import pandas as pd

from src.renewable.eia_renewable import EIARenewableFetcher
from src.renewable.modeling import (
    RenewableForecastModel,
    _log_series_summary,
    compute_baseline_metrics,
)
from src.renewable.open_meteo import OpenMeteoRenewable
from src.renewable.regions import REGIONS, list_regions

logger = logging.getLogger(__name__)


@dataclass
class RenewablePipelineConfig:
    """Configuration for renewable forecasting pipeline."""

    # Data parameters
    regions: list[str] = field(default_factory=lambda: ["CALI", "ERCO", "MISO", "PJM", "SWPP"])
    fuel_types: list[str] = field(default_factory=lambda: ["WND", "SUN"])
    start_date: str = ""  # Set dynamically
    end_date: str = ""  # Set dynamically
    lookback_days: int = 30

    # Forecast parameters
    horizon: int = 24
    confidence_levels: tuple[int, int] = (80, 95)

    # CV parameters
    cv_windows: int = 5
    cv_step_size: int = 168  # 1 week

    # Output paths
    data_dir: str = "data/renewable"
    overwrite: bool = False

    def __post_init__(self):
        # Set default dates if not provided
        if not self.end_date:
            self.end_date = datetime.now(timezone.utc).strftime("%Y-%m-%d")
        if not self.start_date:
            end = datetime.strptime(self.end_date, "%Y-%m-%d")
            start = end - timedelta(days=self.lookback_days)
            self.start_date = start.strftime("%Y-%m-%d")

    def generation_path(self) -> Path:
        return Path(self.data_dir) / "generation.parquet"

    def weather_path(self) -> Path:
        return Path(self.data_dir) / "weather.parquet"

    def forecasts_path(self) -> Path:
        return Path(self.data_dir) / "forecasts.parquet"

    def baseline_path(self) -> Path:
        return Path(self.data_dir) / "baseline.json"


def fetch_renewable_data(
    config: RenewablePipelineConfig,
    fetch_diagnostics: Optional[list[dict]] = None,
) -> pd.DataFrame:
    """Task 1: Fetch EIA generation data for all regions and fuel types.

    Args:
        config: Pipeline configuration
        fetch_diagnostics: Optional list to capture per-region fetch metadata

    Returns:
        DataFrame with columns [unique_id, ds, y]
    """
    output_path = config.generation_path()
    output_path.parent.mkdir(parents=True, exist_ok=True)

    def _log_generation_summary(df: pd.DataFrame, source: str) -> None:
        _log_series_summary(df, value_col="y", label=f"generation_data_{source}")

        expected_series = {
            f"{region}_{fuel}" for region in config.regions for fuel in config.fuel_types
        }
        present_series = set(df["unique_id"]) if "unique_id" in df.columns else set()
        missing_series = sorted(expected_series - present_series)
        if missing_series:
            logger.warning(
                "[fetch_generation] Missing expected series (%s): %s",
                source,
                missing_series,
            )

        if df.empty:
            logger.warning("[fetch_generation] No generation data rows (%s).", source)
            return

        coverage = (
            df.groupby("unique_id")["ds"]
            .agg(min_ds="min", max_ds="max", rows="count")
            .reset_index()
            .sort_values("unique_id")
        )
        max_series_log = 25
        if len(coverage) > max_series_log:
            logger.info(
                "[fetch_generation] Coverage (%s, first %s series):\n%s",
                source,
                max_series_log,
                coverage.head(max_series_log).to_string(index=False),
            )
        else:
            logger.info("[fetch_generation] Coverage (%s):\n%s", source, coverage.to_string(index=False))

    if output_path.exists() and not config.overwrite:
        logger.info(f"[fetch_generation] exists, loading: {output_path}")
        cached = pd.read_parquet(output_path)
        # Log cached coverage to surface missing series without refetching.
        _log_generation_summary(cached, source="cache")
        return cached

    logger.info(f"[fetch_generation] Fetching {config.fuel_types} for {config.regions}")

    fetcher = EIARenewableFetcher()
    all_dfs = []

    for fuel_type in config.fuel_types:
        df = fetcher.fetch_all_regions(
            fuel_type=fuel_type,
            start_date=config.start_date,
            end_date=config.end_date,
            regions=config.regions,
            diagnostics=fetch_diagnostics,
        )
        all_dfs.append(df)

    combined = pd.concat(all_dfs, ignore_index=True)
    combined = combined.sort_values(["unique_id", "ds"]).reset_index(drop=True)

    # Log fresh coverage to highlight gaps or unexpected negatives.
    _log_generation_summary(combined, source="fresh")

    if fetch_diagnostics:
        empty_series = [
            entry
            for entry in fetch_diagnostics
            if entry.get("empty")
        ]
        for entry in empty_series:
            logger.warning(
                "[fetch_generation] Empty series detail: region=%s fuel=%s total=%s pages=%s",
                entry.get("region"),
                entry.get("fuel_type"),
                entry.get("total_records"),
                entry.get("pages"),
            )

    combined.to_parquet(output_path, index=False)
    logger.info(f"[fetch_generation] Saved: {output_path} ({len(combined)} rows)")

    return combined


def fetch_renewable_weather(
    config: RenewablePipelineConfig,
    include_forecast: bool = True,
) -> pd.DataFrame:
    """Task 2: Fetch weather data for all regions.

    Args:
        config: Pipeline configuration
        include_forecast: Include forecast weather for predictions

    Returns:
        DataFrame with columns [ds, region, weather_vars...]
    """
    output_path = config.weather_path()
    output_path.parent.mkdir(parents=True, exist_ok=True)

    def _log_weather_summary(df: pd.DataFrame, source: str) -> None:
        if df.empty:
            logger.warning("[fetch_weather] No weather data rows (%s).", source)
            return

        coverage = (
            df.groupby("region")["ds"]
            .agg(min_ds="min", max_ds="max", rows="count")
            .reset_index()
            .sort_values("region")
        )
        max_region_log = 25
        if len(coverage) > max_region_log:
            logger.info(
                "[fetch_weather] Coverage (%s, first %s regions):\n%s",
                source,
                max_region_log,
                coverage.head(max_region_log).to_string(index=False),
            )
        else:
            logger.info("[fetch_weather] Coverage (%s):\n%s", source, coverage.to_string(index=False))

        missing_cols = [
            col for col in OpenMeteoRenewable.WEATHER_VARS if col not in df.columns
        ]
        if missing_cols:
            logger.warning(
                "[fetch_weather] Missing expected weather columns (%s): %s",
                source,
                missing_cols,
            )

        missing_values = {
            col: int(df[col].isna().sum())
            for col in OpenMeteoRenewable.WEATHER_VARS
            if col in df.columns and df[col].isna().any()
        }
        if missing_values:
            logger.warning(
                "[fetch_weather] Missing weather values (%s): %s",
                source,
                missing_values,
            )

    if output_path.exists() and not config.overwrite:
        logger.info(f"[fetch_weather] exists, loading: {output_path}")
        cached = pd.read_parquet(output_path)
        # Log cached weather coverage to surface missing regions/columns.
        _log_weather_summary(cached, source="cache")
        return cached

    logger.info(f"[fetch_weather] Fetching weather for {config.regions}")

    weather = OpenMeteoRenewable()

    # Historical weather
    hist_df = weather.fetch_all_regions_historical(
        regions=config.regions,
        start_date=config.start_date,
        end_date=config.end_date,
    )

    # Forecast weather (for prediction, prevents leakage)
    if include_forecast:
        fcst_df = weather.fetch_all_regions_forecast(
            regions=config.regions,
            horizon_hours=config.horizon + 24,  # Buffer
        )

        # Combine, preferring forecast for overlapping times
        combined = pd.concat([hist_df, fcst_df], ignore_index=True)
        combined = combined.drop_duplicates(subset=["ds", "region"], keep="last")
    else:
        combined = hist_df

    combined = combined.sort_values(["region", "ds"]).reset_index(drop=True)

    # Log fresh weather coverage and missing values before saving.
    _log_weather_summary(combined, source="fresh")

    combined.to_parquet(output_path, index=False)
    logger.info(f"[fetch_weather] Saved: {output_path} ({len(combined)} rows)")

    return combined


def train_renewable_models(
    config: RenewablePipelineConfig,
    generation_df: Optional[pd.DataFrame] = None,
    weather_df: Optional[pd.DataFrame] = None,
) -> tuple[pd.DataFrame, pd.DataFrame, dict]:
    """Task 3: Train models and compute baseline metrics via cross-validation.

    Args:
        config: Pipeline configuration
        generation_df: Generation data (loads from file if None)
        weather_df: Weather data (loads from file if None)

    Returns:
        Tuple of (cv_results, leaderboard, baseline_metrics)
    """
    # Load data if not provided
    if generation_df is None:
        generation_df = pd.read_parquet(config.generation_path())
    if weather_df is None:
        weather_df = pd.read_parquet(config.weather_path())

    logger.info(f"[train_models] Training on {len(generation_df)} rows")

    model = RenewableForecastModel(
        horizon=config.horizon,
        confidence_levels=config.confidence_levels,
    )

    # Compute adaptive CV settings based on shortest series
    min_series_len = generation_df.groupby("unique_id").size().min()

    # CV needs: horizon + (n_windows * step_size) rows minimum
    # Solve for n_windows: n_windows = (min_series_len - horizon) / step_size
    available_for_cv = min_series_len - config.horizon

    # Adjust step_size and n_windows to fit data
    step_size = min(config.cv_step_size, max(24, available_for_cv // 3))
    n_windows = min(config.cv_windows, max(2, available_for_cv // step_size))

    logger.info(
        f"[train_models] Adaptive CV: {n_windows} windows, "
        f"step={step_size}h (min_series={min_series_len} rows)"
    )

    # Cross-validation
    cv_results, leaderboard = model.cross_validate(
        df=generation_df,
        weather_df=weather_df,
        n_windows=n_windows,
        step_size=step_size,
    )

    # Compute baseline for drift detection
    best_model = leaderboard.iloc[0]["model"]
    baseline = compute_baseline_metrics(cv_results, model_name=best_model)

    logger.info(f"[train_models] Best model: {best_model}, RMSE: {baseline['rmse_mean']:.1f}")

    return cv_results, leaderboard, baseline


def generate_renewable_forecasts(
    config: RenewablePipelineConfig,
    generation_df: Optional[pd.DataFrame] = None,
    weather_df: Optional[pd.DataFrame] = None,
) -> pd.DataFrame:
    """Task 4: Generate forecasts with prediction intervals.

    Args:
        config: Pipeline configuration
        generation_df: Generation data (loads from file if None)
        weather_df: Weather data (loads from file if None)

    Returns:
        DataFrame with forecasts [unique_id, ds, yhat, intervals...]
    """
    output_path = config.forecasts_path()
    output_path.parent.mkdir(parents=True, exist_ok=True)

    # Load data if not provided
    if generation_df is None:
        generation_df = pd.read_parquet(config.generation_path())
    if weather_df is None:
        weather_df = pd.read_parquet(config.weather_path())

    logger.info(f"[generate_forecasts] Generating {config.horizon}h forecasts")

    model = RenewableForecastModel(
        horizon=config.horizon,
        confidence_levels=config.confidence_levels,
    )

    # Fit on all data
    model.fit(generation_df, weather_df)

    # Predict
    forecasts = model.predict()

    forecasts.to_parquet(output_path, index=False)
    logger.info(f"[generate_forecasts] Saved: {output_path} ({len(forecasts)} rows)")

    return forecasts


def compute_renewable_drift(
    predictions: pd.DataFrame,
    actuals: pd.DataFrame,
    baseline_metrics: dict,
) -> dict:
    """Task 5: Detect drift by comparing current metrics to baseline.

    Drift is flagged when current RMSE > baseline_mean + 2*baseline_std

    Args:
        predictions: Forecast DataFrame with [unique_id, ds, yhat]
        actuals: Actual values DataFrame with [unique_id, ds, y]
        baseline_metrics: Baseline from cross-validation

    Returns:
        Dictionary with drift status and details
    """
    from src.chapter2.evaluation import ForecastMetrics

    # Merge predictions with actuals
    merged = predictions.merge(
        actuals[["unique_id", "ds", "y"]],
        on=["unique_id", "ds"],
        how="inner",
    )

    if len(merged) == 0:
        return {
            "status": "no_data",
            "message": "No overlapping data between predictions and actuals",
        }

    # Compute current metrics
    y_true = merged["y"].values
    y_pred = merged["yhat"].values

    current_rmse = ForecastMetrics.rmse(y_true, y_pred)
    current_mae = ForecastMetrics.mae(y_true, y_pred)

    # Check against threshold
    threshold = baseline_metrics.get("drift_threshold_rmse", float("inf"))
    is_drifting = current_rmse > threshold

    result = {
        "status": "drift_detected" if is_drifting else "stable",
        "current_rmse": float(current_rmse),
        "current_mae": float(current_mae),
        "baseline_rmse": float(baseline_metrics.get("rmse_mean", 0)),
        "drift_threshold": float(threshold),
        "threshold_exceeded_by": float(max(0, current_rmse - threshold)),
        "n_predictions": len(merged),
        "timestamp": datetime.utcnow().isoformat(),
    }

    if is_drifting:
        logger.warning(
            f"[drift] DRIFT DETECTED: RMSE={current_rmse:.1f} > threshold={threshold:.1f}"
        )
    else:
        logger.info(f"[drift] Stable: RMSE={current_rmse:.1f} <= threshold={threshold:.1f}")

    return result


def run_full_pipeline(
    config: RenewablePipelineConfig,
    fetch_diagnostics: Optional[list[dict]] = None,
) -> dict:
    """Run the complete renewable forecasting pipeline.

    Steps:
    1. Fetch generation data
    2. Fetch weather data
    3. Train models (CV)
    4. Generate forecasts

    Args:
        config: Pipeline configuration
        fetch_diagnostics: Optional list to capture per-region fetch metadata

    Returns:
        Dictionary with pipeline results
    """
    logger.info(f"[pipeline] Starting: {config.start_date} to {config.end_date}")
    logger.info(f"[pipeline] Regions: {config.regions}")
    logger.info(f"[pipeline] Fuel types: {config.fuel_types}")

    results = {}

    # Step 1: Fetch generation
    generation_df = fetch_renewable_data(config, fetch_diagnostics=fetch_diagnostics)
    results["generation_rows"] = len(generation_df)
    results["series_count"] = generation_df["unique_id"].nunique()

    # Step 2: Fetch weather
    weather_df = fetch_renewable_weather(config)
    results["weather_rows"] = len(weather_df)

    # Step 3: Train and validate
    cv_results, leaderboard, baseline = train_renewable_models(
        config, generation_df, weather_df
    )
    results["best_model"] = leaderboard.iloc[0]["model"]
    results["best_rmse"] = float(leaderboard.iloc[0]["rmse"])
    results["baseline"] = baseline

    # Step 4: Generate forecasts
    forecasts = generate_renewable_forecasts(config, generation_df, weather_df)
    results["forecast_rows"] = len(forecasts)

    if fetch_diagnostics is not None:
        results["fetch_diagnostics"] = fetch_diagnostics

    logger.info(f"[pipeline] Complete. Best model: {results['best_model']}")

    return results


def main():
    """CLI entry point for renewable pipeline."""
    parser = argparse.ArgumentParser(description="Renewable Energy Forecasting Pipeline")

    parser.add_argument(
        "--regions",
        type=str,
        default="CALI,ERCO,MISO",
        help="Comma-separated region codes (default: CALI,ERCO,MISO)",
    )
    parser.add_argument(
        "--fuel",
        type=str,
        default="WND,SUN",
        help="Comma-separated fuel types (default: WND,SUN)",
    )
    parser.add_argument(
        "--days",
        type=int,
        default=30,
        help="Lookback days (default: 30)",
    )
    parser.add_argument(
        "--horizon",
        type=int,
        default=24,
        help="Forecast horizon in hours (default: 24)",
    )
    parser.add_argument(
        "--overwrite",
        action="store_true",
        help="Overwrite existing data files",
    )
    parser.add_argument(
        "--data-dir",
        type=str,
        default="data/renewable",
        help="Output directory (default: data/renewable)",
    )

    args = parser.parse_args()

    # Configure logging
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

    # Build config
    config = RenewablePipelineConfig(
        regions=args.regions.split(","),
        fuel_types=args.fuel.split(","),
        lookback_days=args.days,
        horizon=args.horizon,
        overwrite=args.overwrite,
        data_dir=args.data_dir,
    )

    # Run pipeline
    results = run_full_pipeline(config)

    print("\n" + "=" * 60)
    print("PIPELINE RESULTS")
    print("=" * 60)
    print(f"  Series count: {results['series_count']}")
    print(f"  Generation rows: {results['generation_rows']}")
    print(f"  Weather rows: {results['weather_rows']}")
    print(f"  Forecast rows: {results['forecast_rows']}")
    print(f"  Best model: {results['best_model']}")
    print(f"  Best RMSE: {results['best_rmse']:.1f}")
    print("=" * 60)


if __name__ == "__main__":
    main()


# validation

In [None]:
%%writefile src/renewable/validation.py
# file: src/renewable/validation.py
"""Validation utilities for renewable generation data."""

from __future__ import annotations

from dataclasses import dataclass
from typing import Iterable, Optional

import pandas as pd


@dataclass(frozen=True)
class ValidationReport:
    ok: bool
    message: str
    details: dict


def validate_generation_df(
    df: pd.DataFrame,
    *,
    max_lag_hours: int = 3,
    max_missing_ratio: float = 0.02,
    expected_series: Optional[Iterable[str]] = None,
) -> ValidationReport:
    required = {"unique_id", "ds", "y"}
    missing_cols = required - set(df.columns)
    if missing_cols:
        return ValidationReport(
            False,
            "Missing required columns",
            {"missing_cols": sorted(missing_cols)},
        )

    if df.empty:
        return ValidationReport(False, "Generation data is empty", {})

    work = df.copy()

    work["ds"] = pd.to_datetime(work["ds"], errors="coerce", utc=True)
    if work["ds"].isna().any():
        return ValidationReport(
            False,
            "Unparseable ds values found",
            {"bad_ds": int(work["ds"].isna().sum())},
        )

    work["y"] = pd.to_numeric(work["y"], errors="coerce")
    if work["y"].isna().any():
        return ValidationReport(
            False,
            "Unparseable y values found",
            {"bad_y": int(work["y"].isna().sum())},
        )

    if (work["y"] < 0).any():
        return ValidationReport(
            False,
            "Negative generation values found",
            {"neg_y": int((work["y"] < 0).sum())},
        )

    dup = work.duplicated(subset=["unique_id", "ds"]).sum()
    if dup:
        return ValidationReport(
            False,
            "Duplicate (unique_id, ds) rows found",
            {"duplicates": int(dup)},
        )

    if expected_series:
        expected = sorted(set(expected_series))
        present = sorted(set(work["unique_id"]))
        missing_series = sorted(set(expected) - set(present))
        if missing_series:
            return ValidationReport(
                False,
                "Missing expected series",
                {"missing_series": missing_series, "present_series": present},
            )

    now_utc = pd.Timestamp.now(tz="UTC").floor("H")
    max_ds = work["ds"].max()
    lag_hours = (now_utc - max_ds).total_seconds() / 3600.0
    if lag_hours > max_lag_hours:
        return ValidationReport(
            False,
            "Data not fresh enough",
            {
                "now_utc": now_utc.isoformat(),
                "max_ds": max_ds.isoformat(),
                "lag_hours": lag_hours,
            },
        )

    series_max = work.groupby("unique_id")["ds"].max()
    series_lag = (now_utc - series_max).dt.total_seconds() / 3600.0
    stale = series_lag[series_lag > max_lag_hours].sort_values(ascending=False)
    if not stale.empty:
        return ValidationReport(
            False,
            "Stale series found",
            {
                "stale_series": stale.head(10).to_dict(),
                "max_lag_hours": max_lag_hours,
            },
        )

    missing_ratios = {}
    for uid, group in work.groupby("unique_id"):
        group = group.sort_values("ds")
        start = group["ds"].iloc[0]
        end = group["ds"].iloc[-1]
        expected = int(((end - start) / pd.Timedelta(hours=1)) + 1)
        actual = len(group)
        missing = max(expected - actual, 0)
        missing_ratios[uid] = missing / max(expected, 1)

    worst_uid = max(missing_ratios, key=missing_ratios.get)
    worst_ratio = missing_ratios[worst_uid]
    if worst_ratio > max_missing_ratio:
        return ValidationReport(
            False,
            "Too many missing hourly points",
            {"worst_uid": worst_uid, "worst_missing_ratio": worst_ratio},
        )

    return ValidationReport(
        True,
        "OK",
        {
            "row_count": len(work),
            "series_count": int(work["unique_id"].nunique()),
            "max_ds": max_ds.isoformat(),
            "lag_hours": lag_hours,
            "worst_missing_ratio": worst_ratio,
        },
    )


---

# Module 8: Dashboard

**File:** `src/renewable/dashboard.py`

The Streamlit dashboard provides:
- **Forecast visualization** with prediction intervals
- **Drift monitoring** and alerts
- **Coverage analysis** (nominal vs empirical)
- **Weather features** by region

## Running the Dashboard

```bash
streamlit run src/renewable/dashboard.py
```

The dashboard will:
1. Load forecasts from `data/renewable/forecasts.parquet`
2. Display interactive charts with Plotly
3. Show drift alerts from the database

---

# Summary

## What We Built

| Module | Purpose | Key Concept |
|--------|---------|-------------|
| `regions.py` | Region definitions | EIA codes + coordinates |
| `eia_renewable.py` | Data fetching | StatsForecast format |
| `open_meteo.py` | Weather integration | Leakage prevention |
| `modeling.py` | Forecasting | Probabilistic intervals |
| `db.py` | Persistence | SQLite with WAL |
| `tasks.py` | Pipeline orchestration | Adaptive CV |
| `dashboard.py` | Visualization | Streamlit + Plotly |

## Key Takeaways

1. **StatsForecast format**: `[unique_id, ds, y]` enables multi-series modeling
2. **No MAPE for renewables**: Solar has zeros - use RMSE/MAE instead
3. **Weather leakage**: Use forecast weather for predictions, not historical
4. **Drift detection**: threshold = baseline_mean + 2 × baseline_std
5. **Adaptive CV**: Adjust window count for short time series

## Next Steps

1. Get an EIA API key and run with real data
2. Launch the dashboard: `streamlit run src/renewable/dashboard.py`
3. Experiment with different regions and fuel types
4. Set up scheduled pipeline runs for production