# Renewable Energy Forecasting Pipeline

This notebook walks through building a **next-24h renewable generation forecast system** with:

- **EIA data integration** - Hourly wind/solar generation for US regions
- **Weather features** - Open-Meteo integration (wind speed, solar radiation)
- **Probabilistic forecasting** - Dual prediction intervals (80%, 95%)
- **Drift monitoring** - Automatic detection of model degradation

## Architecture Overview

```
EIA API (WND/SUN) ──┐
                    ├──► Data Pipeline ──► StatsForecast ──► Predictions
Open-Meteo API ─────┘         │                  │              │
                              ▼                  ▼              ▼
                         Validation        Multi-Series    Probabilistic
                         & Quality         [unique_id,     (80%, 95%
                                           ds, y, X]       intervals)
                                                              │
                                                              ▼
                                                         Streamlit
                                                         Dashboard
                                                         (drift, alerts)
```

## Key Concepts

1. **StatsForecast format**: `[unique_id, ds, y]` - where `unique_id` = `{region}_{fuel_type}`
2. **Zero-value handling**: Solar generates 0 at night - we use RMSE/MAE, NOT MAPE
3. **Leakage prevention**: Use **forecasted** weather for predictions, not historical
4. **Drift detection**: Threshold = mean + 2*std from backtest

## Setup

First, let's ensure we have the project root in our path and configure logging.

In [24]:
import sys
import logging
from pathlib import Path
import os 

# Add project root to path
project_root = r"c:\docker_projects\atsaf"
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

if os.getcwd() != str(project_root):
    os.chdir(project_root)
    print(f"Changed working directory to project root: {project_root} we are currently at {os.getcwd()}")

# Configure logging for visibility
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)

print(f"Project root: {project_root}")

Project root: c:\docker_projects\atsaf


---

# Module 1: Region Definitions

**File:** `src/renewable/regions.py`

This module maps **EIA balancing authority regions** to their geographic coordinates. Why do we need coordinates?

- **Weather API lookup**: Open-Meteo requires latitude/longitude
- **Regional analysis**: Compare forecast accuracy across regions
- **Timezone handling**: Each region has a primary timezone

## Key Design Decisions

1. **NamedTuple for RegionInfo**: Immutable, type-safe, and memory-efficient
2. **Centroid coordinates**: Approximate centers - good enough for hourly weather
3. **Fuel type codes**: `WND` (wind), `SUN` (solar) - match EIA's API

In [25]:
%%writefile src/renewable/regions.py
# src/renewable/regions.py
from __future__ import annotations

from typing import NamedTuple, Optional


class RegionInfo(NamedTuple):
    """Region metadata for EIA and weather lookups."""
    name: str
    lat: float
    lon: float
    timezone: str
    # Some internal regions may not map cleanly to an EIA respondent.
    # We keep them in REGIONS for weather/features, but EIA fetch requires this.
    eia_respondent: Optional[str] = None


REGIONS: dict[str, RegionInfo] = {
    # Western Interconnection
    "CALI": RegionInfo(
        name="California ISO",
        lat=36.7,
        lon=-119.4,
        timezone="America/Los_Angeles",
        eia_respondent="CISO",
    ),
    "NW": RegionInfo(
        name="Northwest",
        lat=45.5,
        lon=-122.0,
        timezone="America/Los_Angeles",
        eia_respondent=None,  # intentionally unset until verified
    ),
    "SW": RegionInfo(
        name="Southwest",
        lat=33.5,
        lon=-112.0,
        timezone="America/Phoenix",
        eia_respondent=None,  # intentionally unset until verified
    ),

    # Texas Interconnection
    "ERCO": RegionInfo(
        name="ERCOT (Texas)",
        lat=31.0,
        lon=-100.0,
        timezone="America/Chicago",
        eia_respondent="ERCO",
    ),

    # Midwest
    "MISO": RegionInfo(
        name="Midcontinent ISO",
        lat=41.0,
        lon=-93.0,
        timezone="America/Chicago",
        eia_respondent="MISO",
    ),
    "PJM": RegionInfo(
        name="PJM Interconnection",
        lat=39.0,
        lon=-77.0,
        timezone="America/New_York",
        eia_respondent="PJM",
    ),
    "SWPP": RegionInfo(
        name="Southwest Power Pool",
        lat=37.0,
        lon=-97.0,
        timezone="America/Chicago",
        eia_respondent="SWPP",
    ),

    # Internal/aggregate regions kept for non-EIA use (weather/features/etc.)
    "SE": RegionInfo(name="Southeast", lat=33.0, lon=-84.0, timezone="America/New_York", eia_respondent=None),
    "FLA": RegionInfo(name="Florida", lat=28.0, lon=-82.0, timezone="America/New_York", eia_respondent=None),
    "CAR": RegionInfo(name="Carolinas", lat=35.5, lon=-80.0, timezone="America/New_York", eia_respondent=None),
    "TEN": RegionInfo(name="Tennessee Valley", lat=35.5, lon=-86.0, timezone="America/Chicago", eia_respondent=None),

    "US48": RegionInfo(name="Lower 48 States", lat=39.8, lon=-98.5, timezone="America/Chicago", eia_respondent=None),
}

FUEL_TYPES = {"WND": "Wind", "SUN": "Solar"}


def list_regions() -> list[str]:
    return sorted(REGIONS.keys())


def get_region_info(region_code: str) -> RegionInfo:
    return REGIONS[region_code]


def get_region_coords(region_code: str) -> tuple[float, float]:
    r = REGIONS[region_code]
    return (r.lat, r.lon)


def get_eia_respondent(region_code: str) -> str:
    """Return the code EIA expects for facets[respondent][]. Fail loudly if missing."""
    info = REGIONS[region_code]
    if not info.eia_respondent:
        raise ValueError(
            f"Region '{region_code}' has no configured eia_respondent. "
            f"Set REGIONS['{region_code}'].eia_respondent to a verified EIA respondent code "
            f"before using it for EIA fetches."
        )
    return info.eia_respondent


def validate_region(region_code: str) -> bool:
    return region_code in REGIONS


def validate_fuel_type(fuel_type: str) -> bool:
    return fuel_type in FUEL_TYPES



if __name__ == "__main__":
    # Example run - test region functions

    print("=== Available Regions ===")
    print(f"Total regions: {len(REGIONS)}")
    print(f"Region codes: {list_regions()}")

    print("\n=== Example: California ===")
    cali_info = get_region_info("CALI")
    print(f"Name: {cali_info.name}")
    print(f"Coordinates: ({cali_info.lat}, {cali_info.lon})")
    print(f"Timezone: {cali_info.timezone}")

    print("\n=== Weather API Coordinates ===")
    for region in ["CALI", "ERCO", "MISO"]:
        lat, lon = get_region_coords(region)
        print(f"{region}: lat={lat}, lon={lon}")

    print("\n=== Fuel Types ===")
    for code, name in FUEL_TYPES.items():
        print(f"{code}: {name}")

    print("\n=== Validation ===")
    print(f"validate_region('CALI'): {validate_region('CALI')}")
    print(f"validate_region('INVALID'): {validate_region('INVALID')}")
    print(f"validate_fuel_type('WND'): {validate_fuel_type('WND')}")


Overwriting src/renewable/regions.py


### Example: Using Region Definitions

---

# Module 2: EIA Data Fetcher

**File:** `src/renewable/eia_renewable.py`

This module fetches **hourly wind and solar generation** from the EIA API.

## Critical Concepts

### StatsForecast Format
StatsForecast expects data in a specific format:
```
unique_id | ds                  | y
----------|---------------------|--------
CALI_WND  | 2024-01-01 00:00:00 | 1234.5
CALI_WND  | 2024-01-01 01:00:00 | 1456.7
ERCO_WND  | 2024-01-01 00:00:00 | 2345.6
```

- `unique_id`: Identifies the time series (e.g., "CALI_WND" = California Wind)
- `ds`: Datetime column (timezone-naive UTC)
- `y`: Target value (generation in MWh)

### API Rate Limiting
- EIA API has rate limits (~5 requests/second)
- We use controlled parallelism with delays

In [26]:
%%writefile src/renewable/eia_renewable.py
# src/renewable/eia_renewable.py
from __future__ import annotations

import logging
import os
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
from typing import Optional

import pandas as pd
import requests
from dotenv import find_dotenv, load_dotenv
from urllib.parse import urlsplit, urlunsplit, parse_qsl, urlencode

from src.renewable.regions import REGIONS, get_eia_respondent, validate_fuel_type, validate_region

logger = logging.getLogger(__name__)

def _sanitize_url(url: str) -> str:
    parts = urlsplit(url)
    q = [(k, v) for k, v in parse_qsl(parts.query, keep_blank_values=True) if k.lower() != "api_key"]
    return urlunsplit((parts.scheme, parts.netloc, parts.path, urlencode(q), parts.fragment))


def _load_env_once(*, debug: bool = False) -> Optional[str]:
    """
    Load .env if present.
    - Primary: find_dotenv(usecwd=True) (walk up from CWD)
    - Fallback: repo_root/.env based on this file location
    Returns the path loaded (or None).
    """
    # 1) Try from current working directory upward
    dotenv_path = find_dotenv(usecwd=True)
    if dotenv_path:
        load_dotenv(dotenv_path, override=False)
        if debug:
            logger.info("Loaded .env via find_dotenv: %s", dotenv_path)
        return dotenv_path

    # 2) Fallback: assume src-layout -> repo root is ../../ from this file
    try:
        repo_root = Path(__file__).resolve().parents[2]
        fallback = repo_root / ".env"
        if fallback.exists():
            load_dotenv(fallback, override=False)
            if debug:
                logger.info("Loaded .env via fallback: %s", str(fallback))
            return str(fallback)
    except Exception:
        pass

    if debug:
        logger.info("No .env found to load.")
    return None


class EIARenewableFetcher:
    BASE_URL = "https://api.eia.gov/v2/electricity/rto/fuel-type-data/data/"
    MAX_RECORDS_PER_REQUEST = 5000
    RATE_LIMIT_DELAY = 0.2  # 5 requests/second max

    def __init__(self, api_key: Optional[str] = None, *, debug_env: bool = False):
        """
        Initialize API key. Pulls from:
        1) explicit api_key argument
        2) environment variable EIA_API_KEY (optionally loaded from .env)
        """
        loaded_env = _load_env_once(debug=debug_env)

        self.api_key = api_key or os.getenv("EIA_API_KEY")
        if not self.api_key:
            raise ValueError(
                "EIA API key required but not found.\n"
                "- Ensure .env contains EIA_API_KEY=...\n"
                "- Ensure your process CWD is under the repo (so find_dotenv can locate it), OR\n"
                "- Pass api_key=... explicitly.\n"
                f"Loaded .env path: {loaded_env}"
            )

        # Debug without leaking the key
        if debug_env:
            masked = self.api_key[:4] + "..." + self.api_key[-4:] if len(self.api_key) >= 8 else "***"
            logger.info("EIA_API_KEY loaded (masked): %s", masked)

    @staticmethod
    def _extract_eia_response(payload: dict, *, request_url: Optional[str] = None) -> tuple[list[dict], dict]:
        if not isinstance(payload, dict):
            raise TypeError(f"EIA payload is not a dict. type={type(payload)} url={request_url}")

        if "error" in payload and payload.get("response") is None:
            raise ValueError(f"EIA returned error payload. url={request_url} error={payload.get('error')}")

        if "response" not in payload:
            raise ValueError(
                f"EIA payload missing 'response'. url={request_url} keys={list(payload.keys())[:25]}"
            )

        response = payload.get("response") or {}
        if not isinstance(response, dict):
            raise TypeError(f"EIA payload['response'] is not a dict. type={type(response)} url={request_url}")

        if "data" not in response:
            raise ValueError(
                f"EIA response missing 'data'. url={request_url} response_keys={list(response.keys())[:25]}"
            )

        records = response.get("data") or []
        if not isinstance(records, list):
            raise TypeError(f"EIA response['data'] is not a list. type={type(records)} url={request_url}")

        total = response.get("total", None)
        offset = response.get("offset", None)

        meta_obj = response.get("metadata") or {}
        if isinstance(meta_obj, dict):
            if total is None and "total" in meta_obj:
                total = meta_obj.get("total")
            if offset is None and "offset" in meta_obj:
                offset = meta_obj.get("offset")

        try:
            total = int(total) if total is not None else None
        except Exception:
            pass
        try:
            offset = int(offset) if offset is not None else None
        except Exception:
            pass

        return records, {"total": total, "offset": offset}

    def fetch_region(
        self,
        region: str,
        fuel_type: str,
        start_date: str,
        end_date: str,
        *,
        debug: bool = False,
        diag: Optional[dict] = None,
    ) -> pd.DataFrame:
        if not validate_region(region):
            raise ValueError(f"Invalid region: {region}")
        if not validate_fuel_type(fuel_type):
            raise ValueError(f"Invalid fuel type: {fuel_type}")

        respondent = get_eia_respondent(region)

        all_records: list[dict] = []
        offset = 0

        while True:
            params = {
                "api_key": self.api_key,
                "data[]": "value",
                "facets[respondent][]": respondent,
                "facets[fueltype][]": fuel_type,
                "frequency": "hourly",
                "start": f"{start_date}T00",
                "end": f"{end_date}T23",
                "length": self.MAX_RECORDS_PER_REQUEST,
                "offset": offset,
                "sort[0][column]": "period",
                "sort[0][direction]": "asc",
            }

            resp = requests.get(self.BASE_URL, params=params, timeout=30)
            resp.raise_for_status()
            payload = resp.json()

            records, meta = self._extract_eia_response(payload, request_url=resp.url)
            page_count += 1
            if total_hint is None:
                total_hint = meta.get("total")

            returned = len(records)

            if debug:
                safe_url = _sanitize_url(resp.url)
                print(
                    f"[PAGE] region={region} fuel={fuel_type} returned={returned} "
                    f"offset={offset} total={meta.get('total')} url={safe_url}"
                )
            if returned == 0 and offset == 0:
                return pd.DataFrame(columns=["ds", "value", "region", "fuel_type"])
            if returned == 0:
                break

            all_records.extend(records)

            if returned < self.MAX_RECORDS_PER_REQUEST:
                break

            offset += self.MAX_RECORDS_PER_REQUEST
            time.sleep(self.RATE_LIMIT_DELAY)

        df = pd.DataFrame(all_records)

        missing_cols = [c for c in ["period", "value"] if c not in df.columns]
        if missing_cols:
            sample_keys = sorted(set().union(*(r.keys() for r in all_records[:5]))) if all_records else []
            raise ValueError(
                f"EIA records missing expected keys {missing_cols}. "
                f"columns={df.columns.tolist()} sample_record_keys={sample_keys}"
            )

        raw_rows = len(df)
        df["ds"] = pd.to_datetime(df["period"], utc=True, errors="coerce").dt.tz_convert("UTC").dt.tz_localize(None)
        df["value"] = pd.to_numeric(df["value"], errors="coerce")

        bad_ds = int(df["ds"].isna().sum())
        bad_val = int(df["value"].isna().sum())

        df["region"] = region
        df["fuel_type"] = fuel_type

        df = df.dropna(subset=["ds", "value"]).sort_values("ds").reset_index(drop=True)

        if diag is not None:
            diag.update({
                "region": region,
                "fuel_type": fuel_type,
                "start_date": start_date,
                "end_date": end_date,
                "total_records": total_hint,
                "pages": page_count,
                "rows_parsed": int(len(df)),
                "empty": bool(len(df) == 0),
            })

        return df[["ds", "value", "region", "fuel_type"]]

    def fetch_all_regions(
        self,
        fuel_type: str,
        start_date: str,
        end_date: str,
        regions: Optional[list[str]] = None,
        max_workers: int = 3,
        diagnostics: Optional[list[dict]] = None,
    ) -> pd.DataFrame:
        if regions is None:
            regions = [r for r in REGIONS.keys() if r != "US48"]

        all_dfs: list[pd.DataFrame] = []

        def _run_one(region: str) -> tuple[str, pd.DataFrame, dict]:
            d: dict = {}
            df = self.fetch_region(region, fuel_type, start_date, end_date, diag=d)
            return region, df, d

        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            futures = {executor.submit(_run_one, region): region for region in regions}
            for future in as_completed(futures):
                region = futures[future]
                try:
                    _, df, d = future.result()
                    if diagnostics is not None:
                        diagnostics.append(d)

                    if len(df) > 0:
                        all_dfs.append(df)
                        print(f"[OK] {region}: {len(df)} rows")
                    else:
                        print(f"[EMPTY] {region}: 0 rows")
                except Exception as e:
                    if diagnostics is not None:
                        diagnostics.append({
                            "region": region,
                            "fuel_type": fuel_type,
                            "start_date": start_date,
                            "end_date": end_date,
                            "error": str(e),
                        })
                    print(f"[FAIL] {region}: {e}")

        if not all_dfs:
            return pd.DataFrame(columns=["unique_id", "ds", "y"])

        combined = pd.concat(all_dfs, ignore_index=True)
        combined["unique_id"] = combined["region"] + "_" + combined["fuel_type"]
        combined = combined.rename(columns={"value": "y"})
        return combined[["unique_id", "ds", "y"]].sort_values(["unique_id", "ds"]).reset_index(drop=True)

    def get_series_summary(self, df: pd.DataFrame) -> pd.DataFrame:
        return df.groupby("unique_id").agg(
            count=("y", "count"),
            min_value=("y", "min"),
            max_value=("y", "max"),
            mean_value=("y", "mean"),
            zero_count=("y", lambda x: (x == 0).sum()),
        ).reset_index()


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)

    fetcher = EIARenewableFetcher(debug_env=True)

    print("=== Testing Single Region Fetch ===")
    df_single = fetcher.fetch_region("CALI", "WND", "2024-12-01", "2024-12-03", debug=True)
    print(f"Single region: {len(df_single)} rows")
    print(df_single.head())

    print("\n=== Testing Multi-Region Fetch ===")
    df_multi = fetcher.fetch_all_regions("WND", "2024-12-01", "2024-12-03", regions=["CALI", "ERCO", "MISO"])
    print(f"\nMulti-region: {len(df_multi)} rows")
    print(f"Series: {df_multi['unique_id'].unique().tolist()}")

    print("\n=== Series Summary ===")
    print(fetcher.get_series_summary(df_multi))

    # sun checks:
    f = EIARenewableFetcher()
    df = f.fetch_region("CALI", "SUN", "2024-12-01", "2024-12-03", debug=True)
    print(df.head(), len(df))


Overwriting src/renewable/eia_renewable.py


---

# Module 3: Weather Integration

**File:** `src/renewable/open_meteo.py`

Weather is **critical** for renewable forecasting:
- **Wind generation** depends on wind speed (especially at hub height ~100m)
- **Solar generation** depends on radiation and cloud cover

## Key Concept: Preventing Leakage

**Data leakage** occurs when training uses information that wouldn't be available at prediction time.

```
❌ WRONG: Using historical weather to predict future generation
   - At prediction time, we don't have future actual weather!
   
✅ CORRECT: Use forecasted weather for predictions
   - Training: historical weather aligned with historical generation
   - Prediction: weather forecast for the prediction horizon
```

## Open-Meteo API

Open-Meteo is **free** and requires no API key:
- Historical API: Past weather data
- Forecast API: Up to 16 days ahead

In [27]:
%%writefile src/renewable/open_meteo.py
# src/renewable/open_meteo.py
from __future__ import annotations

from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Optional

import pandas as pd
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

from src.renewable.regions import get_region_coords, validate_region


@dataclass(frozen=True)
class OpenMeteoEndpoints:
    historical_url: str = "https://archive-api.open-meteo.com/v1/archive"
    forecast_url: str = "https://api.open-meteo.com/v1/forecast"


class OpenMeteoRenewable:
    """
    Fetch weather features for renewable energy forecasting.

    Strict-by-default:
    - If Open-Meteo doesn't return a requested variable, we raise.
    - We do NOT fabricate values or silently "fill" missing columns.
    """

    WEATHER_VARS = [
        "temperature_2m",
        "wind_speed_10m",
        "wind_speed_100m",
        "wind_direction_10m",
        "direct_radiation",
        "diffuse_radiation",
        "cloud_cover",
    ]

    def __init__(self, timeout: int = 30, *, strict: bool = True):
        self.timeout = timeout
        self.strict = strict
        self.endpoints = OpenMeteoEndpoints()
        self.session = self._create_session()

    def _create_session(self) -> requests.Session:
        session = requests.Session()
        retries = Retry(
            total=3,
            backoff_factor=0.5,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=frozenset(["GET"]),
        )
        session.mount("https://", HTTPAdapter(max_retries=retries))
        return session

    def fetch_historical(
        self,
        lat: float,
        lon: float,
        start_date: str,
        end_date: str,
        variables: Optional[list[str]] = None,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if variables is None:
            variables = self.WEATHER_VARS

        params = {
            "latitude": lat,
            "longitude": lon,
            "start_date": start_date,
            "end_date": end_date,
            "hourly": ",".join(variables),
            "timezone": "UTC",
        }

        resp = self.session.get(self.endpoints.historical_url, params=params, timeout=self.timeout)
        if debug:
            print(f"[OPENMETEO][HIST] status={resp.status_code} url={resp.url}")
        resp.raise_for_status()

        return self._parse_response(resp.json(), variables, debug=debug, request_url=resp.url)

    def fetch_forecast(
        self,
        lat: float,
        lon: float,
        horizon_hours: int = 48,
        variables: Optional[list[str]] = None,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if variables is None:
            variables = self.WEATHER_VARS

        forecast_days = min((horizon_hours // 24) + 1, 16)
        params = {
            "latitude": lat,
            "longitude": lon,
            "hourly": ",".join(variables),
            "timezone": "UTC",
            "forecast_days": forecast_days,
        }

        resp = self.session.get(self.endpoints.forecast_url, params=params, timeout=self.timeout)
        if debug:
            print(f"[OPENMETEO][FCST] status={resp.status_code} url={resp.url}")
        resp.raise_for_status()

        df = self._parse_response(resp.json(), variables, debug=debug, request_url=resp.url)

        # Trim to requested horizon (ds is naive UTC)
        if len(df) > 0:
            cutoff = datetime.utcnow() + timedelta(hours=horizon_hours)
            df = df[df["ds"] <= cutoff].reset_index(drop=True)

        return df

    def fetch_for_region(
        self,
        region_code: str,
        start_date: str,
        end_date: str,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if not validate_region(region_code):
            raise ValueError(f"Invalid region_code: {region_code}")

        lat, lon = get_region_coords(region_code)
        df = self.fetch_historical(lat, lon, start_date, end_date, debug=debug)
        df["region"] = region_code
        return df

    def fetch_all_regions_historical(
        self,
        regions: list[str],
        start_date: str,
        end_date: str,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        all_dfs: list[pd.DataFrame] = []
        for region in regions:
            try:
                df = self.fetch_for_region(region, start_date, end_date, debug=debug)
                all_dfs.append(df)
                print(f"[OK] Weather for {region}: {len(df)} rows")
            except Exception as e:
                print(f"[FAIL] Weather for {region}: {e}")

        if not all_dfs:
            return pd.DataFrame()

        return (
            pd.concat(all_dfs, ignore_index=True)
            .sort_values(["region", "ds"])
            .reset_index(drop=True)
        )

    def _parse_response(
        self,
        data: dict,
        variables: list[str],
        *,
        debug: bool,
        request_url: str,
    ) -> pd.DataFrame:
        hourly = data.get("hourly")
        if not isinstance(hourly, dict):
            raise ValueError(f"Open-Meteo response missing/invalid 'hourly'. url={request_url}")

        times = hourly.get("time")
        if not isinstance(times, list) or len(times) == 0:
            raise ValueError(f"Open-Meteo response has no hourly time grid. url={request_url}")

        # Build ds (naive UTC)
        ds = pd.to_datetime(times, errors="coerce", utc=True).tz_localize(None)
        if ds.isna().any():
            bad = int(ds.isna().sum())
            raise ValueError(f"Open-Meteo returned unparsable times. bad={bad} url={request_url}")

        df_data = {"ds": ds}

        # Strict variable presence: raise if missing (no silent None padding)
        missing_vars = [v for v in variables if v not in hourly]
        if missing_vars and self.strict:
            raise ValueError(f"Open-Meteo missing requested vars={missing_vars}. url={request_url}")

        for var in variables:
            values = hourly.get(var)
            if values is None:
                # If not strict, keep as all-NA but be explicit (not hidden)
                df_data[var] = [None] * len(ds)
                continue

            if not isinstance(values, list):
                raise ValueError(f"Open-Meteo var '{var}' not a list. type={type(values)} url={request_url}")

            if len(values) != len(ds):
                raise ValueError(
                    f"Open-Meteo length mismatch for '{var}': "
                    f"len(values)={len(values)} len(time)={len(ds)} url={request_url}"
                )

            df_data[var] = pd.to_numeric(values, errors="coerce")

        df = pd.DataFrame(df_data).sort_values("ds").reset_index(drop=True)

        if debug:
            dup = int(df["ds"].duplicated().sum())
            na_counts = {v: int(df[v].isna().sum()) for v in variables if v in df.columns}
            print(f"[OPENMETEO][PARSE] rows={len(df)} dup_ds={dup} na_counts(sample)={dict(list(na_counts.items())[:3])}")

        return df

    def fetch_for_region_forecast(
        self,
        region_code: str,
        horizon_hours: int = 48,
        variables: Optional[list[str]] = None,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        if not validate_region(region_code):
            raise ValueError(f"Invalid region_code: {region_code}")

        lat, lon = get_region_coords(region_code)
        df = self.fetch_forecast(lat, lon, horizon_hours=horizon_hours, variables=variables, debug=debug)
        df["region"] = region_code
        return df


    def fetch_all_regions_forecast(
        self,
        regions: list[str],
        horizon_hours: int = 48,
        variables: Optional[list[str]] = None,
        *,
        debug: bool = False,
    ) -> pd.DataFrame:
        all_dfs: list[pd.DataFrame] = []
        for region in regions:
            try:
                df = self.fetch_for_region_forecast(
                    region, horizon_hours=horizon_hours, variables=variables, debug=debug
                )
                all_dfs.append(df)
                print(f"[OK] Forecast weather for {region}: {len(df)} rows")
            except Exception as e:
                print(f"[FAIL] Forecast weather for {region}: {e}")

        if not all_dfs:
            return pd.DataFrame()

        return (
            pd.concat(all_dfs, ignore_index=True)
            .sort_values(["region", "ds"])
            .reset_index(drop=True)
        )



if __name__ == "__main__": 
    # Real API smoke test (no key needed)
    weather = OpenMeteoRenewable(strict=True)

    print("=== Testing Historical Weather (REAL API) ===")
    hist_df = weather.fetch_for_region("CALI", "2024-12-01", "2024-12-03", debug=True)
    print(f"Historical rows: {len(hist_df)}")
    print(hist_df.head())


Overwriting src/renewable/open_meteo.py


---

# Module 4: Probabilistic Modeling

**File:** `src/renewable/modeling.py`

This is where the forecasting happens! We use **StatsForecast** for:

1. **Multi-series forecasting**: Handle multiple regions/fuel types in one model
2. **Probabilistic predictions**: Get prediction intervals, not just point forecasts
3. **Weather exogenous**: Include weather features as predictors

## Key Concepts

### Why Prediction Intervals?

Point forecasts are useful, but energy traders need **uncertainty quantification**:
- **80% interval**: "I'm 80% confident generation will be between X and Y"
- **95% interval**: Wider, for risk management

### Zero-Value Safety (CRITICAL)

**Solar panels generate ZERO at night!** This breaks MAPE:

```
MAPE = mean(|actual - predicted| / actual)

When actual = 0:
MAPE = |0 - pred| / 0 = undefined (division by zero!)
```

**Solution**: Always use RMSE and MAE for renewable forecasting.

In [28]:
%%writefile src/chapter2/evaluation.py
# file: src/chapter2/evaluation.py
"""
Chapter 2: Model Evaluation Metrics

Computes forecasting metrics with explicit NaN handling (fail-loud principle).
"""

import logging
from typing import Dict, Optional, Tuple

import numpy as np
import pandas as pd

logger = logging.getLogger(__name__)


class ForecastMetrics:
    """Compute and track forecasting evaluation metrics"""

    @staticmethod
    def rmse(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Root Mean Squared Error

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Masks NaN/inf values before computation
        """
        valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)

        if valid_mask.sum() == 0:
            return np.nan

        return np.sqrt(np.mean((y_pred[valid_mask] - y_true[valid_mask]) ** 2))

    @staticmethod
    def mae(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Mean Absolute Error

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Masks NaN/inf values before computation
        """
        valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)

        if valid_mask.sum() == 0:
            return np.nan

        return np.mean(np.abs(y_pred[valid_mask] - y_true[valid_mask]))

    @staticmethod
    def mape(y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Mean Absolute Percentage Error (%)

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Masks NaN/inf values and zero y_true before computation
        """
        valid_mask = (
            np.isfinite(y_pred) &
            np.isfinite(y_true) &
            (np.abs(y_true) > 1e-10)
        )

        if valid_mask.sum() == 0:
            return np.nan

        ape = np.abs((y_pred[valid_mask] - y_true[valid_mask]) / np.abs(y_true[valid_mask]))
        return 100 * np.mean(ape)

    @staticmethod
    def mase(
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_train: np.ndarray,
        season_length: int = 24
    ) -> float:
        """
        Mean Absolute Scaled Error

        Scales error relative to naive seasonal forecasting.

        Explicit NaN masking (fail-loud):
        - Returns NaN if insufficient training data
        - Masks NaN/inf values before computation
        """
        # Check minimum training data
        if len(y_train) < season_length:
            return np.nan

        # Compute seasonal naive MAE
        try:
            mae_train = np.mean(np.abs(
                y_train[season_length:] - y_train[:-season_length]
            ))
        except:
            return np.nan

        if mae_train < 1e-10:
            return np.nan

        # Compute test MAE
        valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)

        if valid_mask.sum() == 0:
            return np.nan

        mae_test = np.mean(np.abs(y_pred[valid_mask] - y_true[valid_mask]))

        return mae_test / mae_train

    @staticmethod
    def coverage(
        y_true: np.ndarray,
        lower: np.ndarray,
        upper: np.ndarray
    ) -> float:
        """
        Prediction Interval Coverage (%)

        Percentage of actual values within prediction interval.

        Explicit NaN masking (fail-loud):
        - Returns NaN if no valid predictions
        - Counts valid (non-NaN) rows in denominator
        """
        valid_mask = (
            np.isfinite(y_true) &
            np.isfinite(lower) &
            np.isfinite(upper)
        )

        if valid_mask.sum() == 0:
            return np.nan

        covered = (y_true[valid_mask] >= lower[valid_mask]) & \
                  (y_true[valid_mask] <= upper[valid_mask])

        return 100 * np.mean(covered)

    @staticmethod
    def compute_all(
        y_true: np.ndarray,
        y_pred: np.ndarray,
        y_train: Optional[np.ndarray] = None
    ) -> Dict[str, float]:
        """
        Compute all metrics at once

        Args:
            y_true: Actual values
            y_pred: Predictions
            y_train: Training values (for MASE)

        Returns:
            Dictionary of metrics
        """
        metrics = {
            "rmse": ForecastMetrics.rmse(y_true, y_pred),
            "mae": ForecastMetrics.mae(y_true, y_pred),
            "mape": ForecastMetrics.mape(y_true, y_pred),
        }

        if y_train is not None:
            metrics["mase"] = ForecastMetrics.mase(
                y_true, y_pred, y_train, season_length=24
            )

        return metrics


def compute_series_metrics(
    y_true: np.ndarray,
    y_pred: np.ndarray,
    y_train: Optional[np.ndarray] = None,
    valid_threshold: int = 1
) -> Dict[str, float]:
    """
    Compute metrics with explicit validation

    Args:
        y_true: Actual values
        y_pred: Predictions
        y_train: Training values (for MASE)
        valid_threshold: Minimum valid predictions required

    Returns:
        Dictionary of metrics
    """
    # Count valid predictions
    valid_mask = np.isfinite(y_pred) & np.isfinite(y_true)
    valid_count = valid_mask.sum()

    if valid_count < valid_threshold:
        return {
            "rmse": np.nan,
            "mae": np.nan,
            "mape": np.nan,
            "mase": np.nan,
            "valid_count": valid_count,
            "error": f"Insufficient valid predictions: {valid_count} < {valid_threshold}"
        }

    metrics = ForecastMetrics.compute_all(y_true, y_pred, y_train)
    metrics["valid_count"] = valid_count

    return metrics


def aggregate_metrics(
    results: pd.DataFrame,
    by: Optional[str] = None
) -> pd.DataFrame:
    """
    Aggregate metrics across splits and series

    Args:
        results: DataFrame with metric columns
        by: Groupby column ("model_name", "unique_id", etc.)

    Returns:
        Aggregated metrics DataFrame
    """
    metric_cols = ["rmse", "mae", "mape", "mase"]

    if by is None:
        # Overall aggregation
        agg = results[metric_cols].agg([
            ("mean", "mean"),
            ("std", "std"),
            ("min", "min"),
            ("max", "max")
        ])
        return agg
    else:
        # Grouped aggregation
        agg = results.groupby(by)[metric_cols].agg([
            ("mean", "mean"),
            ("std", "std"),
            ("count", "count")
        ])
        return agg
        return agg


Overwriting src/chapter2/evaluation.py


In [29]:
%%writefile src/renewable/modeling.py
# file: src/renewable/modeling.py

from __future__ import annotations

import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import Optional, Sequence
import re
from typing import Any

from src.chapter2.evaluation import ForecastMetrics


WEATHER_VARS = [
    "temperature_2m",
    "wind_speed_10m",
    "wind_speed_100m",
    "wind_direction_10m",
    "direct_radiation",
    "diffuse_radiation",
    "cloud_cover",
]


def _log_series_summary(df: pd.DataFrame, *, value_col: str = "y", label: str = "series") -> None:
    if df.empty:
        print(f"[{label}] EMPTY")
        return

    tmp = df.copy()
    tmp["ds"] = pd.to_datetime(tmp["ds"], errors="coerce")

    def _mode_delta_hours(g: pd.Series) -> float:
        d = g.sort_values().diff().dropna()
        if d.empty:
            return float("nan")
        return float(d.dt.total_seconds().div(3600).mode().iloc[0])

    g = tmp.groupby("unique_id").agg(
        rows=(value_col, "count"),
        na_y=(value_col, lambda s: int(s.isna().sum())),
        min_ds=("ds", "min"),
        max_ds=("ds", "max"),
        min_y=(value_col, "min"),
        max_y=(value_col, "max"),
        mean_y=(value_col, "mean"),
        zero_y=(value_col, lambda s: int((s == 0).sum())),
        mode_delta_hours=("ds", _mode_delta_hours),
    ).reset_index().sort_values("unique_id")

    print(f"[{label}] series={g['unique_id'].nunique()} rows={len(tmp)}")
    print(g.head(20).to_string(index=False))

def _missing_hour_blocks(ds: pd.Series) -> list[tuple[pd.Timestamp, pd.Timestamp, int]]:
    """
    Return contiguous blocks of missing hourly timestamps.
    Each tuple: (block_start, block_end, n_hours)
    """
    ds = pd.to_datetime(ds, errors="raise").sort_values()
    start, end = ds.iloc[0], ds.iloc[-1]
    expected = pd.date_range(start, end, freq="h")
    missing = expected.difference(ds)

    if missing.empty:
        return []

    blocks = []
    block_start = missing[0]
    prev = missing[0]
    for t in missing[1:]:
        if t - prev == pd.Timedelta(hours=1):
            prev = t
        else:
            n = int((prev - block_start).total_seconds() / 3600) + 1
            blocks.append((block_start, prev, n))
            block_start = t
            prev = t
    n = int((prev - block_start).total_seconds() / 3600) + 1
    blocks.append((block_start, prev, n))
    return blocks


def _hourly_grid_report(df: pd.DataFrame) -> pd.DataFrame:
    rows = []
    for uid, g in df.groupby("unique_id"):
        g = g.sort_values("ds")
        start, end = g["ds"].iloc[0], g["ds"].iloc[-1]
        expected = pd.date_range(start, end, freq="h")
        missing = expected.difference(g["ds"])
        blocks = _missing_hour_blocks(g["ds"])

        rows.append(
            {
                "unique_id": uid,
                "start": start,
                "end": end,
                "expected_hours": int(len(expected)),
                "actual_hours": int(len(g)),
                "missing_hours": int(len(missing)),
                "missing_ratio": float(len(missing) / max(len(expected), 1)),
                "n_missing_blocks": int(len(blocks)),
                "largest_missing_block_hours": int(max([b[2] for b in blocks], default=0)),
                "first_missing_block_start": blocks[0][0] if blocks else pd.NaT,
                "first_missing_block_end": blocks[0][1] if blocks else pd.NaT,
            }
        )
    return pd.DataFrame(rows).sort_values(["missing_ratio", "missing_hours"], ascending=False)


def _enforce_hourly_grid(
    df: pd.DataFrame,
    *,
    label: str,
    policy: str = "raise",  # "raise" | "drop_incomplete_series"
) -> pd.DataFrame:
    """
    Enforce hourly continuity without imputation.
    - raise: fail loud with detailed report
    - drop_incomplete_series: drop series that have missing hours (log what was dropped)
    """
    rep = _hourly_grid_report(df)
    worst = rep.iloc[0].to_dict()

    if worst["missing_hours"] == 0:
        return df

    print(f"[{label}][GRID] report (top):\n{rep.head(10).to_string(index=False)}")

    if policy == "drop_incomplete_series":
        bad_uids = rep.loc[rep["missing_hours"] > 0, "unique_id"].tolist()
        kept = df.loc[~df["unique_id"].isin(bad_uids)].copy()
        print(f"[{label}][GRID] policy=drop_incomplete_series dropped={bad_uids} kept_series={kept['unique_id'].nunique()}")
        if kept.empty:
            raise RuntimeError(f"[{label}][GRID] all series dropped due to missing hours")
        return kept

    # default: raise
    worst_uid = worst["unique_id"]
    g = df[df["unique_id"] == worst_uid].sort_values("ds")
    blocks = _missing_hour_blocks(g["ds"])
    sample_blocks = blocks[:3]
    raise RuntimeError(
        f"[{label}][GRID] Missing hours detected (no imputation). "
        f"worst_unique_id={worst_uid} missing_hours={worst['missing_hours']} "
        f"missing_ratio={worst['missing_ratio']:.3f} blocks(sample)={sample_blocks}"
    )

def _validate_hourly_grid_fail_loud(
    df: pd.DataFrame,
    *,
    max_missing_ratio: float = 0.0,
    label: str = "generation",
) -> None:
    # Keep your original basic checks:
    if df.empty:
        raise RuntimeError(f"[{label}] empty dataframe")

    bad = df["ds"].isna().sum()
    if bad:
        raise RuntimeError(f"[{label}] ds has NaT values bad={int(bad)}")

    dup = df.duplicated(subset=["unique_id", "ds"]).sum()
    if dup:
        raise RuntimeError(f"[{label}] duplicate (unique_id, ds) rows dup={int(dup)}")

    rep = _hourly_grid_report(df)
    worst = rep.iloc[0].to_dict()
    if worst["missing_ratio"] > max_missing_ratio:
        print(f"[{label}][GRID] report (top):\n{rep.head(10).to_string(index=False)}")
        worst_uid = worst["unique_id"]
        g = df[df["unique_id"] == worst_uid].sort_values("ds")
        blocks = _missing_hour_blocks(g["ds"])
        raise RuntimeError(
            f"[{label}][GRID] Missing hours detected (no imputation allowed). "
            f"unique_id={worst_uid} missing_hours={worst['missing_hours']} "
            f"missing_ratio={worst['missing_ratio']:.3f} blocks(sample)={blocks[:3]}"
        )



def _add_time_features(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out["hour"] = out["ds"].dt.hour
    out["dow"] = out["ds"].dt.dayofweek

    out["hour_sin"] = np.sin(2 * np.pi * out["hour"] / 24)
    out["hour_cos"] = np.cos(2 * np.pi * out["hour"] / 24)

    out["dow_sin"] = np.sin(2 * np.pi * out["dow"] / 7)
    out["dow_cos"] = np.cos(2 * np.pi * out["dow"] / 7)

    return out.drop(columns=["hour", "dow"])

def _infer_model_columns(cv_df: pd.DataFrame) -> list[str]:
    """
    Infer StatsForecast model prediction columns from a cross_validation dataframe.

    We treat as "model columns" those that:
      - are not core columns (unique_id, ds, cutoff, y)
      - are not interval columns like '<model>-lo-80' or '<model>-hi-95'
    """
    core = {"unique_id", "ds", "cutoff", "y"}
    cols = [c for c in cv_df.columns if c not in core]

    model_cols: set[str] = set()
    interval_pat = re.compile(r"-(lo|hi)-\d+$")
    for c in cols:
        if interval_pat.search(c):
            continue
        model_cols.add(c)

    return sorted(model_cols)


def compute_leaderboard(
    cv_df: pd.DataFrame,
    *,
    confidence_levels: tuple[int, int] = (80, 95),
) -> pd.DataFrame:
    """
    Build an aggregated leaderboard from StatsForecast cross_validation output.

    Returns columns:
      - model, rmse, mae, mape, valid_rows
      - coverage_<level> if interval columns exist
    """
    required = {"y", "unique_id", "ds", "cutoff"}
    missing = required - set(cv_df.columns)
    if missing:
        raise ValueError(f"[leaderboard] cv_df missing required columns: {sorted(missing)}")

    model_cols = _infer_model_columns(cv_df)
    if not model_cols:
        raise RuntimeError(
            f"[leaderboard] Could not infer any model prediction columns. "
            f"cv_df columns={cv_df.columns.tolist()}"
        )

    rows: list[dict[str, Any]] = []
    y_true = cv_df["y"].to_numpy()

    for m in model_cols:
        if m not in cv_df.columns:
            continue

        y_pred = cv_df[m].to_numpy()
        valid_mask = np.isfinite(y_true) & np.isfinite(y_pred)
        valid_rows = int(valid_mask.sum())

        metrics = {
            "model": m,
            "rmse": float(ForecastMetrics.rmse(y_true, y_pred)),
            "mae": float(ForecastMetrics.mae(y_true, y_pred)),
            "mape": float(ForecastMetrics.mape(y_true, y_pred)),
            "valid_rows": valid_rows,
        }

        # Coverage if interval columns exist
        for lvl in confidence_levels:
            lo_col = f"{m}-lo-{lvl}"
            hi_col = f"{m}-hi-{lvl}"
            if lo_col in cv_df.columns and hi_col in cv_df.columns:
                cov = ForecastMetrics.coverage(
                    y_true,
                    cv_df[lo_col].to_numpy(),
                    cv_df[hi_col].to_numpy(),
                )
                metrics[f"coverage_{lvl}"] = float(cov)

        rows.append(metrics)

    lb = pd.DataFrame(rows)
    if lb.empty:
        raise RuntimeError("[leaderboard] computed empty leaderboard (no usable model columns).")

    # Fail-loud sorting: rmse NaNs should sort last
    lb = lb.sort_values(["rmse"], ascending=True, na_position="last").reset_index(drop=True)
    return lb


def compute_baseline_metrics(
    cv_df: pd.DataFrame,
    *,
    model_name: str,
    threshold_k: float = 2.0,
) -> dict:
    """
    Compute baseline metrics for drift detection from CV output.

    We compute RMSE/MAE per (unique_id, cutoff) window, then aggregate:
      rmse_mean, rmse_std, drift_threshold_rmse = mean + k*std

    No imputation/filling: metrics are computed only from finite values.
    """
    required = {"unique_id", "cutoff", "y", model_name}
    missing = required - set(cv_df.columns)
    if missing:
        raise ValueError(
            f"[baseline] cv_df missing required columns for model '{model_name}': {sorted(missing)}"
        )

    # Compute per-window metrics (unique_id, cutoff)
    def _window_metrics(g: pd.DataFrame) -> pd.Series:
        yt = g["y"].to_numpy()
        yp = g[model_name].to_numpy()
        valid = np.isfinite(yt) & np.isfinite(yp)
        if valid.sum() == 0:
            return pd.Series({"rmse": np.nan, "mae": np.nan, "valid_rows": 0})
        return pd.Series({
            "rmse": ForecastMetrics.rmse(yt, yp),
            "mae": ForecastMetrics.mae(yt, yp),
            "valid_rows": int(valid.sum()),
        })

    per_window = (
        cv_df.groupby(["unique_id", "cutoff"], sort=False, dropna=False)
        .apply(_window_metrics)
        .reset_index()
    )

    # Fail loud if baseline is entirely NaN
    if per_window["rmse"].notna().sum() == 0:
        sample_cols = ["unique_id", "cutoff", "y", model_name]
        raise RuntimeError(
            "[baseline] All per-window RMSE are NaN. "
            "This usually means predictions or y are non-finite everywhere. "
            f"Sample:\n{cv_df[sample_cols].head(20).to_string(index=False)}"
        )

    rmse_mean = float(per_window["rmse"].mean(skipna=True))
    rmse_std = float(per_window["rmse"].std(skipna=True, ddof=0))
    mae_mean = float(per_window["mae"].mean(skipna=True))
    mae_std = float(per_window["mae"].std(skipna=True, ddof=0))

    baseline = {
        "model": model_name,
        "rmse_mean": rmse_mean,
        "rmse_std": rmse_std,
        "mae_mean": mae_mean,
        "mae_std": mae_std,
        "drift_threshold_rmse": float(rmse_mean + threshold_k * rmse_std),
        "drift_threshold_mae": float(mae_mean + threshold_k * mae_std),
        "n_series": int(per_window["unique_id"].nunique()),
        "n_windows": int(per_window["cutoff"].nunique()),
        "per_window_rows": int(len(per_window)),
    }

    # Optional per-series baseline (useful later if you want drift per series)
    per_series = (
        per_window.groupby("unique_id")[["rmse", "mae"]]
        .agg(rmse_mean=("rmse", "mean"), rmse_std=("rmse", lambda s: s.std(ddof=0)),
             mae_mean=("mae", "mean"), mae_std=("mae", lambda s: s.std(ddof=0)))
        .reset_index()
    )
    per_series["drift_threshold_rmse"] = per_series["rmse_mean"] + threshold_k * per_series["rmse_std"]
    per_series["drift_threshold_mae"] = per_series["mae_mean"] + threshold_k * per_series["mae_std"]
    baseline["per_series"] = per_series.to_dict(orient="records")

    return baseline



@dataclass
class ForecastConfig:
    horizon: int = 24
    confidence_levels: tuple[int, int] = (80, 95)


class RenewableForecastModel:
    def __init__(self, horizon: int = 24, confidence_levels: tuple[int, int] = (80, 95)):
        self.horizon = horizon
        self.confidence_levels = confidence_levels
        self.sf = None
        self._train_df = None  # contains y + exog columns
        self._exog_cols: list[str] = []
        self.fitted = False

    def prepare_training_df(self, df: pd.DataFrame, weather_df: Optional[pd.DataFrame]) -> pd.DataFrame:
        req = {"unique_id", "ds", "y"}
        if not req.issubset(df.columns):
            raise ValueError(f"generation df missing cols={sorted(req - set(df.columns))}")

        work = df.copy()
        work["ds"] = pd.to_datetime(work["ds"], errors="raise")
        work = work.sort_values(["unique_id", "ds"]).reset_index(drop=True)

        # Fail-loud on null y (truthful: indicates missing measurements, not missing timestamps)
        y_null = work["y"].isna()
        if y_null.any():
            sample = work.loc[y_null, ["unique_id", "ds", "y"]].head(25)
            raise RuntimeError(
                f"[generation][Y] Found null y values (no imputation). rows={int(y_null.sum())}. "
                f"Sample:\n{sample.to_string(index=False)}"
            )

        # Hourly grid enforcement (no imputation).
        # Default policy remains strict. To run CV while keeping the defect visible:
        #   set policy="drop_incomplete_series" here temporarily.
        work = _enforce_hourly_grid(work, label="generation", policy="drop_incomplete_series")


        # Deterministic time features
        work = _add_time_features(work)

        # Weather merge (FAIL LOUD on missing)
        if weather_df is not None and not weather_df.empty:
            if not {"ds", "region"}.issubset(weather_df.columns):
                raise ValueError("weather_df must have columns ['ds','region', ...]")

            work["region"] = work["unique_id"].str.split("_").str[0]

            wcols = [c for c in WEATHER_VARS if c in weather_df.columns]
            if not wcols:
                raise ValueError("weather_df has none of expected WEATHER_VARS")

            merged = work.merge(
                weather_df[["ds", "region"] + wcols],
                on=["ds", "region"],
                how="left",
                validate="many_to_one",
            )

            missing_any = merged[wcols].isna().any(axis=1)
            if missing_any.any():
                sample = merged.loc[missing_any, ["unique_id", "ds", "region"] + wcols].head(10)
                raise RuntimeError(
                    f"[weather][ALIGN] Missing weather after merge rows={int(missing_any.sum())}. "
                    f"Sample:\n{sample.to_string(index=False)}"
                )

            work = merged.drop(columns=["region"])
            self._exog_cols = ["hour_sin", "hour_cos", "dow_sin", "dow_cos"] + wcols
        else:
            self._exog_cols = ["hour_sin", "hour_cos", "dow_sin", "dow_cos"]

        return work


    def fit(self, df: pd.DataFrame, weather_df: Optional[pd.DataFrame] = None) -> None:
        from statsforecast import StatsForecast
        from statsforecast.models import AutoARIMA, SeasonalNaive, AutoETS, MSTL

        train_df = self.prepare_training_df(df, weather_df)

        models = [
            AutoARIMA(season_length=24),
            SeasonalNaive(season_length=24),
            AutoETS(season_length=24),
            MSTL(season_length=[24, 168], trend_forecaster=AutoARIMA(), alias="MSTL_ARIMA"),
        ]

        self.sf = StatsForecast(models=models, freq="h", n_jobs=-1)
        self._train_df = train_df
        self.fitted = True

        print(f"[fit] rows={len(train_df)} series={train_df['unique_id'].nunique()} exog_cols={self._exog_cols}")

    def build_future_X_df(self, future_weather: pd.DataFrame) -> pd.DataFrame:
        """
        Build future X_df for forecast horizon using forecast weather.
        Must include: unique_id, ds, and exactly the exog columns used in training.
        """
        if not self.fitted:
            raise RuntimeError("fit() first")

        if future_weather is None or future_weather.empty:
            raise RuntimeError("future_weather required to forecast with regressors (no fabrication).")

        if not {"ds", "region"}.issubset(future_weather.columns):
            raise ValueError("future_weather must have columns ['ds','region', ...]")

        # Create the future ds grid per series
        last_ds = self._train_df.groupby("unique_id")["ds"].max()
        frames = []
        for uid, end in last_ds.items():
            future_ds = pd.date_range(end + pd.Timedelta(hours=1), periods=self.horizon, freq="h")
            frames.append(pd.DataFrame({"unique_id": uid, "ds": future_ds}))
        X = pd.concat(frames, ignore_index=True)

        X = _add_time_features(X)
        X["region"] = X["unique_id"].str.split("_").str[0]

        wcols = [c for c in WEATHER_VARS if c in future_weather.columns]
        X = X.merge(
            future_weather[["ds", "region"] + wcols],
            on=["ds", "region"],
            how="left",
            validate="many_to_one",
        )

        # Fail loud on missing future regressors
        needed = [c for c in self._exog_cols if c not in ["hour_sin", "hour_cos", "dow_sin", "dow_cos"]]  # weather cols
        if needed:
            missing_any = X[needed].isna().any(axis=1)
            if missing_any.any():
                sample = X.loc[missing_any, ["unique_id", "ds", "region"] + needed].head(10)
                raise RuntimeError(
                    f"[future_weather][ALIGN] Missing future weather rows={int(missing_any.sum())}. "
                    f"Sample:\n{sample.to_string(index=False)}"
                )

        X = X.drop(columns=["region"])
        keep = ["unique_id", "ds"] + self._exog_cols
        return X[keep].sort_values(["unique_id", "ds"]).reset_index(drop=True)

    def predict(self, future_weather: pd.DataFrame) -> pd.DataFrame:
        if not self.fitted:
            raise RuntimeError("fit() first")

        X_df = self.build_future_X_df(future_weather)

        # IMPORTANT: If you fit models using exogenous regressors, you must supply X_df at forecast time.
        fcst = self.sf.forecast(
            h=self.horizon,
            df=self._train_df,
            X_df=X_df,
            level=list(self.confidence_levels),
        ).reset_index()

        return fcst

    def cross_validate(
        self,
        df: pd.DataFrame,
        weather_df: Optional[pd.DataFrame] = None,
        n_windows: int = 3,
        step_size: int = 168,
    ) -> tuple[pd.DataFrame, pd.DataFrame]:
        from statsforecast import StatsForecast
        from statsforecast.models import AutoARIMA, SeasonalNaive, AutoETS, MSTL

        train_df = self.prepare_training_df(df, weather_df)

        models = [
            AutoARIMA(season_length=24),
            SeasonalNaive(season_length=24),
            AutoETS(season_length=24),
            MSTL(season_length=[24, 168], trend_forecaster=AutoARIMA(), alias="MSTL_ARIMA"),
        ]
        sf = StatsForecast(models=models, freq="h", n_jobs=-1)

        print(
            f"[cv] windows={n_windows} step={step_size} h={self.horizon} "
            f"rows={len(train_df)} series={train_df['unique_id'].nunique()}"
        )

        cv = sf.cross_validation(
            df=train_df,
            h=self.horizon,
            step_size=step_size,
            n_windows=n_windows,
            level=list(self.confidence_levels),
        ).reset_index()

        leaderboard = compute_leaderboard(cv, confidence_levels=self.confidence_levels)
        return cv, leaderboard



if __name__ == "__main__":
    # REAL EXAMPLE: multi-series WND with strict gates and CV

    from src.renewable.eia_renewable import EIARenewableFetcher
    from src.renewable.open_meteo import OpenMeteoRenewable

    regions = ["CALI", "ERCO", "MISO"]
    fuel = "WND"
    start_date = "2024-11-01"
    end_date = "2024-12-15"

    fetcher = EIARenewableFetcher(debug_env=True)
    gen = fetcher.fetch_all_regions(fuel, start_date, end_date, regions=regions)
    _log_series_summary(gen, label="generation_raw")

    weather_api = OpenMeteoRenewable(strict=True)
    wx_hist = weather_api.fetch_all_regions_historical(regions, start_date, end_date, debug=True)

    model = RenewableForecastModel(horizon=24, confidence_levels=(80, 95))

    # CV (historical): regressors live in df, no filling allowed
    cv = model.cross_validate(gen, weather_df=wx_hist, n_windows=3, step_size=168)
    print(cv.head().to_string(index=False))

    # Optional: fit + forecast next 24h using forecast weather (no leakage)
    # wx_future = weather_api.fetch_all_regions_forecast(regions, horizon_hours=48, debug=True)
    # model.fit(gen, weather_df=wx_hist)
    # fcst = model.predict(future_weather=wx_future)
    # print(fcst.head().to_string(index=False))


Overwriting src/renewable/modeling.py


In [30]:
%%writefile src/renewable/db.py
# file: src/renewable/db.py
"""Database schema and operations for renewable forecasting.

Extends the Chapter 4 monitoring database with:
- Prediction intervals (80%, 95%)
- Weather features table
- Renewable-specific columns (fuel_type, region)
"""

import json
import sqlite3
from datetime import datetime
from pathlib import Path
from typing import Optional

import pandas as pd


def connect(db_path: str) -> sqlite3.Connection:
    """Connect to SQLite database with optimized settings."""
    Path(db_path).parent.mkdir(parents=True, exist_ok=True)
    con = sqlite3.connect(db_path)
    con.execute("PRAGMA journal_mode=WAL;")
    con.execute("PRAGMA synchronous=NORMAL;")
    return con


def init_renewable_db(db_path: str) -> None:
    """Initialize renewable forecasting database schema.

    Creates tables:
    - renewable_forecasts: Forecasts with dual intervals
    - renewable_scores: Evaluation metrics with coverage
    - weather_features: Weather data by region
    - drift_alerts: Drift detection history
    - baseline_metrics: Backtest baselines for drift thresholds
    """
    con = connect(db_path)
    cur = con.cursor()

    # Forecasts with dual prediction intervals
    cur.execute("""
    CREATE TABLE IF NOT EXISTS renewable_forecasts (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        run_id TEXT NOT NULL,
        created_at TEXT NOT NULL,
        unique_id TEXT NOT NULL,
        region TEXT NOT NULL,
        fuel_type TEXT NOT NULL,
        ds TEXT NOT NULL,
        model TEXT NOT NULL,
        yhat REAL,
        yhat_lo_80 REAL,
        yhat_hi_80 REAL,
        yhat_lo_95 REAL,
        yhat_hi_95 REAL,
        UNIQUE (run_id, model, unique_id, ds)
    );
    """)

    # Index for efficient queries
    cur.execute("""
    CREATE INDEX IF NOT EXISTS idx_forecasts_region_ds
    ON renewable_forecasts (region, ds);
    """)

    cur.execute("""
    CREATE INDEX IF NOT EXISTS idx_forecasts_fuel_ds
    ON renewable_forecasts (fuel_type, ds);
    """)

    # Evaluation scores with dual coverage
    cur.execute("""
    CREATE TABLE IF NOT EXISTS renewable_scores (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        scored_at TEXT NOT NULL,
        run_id TEXT NOT NULL,
        unique_id TEXT NOT NULL,
        region TEXT NOT NULL,
        fuel_type TEXT NOT NULL,
        model TEXT NOT NULL,
        horizon_hours INTEGER NOT NULL,
        rmse REAL,
        mae REAL,
        coverage_80 REAL,
        coverage_95 REAL,
        valid_rows INTEGER,
        UNIQUE (run_id, model, unique_id, horizon_hours)
    );
    """)

    # Weather features by region
    cur.execute("""
    CREATE TABLE IF NOT EXISTS weather_features (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        region TEXT NOT NULL,
        ds TEXT NOT NULL,
        temperature_2m REAL,
        wind_speed_10m REAL,
        wind_speed_100m REAL,
        wind_direction_10m REAL,
        direct_radiation REAL,
        diffuse_radiation REAL,
        cloud_cover REAL,
        is_forecast INTEGER DEFAULT 0,
        created_at TEXT DEFAULT CURRENT_TIMESTAMP,
        UNIQUE (region, ds, is_forecast)
    );
    """)

    cur.execute("""
    CREATE INDEX IF NOT EXISTS idx_weather_region_ds
    ON weather_features (region, ds);
    """)

    # Drift detection alerts
    cur.execute("""
    CREATE TABLE IF NOT EXISTS drift_alerts (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        alert_at TEXT NOT NULL,
        run_id TEXT,
        unique_id TEXT,
        region TEXT,
        fuel_type TEXT,
        alert_type TEXT NOT NULL,
        severity TEXT NOT NULL,
        current_rmse REAL,
        threshold_rmse REAL,
        message TEXT,
        metadata_json TEXT
    );
    """)

    cur.execute("""
    CREATE INDEX IF NOT EXISTS idx_drift_alerts_time
    ON drift_alerts (alert_at);
    """)

    # Baseline metrics for drift detection
    cur.execute("""
    CREATE TABLE IF NOT EXISTS baseline_metrics (
        id INTEGER PRIMARY KEY AUTOINCREMENT,
        created_at TEXT NOT NULL,
        unique_id TEXT NOT NULL,
        model TEXT NOT NULL,
        rmse_mean REAL NOT NULL,
        rmse_std REAL NOT NULL,
        mae_mean REAL,
        mae_std REAL,
        drift_threshold_rmse REAL NOT NULL,
        drift_threshold_mae REAL,
        n_windows INTEGER,
        metadata_json TEXT,
        UNIQUE (unique_id, model)
    );
    """)

    con.commit()
    con.close()


def save_forecasts(
    db_path: str,
    forecasts_df: pd.DataFrame,
    run_id: str,
    model: str = "MSTL_ARIMA",
) -> int:
    """Save forecasts to database.

    Args:
        db_path: Path to SQLite database
        forecasts_df: DataFrame with [unique_id, ds, yhat, yhat_lo_80, ...]
        run_id: Pipeline run identifier
        model: Model name

    Returns:
        Number of rows inserted
    """
    con = connect(db_path)
    created_at = datetime.utcnow().isoformat()

    rows = []
    for _, row in forecasts_df.iterrows():
        unique_id = row["unique_id"]
        parts = unique_id.split("_")
        region = parts[0] if len(parts) > 0 else ""
        fuel_type = parts[1] if len(parts) > 1 else ""

        rows.append((
            run_id,
            created_at,
            unique_id,
            region,
            fuel_type,
            str(row["ds"]),
            model,
            row.get("yhat"),
            row.get("yhat_lo_80"),
            row.get("yhat_hi_80"),
            row.get("yhat_lo_95"),
            row.get("yhat_hi_95"),
        ))

    cur = con.cursor()
    cur.executemany("""
        INSERT OR REPLACE INTO renewable_forecasts
        (run_id, created_at, unique_id, region, fuel_type, ds, model,
         yhat, yhat_lo_80, yhat_hi_80, yhat_lo_95, yhat_hi_95)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, rows)

    con.commit()
    con.close()

    return len(rows)


def save_weather(
    db_path: str,
    weather_df: pd.DataFrame,
    is_forecast: bool = False,
) -> int:
    """Save weather features to database.

    Args:
        db_path: Path to SQLite database
        weather_df: DataFrame with [ds, region, weather_vars...]
        is_forecast: True if this is forecast weather data

    Returns:
        Number of rows inserted
    """
    con = connect(db_path)

    weather_cols = [
        "temperature_2m", "wind_speed_10m", "wind_speed_100m",
        "wind_direction_10m", "direct_radiation", "diffuse_radiation", "cloud_cover"
    ]

    rows = []
    for _, row in weather_df.iterrows():
        values = [row.get(col) for col in weather_cols]
        rows.append((
            row["region"],
            str(row["ds"]),
            *values,
            1 if is_forecast else 0,
        ))

    cur = con.cursor()
    cur.executemany(f"""
        INSERT OR REPLACE INTO weather_features
        (region, ds, {', '.join(weather_cols)}, is_forecast)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, rows)

    con.commit()
    con.close()

    return len(rows)


def save_drift_alert(
    db_path: str,
    run_id: str,
    unique_id: str,
    current_rmse: float,
    threshold_rmse: float,
    severity: str = "warning",
    metadata: Optional[dict] = None,
) -> None:
    """Save drift detection alert.

    Args:
        db_path: Path to SQLite database
        run_id: Pipeline run identifier
        unique_id: Series identifier
        current_rmse: Current RMSE value
        threshold_rmse: Drift threshold
        severity: Alert severity (info, warning, critical)
        metadata: Additional metadata
    """
    con = connect(db_path)

    parts = unique_id.split("_")
    region = parts[0] if len(parts) > 0 else ""
    fuel_type = parts[1] if len(parts) > 1 else ""

    alert_type = "drift_detected" if current_rmse > threshold_rmse else "drift_check"
    message = (
        f"RMSE {current_rmse:.1f} {'>' if current_rmse > threshold_rmse else '<='} "
        f"threshold {threshold_rmse:.1f}"
    )

    cur = con.cursor()
    cur.execute("""
        INSERT INTO drift_alerts
        (alert_at, run_id, unique_id, region, fuel_type, alert_type, severity,
         current_rmse, threshold_rmse, message, metadata_json)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        datetime.utcnow().isoformat(),
        run_id,
        unique_id,
        region,
        fuel_type,
        alert_type,
        severity,
        current_rmse,
        threshold_rmse,
        message,
        json.dumps(metadata) if metadata else None,
    ))

    con.commit()
    con.close()


def save_baseline(
    db_path: str,
    unique_id: str,
    model: str,
    baseline: dict,
) -> None:
    """Save baseline metrics for drift detection.

    Args:
        db_path: Path to SQLite database
        unique_id: Series identifier
        model: Model name
        baseline: Baseline metrics dictionary
    """
    con = connect(db_path)
    cur = con.cursor()

    cur.execute("""
        INSERT OR REPLACE INTO baseline_metrics
        (created_at, unique_id, model, rmse_mean, rmse_std, mae_mean, mae_std,
         drift_threshold_rmse, drift_threshold_mae, n_windows, metadata_json)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    """, (
        datetime.utcnow().isoformat(),
        unique_id,
        model,
        baseline.get("rmse_mean"),
        baseline.get("rmse_std"),
        baseline.get("mae_mean"),
        baseline.get("mae_std"),
        baseline.get("drift_threshold_rmse"),
        baseline.get("drift_threshold_mae"),
        baseline.get("n_windows"),
        json.dumps(baseline),
    ))

    con.commit()
    con.close()


def get_recent_forecasts(
    db_path: str,
    region: Optional[str] = None,
    fuel_type: Optional[str] = None,
    hours: int = 48,
) -> pd.DataFrame:
    """Get recent forecasts from database.

    Args:
        db_path: Path to SQLite database
        region: Filter by region (optional)
        fuel_type: Filter by fuel type (optional)
        hours: Hours of history to retrieve

    Returns:
        DataFrame with forecasts
    """
    con = connect(db_path)

    query = """
        SELECT *
        FROM renewable_forecasts
        WHERE datetime(created_at) > datetime('now', ?)
    """
    params = [f"-{hours} hours"]

    if region:
        query += " AND region = ?"
        params.append(region)

    if fuel_type:
        query += " AND fuel_type = ?"
        params.append(fuel_type)

    query += " ORDER BY ds DESC"

    df = pd.read_sql_query(query, con, params=params)
    con.close()

    return df


def get_drift_alerts(
    db_path: str,
    hours: int = 24,
    severity: Optional[str] = None,
) -> pd.DataFrame:
    """Get recent drift alerts.

    Args:
        db_path: Path to SQLite database
        hours: Hours of history
        severity: Filter by severity (optional)

    Returns:
        DataFrame with alerts
    """
    con = connect(db_path)

    query = """
        SELECT *
        FROM drift_alerts
        WHERE datetime(alert_at) > datetime('now', ?)
    """
    params = [f"-{hours} hours"]

    if severity:
        query += " AND severity = ?"
        params.append(severity)

    query += " ORDER BY alert_at DESC"

    df = pd.read_sql_query(query, con, params=params)
    con.close()

    return df


if __name__ == "__main__":
    # Test database initialization
    import tempfile

    with tempfile.TemporaryDirectory() as tmpdir:
        db_path = f"{tmpdir}/test_renewable.db"

        print("Initializing database...")
        init_renewable_db(db_path)

        print("Database initialized successfully!")

        # Test connection
        con = connect(db_path)
        cur = con.cursor()
        cur.execute("SELECT name FROM sqlite_master WHERE type='table'")
        tables = cur.fetchall()
        print(f"Tables created: {[t[0] for t in tables]}")
        con.close()


Overwriting src/renewable/db.py


---

# Module 6: Pipeline Tasks

**File:** `src/renewable/tasks.py`

This module orchestrates the complete pipeline:

1. **Fetch generation data** from EIA
2. **Fetch weather data** from Open-Meteo
3. **Train models** with cross-validation
4. **Generate forecasts** with prediction intervals
5. **Compute drift metrics** vs baseline

## Key Feature: Adaptive CV

Cross-validation requires sufficient data:
```
Minimum rows = horizon + (n_windows × step_size)
```

For short series, we **adapt** the CV settings automatically.

In [31]:
%%writefile src/renewable/tasks.py
# file: src\renewable\tasks.py
"""Renewable energy forecasting pipeline tasks.

Idempotent tasks for:
- Fetching EIA renewable generation data
- Fetching weather data from Open-Meteo
- Training probabilistic models
- Generating forecasts with intervals
- Computing drift metrics
"""

import argparse
import logging
import os
from dataclasses import dataclass, field
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import Optional

import pandas as pd

from src.renewable.eia_renewable import EIARenewableFetcher
from src.renewable.modeling import (
    RenewableForecastModel,
    _log_series_summary,
    compute_baseline_metrics,
)
from src.renewable.open_meteo import OpenMeteoRenewable
from src.renewable.regions import REGIONS, list_regions

logger = logging.getLogger(__name__)


@dataclass
class RenewablePipelineConfig:
    """Configuration for renewable forecasting pipeline."""

    # Data parameters
    regions: list[str] = field(default_factory=lambda: ["CALI", "ERCO", "MISO", "PJM", "SWPP"])
    fuel_types: list[str] = field(default_factory=lambda: ["WND", "SUN"])
    start_date: str = ""  # Set dynamically
    end_date: str = ""  # Set dynamically
    lookback_days: int = 30

    # Forecast parameters
    horizon: int = 24
    confidence_levels: tuple[int, int] = (80, 95)

    # CV parameters
    cv_windows: int = 5
    cv_step_size: int = 168  # 1 week

    # Output paths
    data_dir: str = "data/renewable"
    overwrite: bool = False

    def __post_init__(self):
        # Set default dates if not provided
        if not self.end_date:
            self.end_date = datetime.now(timezone.utc).strftime("%Y-%m-%d")
        if not self.start_date:
            end = datetime.strptime(self.end_date, "%Y-%m-%d")
            start = end - timedelta(days=self.lookback_days)
            self.start_date = start.strftime("%Y-%m-%d")

    def generation_path(self) -> Path:
        return Path(self.data_dir) / "generation.parquet"

    def weather_path(self) -> Path:
        return Path(self.data_dir) / "weather.parquet"

    def forecasts_path(self) -> Path:
        return Path(self.data_dir) / "forecasts.parquet"

    def baseline_path(self) -> Path:
        return Path(self.data_dir) / "baseline.json"


def fetch_renewable_data(
    config: RenewablePipelineConfig,
    fetch_diagnostics: Optional[list[dict]] = None,
) -> pd.DataFrame:
    """Task 1: Fetch EIA generation data for all regions and fuel types.

    Args:
        config: Pipeline configuration
        fetch_diagnostics: Optional list to capture per-region fetch metadata

    Returns:
        DataFrame with columns [unique_id, ds, y]
    """
    output_path = config.generation_path()
    output_path.parent.mkdir(parents=True, exist_ok=True)

    def _log_generation_summary(df: pd.DataFrame, source: str) -> None:
        _log_series_summary(df, value_col="y", label=f"generation_data_{source}")

        expected_series = {
            f"{region}_{fuel}" for region in config.regions for fuel in config.fuel_types
        }
        present_series = set(df["unique_id"]) if "unique_id" in df.columns else set()
        missing_series = sorted(expected_series - present_series)
        if missing_series:
            logger.warning(
                "[fetch_generation] Missing expected series (%s): %s",
                source,
                missing_series,
            )

        if df.empty:
            logger.warning("[fetch_generation] No generation data rows (%s).", source)
            return

        coverage = (
            df.groupby("unique_id")["ds"]
            .agg(min_ds="min", max_ds="max", rows="count")
            .reset_index()
            .sort_values("unique_id")
        )
        max_series_log = 25
        if len(coverage) > max_series_log:
            logger.info(
                "[fetch_generation] Coverage (%s, first %s series):\n%s",
                source,
                max_series_log,
                coverage.head(max_series_log).to_string(index=False),
            )
        else:
            logger.info("[fetch_generation] Coverage (%s):\n%s", source, coverage.to_string(index=False))

    if output_path.exists() and not config.overwrite:
        logger.info(f"[fetch_generation] exists, loading: {output_path}")
        cached = pd.read_parquet(output_path)
        # Log cached coverage to surface missing series without refetching.
        _log_generation_summary(cached, source="cache")
        return cached

    logger.info(f"[fetch_generation] Fetching {config.fuel_types} for {config.regions}")

    fetcher = EIARenewableFetcher()
    all_dfs = []

    for fuel_type in config.fuel_types:
        df = fetcher.fetch_all_regions(
            fuel_type=fuel_type,
            start_date=config.start_date,
            end_date=config.end_date,
            regions=config.regions,
            diagnostics=fetch_diagnostics,
        )
        all_dfs.append(df)

    combined = pd.concat(all_dfs, ignore_index=True)
    combined = combined.sort_values(["unique_id", "ds"]).reset_index(drop=True)

    # Log fresh coverage to highlight gaps or unexpected negatives.
    _log_generation_summary(combined, source="fresh")

    if fetch_diagnostics:
        empty_series = [
            entry
            for entry in fetch_diagnostics
            if entry.get("empty")
        ]
        for entry in empty_series:
            logger.warning(
                "[fetch_generation] Empty series detail: region=%s fuel=%s total=%s pages=%s",
                entry.get("region"),
                entry.get("fuel_type"),
                entry.get("total_records"),
                entry.get("pages"),
            )

    combined.to_parquet(output_path, index=False)
    logger.info(f"[fetch_generation] Saved: {output_path} ({len(combined)} rows)")

    return combined


def fetch_renewable_weather(
    config: RenewablePipelineConfig,
    include_forecast: bool = True,
) -> pd.DataFrame:
    """Task 2: Fetch weather data for all regions.

    Args:
        config: Pipeline configuration
        include_forecast: Include forecast weather for predictions

    Returns:
        DataFrame with columns [ds, region, weather_vars...]
    """
    output_path = config.weather_path()
    output_path.parent.mkdir(parents=True, exist_ok=True)

    def _log_weather_summary(df: pd.DataFrame, source: str) -> None:
        if df.empty:
            logger.warning("[fetch_weather] No weather data rows (%s).", source)
            return

        coverage = (
            df.groupby("region")["ds"]
            .agg(min_ds="min", max_ds="max", rows="count")
            .reset_index()
            .sort_values("region")
        )
        max_region_log = 25
        if len(coverage) > max_region_log:
            logger.info(
                "[fetch_weather] Coverage (%s, first %s regions):\n%s",
                source,
                max_region_log,
                coverage.head(max_region_log).to_string(index=False),
            )
        else:
            logger.info("[fetch_weather] Coverage (%s):\n%s", source, coverage.to_string(index=False))

        missing_cols = [
            col for col in OpenMeteoRenewable.WEATHER_VARS if col not in df.columns
        ]
        if missing_cols:
            logger.warning(
                "[fetch_weather] Missing expected weather columns (%s): %s",
                source,
                missing_cols,
            )

        missing_values = {
            col: int(df[col].isna().sum())
            for col in OpenMeteoRenewable.WEATHER_VARS
            if col in df.columns and df[col].isna().any()
        }
        if missing_values:
            logger.warning(
                "[fetch_weather] Missing weather values (%s): %s",
                source,
                missing_values,
            )

    if output_path.exists() and not config.overwrite:
        logger.info(f"[fetch_weather] exists, loading: {output_path}")
        cached = pd.read_parquet(output_path)
        # Log cached weather coverage to surface missing regions/columns.
        _log_weather_summary(cached, source="cache")
        return cached

    logger.info(f"[fetch_weather] Fetching weather for {config.regions}")

    weather = OpenMeteoRenewable()

    # Historical weather
    hist_df = weather.fetch_all_regions_historical(
        regions=config.regions,
        start_date=config.start_date,
        end_date=config.end_date,
    )

    # Forecast weather (for prediction, prevents leakage)
    if include_forecast:
        fcst_df = weather.fetch_all_regions_forecast(
            regions=config.regions,
            horizon_hours=config.horizon + 24,  # Buffer
        )

        # Combine, preferring forecast for overlapping times
        combined = pd.concat([hist_df, fcst_df], ignore_index=True)
        combined = combined.drop_duplicates(subset=["ds", "region"], keep="last")
    else:
        combined = hist_df

    combined = combined.sort_values(["region", "ds"]).reset_index(drop=True)

    # Log fresh weather coverage and missing values before saving.
    _log_weather_summary(combined, source="fresh")

    combined.to_parquet(output_path, index=False)
    logger.info(f"[fetch_weather] Saved: {output_path} ({len(combined)} rows)")

    return combined


def train_renewable_models(
    config: RenewablePipelineConfig,
    generation_df: Optional[pd.DataFrame] = None,
    weather_df: Optional[pd.DataFrame] = None,
) -> tuple[pd.DataFrame, pd.DataFrame, dict]:
    """Task 3: Train models and compute baseline metrics via cross-validation.

    Args:
        config: Pipeline configuration
        generation_df: Generation data (loads from file if None)
        weather_df: Weather data (loads from file if None)

    Returns:
        Tuple of (cv_results, leaderboard, baseline_metrics)
    """
    # Load data if not provided
    if generation_df is None:
        generation_df = pd.read_parquet(config.generation_path())
    if weather_df is None:
        weather_df = pd.read_parquet(config.weather_path())

    logger.info(f"[train_models] Training on {len(generation_df)} rows")

    model = RenewableForecastModel(
        horizon=config.horizon,
        confidence_levels=config.confidence_levels,
    )

    # Compute adaptive CV settings based on shortest series
    min_series_len = generation_df.groupby("unique_id").size().min()

    # CV needs: horizon + (n_windows * step_size) rows minimum
    # Solve for n_windows: n_windows = (min_series_len - horizon) / step_size
    available_for_cv = min_series_len - config.horizon

    # Adjust step_size and n_windows to fit data
    step_size = min(config.cv_step_size, max(24, available_for_cv // 3))
    n_windows = min(config.cv_windows, max(2, available_for_cv // step_size))

    logger.info(
        f"[train_models] Adaptive CV: {n_windows} windows, "
        f"step={step_size}h (min_series={min_series_len} rows)"
    )

    # Cross-validation
    cv_results, leaderboard = model.cross_validate(
        df=generation_df,
        weather_df=weather_df,
        n_windows=n_windows,
        step_size=step_size,
    )

    best_model = leaderboard.iloc[0]["model"]
    baseline = compute_baseline_metrics(cv_results, model_name=best_model)


    logger.info(f"[train_models] Best model: {best_model}, RMSE: {baseline['rmse_mean']:.1f}")

    return cv_results, leaderboard, baseline


def generate_renewable_forecasts(
    config: RenewablePipelineConfig,
    generation_df: Optional[pd.DataFrame] = None,
    weather_df: Optional[pd.DataFrame] = None,
) -> pd.DataFrame:
    """Task 4: Generate forecasts with prediction intervals."""
    output_path = config.forecasts_path()
    output_path.parent.mkdir(parents=True, exist_ok=True)

    if generation_df is None:
        generation_df = pd.read_parquet(config.generation_path())
    if weather_df is None:
        weather_df = pd.read_parquet(config.weather_path())

    logger.info(f"[generate_forecasts] Generating {config.horizon}h forecasts")

    # Ensure datetime types
    generation_df = generation_df.copy()
    generation_df["ds"] = pd.to_datetime(generation_df["ds"], errors="raise")
    weather_df = weather_df.copy()
    weather_df["ds"] = pd.to_datetime(weather_df["ds"], errors="raise")

    model = RenewableForecastModel(
        horizon=config.horizon,
        confidence_levels=config.confidence_levels,
    )

    # Fit uses only historical generation timestamps, weather merge will fail-loud if missing.
    model.fit(generation_df, weather_df)

    # Future weather must cover the horizon after the latest generation timestamp.
    last_gen_ds = generation_df["ds"].max()
    future_weather = weather_df[weather_df["ds"] > last_gen_ds].copy()

    if future_weather.empty:
        raise RuntimeError(
            "[generate_forecasts] No future weather rows found after last generation timestamp. "
            f"last_gen_ds={last_gen_ds}"
        )

    forecasts = model.predict(future_weather=future_weather)

    forecasts.to_parquet(output_path, index=False)
    logger.info(f"[generate_forecasts] Saved: {output_path} ({len(forecasts)} rows)")

    return forecasts



def compute_renewable_drift(
    predictions: pd.DataFrame,
    actuals: pd.DataFrame,
    baseline_metrics: dict,
) -> dict:
    """Task 5: Detect drift by comparing current metrics to baseline.

    Drift is flagged when current RMSE > baseline_mean + 2*baseline_std

    Args:
        predictions: Forecast DataFrame with [unique_id, ds, yhat]
        actuals: Actual values DataFrame with [unique_id, ds, y]
        baseline_metrics: Baseline from cross-validation

    Returns:
        Dictionary with drift status and details
    """
    from src.chapter2.evaluation import ForecastMetrics

    # Merge predictions with actuals
    merged = predictions.merge(
        actuals[["unique_id", "ds", "y"]],
        on=["unique_id", "ds"],
        how="inner",
    )

    if len(merged) == 0:
        return {
            "status": "no_data",
            "message": "No overlapping data between predictions and actuals",
        }

    # Compute current metrics
    y_true = merged["y"].values
    y_pred = merged["yhat"].values

    current_rmse = ForecastMetrics.rmse(y_true, y_pred)
    current_mae = ForecastMetrics.mae(y_true, y_pred)

    # Check against threshold
    threshold = baseline_metrics.get("drift_threshold_rmse", float("inf"))
    is_drifting = current_rmse > threshold

    result = {
        "status": "drift_detected" if is_drifting else "stable",
        "current_rmse": float(current_rmse),
        "current_mae": float(current_mae),
        "baseline_rmse": float(baseline_metrics.get("rmse_mean", 0)),
        "drift_threshold": float(threshold),
        "threshold_exceeded_by": float(max(0, current_rmse - threshold)),
        "n_predictions": len(merged),
        "timestamp": datetime.utcnow().isoformat(),
    }

    if is_drifting:
        logger.warning(
            f"[drift] DRIFT DETECTED: RMSE={current_rmse:.1f} > threshold={threshold:.1f}"
        )
    else:
        logger.info(f"[drift] Stable: RMSE={current_rmse:.1f} <= threshold={threshold:.1f}")

    return result


def run_full_pipeline(
    config: RenewablePipelineConfig,
    fetch_diagnostics: Optional[list[dict]] = None,
) -> dict:
    """Run the complete renewable forecasting pipeline.

    Steps:
    1. Fetch generation data
    2. Fetch weather data
    3. Train models (CV)
    4. Generate forecasts

    Args:
        config: Pipeline configuration
        fetch_diagnostics: Optional list to capture per-region fetch metadata

    Returns:
        Dictionary with pipeline results
    """
    logger.info(f"[pipeline] Starting: {config.start_date} to {config.end_date}")
    logger.info(f"[pipeline] Regions: {config.regions}")
    logger.info(f"[pipeline] Fuel types: {config.fuel_types}")

    results = {}

    # Step 1: Fetch generation
    generation_df = fetch_renewable_data(config, fetch_diagnostics=fetch_diagnostics)
    results["generation_rows"] = len(generation_df)
    results["series_count"] = generation_df["unique_id"].nunique()

    # Step 2: Fetch weather
    weather_df = fetch_renewable_weather(config)
    results["weather_rows"] = len(weather_df)

    # Step 3: Train and validate
    cv_results, leaderboard, baseline = train_renewable_models(
        config, generation_df, weather_df
    )
    results["best_model"] = leaderboard.iloc[0]["model"]
    results["best_rmse"] = float(leaderboard.iloc[0]["rmse"])
    results["baseline"] = baseline

    # Step 4: Generate forecasts
    forecasts = generate_renewable_forecasts(config, generation_df, weather_df)
    results["forecast_rows"] = len(forecasts)

    if fetch_diagnostics is not None:
        results["fetch_diagnostics"] = fetch_diagnostics

    logger.info(f"[pipeline] Complete. Best model: {results['best_model']}")

    return results


def main():
    """CLI entry point for renewable pipeline."""
    parser = argparse.ArgumentParser(description="Renewable Energy Forecasting Pipeline")

    parser.add_argument(
        "--regions",
        type=str,
        default="CALI,ERCO,MISO",
        help="Comma-separated region codes (default: CALI,ERCO,MISO)",
    )
    parser.add_argument(
        "--fuel",
        type=str,
        default="WND,SUN",
        help="Comma-separated fuel types (default: WND,SUN)",
    )
    parser.add_argument(
        "--days",
        type=int,
        default=30,
        help="Lookback days (default: 30)",
    )
    parser.add_argument(
        "--horizon",
        type=int,
        default=24,
        help="Forecast horizon in hours (default: 24)",
    )
    parser.add_argument(
        "--overwrite",
        action="store_true",
        help="Overwrite existing data files",
    )
    parser.add_argument(
        "--data-dir",
        type=str,
        default="data/renewable",
        help="Output directory (default: data/renewable)",
    )

    args = parser.parse_args()

    # Configure logging
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
    )

    # Build config
    config = RenewablePipelineConfig(
        regions=args.regions.split(","),
        fuel_types=args.fuel.split(","),
        lookback_days=args.days,
        horizon=args.horizon,
        overwrite=args.overwrite,
        data_dir=args.data_dir,
    )

    # Run pipeline
    results = run_full_pipeline(config)

    print("\n" + "=" * 60)
    print("PIPELINE RESULTS")
    print("=" * 60)
    print(f"  Series count: {results['series_count']}")
    print(f"  Generation rows: {results['generation_rows']}")
    print(f"  Weather rows: {results['weather_rows']}")
    print(f"  Forecast rows: {results['forecast_rows']}")
    print(f"  Best model: {results['best_model']}")
    print(f"  Best RMSE: {results['best_rmse']:.1f}")
    print("=" * 60)


if __name__ == "__main__":
    main()


Overwriting src/renewable/tasks.py


# validation

In [32]:
%%writefile src/renewable/validation.py
# file: src/renewable/validation.py
"""Validation utilities for renewable generation data."""

from __future__ import annotations

from dataclasses import dataclass
from typing import Iterable, Optional

import pandas as pd


@dataclass(frozen=True)
class ValidationReport:
    ok: bool
    message: str
    details: dict


def validate_generation_df(
    df: pd.DataFrame,
    *,
    max_lag_hours: int = 3,
    max_missing_ratio: float = 0.02,
    expected_series: Optional[Iterable[str]] = None,
) -> ValidationReport:
    required = {"unique_id", "ds", "y"}
    missing_cols = required - set(df.columns)
    if missing_cols:
        return ValidationReport(
            False,
            "Missing required columns",
            {"missing_cols": sorted(missing_cols)},
        )

    if df.empty:
        return ValidationReport(False, "Generation data is empty", {})

    work = df.copy()

    work["ds"] = pd.to_datetime(work["ds"], errors="coerce", utc=True)
    if work["ds"].isna().any():
        return ValidationReport(
            False,
            "Unparseable ds values found",
            {"bad_ds": int(work["ds"].isna().sum())},
        )

    work["y"] = pd.to_numeric(work["y"], errors="coerce")
    if work["y"].isna().any():
        return ValidationReport(
            False,
            "Unparseable y values found",
            {"bad_y": int(work["y"].isna().sum())},
        )

    if (work["y"] < 0).any():
        return ValidationReport(
            False,
            "Negative generation values found",
            {"neg_y": int((work["y"] < 0).sum())},
        )

    dup = work.duplicated(subset=["unique_id", "ds"]).sum()
    if dup:
        return ValidationReport(
            False,
            "Duplicate (unique_id, ds) rows found",
            {"duplicates": int(dup)},
        )

    if expected_series:
        expected = sorted(set(expected_series))
        present = sorted(set(work["unique_id"]))
        missing_series = sorted(set(expected) - set(present))
        if missing_series:
            return ValidationReport(
                False,
                "Missing expected series",
                {"missing_series": missing_series, "present_series": present},
            )

    now_utc = pd.Timestamp.now(tz="UTC").floor("H")
    max_ds = work["ds"].max()
    lag_hours = (now_utc - max_ds).total_seconds() / 3600.0
    if lag_hours > max_lag_hours:
        return ValidationReport(
            False,
            "Data not fresh enough",
            {
                "now_utc": now_utc.isoformat(),
                "max_ds": max_ds.isoformat(),
                "lag_hours": lag_hours,
            },
        )

    series_max = work.groupby("unique_id")["ds"].max()
    series_lag = (now_utc - series_max).dt.total_seconds() / 3600.0
    stale = series_lag[series_lag > max_lag_hours].sort_values(ascending=False)
    if not stale.empty:
        return ValidationReport(
            False,
            "Stale series found",
            {
                "stale_series": stale.head(10).to_dict(),
                "max_lag_hours": max_lag_hours,
            },
        )

    missing_ratios = {}
    for uid, group in work.groupby("unique_id"):
        group = group.sort_values("ds")
        start = group["ds"].iloc[0]
        end = group["ds"].iloc[-1]
        expected = int(((end - start) / pd.Timedelta(hours=1)) + 1)
        actual = len(group)
        missing = max(expected - actual, 0)
        missing_ratios[uid] = missing / max(expected, 1)

    worst_uid = max(missing_ratios, key=missing_ratios.get)
    worst_ratio = missing_ratios[worst_uid]
    if worst_ratio > max_missing_ratio:
        return ValidationReport(
            False,
            "Too many missing hourly points",
            {"worst_uid": worst_uid, "worst_missing_ratio": worst_ratio},
        )

    return ValidationReport(
        True,
        "OK",
        {
            "row_count": len(work),
            "series_count": int(work["unique_id"].nunique()),
            "max_ds": max_ds.isoformat(),
            "lag_hours": lag_hours,
            "worst_missing_ratio": worst_ratio,
        },
    )


Overwriting src/renewable/validation.py


---

# Module 8: Dashboard

**File:** `src/renewable/dashboard.py`

The Streamlit dashboard provides:
- **Forecast visualization** with prediction intervals
- **Drift monitoring** and alerts
- **Coverage analysis** (nominal vs empirical)
- **Weather features** by region

## Running the Dashboard

```bash
streamlit run src/renewable/dashboard.py
```

The dashboard will:
1. Load forecasts from `data/renewable/forecasts.parquet`
2. Display interactive charts with Plotly
3. Show drift alerts from the database

In [33]:
%%writefile src/renewable/dashboard.py
# file: src/renewable/dashboard.py
"""Streamlit dashboard for renewable energy forecasting.

Provides:
- Forecast visualization with prediction intervals
- Drift monitoring and alerts
- Coverage analysis (nominal vs empirical)
- Weather features by region

Run with:
    streamlit run src/renewable/dashboard.py
"""

import os
import sys
from datetime import datetime, timedelta, timezone
from pathlib import Path

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import streamlit as st

# Add project root to path
sys.path.insert(0, str(Path(__file__).parent.parent.parent))

from src.renewable.db import (
    connect,
    get_drift_alerts,
    get_recent_forecasts,
    init_renewable_db,
)
from src.renewable.regions import FUEL_TYPES, REGIONS

# Page config
st.set_page_config(
    page_title="Renewable Forecast Dashboard",
    page_icon="⚡",
    layout="wide",
)


def main():
    """Main dashboard application."""
    st.title("⚡ Renewable Energy Forecast Dashboard")
    st.markdown("Next-24h wind/solar generation forecasts with drift monitoring")

    # Sidebar configuration
    with st.sidebar:
        st.header("Configuration")

        db_path = st.text_input(
            "Database Path",
            value="data/renewable/renewable.db",
        )

        # Initialize database if it doesn't exist
        if not Path(db_path).exists():
            init_renewable_db(db_path)
            st.info("Database initialized")

        st.divider()

        # Region filter
        all_regions = list(REGIONS.keys())
        selected_regions = st.multiselect(
            "Regions",
            options=all_regions,
            default=["CALI", "ERCO", "MISO"],
        )

        # Fuel type filter
        fuel_type = st.selectbox(
            "Fuel Type",
            options=["WND", "SUN", "Both"],
            index=0,
        )

        st.divider()

        # Actions
        show_debug = st.checkbox("Show Debug", value=False)
        if st.button("🔄 Refresh Data", width="stretch"):
            st.rerun()

        if st.button("📊 Run Pipeline", width="stretch"):
            run_pipeline_from_dashboard(db_path, selected_regions, fuel_type)

    # Main content tabs
    tab1, tab2, tab3, tab4 = st.tabs([
        "📈 Forecasts",
        "⚠️ Drift Monitor",
        "📊 Coverage",
        "🌤️ Weather",
    ])

    with tab1:
        render_forecasts_tab(db_path, selected_regions, fuel_type, show_debug=show_debug)

    with tab2:
        render_drift_tab(db_path)

    with tab3:
        render_coverage_tab(db_path)

    with tab4:
        render_weather_tab(db_path, selected_regions)


def render_forecasts_tab(db_path: str, regions: list, fuel_type: str, *, show_debug: bool = False):
    """Render forecast visualization with prediction intervals."""
    st.subheader("Generation Forecasts")

    forecasts_df = pd.DataFrame()
    data_source = "none"
    derived_columns: list[str] = []

    # Try to load from parquet file first (pipeline output)
    parquet_path = Path("data/renewable/forecasts.parquet")
    if parquet_path.exists():
        try:
            forecasts_df = pd.read_parquet(parquet_path)
            data_source = f"parquet:{parquet_path}"
            # Add region/fuel_type columns if missing
            if "unique_id" in forecasts_df.columns:
                parts = forecasts_df["unique_id"].astype(str).str.split("_", n=1, expand=True)
                if "region" not in forecasts_df.columns:
                    forecasts_df["region"] = parts[0]
                    derived_columns.append("region")
                if "fuel_type" not in forecasts_df.columns:
                    forecasts_df["fuel_type"] = parts[1] if parts.shape[1] > 1 else pd.NA
                    derived_columns.append("fuel_type")
            st.success(f"Loaded {len(forecasts_df)} forecasts from pipeline")
        except Exception as e:
            st.warning(f"Could not load parquet: {e}")

    # Fall back to database
    if forecasts_df.empty:
        try:
            forecasts_df = get_recent_forecasts(db_path, hours=72)
            data_source = f"db:{db_path}"
        except Exception as e:
            st.warning(f"Could not load from database: {e}")

    if forecasts_df.empty:
        # Show demo data
        st.info("No forecasts found. Showing demo data.")
        forecasts_df = generate_demo_forecasts(regions, fuel_type)
        data_source = "demo"

    if show_debug:
        with st.expander("Debug: Forecast Data", expanded=False):
            st.markdown("**Source**")
            st.code(data_source)
            st.markdown("**Columns**")
            st.code(", ".join(forecasts_df.columns.tolist()))

            st.markdown("**Counts (pre-filter)**")
            st.write({"rows": int(len(forecasts_df))})

            if derived_columns:
                st.markdown("**Derived Columns**")
                st.write(derived_columns)

            if "unique_id" in forecasts_df.columns:
                st.markdown("**unique_id sample**")
                st.write(forecasts_df["unique_id"].dropna().astype(str).head(10).tolist())

            if "fuel_type" in forecasts_df.columns:
                st.markdown("**fuel_type counts**")
                st.dataframe(forecasts_df["fuel_type"].value_counts(dropna=False).to_frame())

                unknown = sorted(
                    {str(v) for v in forecasts_df["fuel_type"].dropna().unique()}
                    - set(FUEL_TYPES.keys())
                )
                if unknown:
                    st.warning(f"Unknown fuel_type values: {unknown}")

            if "region" in forecasts_df.columns:
                st.markdown("**region counts**")
                st.dataframe(forecasts_df["region"].value_counts(dropna=False).to_frame())

    # Filter by selections
    if fuel_type != "Both":
        forecasts_df = forecasts_df[forecasts_df["fuel_type"] == fuel_type]

    if regions:
        forecasts_df = forecasts_df[forecasts_df["region"].isin(regions)]

    if show_debug:
        with st.expander("Debug: Filter Result", expanded=False):
            st.markdown("**Applied Filters**")
            st.write({"fuel_type": fuel_type, "regions": regions})
            st.markdown("**Counts (post-filter)**")
            st.write({"rows": int(len(forecasts_df))})
            if "unique_id" in forecasts_df.columns:
                st.markdown("**unique_id after filter**")
                st.write(sorted(forecasts_df["unique_id"].dropna().astype(str).unique().tolist()))

    if forecasts_df.empty:
        st.warning("No data matching filters")
        return

    # Series selector
    series_options = forecasts_df["unique_id"].unique().tolist()
    selected_series = st.selectbox(
        "Select Series",
        options=series_options,
        index=0 if series_options else None,
    )

    if selected_series:
        series_data = forecasts_df[forecasts_df["unique_id"] == selected_series].copy()
        series_data = series_data.sort_values("ds")

        # Create forecast plot with intervals
        fig = create_forecast_plot(series_data, selected_series)
        st.plotly_chart(fig, width="stretch")

        # Show data table
        with st.expander("View Data"):
            st.dataframe(
                series_data[["ds", "yhat", "yhat_lo_80", "yhat_hi_80", "yhat_lo_95", "yhat_hi_95"]],
                width="stretch",
            )


def create_forecast_plot(df: pd.DataFrame, title: str) -> go.Figure:
    """Create Plotly figure with forecast and prediction intervals."""
    fig = go.Figure()

    # Ensure datetime
    df["ds"] = pd.to_datetime(df["ds"])

    # 95% interval (outer, lighter)
    if "yhat_lo_95" in df.columns and "yhat_hi_95" in df.columns:
        fig.add_trace(go.Scatter(
            x=pd.concat([df["ds"], df["ds"][::-1]]),
            y=pd.concat([df["yhat_hi_95"], df["yhat_lo_95"][::-1]]),
            fill="toself",
            fillcolor="rgba(68, 138, 255, 0.2)",
            line=dict(color="rgba(255,255,255,0)"),
            name="95% Interval",
            hoverinfo="skip",
        ))

    # 80% interval (inner, darker)
    if "yhat_lo_80" in df.columns and "yhat_hi_80" in df.columns:
        fig.add_trace(go.Scatter(
            x=pd.concat([df["ds"], df["ds"][::-1]]),
            y=pd.concat([df["yhat_hi_80"], df["yhat_lo_80"][::-1]]),
            fill="toself",
            fillcolor="rgba(68, 138, 255, 0.4)",
            line=dict(color="rgba(255,255,255,0)"),
            name="80% Interval",
            hoverinfo="skip",
        ))

    # Point forecast
    fig.add_trace(go.Scatter(
        x=df["ds"],
        y=df["yhat"],
        mode="lines",
        name="Forecast",
        line=dict(color="#1f77b4", width=2),
    ))

    # Actuals if available
    if "y" in df.columns:
        fig.add_trace(go.Scatter(
            x=df["ds"],
            y=df["y"],
            mode="markers",
            name="Actual",
            marker=dict(color="#2ca02c", size=6),
        ))

    fig.update_layout(
        title=f"Forecast: {title}",
        xaxis_title="Time",
        yaxis_title="Generation (MWh)",
        hovermode="x unified",
        legend=dict(orientation="h", yanchor="bottom", y=1.02),
        height=450,
    )

    return fig


def render_drift_tab(db_path: str):
    """Render drift monitoring and alerts."""
    st.subheader("Drift Detection")

    col1, col2, col3 = st.columns(3)

    # Try to load alerts
    try:
        alerts_df = get_drift_alerts(db_path, hours=48)
    except Exception:
        alerts_df = pd.DataFrame()

    # Summary metrics
    with col1:
        critical = len(alerts_df[alerts_df["severity"] == "critical"]) if not alerts_df.empty else 0
        st.metric(
            "Critical Alerts",
            critical,
            delta=None,
            delta_color="inverse" if critical > 0 else "off",
        )

    with col2:
        warning = len(alerts_df[alerts_df["severity"] == "warning"]) if not alerts_df.empty else 0
        st.metric("Warnings", warning)

    with col3:
        stable = len(alerts_df[alerts_df["alert_type"] == "drift_check"]) if not alerts_df.empty else 0
        st.metric("Stable Checks", stable)

    st.divider()

    if alerts_df.empty:
        st.info("No drift alerts in the last 48 hours. System is stable.")

        # Show demo drift status
        st.markdown("### Demo Drift Status")
        demo_drift = pd.DataFrame({
            "Series": ["CALI_WND", "ERCO_WND", "MISO_WND", "CALI_SUN", "ERCO_SUN"],
            "Current RMSE": [125.3, 98.7, 156.2, 45.1, 67.8],
            "Threshold": [150.0, 120.0, 180.0, 60.0, 80.0],
            "Status": ["✅ Stable", "✅ Stable", "✅ Stable", "✅ Stable", "✅ Stable"],
        })
        st.dataframe(demo_drift, width="stretch")
    else:
        # Show alerts table
        st.dataframe(
            alerts_df[["alert_at", "unique_id", "severity", "current_rmse", "threshold_rmse", "message"]],
            width="stretch",
        )

        # Drift timeline
        if len(alerts_df) > 1:
            alerts_df["alert_at"] = pd.to_datetime(alerts_df["alert_at"])
            fig = px.scatter(
                alerts_df,
                x="alert_at",
                y="current_rmse",
                color="severity",
                size="current_rmse",
                hover_data=["unique_id", "message"],
                title="Drift Timeline",
            )
            fig.add_hline(
                y=alerts_df["threshold_rmse"].mean(),
                line_dash="dash",
                annotation_text="Avg Threshold",
            )
            st.plotly_chart(fig, width="stretch")


def render_coverage_tab(db_path: str):
    """Render coverage analysis comparing nominal vs empirical."""
    st.subheader("Prediction Interval Coverage")

    st.markdown("""
    **Coverage** measures how often actual values fall within prediction intervals.
    - **Nominal**: The expected coverage (80% or 95%)
    - **Empirical**: The actual observed coverage
    - **Gap**: Difference indicates calibration quality
    """)

    # Demo coverage data
    coverage_data = pd.DataFrame({
        "Series": ["CALI_WND", "ERCO_WND", "MISO_WND", "SWPP_WND", "CALI_SUN", "ERCO_SUN"],
        "Nominal 80%": [80, 80, 80, 80, 80, 80],
        "Empirical 80%": [78.5, 82.1, 76.3, 79.8, 81.2, 77.9],
        "Nominal 95%": [95, 95, 95, 95, 95, 95],
        "Empirical 95%": [93.2, 96.1, 91.5, 94.8, 95.7, 92.3],
    })

    coverage_data["Gap 80%"] = coverage_data["Empirical 80%"] - coverage_data["Nominal 80%"]
    coverage_data["Gap 95%"] = coverage_data["Empirical 95%"] - coverage_data["Nominal 95%"]

    # Summary
    col1, col2 = st.columns(2)

    with col1:
        avg_80 = coverage_data["Empirical 80%"].mean()
        st.metric("Avg 80% Coverage", f"{avg_80:.1f}%", f"{avg_80 - 80:.1f}%")

    with col2:
        avg_95 = coverage_data["Empirical 95%"].mean()
        st.metric("Avg 95% Coverage", f"{avg_95:.1f}%", f"{avg_95 - 95:.1f}%")

    st.divider()

    # Coverage comparison chart
    fig = go.Figure()

    fig.add_trace(go.Bar(
        name="80% Empirical",
        x=coverage_data["Series"],
        y=coverage_data["Empirical 80%"],
        marker_color="rgba(68, 138, 255, 0.7)",
    ))

    fig.add_trace(go.Bar(
        name="95% Empirical",
        x=coverage_data["Series"],
        y=coverage_data["Empirical 95%"],
        marker_color="rgba(68, 138, 255, 0.4)",
    ))

    # Nominal lines
    fig.add_hline(y=80, line_dash="dash", line_color="red", annotation_text="80% Nominal")
    fig.add_hline(y=95, line_dash="dash", line_color="orange", annotation_text="95% Nominal")

    fig.update_layout(
        title="Coverage by Series",
        xaxis_title="Series",
        yaxis_title="Coverage (%)",
        barmode="group",
        height=400,
    )

    st.plotly_chart(fig, width="stretch")

    # Detailed table
    with st.expander("View Coverage Data"):
        st.dataframe(coverage_data, width="stretch")


def render_weather_tab(db_path: str, regions: list):
    """Render weather features visualization."""
    st.subheader("Weather Features")

    weather_df = pd.DataFrame()

    # Prefer real pipeline output; no demo fallback.
    parquet_path = Path("data/renewable/weather.parquet")
    if parquet_path.exists():
        try:
            weather_df = pd.read_parquet(parquet_path)
            st.success(f"Loaded {len(weather_df)} weather rows from pipeline")
        except Exception as exc:
            st.warning(f"Could not load weather parquet: {exc}")

    if weather_df.empty and Path(db_path).exists():
        try:
            with connect(db_path) as con:
                weather_df = pd.read_sql_query(
                    "SELECT * FROM weather_features ORDER BY ds ASC",
                    con,
                )
            if not weather_df.empty:
                st.success(f"Loaded {len(weather_df)} weather rows from database")
        except Exception as exc:
            st.warning(f"Could not load weather data from database: {exc}")

    if weather_df.empty:
        st.warning("No weather data available. Run the pipeline to populate weather features.")
        return

    weather_df["ds"] = pd.to_datetime(weather_df["ds"], errors="coerce")
    if regions:
        weather_df = weather_df[weather_df["region"].isin(regions)]
    if weather_df.empty:
        st.warning("No weather data matching selected regions.")
        return

    # Variable selector
    weather_vars = [
        col for col in ["wind_speed_10m", "wind_speed_100m", "direct_radiation", "cloud_cover"]
        if col in weather_df.columns
    ]
    if not weather_vars:
        st.warning("Weather data missing expected variables.")
        return
    selected_var = st.selectbox("Weather Variable", options=weather_vars)

    # Plot
    fig = px.line(
        weather_df,
        x="ds",
        y=selected_var,
        color="region",
        title=f"{selected_var} by Region",
    )
    fig.update_layout(height=400)
    st.plotly_chart(fig, width="stretch")

    # Summary stats
    st.markdown("### Current Conditions")

    cols = st.columns(len(regions[:4]))
    for i, region in enumerate(regions[:4]):
        if i < len(cols):
            with cols[i]:
                region_data = weather_df[weather_df["region"] == region].iloc[-1] if len(weather_df[weather_df["region"] == region]) > 0 else {}
                st.metric(
                    region,
                    f"{region_data.get('wind_speed_10m', 0):.1f} m/s",
                    help="Wind speed at 10m",
                )


def generate_demo_forecasts(regions: list, fuel_type: str) -> pd.DataFrame:
    """Generate demo forecast data for display."""
    data = []
    base_time = datetime.now(timezone.utc).replace(minute=0, second=0, microsecond=0)

    fuel_types = [fuel_type] if fuel_type != "Both" else ["WND", "SUN"]

    for region in regions[:3]:
        for ft in fuel_types:
            unique_id = f"{region}_{ft}"
            base_value = 500 if ft == "WND" else 300

            for h in range(24):
                ds = base_time + timedelta(hours=h)

                # Add daily pattern
                if ft == "SUN":
                    hour_factor = max(0, np.sin((ds.hour - 6) * np.pi / 12)) if 6 < ds.hour < 18 else 0
                    yhat = base_value * hour_factor + np.random.normal(0, 20)
                else:
                    yhat = base_value + np.sin(ds.hour * np.pi / 12) * 100 + np.random.normal(0, 30)

                yhat = max(0, yhat)

                data.append({
                    "unique_id": unique_id,
                    "region": region,
                    "fuel_type": ft,
                    "ds": ds,
                    "yhat": yhat,
                    "yhat_lo_80": yhat * 0.85,
                    "yhat_hi_80": yhat * 1.15,
                    "yhat_lo_95": yhat * 0.75,
                    "yhat_hi_95": yhat * 1.25,
                })

    return pd.DataFrame(data)


def run_pipeline_from_dashboard(db_path: str, regions: list, fuel_type: str):
    """Run the forecasting pipeline from the dashboard."""
    st.info("Running pipeline... (This would trigger the actual pipeline)")

    # In production, this would call:
    # from src.renewable.tasks import run_full_pipeline, RenewablePipelineConfig
    # config = RenewablePipelineConfig(regions=regions, fuel_types=[fuel_type])
    # results = run_full_pipeline(config)

    st.success("Pipeline completed!")


if __name__ == "__main__":
    main()


Overwriting src/renewable/dashboard.py


---

# Summary

## What We Built

| Module | Purpose | Key Concept |
|--------|---------|-------------|
| `regions.py` | Region definitions | EIA codes + coordinates |
| `eia_renewable.py` | Data fetching | StatsForecast format |
| `open_meteo.py` | Weather integration | Leakage prevention |
| `modeling.py` | Forecasting | Probabilistic intervals |
| `db.py` | Persistence | SQLite with WAL |
| `tasks.py` | Pipeline orchestration | Adaptive CV |
| `dashboard.py` | Visualization | Streamlit + Plotly |

## Key Takeaways

1. **StatsForecast format**: `[unique_id, ds, y]` enables multi-series modeling
2. **No MAPE for renewables**: Solar has zeros - use RMSE/MAE instead
3. **Weather leakage**: Use forecast weather for predictions, not historical
4. **Drift detection**: threshold = baseline_mean + 2 × baseline_std
5. **Adaptive CV**: Adjust window count for short time series
