# 02 – ERA5 Weather Download (Raw Data)

This notebook downloads ERA5 single-level weather data from the Copernicus Climate Data Store (CDS)  
for multiple bidding zones and stores the raw NetCDF files under `data/raw/weather`.

The data will later be merged with price data for further analysis and modeling.


## 1. Objectives

In this notebook we:

- Define a project-relative output directory for ERA5 raw weather data
- Configure geographic areas (bounding boxes) for the bidding zones (ES, DK1, NO2, NO4)
- Define the time range (years, months) and ERA5 variables we want to download
- Initialize the CDS API client using credentials from `config/secrets.env`
- Implement a robust download function with:
  - Progress logging
  - Simple validation that the NetCDF file is readable
  - Per-file timing
- Run an execution loop over all zones, years and months with:
  - Global progress counters
  - Total runtime measurement
- Add checks to:
  - List downloaded files and summarize them by zone and year
  - Detect missing `(zone, year, month)` combinations


## 2. Setup paths, environment, and imports

We assume the following project structure:

- `<project_root>/notebooks/02_weather_era5.ipynb`   (this notebook)
- `<project_root>/config/secrets.env`               (CDS credentials)
- `<project_root>/data/raw/weather`                 (output folder for ERA5 NetCDF files)

The notebook will:
- Resolve `project_root` relative to the current working directory (the `notebooks` folder)
- Ensure that `data/raw/weather` exists
- Load the `secrets.env` file which must define `CDS_URL` and `CDS_KEY`


In [None]:
import os
import time
from pathlib import Path
from collections import defaultdict
import zipfile

import cdsapi
import netCDF4
import requests
from dotenv import load_dotenv

# 1. Setup paths (relative to the notebooks directory)
current_dir = Path.cwd()
project_dir = current_dir.parent.parent  # up from /notebooks to project root
secrets_path = project_dir / "config" / "secrets.env"

# 2. Define output directory for raw weather data
output_dir = project_dir / "data" / "raw" / "weather"
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Current working directory: {current_dir}")
print(f"Project directory        : {project_dir}")
print(f"Secrets file             : {secrets_path}")
print(f"Weather output directory : {output_dir.resolve()}")

# 3. Load secrets (CDS credentials)
if not secrets_path.exists():
    print("Warning: secrets.env file not found at:", secrets_path)
else:
    load_dotenv(secrets_path)
    print("Loaded environment variables from secrets.env")


## 3. ERA5 configuration

Here we define:

- `AREAS`: geographic bounding boxes for each bidding zone  
  Format: `north, west, south, east` (ERA5 / CDS expected format)
- `YEARS`: years to download
- `MONTHS`: months to download (two-digit strings `"01"`, `"02"`, ..., `"12"`)
- `VARIABLES`: ERA5 variables required for later analysis

The bounding boxes have been derived using `https://boundingbox.klokantech.com`  
and then converted into the `[North, West, South, East]` format expected by ERA5.


In [None]:
# Bidding zone areas in ERA5 format: [North, West, South, East]
# Derived from boundingbox.klokantech.com (minLon, minLat, maxLon, maxLat)
# and converted to [maxLat, minLon, minLat, maxLon].

AREAS = {
    "DK1": {
        "north": 57.846,
        "west": 7.8714,
        "south": 54.7545,
        "east": 11.0739,
    },
    "ES": {
        "north": 43.733,
        "west": -9.5816,
        "south": 36.0242,
        "east": 3.4922,
    },
    "NO2": {
        "north": 59.361,
        "west": 5.3171,
        "south": 57.9584,
        "east": 9.8517,
    },
    #"NO4": {
    #    "north": 71.09,
    #    "west": 12.55,
    #    "south": 66.3,
    #    "east": 30.85,
    #},
}

# Time configuration
YEARS = [str(year) for year in range(2023, 2026)]  
MONTHS = [f"{month:02d}" for month in range(1, 13)]

# ERA5 variables
VARIABLES = [
    "2m_temperature",
    "total_precipitation",
    "10m_u_component_of_wind",
    "10m_v_component_of_wind",
    "surface_solar_radiation_downwards",
]

print("Configured zones: ", list(AREAS.keys()))
print("Years           : ", YEARS)
print("Months          : ", MONTHS)
print("Variables       : ", VARIABLES)


## 4. Initialize CDS API client and connectivity check

We initialize the CDS client using:

- `CDS_URL` and `CDS_KEY` from `config/secrets.env`

Then we perform a lightweight HTTP GET request to the CDS API endpoint to verify basic network connectivity.  
This does **not** download data, it only checks if the endpoint is reachable.


In [None]:
# Initialize CDS client
try:
    cds_url = os.getenv("CDS_URL")
    cds_key = os.getenv("CDS_KEY")

    if not cds_url or not cds_key:
        raise ValueError("Missing CDS_URL or CDS_KEY in environment variables.")

    client = cdsapi.Client(url=cds_url, key=cds_key)
    print("CDS API Client initialized successfully.")

except Exception as e:
    client = None
    print(f"Initialization failed: {e}")


def check_cds_connection():
    """
    Simple connectivity check for the CDS API endpoint.
    This does not use authentication, it only checks if the endpoint is reachable.
    """
    url = "https://cds.climate.copernicus.eu/api/v2"
    print(f"Checking basic connectivity to CDS API: {url}")
    start = time.time()
    try:
        response = requests.get(url, timeout=10)
        elapsed = time.time() - start
        print(f"  HTTP status: {response.status_code}")
        print(f"  Elapsed    : {elapsed:.2f} seconds")
        if 200 <= response.status_code < 500:
            print("  Network connectivity looks OK (request reached CDS).")
        else:
            print("  Unexpected status code. Check credentials or CDS status page if downloads fail.")
    except Exception as e:
        elapsed = time.time() - start
        print(f"  Connection failed after {elapsed:.2f} seconds:")
        print(f"  {e}")


# Optional: run connectivity check
if client is not None:
    check_cds_connection()
else:
    print("Client is not initialized, skipping connectivity check.")


## 5. ERA5 download helper function

The function `download_era5_month`:

- Builds the output filename: `era5_<ZONE>_<YEAR>_<MONTH>.nc`
- Skips download if the file already exists (idempotent behavior)
- Sends the request to the ERA5 dataset `reanalysis-era5-single-levels`
- Measures per-file runtime
- Validates the resulting NetCDF file (can it be opened, and variables listed?)
- Returns a status string: `"downloaded"`, `"skipped"`, or `"failed"`

This function is used in the main execution loop below.


In [None]:
## 5. ERA5 download helper function

DATASET = "reanalysis-era5-single-levels"


def _maybe_unzip_era5_file(file_path: Path) -> Path:
    """
    If file_path is actually a ZIP archive (starts with 'PK'),
    extract the contained NetCDF file(s) and return the primary .nc path.

    Behaviour:
    - If header is not ZIP -> return file_path unchanged
    - If ZIP:
        - Rename file_path to file_path + '.zip'
        - Extract contents to output_dir
        - Rename NetCDFs to:
            <stem>_instant.nc or <stem>_accum.nc depending on the name
        - If an '_instant.nc' exists, return that for validation,
          otherwise return the first .nc.
    """
    if not file_path.exists():
        return file_path

    # Check ZIP signature
    with open(file_path, "rb") as f:
        sig = f.read(2)

    if sig != b"PK":
        # Not a ZIP archive, nothing to do
        return file_path

    print("  Detected ZIP archive instead of plain NetCDF. Unzipping...")

    stem = file_path.stem  # e.g. 'era5_DK1_2023_01'
    zip_path = file_path.with_suffix(file_path.suffix + ".zip")
    if zip_path.exists():
        zip_path.unlink()
    file_path.rename(zip_path)

    with zipfile.ZipFile(zip_path, "r") as zf:
        members = zf.namelist()
        nc_members = [m for m in members if m.endswith(".nc")]

        print(f"  ZIP contents: {len(members)} files, {len(nc_members)} NetCDF file(s).")
        zf.extractall(path=output_dir)

    if not nc_members:
        print("  Warning: no .nc files found inside ZIP. Leaving ZIP on disk.")
        return zip_path

    renamed_paths = []
    instant_path = None

    for nc_name in nc_members:
        src = output_dir / nc_name

        lower = nc_name.lower()
        if "instant" in lower:
            dst = output_dir / f"{stem}_instant.nc"
        elif "accum" in lower:
            dst = output_dir / f"{stem}_accum.nc"
        else:
            dst = output_dir / f"{stem}_{nc_name.replace('/', '_')}"

        if dst.exists():
            dst.unlink()
        src.rename(dst)
        renamed_paths.append(dst)

        if "instant" in lower:
            instant_path = dst

    # Clean up ZIP
    zip_path.unlink(missing_ok=True)

    print("  Extracted NetCDF files:")
    for p in renamed_paths:
        print(f"    - {p.name}")

    if instant_path is not None:
        print(f"  Using instant file for validation: {instant_path.name}")
        return instant_path

    # No explicit instant file: just use the first
    first_nc = renamed_paths[0]
    print(f"  Using first NetCDF for validation: {first_nc.name}")
    return first_nc


def download_era5_month(year: str, month: str, zone: str, coords: dict) -> str:
    """
    Download one ERA5 month for a given zone, if needed.

    Rules:
    - If both <stem>_instant.nc and <stem>_accum.nc exist and are valid -> skip
    - Else, download to <stem>.nc, unzip if necessary, create _instant/_accum, validate.

    Parameters
    ----------
    year : str   e.g. "2023"
    month: str   e.g. "01"
    zone : str   e.g. "DK1"
    coords : dict with keys 'north', 'west', 'south', 'east'

    Returns
    -------
    status : "downloaded", "skipped", or "failed"
    """
    if client is None:
        print("CDS client is not initialized. Cannot download.")
        return "failed"

    stem = f"era5_{zone}_{year}_{month}"
    instant_path = output_dir / f"{stem}_instant.nc"
    accum_path = output_dir / f"{stem}_accum.nc"
    base_nc_path = output_dir / f"{stem}.nc"

    # ------------------------------------------------------
    # 1) Check if we already have valid instant + accum data
    # ------------------------------------------------------
    if instant_path.exists() and accum_path.exists():
        print(f"\nExisting files found for {zone} {year}-{month}:")
        print(f"  {instant_path.name}")
        print(f"  {accum_path.name}")

        try:
            # Quick validation: can both files be opened as NetCDF?
            with netCDF4.Dataset(instant_path, "r") as nc_i:
                _ = list(nc_i.variables.keys())
            with netCDF4.Dataset(accum_path, "r") as nc_a:
                _ = list(nc_a.variables.keys())

            print("  Both instant and accum files are valid NetCDF. Skipping download.")
            return "skipped"

        except Exception as e:
            print("  Existing files are not valid NetCDF, will re-download.")
            print(f"  Reason: {e}")
            # Remove broken files and continue to fresh download
            if instant_path.exists():
                instant_path.unlink()
            if accum_path.exists():
                accum_path.unlink()
            if base_nc_path.exists():
                base_nc_path.unlink()

    # ------------------------------------------------------
    # 2) No valid instant/accum files -> download (or reuse .nc if present)
    # ------------------------------------------------------
    # If a leftover base_nc_path exists (from interrupted run), try to unzip/use it
    if base_nc_path.exists():
        print(f"\nFound leftover base file: {base_nc_path.name}")
        nc_path = _maybe_unzip_era5_file(base_nc_path)
        # After this, instant/accum files should exist; the next section will validate them

    print(f"\nRequesting file: {stem}.nc")
    print(f"  Year: {year}, Month: {month}, Zone: {zone}")
    print(f"  Area (N,W,S,E): [{coords['north']}, {coords['west']}, {coords['south']}, {coords['east']}]")
    start_time = time.time()

    # Build CDS request payload
    request = {
        "product_type": "reanalysis",
        "format": "netcdf",  # CDS may still return a ZIP with NetCDF inside
        "variable": VARIABLES,
        "year": year,
        "month": month,
        "day": [f"{day:02d}" for day in range(1, 32)],     # 01-31
        "time": [f"{hour:02d}:00" for hour in range(24)],  # 00:00-23:00
        "area": [coords["north"], coords["west"], coords["south"], coords["east"]],
    }

    try:
        # Submit request
        print("  Submitting request to CDS API...")
        client.retrieve(DATASET, request, target=str(base_nc_path))
        print("  Download finished, checking file size...")

        if not base_nc_path.exists():
            print("  Error: base .nc file does not exist after download.")
            return "failed"

        size_bytes = base_nc_path.stat().st_size
        print(f"  File size: {size_bytes} bytes (~{size_bytes / (1024*1024):.2f} MB)")

        # Unzip if needed and create _instant/_accum files
        _ = _maybe_unzip_era5_file(base_nc_path)

        # After unzipping, we expect instant/accum files
        if not (instant_path.exists() and accum_path.exists()):
            print("  Error: instant/accum files not found after unzipping.")
            return "failed"

        # Validate NetCDF files
        try:
            with netCDF4.Dataset(instant_path, "r") as nc_i:
                vars_instant = list(nc_i.variables.keys())
            with netCDF4.Dataset(accum_path, "r") as nc_a:
                vars_accum = list(nc_a.variables.keys())

            elapsed = time.time() - start_time
            print(f"  Success: {instant_path.name}, {accum_path.name}")
            print(f"  Instant vars: {vars_instant}")
            print(f"  Accum vars  : {vars_accum}")
            print(f"  Elapsed time for this month: {elapsed:.1f} seconds")
            return "downloaded"

        except Exception as e:
            print("  Validation error for instant/accum files:")
            print(f"  {e}")
            return "failed"

    except Exception as e:
        elapsed = time.time() - start_time
        print(f"  Failed to download {stem}.nc after {elapsed:.1f} seconds:")
        print(f"  {e}")
        return "failed"


## 6. Execution loop with progress and total timing

This loop runs over all zones, years, and months:

- Uses `download_era5_month` for each `(zone, year, month)` combination
- Tracks how many files were:
  - downloaded
  - skipped (already existed)
  - failed
- Prints ongoing progress after every file
- Measures and prints total runtime at the end


In [None]:
if client is None:
    print("CDS client is not initialized. Skipping download loop.")
else:
    print("Starting ERA5 download queue.")
    print("Note: This process may take time depending on the CDS queue.")

    total_files = len(AREAS) * len(YEARS) * len(MONTHS)
    downloaded = 0
    skipped = 0
    failed = 0

    overall_start = time.time()

    for zone, coords in AREAS.items():
        print("\n======================================")
        print(f"Processing Zone: {zone}")
        print("--------------------------------------")

        for year in YEARS:
            for month in MONTHS:
                status = download_era5_month(year, month, zone, coords)

                if status == "downloaded":
                    downloaded += 1
                elif status == "skipped":
                    skipped += 1
                elif status == "failed":
                    failed += 1

                processed = downloaded + skipped + failed
                print(
                    f"  Progress: {processed}/{total_files} "
                    f"(downloaded={downloaded}, skipped={skipped}, failed={failed})"
                )

    overall_elapsed = time.time() - overall_start
    overall_minutes = overall_elapsed / 60.0

    print("\n======================================")
    print("All downloads completed.")
    print(f"Summary:")
    print(f"  Downloaded: {downloaded}")
    print(f"  Skipped   : {skipped}")
    print(f"  Failed    : {failed}")
    print(f"  Total     : {total_files}")
    print(f"Total elapsed time: {overall_minutes:.1f} minutes ({overall_elapsed:.0f} seconds)")


## 7. Check downloaded ERA5 files

This section provides a quick overview of all downloaded ERA5 NetCDF files in the  
`data/raw/weather` directory.

It shows:

- The total number of files
- The first few file names for a quick manual check
- A summary by zone and year (number of months found)

This allows us to verify that:

- The download loop produced the expected coverage
- Re-runs correctly skip already existing files


In [None]:
print(f"Weather directory: {output_dir.resolve()}")

# List all NetCDF files that follow the naming pattern:
#   era5_<ZONE>_<YEAR>_<MONTH>_*.nc
all_files = sorted(output_dir.glob("era5_*.nc"))
print(f"Total ERA5 files found: {len(all_files)}")

if not all_files:
    print("No files found. Check the download loop and paths.")
else:
    print("\nFirst 20 files (sorted):")
    for path in all_files[:20]:
        print("  ", path.name)

    # Parse file names and collect stats
    # We want to count DISTINCT MONTHS per (zone, year),
    # regardless of whether "instant" and "accum" both exist.
    per_zone_year_months = defaultdict(set)
    zones_seen = set()
    years_seen = set()
    parse_errors = []

    for path in all_files:
        # Expected pattern(s):
        #   era5_ZONE_YEAR_MONTH_instant.nc
        #   era5_ZONE_YEAR_MONTH_accum.nc
        name = path.stem  # without .nc
        parts = name.split("_")
        if len(parts) < 4 or parts[0] != "era5":
            parse_errors.append(name)
            continue

        # Take only the first four parts: era5, ZONE, YEAR, MONTH
        _, zone, year_str, month_str = parts[0:4]

        zones_seen.add(zone)
        years_seen.add(year_str)
        per_zone_year_months[(zone, year_str)].add(month_str)

    if parse_errors:
        print("\nWarning: Some files do not match the expected name pattern:")
        for n in parse_errors:
            print("  ", n)

    print("\nZones present on disk:", sorted(zones_seen))
    print("Years present on disk:", sorted(years_seen))

    print("\nSummary: number of DISTINCT months per (zone, year)")
    for (zone, year_str), month_set in sorted(per_zone_year_months.items()):
        print(f"  {zone} {year_str}: {len(month_set):2d} months")


## 8. Check for missing zone–year–month combinations

Here we compare:

- All expected combinations of `AREAS × YEARS × MONTHS`
- With all existing files on disk

For each `(zone, year, month)` that is expected but **not** present as a file,  
we report it as "missing".

This is useful to:

- Detect incomplete downloads
- Decide whether we need to rerun the notebook only for certain months or zones


In [None]:
# Build the expected set of (zone, year, month) combinations
expected = set()
for zone in AREAS.keys():
    for year in YEARS:
        for month in MONTHS:
            expected.add((zone, str(year), str(month)))

# Build the set of existing combinations from the file names
existing = set()
parse_errors = []

for path in all_files:
    name = path.stem  # e.g. "era5_DK1_2023_01_instant"
    parts = name.split("_")
    if len(parts) < 4 or parts[0] != "era5":
        parse_errors.append(name)
        continue

    # Ignore trailing parts like "instant"/"accum"
    _, zone, year_str, month_str = parts[0:4]
    existing.add((zone, year_str, month_str))

missing = expected - existing

print(f"Expected combinations: {len(expected)}")
print(f"Existing combinations: {len(existing)}")
print(f"Missing combinations : {len(missing)}")

if parse_errors:
    print("\nWarning: some files could not be parsed into (zone, year, month):")
    for n in parse_errors:
        print("  ", n)

if missing:
    print("\nMissing (zone, year, month) combinations:")
    for zone, year_str, month_str in sorted(missing):
        print(f"  Zone={zone}, Year={year_str}, Month={month_str}")
else:
    print("\nNo missing combinations. All expected files are present.")


## 9. Notes and next steps

- The downloaded ERA5 NetCDF files are now available in `data/raw/weather`
- Each file contains one month of hourly data for one bidding zone
- The next notebook will:
  - Load these NetCDF files
  - Aggregate and transform the variables as needed
  - Merge them with price data for the same zones and timestamps
