Skip to content

feat: add seasonal/medium-range forecast datasets via dynamical.org remote Zarr #125

@turban

Description

@turban

Background

The Climate API currently serves historical and near-real-time observed datasets (CHIRPS, ERA5-Land, WorldPop, seNorge). Adding forecast data enables DHIS2 to support early warning and anticipatory action use cases — for example, alerting health programmes to an unusually wet season forecast 3–5 weeks ahead.

dynamical.org hosts free, open, continuously-updated forecast datasets in cloud-native Zarr format. Their most relevant datasets for this use case:

Dataset Lead time Resolution Update frequency
NOAA GEFS (ensemble forecast) 35 days 0.25° Daily 00z run
ECMWF IFS ENS (ensemble forecast) 15 days 0.25° Twice daily

Note on terminology: These are medium-range ensemble forecasts (up to 35 days), not true multi-month seasonal outlooks. True seasonal forecasts (3–6 months) are available from Copernicus C3S (SEAS5/SEAS6) and could be added separately.

Implemented approach: pre-computed derived artifacts

Rather than proxying the raw GEFS store directly to clients, we pre-compute a derived local zarr artifact on each sync. The raw GEFS store has five dimensions (init_time, lead_time, ensemble_member, latitude, longitude) which are unsuitable for map clients and inexperienced users. The derivation step produces a simple time × latitude × longitude zarr from the latest complete run:

  1. Select the most recent init_time with a fully-distributed forecast (the latest run is often still publishing its longer lead_times when we read it — we walk back up to 5 runs to find the first complete one)
  2. Average across ensemble members to produce the ensemble mean
  3. Map lead_time → valid_time (calendar dates) to produce a standard time dimension
  4. Subset to the configured instance extent
  5. Resample 6-hourly steps to daily mean
  6. Trim trailing NaN time steps (unpublished lead_times that exist as placeholders)
  7. Write as a local zarr artifact

The derived artifact is treated like any other ingested dataset: it works with ZarrLayer, pygeoapi, the time slider, and all existing serving infrastructure. Re-syncing (e.g. on a daily schedule) keeps the 35-day window current.

Plugin extensibility

The derived sync kind is designed to be generic. Any country plugin can add a forecast dataset from any remote zarr store by declaring ingestion.function in the template YAML — no core code changes required:

- id: ecmwf_ifs_temperature_forecast
  variable: temperature_2m
  period_type: daily
  sync:
    kind: derived
  ingestion:
    function: my_plugin.ecmwf.derive_dataset
  store:
    kind: remote_zarr
    store_url: "s3://..."
    store_format: icechunk
    crs: "EPSG:4326"   # optional, defaults to EPSG:4326

The function must follow this contract:

def derive_dataset(
    *,
    store_config: dict,    # the template's store block
    output_path: Path,     # where to write the zarr
    extent: dict | None,   # configured instance spatial extent
    variable: str,
    period_type: str,
) -> None: ...

The built-in GEFS implementation lives at climate_api.processing.gefs.derive_dataset and can be imported and reused from plugins.

Implementation — PR #127

Lessons learned

1. Icechunk v2 is a complete API rewrite
The dynamical.org store uses Icechunk 2.0. There is no migration path from the Zarr v3 interface — icechunk.StorageConfig and icechunk.Store are entirely different from v1. Target the v2 API from the start.

2. The latest GEFS run is usually incomplete
NOAA GEFS uses mixed temporal resolution (3-hourly for days 0–10, 6-hourly for days 10–35), giving 181 lead_times total. The latest 00z run typically only has ~100 lead_times distributed by the time we read it (~16 days of data). Selecting the most recent complete init_time (last lead_time non-NaN) reliably gives the full 35-day forecast.

3. Derived artifacts need correct CRS in STAC metadata
Local zarr artifacts are normally assumed to be in the instance CRS (e.g. UTM). Derived data from remote stores is usually WGS84. The STAC proj:code must match the actual data CRS for the map viewer to pass correct tile coordinates to ZarrLayer — using the wrong CRS caused ZarrLayer to misinterpret coordinates and render a blank map. The store.crs field in the template controls this; it defaults to EPSG:4326.

4. ZarrLayer requires ascending coordinate order
The raw GEFS store stores latitude 90→−90 (descending). Zarr stores written with descending latitude cause ZarrLayer to compute tile bounding boxes incorrectly. Always .sortby(["latitude", "longitude"]) before writing.

5. Dask rechunk before to_zarr after resample().mean()
resample().mean() produces irregular dask chunks that cannot be written directly to zarr. Explicit .chunk({"time": 10, "latitude": -1, "longitude": -1}) is required before to_zarr().

6. Don't cache xr.Dataset instances across requests
Caching a shared xr.Dataset and allowing callers to call .close() on it poisons the cache for all subsequent requests. Cache only the underlying store object (icechunk session); open a fresh xr.Dataset per call — xr.open_zarr on an already-open store is cheap.

7. Derived sync kind should use ingestion.function — not hardcoded logic
The first implementation hardcoded the GEFS-specific transformation in services.py, making it impossible for plugin authors to add other forecast sources. The right pattern is the same one used by all other dataset templates: declare ingestion.function pointing to a Python function, and have the framework dispatch to it. This makes the derivation pipeline fully extensible without core changes.

8. Use the existing transforms pipeline for unit conversions — don't bake conversion into the derivation function
The raw GEFS precipitation variable (precipitation_surface) is in kg m⁻² s⁻¹ (instantaneous flux). The display range [0.0, 0.001] in the original template was 86.4 mm/day in disguise — confusing to anyone reading the YAML. Rather than adding a scale_factor parameter to derive_dataset, the right fix is to declare a transform in the template and let the existing transform pipeline run after the zarr is written. This keeps derive_dataset unit-agnostic, lets any transform from unit_conversion.py work for derived datasets, and makes the units and display range in the YAML self-evidently correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions