feat: add seasonal/medium-range forecast datasets via dynamical.org remote Zarr

## Background

The Climate API currently serves historical and near-real-time observed datasets (CHIRPS, ERA5-Land, WorldPop, seNorge). Adding **forecast data** enables DHIS2 to support early warning and anticipatory action use cases — for example, alerting health programmes to an unusually wet season forecast 3–5 weeks ahead.

[dynamical.org](https://dynamical.org/) hosts free, open, continuously-updated forecast datasets in cloud-native Zarr format. Their most relevant datasets for this use case:

| Dataset | Lead time | Resolution | Update frequency |
|---|---|---|---|
| NOAA GEFS (ensemble forecast) | 35 days | 0.25° | Daily 00z run |
| ECMWF IFS ENS (ensemble forecast) | 15 days | 0.25° | Twice daily |

> **Note on terminology:** These are medium-range ensemble forecasts (up to 35 days), not true multi-month seasonal outlooks. True seasonal forecasts (3–6 months) are available from Copernicus C3S (SEAS5/SEAS6) and could be added separately.

## Implemented approach: pre-computed derived artifacts

Rather than proxying the raw GEFS store directly to clients, we pre-compute a **derived local zarr artifact** on each sync. The raw GEFS store has five dimensions (`init_time`, `lead_time`, `ensemble_member`, `latitude`, `longitude`) which are unsuitable for map clients and inexperienced users. The derivation step produces a simple `time × latitude × longitude` zarr from the latest complete run:

1. Select the most recent `init_time` with a fully-distributed forecast (the latest run is often still publishing its longer lead_times when we read it — we walk back up to 5 runs to find the first complete one)
2. Average across ensemble members to produce the ensemble mean
3. Map `lead_time → valid_time` (calendar dates) to produce a standard `time` dimension
4. Subset to the configured instance extent
5. Resample 6-hourly steps to daily mean
6. Trim trailing NaN time steps (unpublished lead_times that exist as placeholders)
7. Write as a local zarr artifact

The derived artifact is treated like any other ingested dataset: it works with ZarrLayer, pygeoapi, the time slider, and all existing serving infrastructure. Re-syncing (e.g. on a daily schedule) keeps the 35-day window current.

## Plugin extensibility

The `derived` sync kind is designed to be generic. Any country plugin can add a forecast dataset from any remote zarr store by declaring `ingestion.function` in the template YAML — no core code changes required:

```yaml
- id: ecmwf_ifs_temperature_forecast
  variable: temperature_2m
  period_type: daily
  sync:
    kind: derived
  ingestion:
    function: my_plugin.ecmwf.derive_dataset
  store:
    kind: remote_zarr
    store_url: "s3://..."
    store_format: icechunk
    crs: "EPSG:4326"   # optional, defaults to EPSG:4326
```

The function must follow this contract:

```python
def derive_dataset(
    *,
    store_config: dict,    # the template's store block
    output_path: Path,     # where to write the zarr
    extent: dict | None,   # configured instance spatial extent
    variable: str,
    period_type: str,
) -> None: ...
```

The built-in GEFS implementation lives at `climate_api.processing.gefs.derive_dataset` and can be imported and reused from plugins.

## Implementation — PR #127

## Lessons learned

**1. Icechunk v2 is a complete API rewrite**
The dynamical.org store uses Icechunk 2.0. There is no migration path from the Zarr v3 interface — `icechunk.StorageConfig` and `icechunk.Store` are entirely different from v1. Target the v2 API from the start.

**2. The latest GEFS run is usually incomplete**
NOAA GEFS uses mixed temporal resolution (3-hourly for days 0–10, 6-hourly for days 10–35), giving 181 lead_times total. The latest 00z run typically only has ~100 lead_times distributed by the time we read it (~16 days of data). Selecting the most recent *complete* init_time (last lead_time non-NaN) reliably gives the full 35-day forecast.

**3. Derived artifacts need correct CRS in STAC metadata**
Local zarr artifacts are normally assumed to be in the instance CRS (e.g. UTM). Derived data from remote stores is usually WGS84. The STAC `proj:code` must match the actual data CRS for the map viewer to pass correct tile coordinates to ZarrLayer — using the wrong CRS caused ZarrLayer to misinterpret coordinates and render a blank map. The `store.crs` field in the template controls this; it defaults to `EPSG:4326`.

**4. ZarrLayer requires ascending coordinate order**
The raw GEFS store stores latitude 90→−90 (descending). Zarr stores written with descending latitude cause ZarrLayer to compute tile bounding boxes incorrectly. Always `.sortby(["latitude", "longitude"])` before writing.

**5. Dask rechunk before `to_zarr` after `resample().mean()`**
`resample().mean()` produces irregular dask chunks that cannot be written directly to zarr. Explicit `.chunk({"time": 10, "latitude": -1, "longitude": -1})` is required before `to_zarr()`.

**6. Don't cache xr.Dataset instances across requests**
Caching a shared `xr.Dataset` and allowing callers to call `.close()` on it poisons the cache for all subsequent requests. Cache only the underlying store object (icechunk session); open a fresh `xr.Dataset` per call — `xr.open_zarr` on an already-open store is cheap.

**7. Derived sync kind should use ingestion.function — not hardcoded logic**
The first implementation hardcoded the GEFS-specific transformation in `services.py`, making it impossible for plugin authors to add other forecast sources. The right pattern is the same one used by all other dataset templates: declare `ingestion.function` pointing to a Python function, and have the framework dispatch to it. This makes the derivation pipeline fully extensible without core changes.

**8. Use the existing transforms pipeline for unit conversions — don't bake conversion into the derivation function**
The raw GEFS precipitation variable (`precipitation_surface`) is in kg m⁻² s⁻¹ (instantaneous flux). The display range `[0.0, 0.001]` in the original template was 86.4 mm/day in disguise — confusing to anyone reading the YAML. Rather than adding a `scale_factor` parameter to `derive_dataset`, the right fix is to declare a transform in the template and let the existing transform pipeline run after the zarr is written. This keeps `derive_dataset` unit-agnostic, lets any transform from `unit_conversion.py` work for derived datasets, and makes the units and display range in the YAML self-evidently correct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add seasonal/medium-range forecast datasets via dynamical.org remote Zarr #125

Background

Implemented approach: pre-computed derived artifacts

Plugin extensibility

Implementation — PR #127

Lessons learned

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dataset	Lead time	Resolution	Update frequency
NOAA GEFS (ensemble forecast)	35 days	0.25°	Daily 00z run
ECMWF IFS ENS (ensemble forecast)	15 days	0.25°	Twice daily

feat: add seasonal/medium-range forecast datasets via dynamical.org remote Zarr #125

Description

Background

Implemented approach: pre-computed derived artifacts

Plugin extensibility

Implementation — PR #127

Lessons learned

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions