Background
The Climate API currently serves historical and near-real-time observed datasets (CHIRPS, ERA5-Land, WorldPop, seNorge). Adding forecast data enables DHIS2 to support early warning and anticipatory action use cases — for example, alerting health programmes to an unusually wet season forecast 3–5 weeks ahead.
dynamical.org hosts free, open, continuously-updated forecast datasets in cloud-native Zarr format. Their most relevant datasets for this use case:
| Dataset |
Lead time |
Resolution |
Update frequency |
| NOAA GEFS (ensemble forecast) |
35 days |
0.25° |
Daily 00z run |
| ECMWF IFS ENS (ensemble forecast) |
15 days |
0.25° |
Twice daily |
Note on terminology: These are medium-range ensemble forecasts (up to 35 days), not true multi-month seasonal outlooks. True seasonal forecasts (3–6 months) are available from Copernicus C3S (SEAS5/SEAS6) and could be added separately.
Implemented approach: pre-computed derived artifacts
Rather than proxying the raw GEFS store directly to clients, we pre-compute a derived local zarr artifact on each sync. The raw GEFS store has five dimensions (init_time, lead_time, ensemble_member, latitude, longitude) which are unsuitable for map clients and inexperienced users. The derivation step produces a simple time × latitude × longitude zarr from the latest complete run:
- Select the most recent
init_time with a fully-distributed forecast (the latest run is often still publishing its longer lead_times when we read it — we walk back up to 5 runs to find the first complete one)
- Average across ensemble members to produce the ensemble mean
- Map
lead_time → valid_time (calendar dates) to produce a standard time dimension
- Subset to the configured instance extent
- Resample 6-hourly steps to daily mean
- Trim trailing NaN time steps (unpublished lead_times that exist as placeholders)
- Write as a local zarr artifact
The derived artifact is treated like any other ingested dataset: it works with ZarrLayer, pygeoapi, the time slider, and all existing serving infrastructure. Re-syncing (e.g. on a daily schedule) keeps the 35-day window current.
Plugin extensibility
The derived sync kind is designed to be generic. Any country plugin can add a forecast dataset from any remote zarr store by declaring ingestion.function in the template YAML — no core code changes required:
- id: ecmwf_ifs_temperature_forecast
variable: temperature_2m
period_type: daily
sync:
kind: derived
ingestion:
function: my_plugin.ecmwf.derive_dataset
store:
kind: remote_zarr
store_url: "s3://..."
store_format: icechunk
crs: "EPSG:4326" # optional, defaults to EPSG:4326
The function must follow this contract:
def derive_dataset(
*,
store_config: dict, # the template's store block
output_path: Path, # where to write the zarr
extent: dict | None, # configured instance spatial extent
variable: str,
period_type: str,
) -> None: ...
The built-in GEFS implementation lives at climate_api.processing.gefs.derive_dataset and can be imported and reused from plugins.
Implementation — PR #127
Lessons learned
1. Icechunk v2 is a complete API rewrite
The dynamical.org store uses Icechunk 2.0. There is no migration path from the Zarr v3 interface — icechunk.StorageConfig and icechunk.Store are entirely different from v1. Target the v2 API from the start.
2. The latest GEFS run is usually incomplete
NOAA GEFS uses mixed temporal resolution (3-hourly for days 0–10, 6-hourly for days 10–35), giving 181 lead_times total. The latest 00z run typically only has ~100 lead_times distributed by the time we read it (~16 days of data). Selecting the most recent complete init_time (last lead_time non-NaN) reliably gives the full 35-day forecast.
3. Derived artifacts need correct CRS in STAC metadata
Local zarr artifacts are normally assumed to be in the instance CRS (e.g. UTM). Derived data from remote stores is usually WGS84. The STAC proj:code must match the actual data CRS for the map viewer to pass correct tile coordinates to ZarrLayer — using the wrong CRS caused ZarrLayer to misinterpret coordinates and render a blank map. The store.crs field in the template controls this; it defaults to EPSG:4326.
4. ZarrLayer requires ascending coordinate order
The raw GEFS store stores latitude 90→−90 (descending). Zarr stores written with descending latitude cause ZarrLayer to compute tile bounding boxes incorrectly. Always .sortby(["latitude", "longitude"]) before writing.
5. Dask rechunk before to_zarr after resample().mean()
resample().mean() produces irregular dask chunks that cannot be written directly to zarr. Explicit .chunk({"time": 10, "latitude": -1, "longitude": -1}) is required before to_zarr().
6. Don't cache xr.Dataset instances across requests
Caching a shared xr.Dataset and allowing callers to call .close() on it poisons the cache for all subsequent requests. Cache only the underlying store object (icechunk session); open a fresh xr.Dataset per call — xr.open_zarr on an already-open store is cheap.
7. Derived sync kind should use ingestion.function — not hardcoded logic
The first implementation hardcoded the GEFS-specific transformation in services.py, making it impossible for plugin authors to add other forecast sources. The right pattern is the same one used by all other dataset templates: declare ingestion.function pointing to a Python function, and have the framework dispatch to it. This makes the derivation pipeline fully extensible without core changes.
8. Use the existing transforms pipeline for unit conversions — don't bake conversion into the derivation function
The raw GEFS precipitation variable (precipitation_surface) is in kg m⁻² s⁻¹ (instantaneous flux). The display range [0.0, 0.001] in the original template was 86.4 mm/day in disguise — confusing to anyone reading the YAML. Rather than adding a scale_factor parameter to derive_dataset, the right fix is to declare a transform in the template and let the existing transform pipeline run after the zarr is written. This keeps derive_dataset unit-agnostic, lets any transform from unit_conversion.py work for derived datasets, and makes the units and display range in the YAML self-evidently correct.
Background
The Climate API currently serves historical and near-real-time observed datasets (CHIRPS, ERA5-Land, WorldPop, seNorge). Adding forecast data enables DHIS2 to support early warning and anticipatory action use cases — for example, alerting health programmes to an unusually wet season forecast 3–5 weeks ahead.
dynamical.org hosts free, open, continuously-updated forecast datasets in cloud-native Zarr format. Their most relevant datasets for this use case:
Implemented approach: pre-computed derived artifacts
Rather than proxying the raw GEFS store directly to clients, we pre-compute a derived local zarr artifact on each sync. The raw GEFS store has five dimensions (
init_time,lead_time,ensemble_member,latitude,longitude) which are unsuitable for map clients and inexperienced users. The derivation step produces a simpletime × latitude × longitudezarr from the latest complete run:init_timewith a fully-distributed forecast (the latest run is often still publishing its longer lead_times when we read it — we walk back up to 5 runs to find the first complete one)lead_time → valid_time(calendar dates) to produce a standardtimedimensionThe derived artifact is treated like any other ingested dataset: it works with ZarrLayer, pygeoapi, the time slider, and all existing serving infrastructure. Re-syncing (e.g. on a daily schedule) keeps the 35-day window current.
Plugin extensibility
The
derivedsync kind is designed to be generic. Any country plugin can add a forecast dataset from any remote zarr store by declaringingestion.functionin the template YAML — no core code changes required:The function must follow this contract:
The built-in GEFS implementation lives at
climate_api.processing.gefs.derive_datasetand can be imported and reused from plugins.Implementation — PR #127
Lessons learned
1. Icechunk v2 is a complete API rewrite
The dynamical.org store uses Icechunk 2.0. There is no migration path from the Zarr v3 interface —
icechunk.StorageConfigandicechunk.Storeare entirely different from v1. Target the v2 API from the start.2. The latest GEFS run is usually incomplete
NOAA GEFS uses mixed temporal resolution (3-hourly for days 0–10, 6-hourly for days 10–35), giving 181 lead_times total. The latest 00z run typically only has ~100 lead_times distributed by the time we read it (~16 days of data). Selecting the most recent complete init_time (last lead_time non-NaN) reliably gives the full 35-day forecast.
3. Derived artifacts need correct CRS in STAC metadata
Local zarr artifacts are normally assumed to be in the instance CRS (e.g. UTM). Derived data from remote stores is usually WGS84. The STAC
proj:codemust match the actual data CRS for the map viewer to pass correct tile coordinates to ZarrLayer — using the wrong CRS caused ZarrLayer to misinterpret coordinates and render a blank map. Thestore.crsfield in the template controls this; it defaults toEPSG:4326.4. ZarrLayer requires ascending coordinate order
The raw GEFS store stores latitude 90→−90 (descending). Zarr stores written with descending latitude cause ZarrLayer to compute tile bounding boxes incorrectly. Always
.sortby(["latitude", "longitude"])before writing.5. Dask rechunk before
to_zarrafterresample().mean()resample().mean()produces irregular dask chunks that cannot be written directly to zarr. Explicit.chunk({"time": 10, "latitude": -1, "longitude": -1})is required beforeto_zarr().6. Don't cache xr.Dataset instances across requests
Caching a shared
xr.Datasetand allowing callers to call.close()on it poisons the cache for all subsequent requests. Cache only the underlying store object (icechunk session); open a freshxr.Datasetper call —xr.open_zarron an already-open store is cheap.7. Derived sync kind should use ingestion.function — not hardcoded logic
The first implementation hardcoded the GEFS-specific transformation in
services.py, making it impossible for plugin authors to add other forecast sources. The right pattern is the same one used by all other dataset templates: declareingestion.functionpointing to a Python function, and have the framework dispatch to it. This makes the derivation pipeline fully extensible without core changes.8. Use the existing transforms pipeline for unit conversions — don't bake conversion into the derivation function
The raw GEFS precipitation variable (
precipitation_surface) is in kg m⁻² s⁻¹ (instantaneous flux). The display range[0.0, 0.001]in the original template was 86.4 mm/day in disguise — confusing to anyone reading the YAML. Rather than adding ascale_factorparameter toderive_dataset, the right fix is to declare a transform in the template and let the existing transform pipeline run after the zarr is written. This keepsderive_datasetunit-agnostic, lets any transform fromunit_conversion.pywork for derived datasets, and makes the units and display range in the YAML self-evidently correct.