Skip to content

feat: derived GEFS forecast artifact (ensemble mean, valid_time)#127

Draft
turban wants to merge 34 commits into
docs/zarr-and-geozarrfrom
feat/gefs-derived-forecast-artifact
Draft

feat: derived GEFS forecast artifact (ensemble mean, valid_time)#127
turban wants to merge 34 commits into
docs/zarr-and-geozarrfrom
feat/gefs-derived-forecast-artifact

Conversation

@turban
Copy link
Copy Markdown
Contributor

@turban turban commented May 12, 2026

Depends on

#128 (architecture documentation) — this PR extends that doc and must be rebased on it after merge.

Summary

Implements sync.kind: derived and sync.kind: remote for dataset templates, and ships the first two derived datasets: GEFS precipitation forecast and GEFS 2 m temperature forecast via dynamical.org's icechunk store. Closes #125.

The raw GEFS store has five dimensions (init_time, lead_time, ensemble_member, latitude, longitude) which are not suitable for direct serving. This PR derives a simple time × latitude × longitude zarr from the latest complete run on each sync.

Breaking changes

  • CreateIngestionRequest.start is now optional (str | None, default null). Required for temporal and release datasets; omit for derived and remote datasets. Clients that always supply start are unaffected.

Changes

New sync kinds

sync.kind: derived — a derivation function transforms a remote store into a local zarr on each sync. The framework applies declared transforms, writes GeoZarr root attributes, computes coverage, and registers the artifact. The derivation function only needs to write a (time, latitude, longitude) zarr; the framework handles everything else.

sync.kind: remote — registers the remote store URL as the artifact path and proxies requests directly. No local copy. Use only when the remote store is already in a shape clients can consume.

Framework changes

  • GeoZarr root attributes (spatial:bbox, proj:code) are now written by the framework in _derive_artifact, not by the derivation function. This ensures attrs always reflect the data after transforms run. Plugin code must not write them.
  • CreateIngestionRequest.start made optional with a guard that raises HTTP 422 for temporal/release datasets where start is still required.
  • get_store_if_cached() added to remote_zarr.py as the public accessor for the store cache (replaces private _store_cache import).
  • asyncio.get_running_loop() replaces deprecated get_event_loop() in lifespan hook.
  • Range header int() parsing wrapped in try/except — returns 416 instead of 500 for malformed specs.
  • HEAD requests for missing zarr keys return 404 instead of 503.

GEFS derivation pipeline (climate_api/processing/gefs.py)

  1. Select the most recent init_time with a complete forecast (latest run is often still distributing — walk back up to 5 runs)
  2. Average across ensemble members (ensemble mean)
  3. Map lead_time → valid_time to produce a standard time dimension
  4. Subset to the configured instance extent
  5. Resample 6-hourly steps to daily mean
  6. Trim trailing NaN time steps (unpublished lead times still in transit)
  7. Sort ascending lat/lon, rechunk, write zarr

gefs_precipitation_forecast uses the flux_to_mm_per_day transform (×86400 s/day) so stored values are in mm/day rather than the raw GEFS unit of kg m⁻² s⁻¹.

Map viewer fix (climate_api/templates/map-viewer.html)

Dimension selector HTML construction replaced with DOM API (createElement/textContent) to prevent XSS from dataset dimension keys.

Documentation

  • docs/architecture.md — extended from docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts #128 with derived and remote sync kinds, derivation function contract, GeoZarr/CRS handling for derived artifacts, and design consequences
  • docs/adding_custom_datasets.md — new guide covering template authoring, plugin layout, and the derivation function contract
  • docs/extensibility.md — new reference for the five extension points
  • docs/built_in_datasets.md — GEFS precipitation and temperature datasets added
  • docs/transforms.mdflux_to_mm_per_day documented

Plugin extensibility

Any country plugin can add any forecast source without core changes:

ingestion:
  function: my_plugin.ecmwf.derive_dataset
transforms:
  - climate_api.transforms.unit_conversion.flux_to_mm_per_day
store:
  kind: remote_zarr
  store_url: "s3://..."
  crs: "EPSG:4326"   # optional, defaults to EPSG:4326

Test plan

  • Derived artifact written to disk with correct time dimension (calendar dates, 35 days)
  • Falls back to most recent complete init_time when latest run is still distributing
  • Trailing NaN time steps trimmed from resampled output
  • ZarrLayer renders tiles correctly (WGS84 coordinates, ascending lat/lon)
  • Re-sync produces updated artifact with new init_time
  • Both precipitation and temperature datasets verified on Norway instance
  • ingestion.function dispatch tested with mock derive function
  • most_recent_complete_init tested with incomplete and complete run fixtures
  • GEFS precipitation stored in mm/day (flux converted via flux_to_mm_per_day transform)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for GEFS forecast datasets by introducing remote Zarr/Icechunk store handling plus a new “derived” sync mode that materializes a simplified local time × lat × lon Zarr artifact from the latest GEFS run, making forecasts easier to serve through existing APIs and map clients.

Changes:

  • Add REMOTE_ZARR artifacts (registration, STAC metadata, /zarr access with Range support) and optional forecast dependencies (icechunk, s3fs).
  • Add SyncKind.REMOTE and SyncKind.DERIVED, including sync planning/execution behaviors for always-rematerialized datasets.
  • Add GEFS derived forecast dataset templates and enhance the map viewer to expose non-spatial “extra dimensions” as sliders.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
uv.lock Locks new optional forecast dependencies (icechunk/s3fs and transitive deps).
tests/test_remote_zarr.py Adds unit tests for remote Zarr template validation, artifact registration, /zarr info/file behaviors, and REMOTE sync planning.
pyproject.toml Introduces forecast optional-dependency extra and mypy ignore for icechunk.
climate_api/templates/map-viewer.html Adds UI controls for non-spatial extra dimensions (e.g., lead_time/ensemble_member) via sliders.
climate_api/stac/services.py Extends STAC building to support REMOTE_ZARR and derive cube metadata by opening remote stores.
climate_api/publications/services.py Excludes REMOTE_ZARR from pygeoapi publication routing (served via /zarr).
climate_api/providers/remote_zarr.py New provider module for opening/warming remote Zarr/Icechunk stores and caching store access.
climate_api/main.py Adds FastAPI lifespan warmup to pre-open published remote stores on startup.
climate_api/ingestions/sync_engine.py Adds REMOTE/DERIVED sync planning and execution paths that always rematerialize.
climate_api/ingestions/services.py Implements remote store artifact registration, remote /zarr proxying, and GEFS derivation/materialization logic.
climate_api/ingestions/schemas.py Adds ArtifactFormat.REMOTE_ZARR and new sync kinds remote/derived.
climate_api/ingestions/routes.py Updates /zarr/{dataset_id}/{path} route to async and passes Request through for proxy support.
climate_api/data/datasets/gefs.yaml Adds dataset templates for derived GEFS precipitation + 2m temperature forecasts.
climate_api/data_registry/services/datasets.py Extends template validation to allow remote/derived sync kinds and validate store blocks.
climate_api/data_manager/services/utils.py Extends get_time_dim() to recognize init_time/valid_time.
climate_api/data_accessor/services/accessor.py Opens remote datasets via the new remote_zarr provider when templates declare store.kind: remote_zarr.

Comment thread climate_api/providers/remote_zarr.py Outdated
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/data_registry/services/datasets.py Outdated
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/ingestions/services.py
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/main.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 16 changed files in this pull request and generated 7 comments.

Comment thread climate_api/providers/remote_zarr.py Outdated
Comment thread climate_api/stac/services.py Outdated
Comment thread climate_api/data_registry/services/datasets.py Outdated
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/main.py Outdated
@turban turban marked this pull request as draft May 12, 2026 17:48
@turban turban requested a review from Copilot May 12, 2026 23:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

climate_api/ingestions/services.py:765

  • _remote_zarr_store_info() hardcodes crs = "EPSG:4326" for all REMOTE_ZARR artifacts. If a remote template declares store.crs (or if future remote stores are not WGS84), the map client will be given incorrect CRS/proj4 metadata. Consider deriving crs from the template store block (with EPSG:4326 as the default) to keep ZarrLayer/STAC metadata consistent with the actual coordinate system.
def _remote_zarr_store_info(dataset_id: str, artifact: ArtifactRecord) -> dict[str, object]:
    """Return store metadata for a remote zarr artifact."""
    source_dataset = registry_datasets.get_dataset(artifact.dataset_id) or {}
    raw_store = source_dataset.get("store", {})
    store: dict[str, object] = raw_store if isinstance(raw_store, dict) else {}
    crs = "EPSG:4326"
    coverage = artifact.coverage

Comment thread climate_api/data_registry/services/datasets.py
Comment thread climate_api/stac/services.py
Comment thread climate_api/ingestions/services.py
Comment thread climate_api/ingestions/services.py Outdated
Comment thread climate_api/templates/map-viewer.html Outdated
Comment thread climate_api/ingestions/services.py Outdated
@turban turban changed the base branch from main to docs/platform-concepts May 13, 2026 12:08
@turban turban force-pushed the feat/gefs-derived-forecast-artifact branch from a399eea to 81399d0 Compare May 14, 2026 07:45
@turban turban changed the base branch from docs/platform-concepts to docs/zarr-and-geozarr May 14, 2026 07:45
@turban turban force-pushed the docs/zarr-and-geozarr branch from 8c667b2 to ba5cdf7 Compare May 14, 2026 07:53
turban added 11 commits May 14, 2026 09:53
Introduces a new store.kind: remote_zarr template field that lets the API
register externally-hosted Zarr/Icechunk stores without downloading or
materialising local copies.

Changes:
- gefs.yaml: NOAA GEFS 35-day ensemble forecast template pointing at the
  dynamical.org Icechunk store (s3://dynamical-noaa-gefs/...)
- providers/remote_zarr.py: open_remote_dataset() dispatches on store_format;
  icechunk stores use the icechunk library (optional dep), zarr URLs use
  xr.open_zarr directly
- datasets.py: allow sync.kind: remote and store.kind: remote_zarr in place
  of ingestion.function; validate store.store_url is present
- schemas.py: add ArtifactFormat.REMOTE_ZARR and SyncKind.REMOTE
- services.py: _register_remote_zarr_artifact() opens the remote store to
  read coverage metadata, stores ArtifactRecord with the remote URL as path;
  _upsert_remote_zarr_record() replaces the existing record on re-register;
  _artifact_storage_exists() skips the Path.exists() check for remote formats;
  /zarr/{id} returns remote_url + provider info; /zarr/{id}/{path} returns 501
- sync_engine.py: SyncKind.REMOTE always plans REMATERIALIZE; run_sync()
  handles it by calling create_artifact without start/end
- accessor.py: get_data() routes remote source datasets to open_remote_dataset()
- utils.py: get_time_dim() now recognises init_time (GEFS time dimension)
- pyproject.toml: add [forecast] optional dep group (icechunk, s3fs)
- 16 new tests, all passing; 230 total, 0 regressions

Closes #125 (proof-of-concept)
icechunk v2 replaced StorageConfig/IcechunkStore.open_existing with
s3_storage() + Repository.open() + readonly_session(branch=).
Remote zarr stores (icechunk/S3) cannot be served by pygeoapi's xarray
coverage provider. Skip them in the publication loop and clear
pygeoapi_path, matching the existing pyramid zarr exclusion.
Remote datasets (GEFS, etc.) are always in WGS84. Passing the
instance CRS to _coverage_from_dataset incorrectly re-projects
lon/lat values as if they were in the local projection.
- Include REMOTE_ZARR artifacts in the STAC catalog
- Proxy zarr v3 keys from icechunk stores through /zarr/{dataset_id}/{path}
  so browser-side ZarrLayer can read the store (no public zarr.json on S3)
- Build STAC cube:dimensions from the remote dataset, including ordinal dims
  for ensemble_member (0-30) and lead_time (hours)
- Add ensemble_member and lead_time sliders to the map viewer panel;
  all dimension sliders feed a unified selector on ZarrLayer
- Fix GEFS units (kg m-2 s-1) and add display range for the legend
…sult

get_time_dim previously matched 2D auxiliary coords (e.g. GEFS valid_time)
before actual dimension coords (init_time), causing _cube_dimensions_from_dataset
to classify init_time as ordinal with raw nanosecond timestamps.

Fix get_time_dim to prefer actual dimension coordinates first (consistent with
get_x_y_dims), with hasattr fallback for non-xarray objects. Also update
_cube_dimensions_from_dataset to classify any datetime64 dimension as temporal,
so the map viewer time slider works even if time_dim detection is imprecise.
Without range requests, ZarrLayer had to download the full 420 MB icechunk
shard to read any slice. With Range support, it can fetch just the shard
index (~65 KB) then the specific inner shard bytes for the requested tile.

Changes:
- Make zarr file proxy endpoint async (await store.get/getsize directly)
- HEAD requests use store.getsize() — no S3 download, ~3s instead of 24s
- GET with Range header uses RangeByteRequest or SuffixByteRequest
- All responses include Accept-Ranges: bytes so zarr clients use ranges
- open_icechunk_store runs in asyncio.to_thread to avoid blocking event loop
Each zarr tile request previously re-opened the icechunk repository (a full S3
metadata read). With the store cached per store_url, subsequent requests skip
store initialization entirely. Chunk sizes are also cached per (store_url, key)
so suffix byte range requests avoid an extra getsize() S3 call on repeat access.
Cold start previously required opening the icechunk repository and reading
zarr metadata (coordinate arrays) on the first STAC collection request,
causing noticeable latency when selecting a forecast dataset in the UI.

Changes:
- Cache opened xr.Dataset per store_url in open_remote_dataset() so
  coordinate metadata is only read from S3 once per process
- Add warmup_remote_store() that pre-populates the store and dataset caches
- Add FastAPI lifespan that runs warmup in a background thread at startup,
  so GEFS is ready before the first user request
- In the zarr file proxy, skip asyncio.to_thread when the store is already
  in _store_cache (dict lookup needs no thread)
_coverage_from_dataset returns ds[time_dim].max() as the temporal end,
which for GEFS is the last init_time (today). Forecast data extends
beyond that by the full lead_time horizon — GEFS runs to 840 hours
(35 days) ahead, so the correct end is last_init_time + max_lead_time.

Check for a lead_time dimension in _register_remote_zarr_artifact and
add the maximum lead timedelta to the temporal end after coverage is
computed. Keeps the fix local to the remote zarr registration path so
existing local dataset coverage logic is unaffected.
turban added 21 commits May 14, 2026 09:53
- Fix I001 (import ordering) in ingestions/services.py and main.py
- Fix F401 (unused TestClient import) in tests/test_remote_zarr.py
- Fix E501 (line too long) in publications/services.py by splitting conditional
- Add explicit byte_request union type to fix mypy assignment errors in services.py
- Fix RangeByteRequest end=None by resolving total size before constructing request
- Rename step → time_step in stac/services.py to avoid no-redef error across loop branches
- Add AsyncGenerator return type to _lifespan in main.py to satisfy asynccontextmanager typing
- Add type: ignore[arg-type] to numpy_datetime_to_period_string call consistent with existing usage
Adds a new SyncKind.DERIVED and _derive_gefs_artifact() that reads the
latest GEFS remote run, averages ensemble members, maps lead_time to
valid_time calendar dates, subsets to the configured instance extent,
resamples 6-hourly steps to daily mean, and writes a standard flat zarr.

The result is a normal ZARR artifact (time x lat x lon) compatible with
ZarrLayer, pygeoapi, and all existing serving infrastructure — no special
casing required. Tile reads are local disk instead of S3 range requests.

New dataset templates: gefs_precipitation_forecast and
gefs_temperature_2m_forecast with sync.kind: derived.
… zarr datasets

The derived templates now carry their own store block directly instead
of relying on a separate source_dataset_id lookup. This removes the raw
gefs_precipitation and gefs_temperature_2m remote zarr templates — the
only exposed datasets are the derived ensemble-mean local zarr artifacts.

Registry validation updated: derived kind validates the store block
rather than checking for sync.source_dataset_id.
…facts

- Check _is_derived_source before is_remote_source in create_artifact so
  derived templates with a store block are not mis-routed to remote zarr
  registration
- Compute coverage using _coverage_from_dataset with native_crs=EPSG:4326
  so the instance CRS (UTM) is not applied to data already in WGS84,
  which was producing a near-zero spatial_wgs84 and a wrong STAC bbox
GEFS raw store has latitude in descending order (90→-90). After the
sel/resample pipeline the order is preserved, but ZarrLayer requires
ascending coordinates to compute correct chunk indices. sortby before
rechunking ensures standard ascending lat/lon in the output zarr.
…acts

The GEFS icechunk store fills in lead_times gradually after each 00z run;
unpublished steps exist as NaN placeholders. Drop trailing all-NaN time
steps after daily resampling so the artifact only covers validated data.

STAC proj:code for local zarr artifacts was always set to the instance CRS
(e.g. EPSG:32633 UTM), which caused ZarrLayer to misinterpret WGS84
coordinates as projected. Use coverage.spatial_wgs84 as the discriminator:
when it is None the data is WGS84-native and proj:code should be EPSG:4326.
The latest GEFS run is still being distributed when we derive the artifact,
so its longer lead_times are NaN placeholders.  Walk backwards through the
last 5 init_times and select the most recent one whose final lead_time
has actual data, giving a complete 35-day forecast instead of a partial run.
- remote_zarr: remove _dataset_cache; return a fresh xr.Dataset per call
  so callers can close freely without poisoning a shared instance — the
  underlying icechunk store remains cached
- stac/services: remove duplicate ds.close() in except block (finally
  already closes it)
- main: use asyncio.get_running_loop() instead of deprecated get_event_loop()
  in async lifespan context
- datasets: align validation error messages with actual template field names
  (store block / store.store_url instead of source block / source.store_url)
- ingestions: wrap Range header int() parsing in try/except and return 416
  for malformed specs; return 404 from HEAD when key does not exist
…tion

The derivation pipeline was hardcoded in services.py and tied to GEFS
dimensions, making it impossible for countries to add other forecast sources
(e.g. ECMWF IFS) without modifying core code.

- Move GEFS derivation logic to climate_api/processing/gefs.py as a public
  derive_dataset() function with a documented keyword-only contract
- Replace _derive_gefs_artifact/_most_recent_complete_init in services.py
  with a generic _derive_artifact() that resolves and calls ingestion.function
  (same mechanism used by all other dataset templates)
- Update gefs.yaml to explicitly declare
  ingestion.function: climate_api.processing.gefs.derive_dataset
- Update template validation: derived kind now requires both store.store_url
  and ingestion.function, making the contract explicit for plugin authors
- Add tests for most_recent_complete_init, derived routing, and
  ingestion.function dispatch
Hardcoding EPSG:4326 broke coverage and STAC proj:code for remote stores
in projected coordinate systems.  store.crs now overrides the default so
plugin authors can declare the correct CRS in the template YAML:

  store:
    crs: "EPSG:32633"

Falls back to EPSG:4326 when absent (correct for dynamical.org, ARCO-ERA5,
and most other public WGS84 stores).  spatial_wgs84 is now set when the
store CRS is non-WGS84, consistent with the rest of the artifact pipeline.
…orms

Adds flux_to_mm_per_day to unit_conversion.py and wires it into the
derived artifact pipeline. _derive_artifact now applies dataset transforms
after derive_fn writes the zarr, reusing the same transform infrastructure
as the regular download pipeline. Updates gefs_precipitation_forecast to
declare the transform, units: mm/day, and a display range of [0, 25].
…_mm_per_day transform

Adds GEFS precipitation and temperature entries to built_in_datasets.md,
documents sync.kind: derived in extensibility.md and adding_custom_datasets.md
(including step-by-step derivation function contract and YAML reference),
and adds flux_to_mm_per_day to transforms.md. Also clarifies that transforms
run after derive_fn for derived datasets, not only after download.
…rr path

Writing to_zarr(..., mode='w') on the same path we are reading from truncates
the store before dask reads complete, producing all-zero values. Load the
computed result into memory, close the source, then write.
… for derived

GeoZarr root attrs (spatial:bbox, proj:code) are a framework concern — every
derived artifact needs them for map tile positioning regardless of which
derive_fn produced it. Moving create_geozarr_attrs out of gefs.py into
_derive_artifact means plugin authors don't need to know about it.

The framework now computes bbox from the written zarr's actual spatial
coordinates (via get_x_y_dims), writes the attrs directly to the zarr root
using zarr.open_group, and re-consolidates metadata — all after transforms
are applied, in the same pass used for coverage computation.

Also makes CreateIngestionRequest.start optional (str | None) so derived and
remote dataset syncs can be triggered without a start date. A guard in
create_artifact raises HTTP 422 if start is None for non-derived datasets.
- HEAD requests to remote zarr proxy now return 503 on unexpected store
  errors instead of masking them as 200 with Content-Length: 0
- _store_cache private import replaced by get_store_if_cached() public
  helper in remote_zarr.py
- _register_remote_zarr_artifact reads store.crs (default EPSG:4326)
  instead of hardcoding WGS84, consistent with _derive_artifact
- map-viewer: renderExtraDimSliders replaced innerHTML template literal
  with DOM construction; safeDimId() sanitises dim keys used in element
  ids; event listener references elements by closure rather than id lookup
@turban turban force-pushed the feat/gefs-derived-forecast-artifact branch from 81399d0 to 9eb5e4e Compare May 14, 2026 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add seasonal/medium-range forecast datasets via dynamical.org remote Zarr

2 participants