feat: derived GEFS forecast artifact (ensemble mean, valid_time)#127
Draft
turban wants to merge 34 commits into
Draft
feat: derived GEFS forecast artifact (ensemble mean, valid_time)#127turban wants to merge 34 commits into
turban wants to merge 34 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds support for GEFS forecast datasets by introducing remote Zarr/Icechunk store handling plus a new “derived” sync mode that materializes a simplified local time × lat × lon Zarr artifact from the latest GEFS run, making forecasts easier to serve through existing APIs and map clients.
Changes:
- Add REMOTE_ZARR artifacts (registration, STAC metadata,
/zarraccess with Range support) and optionalforecastdependencies (icechunk,s3fs). - Add
SyncKind.REMOTEandSyncKind.DERIVED, including sync planning/execution behaviors for always-rematerialized datasets. - Add GEFS derived forecast dataset templates and enhance the map viewer to expose non-spatial “extra dimensions” as sliders.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Locks new optional forecast dependencies (icechunk/s3fs and transitive deps). |
| tests/test_remote_zarr.py | Adds unit tests for remote Zarr template validation, artifact registration, /zarr info/file behaviors, and REMOTE sync planning. |
| pyproject.toml | Introduces forecast optional-dependency extra and mypy ignore for icechunk. |
| climate_api/templates/map-viewer.html | Adds UI controls for non-spatial extra dimensions (e.g., lead_time/ensemble_member) via sliders. |
| climate_api/stac/services.py | Extends STAC building to support REMOTE_ZARR and derive cube metadata by opening remote stores. |
| climate_api/publications/services.py | Excludes REMOTE_ZARR from pygeoapi publication routing (served via /zarr). |
| climate_api/providers/remote_zarr.py | New provider module for opening/warming remote Zarr/Icechunk stores and caching store access. |
| climate_api/main.py | Adds FastAPI lifespan warmup to pre-open published remote stores on startup. |
| climate_api/ingestions/sync_engine.py | Adds REMOTE/DERIVED sync planning and execution paths that always rematerialize. |
| climate_api/ingestions/services.py | Implements remote store artifact registration, remote /zarr proxying, and GEFS derivation/materialization logic. |
| climate_api/ingestions/schemas.py | Adds ArtifactFormat.REMOTE_ZARR and new sync kinds remote/derived. |
| climate_api/ingestions/routes.py | Updates /zarr/{dataset_id}/{path} route to async and passes Request through for proxy support. |
| climate_api/data/datasets/gefs.yaml | Adds dataset templates for derived GEFS precipitation + 2m temperature forecasts. |
| climate_api/data_registry/services/datasets.py | Extends template validation to allow remote/derived sync kinds and validate store blocks. |
| climate_api/data_manager/services/utils.py | Extends get_time_dim() to recognize init_time/valid_time. |
| climate_api/data_accessor/services/accessor.py | Opens remote datasets via the new remote_zarr provider when templates declare store.kind: remote_zarr. |
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
climate_api/ingestions/services.py:765
- _remote_zarr_store_info() hardcodes crs = "EPSG:4326" for all REMOTE_ZARR artifacts. If a remote template declares store.crs (or if future remote stores are not WGS84), the map client will be given incorrect CRS/proj4 metadata. Consider deriving crs from the template store block (with EPSG:4326 as the default) to keep ZarrLayer/STAC metadata consistent with the actual coordinate system.
def _remote_zarr_store_info(dataset_id: str, artifact: ArtifactRecord) -> dict[str, object]:
"""Return store metadata for a remote zarr artifact."""
source_dataset = registry_datasets.get_dataset(artifact.dataset_id) or {}
raw_store = source_dataset.get("store", {})
store: dict[str, object] = raw_store if isinstance(raw_store, dict) else {}
crs = "EPSG:4326"
coverage = artifact.coverage
a399eea to
81399d0
Compare
8c667b2 to
ba5cdf7
Compare
Introduces a new store.kind: remote_zarr template field that lets the API
register externally-hosted Zarr/Icechunk stores without downloading or
materialising local copies.
Changes:
- gefs.yaml: NOAA GEFS 35-day ensemble forecast template pointing at the
dynamical.org Icechunk store (s3://dynamical-noaa-gefs/...)
- providers/remote_zarr.py: open_remote_dataset() dispatches on store_format;
icechunk stores use the icechunk library (optional dep), zarr URLs use
xr.open_zarr directly
- datasets.py: allow sync.kind: remote and store.kind: remote_zarr in place
of ingestion.function; validate store.store_url is present
- schemas.py: add ArtifactFormat.REMOTE_ZARR and SyncKind.REMOTE
- services.py: _register_remote_zarr_artifact() opens the remote store to
read coverage metadata, stores ArtifactRecord with the remote URL as path;
_upsert_remote_zarr_record() replaces the existing record on re-register;
_artifact_storage_exists() skips the Path.exists() check for remote formats;
/zarr/{id} returns remote_url + provider info; /zarr/{id}/{path} returns 501
- sync_engine.py: SyncKind.REMOTE always plans REMATERIALIZE; run_sync()
handles it by calling create_artifact without start/end
- accessor.py: get_data() routes remote source datasets to open_remote_dataset()
- utils.py: get_time_dim() now recognises init_time (GEFS time dimension)
- pyproject.toml: add [forecast] optional dep group (icechunk, s3fs)
- 16 new tests, all passing; 230 total, 0 regressions
Closes #125 (proof-of-concept)
icechunk v2 replaced StorageConfig/IcechunkStore.open_existing with s3_storage() + Repository.open() + readonly_session(branch=).
Remote zarr stores (icechunk/S3) cannot be served by pygeoapi's xarray coverage provider. Skip them in the publication loop and clear pygeoapi_path, matching the existing pyramid zarr exclusion.
Remote datasets (GEFS, etc.) are always in WGS84. Passing the instance CRS to _coverage_from_dataset incorrectly re-projects lon/lat values as if they were in the local projection.
- Include REMOTE_ZARR artifacts in the STAC catalog
- Proxy zarr v3 keys from icechunk stores through /zarr/{dataset_id}/{path}
so browser-side ZarrLayer can read the store (no public zarr.json on S3)
- Build STAC cube:dimensions from the remote dataset, including ordinal dims
for ensemble_member (0-30) and lead_time (hours)
- Add ensemble_member and lead_time sliders to the map viewer panel;
all dimension sliders feed a unified selector on ZarrLayer
- Fix GEFS units (kg m-2 s-1) and add display range for the legend
…sult get_time_dim previously matched 2D auxiliary coords (e.g. GEFS valid_time) before actual dimension coords (init_time), causing _cube_dimensions_from_dataset to classify init_time as ordinal with raw nanosecond timestamps. Fix get_time_dim to prefer actual dimension coordinates first (consistent with get_x_y_dims), with hasattr fallback for non-xarray objects. Also update _cube_dimensions_from_dataset to classify any datetime64 dimension as temporal, so the map viewer time slider works even if time_dim detection is imprecise.
Without range requests, ZarrLayer had to download the full 420 MB icechunk shard to read any slice. With Range support, it can fetch just the shard index (~65 KB) then the specific inner shard bytes for the requested tile. Changes: - Make zarr file proxy endpoint async (await store.get/getsize directly) - HEAD requests use store.getsize() — no S3 download, ~3s instead of 24s - GET with Range header uses RangeByteRequest or SuffixByteRequest - All responses include Accept-Ranges: bytes so zarr clients use ranges - open_icechunk_store runs in asyncio.to_thread to avoid blocking event loop
Each zarr tile request previously re-opened the icechunk repository (a full S3 metadata read). With the store cached per store_url, subsequent requests skip store initialization entirely. Chunk sizes are also cached per (store_url, key) so suffix byte range requests avoid an extra getsize() S3 call on repeat access.
Cold start previously required opening the icechunk repository and reading zarr metadata (coordinate arrays) on the first STAC collection request, causing noticeable latency when selecting a forecast dataset in the UI. Changes: - Cache opened xr.Dataset per store_url in open_remote_dataset() so coordinate metadata is only read from S3 once per process - Add warmup_remote_store() that pre-populates the store and dataset caches - Add FastAPI lifespan that runs warmup in a background thread at startup, so GEFS is ready before the first user request - In the zarr file proxy, skip asyncio.to_thread when the store is already in _store_cache (dict lookup needs no thread)
_coverage_from_dataset returns ds[time_dim].max() as the temporal end, which for GEFS is the last init_time (today). Forecast data extends beyond that by the full lead_time horizon — GEFS runs to 840 hours (35 days) ahead, so the correct end is last_init_time + max_lead_time. Check for a lead_time dimension in _register_remote_zarr_artifact and add the maximum lead timedelta to the temporal end after coverage is computed. Keeps the fix local to the remote zarr registration path so existing local dataset coverage logic is unaffected.
- Fix I001 (import ordering) in ingestions/services.py and main.py - Fix F401 (unused TestClient import) in tests/test_remote_zarr.py - Fix E501 (line too long) in publications/services.py by splitting conditional - Add explicit byte_request union type to fix mypy assignment errors in services.py - Fix RangeByteRequest end=None by resolving total size before constructing request - Rename step → time_step in stac/services.py to avoid no-redef error across loop branches - Add AsyncGenerator return type to _lifespan in main.py to satisfy asynccontextmanager typing - Add type: ignore[arg-type] to numpy_datetime_to_period_string call consistent with existing usage
Adds a new SyncKind.DERIVED and _derive_gefs_artifact() that reads the latest GEFS remote run, averages ensemble members, maps lead_time to valid_time calendar dates, subsets to the configured instance extent, resamples 6-hourly steps to daily mean, and writes a standard flat zarr. The result is a normal ZARR artifact (time x lat x lon) compatible with ZarrLayer, pygeoapi, and all existing serving infrastructure — no special casing required. Tile reads are local disk instead of S3 range requests. New dataset templates: gefs_precipitation_forecast and gefs_temperature_2m_forecast with sync.kind: derived.
… zarr datasets The derived templates now carry their own store block directly instead of relying on a separate source_dataset_id lookup. This removes the raw gefs_precipitation and gefs_temperature_2m remote zarr templates — the only exposed datasets are the derived ensemble-mean local zarr artifacts. Registry validation updated: derived kind validates the store block rather than checking for sync.source_dataset_id.
…facts - Check _is_derived_source before is_remote_source in create_artifact so derived templates with a store block are not mis-routed to remote zarr registration - Compute coverage using _coverage_from_dataset with native_crs=EPSG:4326 so the instance CRS (UTM) is not applied to data already in WGS84, which was producing a near-zero spatial_wgs84 and a wrong STAC bbox
GEFS raw store has latitude in descending order (90→-90). After the sel/resample pipeline the order is preserved, but ZarrLayer requires ascending coordinates to compute correct chunk indices. sortby before rechunking ensures standard ascending lat/lon in the output zarr.
…acts The GEFS icechunk store fills in lead_times gradually after each 00z run; unpublished steps exist as NaN placeholders. Drop trailing all-NaN time steps after daily resampling so the artifact only covers validated data. STAC proj:code for local zarr artifacts was always set to the instance CRS (e.g. EPSG:32633 UTM), which caused ZarrLayer to misinterpret WGS84 coordinates as projected. Use coverage.spatial_wgs84 as the discriminator: when it is None the data is WGS84-native and proj:code should be EPSG:4326.
The latest GEFS run is still being distributed when we derive the artifact, so its longer lead_times are NaN placeholders. Walk backwards through the last 5 init_times and select the most recent one whose final lead_time has actual data, giving a complete 35-day forecast instead of a partial run.
- remote_zarr: remove _dataset_cache; return a fresh xr.Dataset per call so callers can close freely without poisoning a shared instance — the underlying icechunk store remains cached - stac/services: remove duplicate ds.close() in except block (finally already closes it) - main: use asyncio.get_running_loop() instead of deprecated get_event_loop() in async lifespan context - datasets: align validation error messages with actual template field names (store block / store.store_url instead of source block / source.store_url) - ingestions: wrap Range header int() parsing in try/except and return 416 for malformed specs; return 404 from HEAD when key does not exist
…tion The derivation pipeline was hardcoded in services.py and tied to GEFS dimensions, making it impossible for countries to add other forecast sources (e.g. ECMWF IFS) without modifying core code. - Move GEFS derivation logic to climate_api/processing/gefs.py as a public derive_dataset() function with a documented keyword-only contract - Replace _derive_gefs_artifact/_most_recent_complete_init in services.py with a generic _derive_artifact() that resolves and calls ingestion.function (same mechanism used by all other dataset templates) - Update gefs.yaml to explicitly declare ingestion.function: climate_api.processing.gefs.derive_dataset - Update template validation: derived kind now requires both store.store_url and ingestion.function, making the contract explicit for plugin authors - Add tests for most_recent_complete_init, derived routing, and ingestion.function dispatch
Hardcoding EPSG:4326 broke coverage and STAC proj:code for remote stores
in projected coordinate systems. store.crs now overrides the default so
plugin authors can declare the correct CRS in the template YAML:
store:
crs: "EPSG:32633"
Falls back to EPSG:4326 when absent (correct for dynamical.org, ARCO-ERA5,
and most other public WGS84 stores). spatial_wgs84 is now set when the
store CRS is non-WGS84, consistent with the rest of the artifact pipeline.
…orms Adds flux_to_mm_per_day to unit_conversion.py and wires it into the derived artifact pipeline. _derive_artifact now applies dataset transforms after derive_fn writes the zarr, reusing the same transform infrastructure as the regular download pipeline. Updates gefs_precipitation_forecast to declare the transform, units: mm/day, and a display range of [0, 25].
…_mm_per_day transform Adds GEFS precipitation and temperature entries to built_in_datasets.md, documents sync.kind: derived in extensibility.md and adding_custom_datasets.md (including step-by-step derivation function contract and YAML reference), and adds flux_to_mm_per_day to transforms.md. Also clarifies that transforms run after derive_fn for derived datasets, not only after download.
…rr path Writing to_zarr(..., mode='w') on the same path we are reading from truncates the store before dask reads complete, producing all-zero values. Load the computed result into memory, close the source, then write.
… for derived GeoZarr root attrs (spatial:bbox, proj:code) are a framework concern — every derived artifact needs them for map tile positioning regardless of which derive_fn produced it. Moving create_geozarr_attrs out of gefs.py into _derive_artifact means plugin authors don't need to know about it. The framework now computes bbox from the written zarr's actual spatial coordinates (via get_x_y_dims), writes the attrs directly to the zarr root using zarr.open_group, and re-consolidates metadata — all after transforms are applied, in the same pass used for coverage computation. Also makes CreateIngestionRequest.start optional (str | None) so derived and remote dataset syncs can be triggered without a start date. A guard in create_artifact raises HTTP 422 if start is None for non-derived datasets.
- HEAD requests to remote zarr proxy now return 503 on unexpected store errors instead of masking them as 200 with Content-Length: 0 - _store_cache private import replaced by get_store_if_cached() public helper in remote_zarr.py - _register_remote_zarr_artifact reads store.crs (default EPSG:4326) instead of hardcoding WGS84, consistent with _derive_artifact - map-viewer: renderExtraDimSliders replaced innerHTML template literal with DOM construction; safeDimId() sanitises dim keys used in element ids; event listener references elements by closure rather than id lookup
81399d0 to
9eb5e4e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on
#128 (architecture documentation) — this PR extends that doc and must be rebased on it after merge.
Summary
Implements
sync.kind: derivedandsync.kind: remotefor dataset templates, and ships the first two derived datasets: GEFS precipitation forecast and GEFS 2 m temperature forecast via dynamical.org's icechunk store. Closes #125.The raw GEFS store has five dimensions (
init_time,lead_time,ensemble_member,latitude,longitude) which are not suitable for direct serving. This PR derives a simpletime × latitude × longitudezarr from the latest complete run on each sync.Breaking changes
CreateIngestionRequest.startis now optional (str | None, defaultnull). Required fortemporalandreleasedatasets; omit forderivedandremotedatasets. Clients that always supplystartare unaffected.Changes
New sync kinds
sync.kind: derived— a derivation function transforms a remote store into a local zarr on each sync. The framework applies declared transforms, writes GeoZarr root attributes, computes coverage, and registers the artifact. The derivation function only needs to write a(time, latitude, longitude)zarr; the framework handles everything else.sync.kind: remote— registers the remote store URL as the artifact path and proxies requests directly. No local copy. Use only when the remote store is already in a shape clients can consume.Framework changes
spatial:bbox,proj:code) are now written by the framework in_derive_artifact, not by the derivation function. This ensures attrs always reflect the data after transforms run. Plugin code must not write them.CreateIngestionRequest.startmade optional with a guard that raises HTTP 422 fortemporal/releasedatasets wherestartis still required.get_store_if_cached()added toremote_zarr.pyas the public accessor for the store cache (replaces private_store_cacheimport).asyncio.get_running_loop()replaces deprecatedget_event_loop()in lifespan hook.int()parsing wrapped in try/except — returns 416 instead of 500 for malformed specs.GEFS derivation pipeline (
climate_api/processing/gefs.py)init_timewith a complete forecast (latest run is often still distributing — walk back up to 5 runs)lead_time → valid_timeto produce a standardtimedimensiongefs_precipitation_forecastuses theflux_to_mm_per_daytransform (×86400 s/day) so stored values are in mm/day rather than the raw GEFS unit of kg m⁻² s⁻¹.Map viewer fix (
climate_api/templates/map-viewer.html)Dimension selector HTML construction replaced with DOM API (
createElement/textContent) to prevent XSS from dataset dimension keys.Documentation
docs/architecture.md— extended from docs: architecture overview — concepts, lifecycle, sync kinds, plugin contracts #128 withderivedandremotesync kinds, derivation function contract, GeoZarr/CRS handling for derived artifacts, and design consequencesdocs/adding_custom_datasets.md— new guide covering template authoring, plugin layout, and the derivation function contractdocs/extensibility.md— new reference for the five extension pointsdocs/built_in_datasets.md— GEFS precipitation and temperature datasets addeddocs/transforms.md—flux_to_mm_per_daydocumentedPlugin extensibility
Any country plugin can add any forecast source without core changes:
Test plan
timedimension (calendar dates, 35 days)init_timewhen latest run is still distributinginit_timeingestion.functiondispatch tested with mock derive functionmost_recent_complete_inittested with incomplete and complete run fixturesflux_to_mm_per_daytransform)