Background
PR #63 introduced temporal resampling as a derived dataset type — sync_kind: derived with a resample block in the YAML template, materialized via POST /processes/resample. This establishes a useful pattern but hardcodes a single operation. The same design should be generalised into a full process registry where:
- Core processes (resampling, normals computation, anomaly calculation) ship with the package
- Country teams and HISP groups can register their own processes via YAML without touching the core codebase
- All registered processes are automatically exposed as OGC API Processes through pygeoapi
This is Layer 2 from the extensibility roadmap (#40), informed by the cloud execution investigation (#49).
Proposed design
OGC API Processes exposure — dynamic config generation
Rather than wrapping pygeoapi's BaseProcessor interface, the process registry should generate pygeoapi process config dynamically — the same pattern publications/services.py already uses for OGC API collections. The registry reads process definitions, writes pygeoapi config, and pygeoapi serves the OGC API surface.
This approach is preferred over BaseProcessor for three reasons:
- Simpler authoring.
BaseProcessor requires process authors to learn pygeoapi internals and write a boilerplate Python class. A YAML definition + plain function is accessible to country teams without pygeoapi knowledge.
- Internal processes. Processes with
ogcapi.expose: false (e.g. the post-sync anomaly cascade) should never appear in the OGC catalogue. pygeoapi has no concept of a non-exposed process; the dynamic generation layer handles this naturally.
- Consistency. The pattern is already established. Reusing it for processes keeps one mental model across the codebase.
Process registry (mirrors the dataset registry)
A processes/ directory (built-in at src/climate_api/data/processes/, user-supplied via processes_dir in CLIMATE_API_CONFIG) holds YAML files, one per process. User-supplied processes are merged with the built-ins — a custom process with the same id overrides the built-in one. This is the same behaviour as datasets_dir.
id: resample
name: Temporal resampling
description: Resample a managed dataset to a coarser time period (weekly, monthly).
execution:
function: climate_api.processing.resample.run
ogcapi:
expose: true
async: true
Custom processes reference a user-supplied function in the same way dataset templates reference eo_function:
id: dengue_suitability
name: Dengue suitability index
execution:
function: nepal_climate_tools.processes.dengue.run
ogcapi:
expose: true
async: true
Note: inputs and outputs are intentionally absent from the YAML. The OGC process description (input/output schema, types, constraints) is generated from the Python function's type hints at registration time — the same way FastAPI generates OpenAPI schemas from function signatures. A YAML-level inputs/outputs block would be added only if constraints need to be expressed that cannot be captured in type hints.
Python function interface
A process function has a simple, consistent signature:
def run(inputs: dict[str, Any], context: ProcessContext) -> dict[str, Any]:
...
ProcessContext provides access to the artifact registry, the configured extent, and the data directory. The function returns a dict of named outputs. No base class required — any callable with this signature qualifies.
OGC API Processes exposure
Execution follows OGC API Processes Part 1:
POST /ogcapi/processes/{id}/execution → 202 Accepted + jobID
GET /ogcapi/jobs/{jobID} → status / results
Processes with ogcapi.expose: false are available internally (e.g. triggered by derived dataset sync) but not listed in the OGC API catalogue.
Relationship to derived datasets
The sync_kind: derived pattern from PR #63 stays as-is for dataset templates that reference a process by id:
sync_kind: derived
processing:
process_id: resample
params:
method: mean
period: monthly
This decouples the dataset definition (what to produce and when) from the process definition (how to produce it). The same resampling process can back multiple derived dataset templates.
Built-in processes (starting point)
| id |
description |
ref |
resample |
Temporal resampling — produces weekly or monthly aggregates from a higher-frequency source dataset |
PR #63 |
normals |
Climate normals — computes long-term averages (e.g. 1991–2020) grouped by day-of-year, with optional WMO 31-day smoothing window |
#56 |
anomaly |
Climate anomalies — computes departure from normals for a given time range; can be triggered automatically after each sync via a post-sync cascade |
#56 |
The normals→anomaly cascade (#56) fits naturally into the process registry: after a base dataset syncs, the sync engine checks whether any sync_kind: derived anomaly datasets reference it and triggers the anomaly process for the new delta range as a background job. The cascade requires that the normals artifact exists first; if it does not, the cascade is skipped with a warning.
ERA5-Land hourly data requires an intermediate daily aggregation step before normals are meaningful. The resample process covers this: a daily ERA5 temperature or precipitation dataset is itself a publishable artifact and the natural source for normals computation.
Relationship to cloud execution (#49)
The ProcessContext passed to each function is the natural place to inject an execution backend — local by default, Dask or serverless for deployments co-located with cloud storage. The function signature stays the same regardless of where it runs. This keeps the API surface standard (OGC API Processes) while allowing the execution location to become a deployment configuration.
Open questions
- Should there be a
POST /processes endpoint for registering a process at runtime (OGC API Processes Part 2 — Deploy), or is file-based registration sufficient for the target user base?
- How do we handle process versioning — if a custom process function changes, do existing derived artifacts need to be marked stale?
Background
PR #63 introduced temporal resampling as a derived dataset type —
sync_kind: derivedwith aresampleblock in the YAML template, materialized viaPOST /processes/resample. This establishes a useful pattern but hardcodes a single operation. The same design should be generalised into a full process registry where:This is Layer 2 from the extensibility roadmap (#40), informed by the cloud execution investigation (#49).
Proposed design
OGC API Processes exposure — dynamic config generation
Rather than wrapping pygeoapi's
BaseProcessorinterface, the process registry should generate pygeoapi process config dynamically — the same patternpublications/services.pyalready uses for OGC API collections. The registry reads process definitions, writes pygeoapi config, and pygeoapi serves the OGC API surface.This approach is preferred over
BaseProcessorfor three reasons:BaseProcessorrequires process authors to learn pygeoapi internals and write a boilerplate Python class. A YAML definition + plain function is accessible to country teams without pygeoapi knowledge.ogcapi.expose: false(e.g. the post-sync anomaly cascade) should never appear in the OGC catalogue. pygeoapi has no concept of a non-exposed process; the dynamic generation layer handles this naturally.Process registry (mirrors the dataset registry)
A
processes/directory (built-in atsrc/climate_api/data/processes/, user-supplied viaprocesses_dirinCLIMATE_API_CONFIG) holds YAML files, one per process. User-supplied processes are merged with the built-ins — a custom process with the sameidoverrides the built-in one. This is the same behaviour asdatasets_dir.Custom processes reference a user-supplied function in the same way dataset templates reference
eo_function:Note:
inputsandoutputsare intentionally absent from the YAML. The OGC process description (input/output schema, types, constraints) is generated from the Python function's type hints at registration time — the same way FastAPI generates OpenAPI schemas from function signatures. A YAML-levelinputs/outputsblock would be added only if constraints need to be expressed that cannot be captured in type hints.Python function interface
A process function has a simple, consistent signature:
ProcessContextprovides access to the artifact registry, the configured extent, and the data directory. The function returns a dict of named outputs. No base class required — any callable with this signature qualifies.OGC API Processes exposure
Execution follows OGC API Processes Part 1:
Processes with
ogcapi.expose: falseare available internally (e.g. triggered by derived dataset sync) but not listed in the OGC API catalogue.Relationship to derived datasets
The
sync_kind: derivedpattern from PR #63 stays as-is for dataset templates that reference a process by id:This decouples the dataset definition (what to produce and when) from the process definition (how to produce it). The same resampling process can back multiple derived dataset templates.
Built-in processes (starting point)
resamplenormalsanomalyThe normals→anomaly cascade (#56) fits naturally into the process registry: after a base dataset syncs, the sync engine checks whether any
sync_kind: derivedanomaly datasets reference it and triggers theanomalyprocess for the new delta range as a background job. The cascade requires that the normals artifact exists first; if it does not, the cascade is skipped with a warning.ERA5-Land hourly data requires an intermediate daily aggregation step before normals are meaningful. The
resampleprocess covers this: a daily ERA5 temperature or precipitation dataset is itself a publishable artifact and the natural source for normals computation.Relationship to cloud execution (#49)
The
ProcessContextpassed to each function is the natural place to inject an execution backend — local by default, Dask or serverless for deployments co-located with cloud storage. The function signature stays the same regardless of where it runs. This keeps the API surface standard (OGC API Processes) while allowing the execution location to become a deployment configuration.Open questions
POST /processesendpoint for registering a process at runtime (OGC API Processes Part 2 — Deploy), or is file-based registration sufficient for the target user base?