Skip to content

YAML-defined process registry with OGC API Processes exposure #65

@turban

Description

@turban

Background

PR #63 introduced temporal resampling as a derived dataset type — sync_kind: derived with a resample block in the YAML template, materialized via POST /processes/resample. This establishes a useful pattern but hardcodes a single operation. The same design should be generalised into a full process registry where:

  • Core processes (resampling, normals computation, anomaly calculation) ship with the package
  • Country teams and HISP groups can register their own processes via YAML without touching the core codebase
  • All registered processes are automatically exposed as OGC API Processes through pygeoapi

This is Layer 2 from the extensibility roadmap (#40), informed by the cloud execution investigation (#49).

Proposed design

OGC API Processes exposure — dynamic config generation

Rather than wrapping pygeoapi's BaseProcessor interface, the process registry should generate pygeoapi process config dynamically — the same pattern publications/services.py already uses for OGC API collections. The registry reads process definitions, writes pygeoapi config, and pygeoapi serves the OGC API surface.

This approach is preferred over BaseProcessor for three reasons:

  • Simpler authoring. BaseProcessor requires process authors to learn pygeoapi internals and write a boilerplate Python class. A YAML definition + plain function is accessible to country teams without pygeoapi knowledge.
  • Internal processes. Processes with ogcapi.expose: false (e.g. the post-sync anomaly cascade) should never appear in the OGC catalogue. pygeoapi has no concept of a non-exposed process; the dynamic generation layer handles this naturally.
  • Consistency. The pattern is already established. Reusing it for processes keeps one mental model across the codebase.

Process registry (mirrors the dataset registry)

A processes/ directory (built-in at src/climate_api/data/processes/, user-supplied via processes_dir in CLIMATE_API_CONFIG) holds YAML files, one per process. User-supplied processes are merged with the built-ins — a custom process with the same id overrides the built-in one. This is the same behaviour as datasets_dir.

id: resample
name: Temporal resampling
description: Resample a managed dataset to a coarser time period (weekly, monthly).
execution:
  function: climate_api.processing.resample.run
ogcapi:
  expose: true
  async: true

Custom processes reference a user-supplied function in the same way dataset templates reference eo_function:

id: dengue_suitability
name: Dengue suitability index
execution:
  function: nepal_climate_tools.processes.dengue.run
ogcapi:
  expose: true
  async: true

Note: inputs and outputs are intentionally absent from the YAML. The OGC process description (input/output schema, types, constraints) is generated from the Python function's type hints at registration time — the same way FastAPI generates OpenAPI schemas from function signatures. A YAML-level inputs/outputs block would be added only if constraints need to be expressed that cannot be captured in type hints.

Python function interface

A process function has a simple, consistent signature:

def run(inputs: dict[str, Any], context: ProcessContext) -> dict[str, Any]:
    ...

ProcessContext provides access to the artifact registry, the configured extent, and the data directory. The function returns a dict of named outputs. No base class required — any callable with this signature qualifies.

OGC API Processes exposure

Execution follows OGC API Processes Part 1:

POST /ogcapi/processes/{id}/execution  →  202 Accepted + jobID
GET  /ogcapi/jobs/{jobID}              →  status / results

Processes with ogcapi.expose: false are available internally (e.g. triggered by derived dataset sync) but not listed in the OGC API catalogue.

Relationship to derived datasets

The sync_kind: derived pattern from PR #63 stays as-is for dataset templates that reference a process by id:

sync_kind: derived
processing:
  process_id: resample
  params:
    method: mean
    period: monthly

This decouples the dataset definition (what to produce and when) from the process definition (how to produce it). The same resampling process can back multiple derived dataset templates.

Built-in processes (starting point)

id description ref
resample Temporal resampling — produces weekly or monthly aggregates from a higher-frequency source dataset PR #63
normals Climate normals — computes long-term averages (e.g. 1991–2020) grouped by day-of-year, with optional WMO 31-day smoothing window #56
anomaly Climate anomalies — computes departure from normals for a given time range; can be triggered automatically after each sync via a post-sync cascade #56

The normals→anomaly cascade (#56) fits naturally into the process registry: after a base dataset syncs, the sync engine checks whether any sync_kind: derived anomaly datasets reference it and triggers the anomaly process for the new delta range as a background job. The cascade requires that the normals artifact exists first; if it does not, the cascade is skipped with a warning.

ERA5-Land hourly data requires an intermediate daily aggregation step before normals are meaningful. The resample process covers this: a daily ERA5 temperature or precipitation dataset is itself a publishable artifact and the natural source for normals computation.

Relationship to cloud execution (#49)

The ProcessContext passed to each function is the natural place to inject an execution backend — local by default, Dask or serverless for deployments co-located with cloud storage. The function signature stays the same regardless of where it runs. This keeps the API surface standard (OGC API Processes) while allowing the execution location to become a deployment configuration.

Open questions

  • Should there be a POST /processes endpoint for registering a process at runtime (OGC API Processes Part 2 — Deploy), or is file-based registration sufficient for the target user base?
  • How do we handle process versioning — if a custom process function changes, do existing derived artifacts need to be marked stale?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions