YAML-defined process registry with OGC API Processes exposure

## Background

PR #63 introduced temporal resampling as a derived dataset type — `sync_kind: derived` with a `resample` block in the YAML template, materialized via `POST /processes/resample`. This establishes a useful pattern but hardcodes a single operation. The same design should be generalised into a full process registry where:

- Core processes (resampling, normals computation, anomaly calculation) ship with the package
- Country teams and HISP groups can register their own processes via YAML without touching the core codebase
- All registered processes are automatically exposed as OGC API Processes through pygeoapi

This is Layer 2 from the extensibility roadmap (#40), informed by the cloud execution investigation (#49).

## Proposed design

### OGC API Processes exposure — dynamic config generation

Rather than wrapping pygeoapi's `BaseProcessor` interface, the process registry should generate pygeoapi process config dynamically — the same pattern `publications/services.py` already uses for OGC API collections. The registry reads process definitions, writes pygeoapi config, and pygeoapi serves the OGC API surface.

This approach is preferred over `BaseProcessor` for three reasons:

- **Simpler authoring.** `BaseProcessor` requires process authors to learn pygeoapi internals and write a boilerplate Python class. A YAML definition + plain function is accessible to country teams without pygeoapi knowledge.
- **Internal processes.** Processes with `ogcapi.expose: false` (e.g. the post-sync anomaly cascade) should never appear in the OGC catalogue. pygeoapi has no concept of a non-exposed process; the dynamic generation layer handles this naturally.
- **Consistency.** The pattern is already established. Reusing it for processes keeps one mental model across the codebase.

### Process registry (mirrors the dataset registry)

A `processes/` directory (built-in at `src/climate_api/data/processes/`, user-supplied via `processes_dir` in `CLIMATE_API_CONFIG`) holds YAML files, one per process. User-supplied processes are **merged with the built-ins** — a custom process with the same `id` overrides the built-in one. This is the same behaviour as `datasets_dir`.

```yaml
id: resample
name: Temporal resampling
description: Resample a managed dataset to a coarser time period (weekly, monthly).
execution:
  function: climate_api.processing.resample.run
ogcapi:
  expose: true
  async: true
```

Custom processes reference a user-supplied function in the same way dataset templates reference `eo_function`:

```yaml
id: dengue_suitability
name: Dengue suitability index
execution:
  function: nepal_climate_tools.processes.dengue.run
ogcapi:
  expose: true
  async: true
```

Note: `inputs` and `outputs` are intentionally absent from the YAML. The OGC process description (input/output schema, types, constraints) is generated from the Python function's type hints at registration time — the same way FastAPI generates OpenAPI schemas from function signatures. A YAML-level `inputs`/`outputs` block would be added only if constraints need to be expressed that cannot be captured in type hints.

### Python function interface

A process function has a simple, consistent signature:

```python
def run(inputs: dict[str, Any], context: ProcessContext) -> dict[str, Any]:
    ...
```

`ProcessContext` provides access to the artifact registry, the configured extent, and the data directory. The function returns a dict of named outputs. No base class required — any callable with this signature qualifies.

### OGC API Processes exposure

Execution follows OGC API Processes Part 1:

```
POST /ogcapi/processes/{id}/execution  →  202 Accepted + jobID
GET  /ogcapi/jobs/{jobID}              →  status / results
```

Processes with `ogcapi.expose: false` are available internally (e.g. triggered by derived dataset sync) but not listed in the OGC API catalogue.

### Relationship to derived datasets

The `sync_kind: derived` pattern from PR #63 stays as-is for dataset templates that reference a process by id:

```yaml
sync_kind: derived
processing:
  process_id: resample
  params:
    method: mean
    period: monthly
```

This decouples the dataset definition (what to produce and when) from the process definition (how to produce it). The same resampling process can back multiple derived dataset templates.

### Built-in processes (starting point)

| id | description | ref |
|---|---|---|
| `resample` | Temporal resampling — produces weekly or monthly aggregates from a higher-frequency source dataset | PR #63 |
| `normals` | Climate normals — computes long-term averages (e.g. 1991–2020) grouped by day-of-year, with optional WMO 31-day smoothing window | #56 |
| `anomaly` | Climate anomalies — computes departure from normals for a given time range; can be triggered automatically after each sync via a post-sync cascade | #56 |

The normals→anomaly cascade (#56) fits naturally into the process registry: after a base dataset syncs, the sync engine checks whether any `sync_kind: derived` anomaly datasets reference it and triggers the `anomaly` process for the new delta range as a background job. The cascade requires that the normals artifact exists first; if it does not, the cascade is skipped with a warning.

ERA5-Land hourly data requires an intermediate daily aggregation step before normals are meaningful. The `resample` process covers this: a daily ERA5 temperature or precipitation dataset is itself a publishable artifact and the natural source for normals computation.

## Relationship to cloud execution (#49)

The `ProcessContext` passed to each function is the natural place to inject an execution backend — local by default, Dask or serverless for deployments co-located with cloud storage. The function signature stays the same regardless of where it runs. This keeps the API surface standard (OGC API Processes) while allowing the execution location to become a deployment configuration.

## Open questions

- Should there be a `POST /processes` endpoint for registering a process at runtime (OGC API Processes Part 2 — Deploy), or is file-based registration sufficient for the target user base?
- How do we handle process versioning — if a custom process function changes, do existing derived artifacts need to be marked stale?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YAML-defined process registry with OGC API Processes exposure #65

Background

Proposed design

OGC API Processes exposure — dynamic config generation

Process registry (mirrors the dataset registry)

Python function interface

OGC API Processes exposure

Relationship to derived datasets

Built-in processes (starting point)

Relationship to cloud execution (#49)

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

id	description	ref
`resample`	Temporal resampling — produces weekly or monthly aggregates from a higher-frequency source dataset	PR #63
`normals`	Climate normals — computes long-term averages (e.g. 1991–2020) grouped by day-of-year, with optional WMO 31-day smoothing window	#56
`anomaly`	Climate anomalies — computes departure from normals for a given time range; can be triggered automatically after each sync via a post-sync cascade	#56

YAML-defined process registry with OGC API Processes exposure #65

Description

Background

Proposed design

OGC API Processes exposure — dynamic config generation

Process registry (mirrors the dataset registry)

Python function interface

OGC API Processes exposure

Relationship to derived datasets

Built-in processes (starting point)

Relationship to cloud execution (#49)

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions