Temporal resampling: period type conversion for GeoZarr datasets (hourly→daily, daily→weekly, daily→monthly)

## Background

Several derived dataset types require period type conversion before they can be computed:

- ERA5-Land is ingested as hourly data. Daily mean/min/max temperature and daily total precipitation are far more useful for DHIS2 health applications and are required before climate normals can be computed (see #56).
- CHIRPS is daily. Monthly and weekly aggregates are needed for monthly normals, seasonal analysis, and import into DHIS2 monthly data elements.
- Any temporal aggregation pipeline (normals, anomalies, exposure indices) starts with a resampling step.

This issue defines resampling as a first-class process: a job that takes one GeoZarr as input, produces another GeoZarr at a coarser period type, and updates automatically when the upstream dataset is synced.

**This is a prerequisite for #56** (climate normals and anomalies), which depends on daily ERA5 and potentially monthly CHIRPS datasets.

---

## Design decisions

**Output is a separate dataset, not an in-place transformation.** A resampled dataset has a different period type than its source, which means a different dataset ID, different chunk shape, and different sync behaviour. Modifying the source store in place is not possible without violating the one-period-type-per-dataset constraint.

**Only complete periods are written.** If the source data ends mid-week, the weekly output truncates to the last complete week. This means a resampled dataset may lag the source by up to one output period. This is explicit and documented in the dataset metadata, not a silent omission.

**Aggregation method is declared per dataset template.** Each climate variable has a correct aggregation method: temperature uses `mean` (or `min`/`max`), accumulated precipitation uses `sum`. Declaring this in the template rather than at request time prevents incorrect aggregations from being applied silently.

---

## Dataset template design

A new `resample` block in the YAML template declares the source and aggregation:

```yaml
- id: era5land_temperature_daily
  name: ERA5-Land 2m temperature (daily mean)
  variable: t2m
  period_type: daily
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: era5land_temperature_hourly
    method: mean

- id: era5land_precipitation_daily
  name: ERA5-Land total precipitation (daily sum)
  variable: tp
  period_type: daily
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: era5land_precipitation_hourly
    method: sum

- id: chirps3_precipitation_weekly
  name: CHIRPS3 total precipitation (weekly sum)
  variable: precip
  period_type: weekly
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: chirps3_precipitation_daily
    method: sum
    week_start: monday            # ISO default; configurable

- id: chirps3_precipitation_monthly
  name: CHIRPS3 total precipitation (monthly sum)
  variable: precip
  period_type: monthly
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: chirps3_precipitation_daily
    method: sum
```

### Supported aggregation methods

| Method | Use case |
|--------|----------|
| `mean` | Temperature, humidity, wind speed |
| `sum` | Precipitation, radiation (accumulated variables) |
| `min` | Daily minimum temperature |
| `max` | Daily maximum temperature |
| `first` | Snapshot or categorical variables |
| `last` | Snapshot or categorical variables |

---

## Computation

New `src/climate_api/processing/resample.py`:

```python
def resample_dataset(
    source_zarr_path: str,
    *,
    target_period_type: str,
    method: str,
    start: str,
    end: str,
) -> xr.Dataset:
    ds = xr.open_zarr(source_zarr_path).sel(time=slice(start, end))
    freq = _period_type_to_pandas_freq(target_period_type)  # "1D", "1W-MON", "1ME", "1YE"
    resampler = ds.resample(time=freq)
    result = getattr(resampler, method)()
    # Drop incomplete final period if source data ends mid-period
    return _drop_incomplete_trailing_period(result, source_end=end)
```

`_drop_incomplete_trailing_period` compares the last output time step against the expected end of that period. If the source data does not cover the full period, the last output step is dropped.

---

## New endpoint: POST /processes/resample

```
POST /processes/resample
{
  "source_dataset_id": "era5land_temperature_hourly_sle",
  "target_period_type": "daily",
  "method": "mean",
  "start": "2024-01-01",
  "end": "2024-12-31"
}
→ 202 Accepted + { "job_id": "..." }
```

The endpoint writes the output GeoZarr and registers it as a published artifact under the resampled dataset ID. On subsequent calls for overlapping ranges it appends only the missing periods.

---

## Automatic cascade on sync

When `run_sync` completes with `status="completed"` for a source dataset, the sync engine checks the dataset registry for any resampled datasets that declare that dataset as their `source_dataset_id`. If found, it triggers a resample job for the new delta range.

```python
# In sync_engine.run_sync, after artifact is stored:
_trigger_downstream_resample_jobs(
    source_dataset_id=managed_dataset_id,
    delta_start=sync_detail.delta_start,
    delta_end=sync_detail.delta_end,
)
```

**Partial period guard:** the cascade passes `delta_end` to the resample job, which drops any trailing incomplete period. The resampled dataset's coverage end will reflect only fully covered periods, and this is recorded in the artifact metadata.

**Example cascade:** syncing `era5land_temperature_hourly_sle` through `2024-03-31T23` triggers a resample job for `era5land_temperature_daily_sle` covering `2024-01-01`–`2024-03-31`. If `era5land_temperature_daily_normals_sle` exists (from #56), a normals computation cascade fires next.

---

## Dependency order

```
1. resample block in dataset template YAML schema + validation
2. resample_dataset computation service
3. POST /processes/resample endpoint
4. Post-sync cascade hook in sync_engine
```

Steps 1–3 are sequential. Step 4 can be built in parallel with step 3 once the service exists.

---

## Relationship to #56

The ERA5 daily aggregation step described in #56 (Phase 1.4) is the first use case of this process. Once this issue is implemented, #56 Phase 1.4 becomes a dataset template declaration rather than a bespoke computation step.

---

## Deferred

- **Downsampling** (daily → hourly interpolation): not a climate use case and out of scope.
- **Custom period alignment** (e.g. dekadal — 10-day periods used in agricultural meteorology): the `period_type` enum can be extended; the xarray resample frequency mapping handles the rest.
- **OGC API Processes compliance**: the `/processes/resample` endpoint is a REST job for now, consistent with `/processes/normals` and `/processes/anomaly` from #56.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporal resampling: period type conversion for GeoZarr datasets (hourly→daily, daily→weekly, daily→monthly) #57

Background

Design decisions

Dataset template design

Supported aggregation methods

Computation

New endpoint: POST /processes/resample

Automatic cascade on sync

Dependency order

Relationship to #56

Deferred

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Use case
`mean`	Temperature, humidity, wind speed
`sum`	Precipitation, radiation (accumulated variables)
`min`	Daily minimum temperature
`max`	Daily maximum temperature
`first`	Snapshot or categorical variables
`last`	Snapshot or categorical variables

Temporal resampling: period type conversion for GeoZarr datasets (hourly→daily, daily→weekly, daily→monthly) #57

Description

Background

Design decisions

Dataset template design

Supported aggregation methods

Computation

New endpoint: POST /processes/resample

Automatic cascade on sync

Dependency order

Relationship to #56

Deferred

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions