Skip to content

Temporal resampling: period type conversion for GeoZarr datasets (hourly→daily, daily→weekly, daily→monthly) #57

@turban

Description

@turban

Background

Several derived dataset types require period type conversion before they can be computed:

  • ERA5-Land is ingested as hourly data. Daily mean/min/max temperature and daily total precipitation are far more useful for DHIS2 health applications and are required before climate normals can be computed (see Climate normals and anomalies: computation, storage, and automatic cascade #56).
  • CHIRPS is daily. Monthly and weekly aggregates are needed for monthly normals, seasonal analysis, and import into DHIS2 monthly data elements.
  • Any temporal aggregation pipeline (normals, anomalies, exposure indices) starts with a resampling step.

This issue defines resampling as a first-class process: a job that takes one GeoZarr as input, produces another GeoZarr at a coarser period type, and updates automatically when the upstream dataset is synced.

This is a prerequisite for #56 (climate normals and anomalies), which depends on daily ERA5 and potentially monthly CHIRPS datasets.


Design decisions

Output is a separate dataset, not an in-place transformation. A resampled dataset has a different period type than its source, which means a different dataset ID, different chunk shape, and different sync behaviour. Modifying the source store in place is not possible without violating the one-period-type-per-dataset constraint.

Only complete periods are written. If the source data ends mid-week, the weekly output truncates to the last complete week. This means a resampled dataset may lag the source by up to one output period. This is explicit and documented in the dataset metadata, not a silent omission.

Aggregation method is declared per dataset template. Each climate variable has a correct aggregation method: temperature uses mean (or min/max), accumulated precipitation uses sum. Declaring this in the template rather than at request time prevents incorrect aggregations from being applied silently.


Dataset template design

A new resample block in the YAML template declares the source and aggregation:

- id: era5land_temperature_daily
  name: ERA5-Land 2m temperature (daily mean)
  variable: t2m
  period_type: daily
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: era5land_temperature_hourly
    method: mean

- id: era5land_precipitation_daily
  name: ERA5-Land total precipitation (daily sum)
  variable: tp
  period_type: daily
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: era5land_precipitation_hourly
    method: sum

- id: chirps3_precipitation_weekly
  name: CHIRPS3 total precipitation (weekly sum)
  variable: precip
  period_type: weekly
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: chirps3_precipitation_daily
    method: sum
    week_start: monday            # ISO default; configurable

- id: chirps3_precipitation_monthly
  name: CHIRPS3 total precipitation (monthly sum)
  variable: precip
  period_type: monthly
  sync_kind: temporal
  sync_execution: append
  resample:
    source_dataset_id: chirps3_precipitation_daily
    method: sum

Supported aggregation methods

Method Use case
mean Temperature, humidity, wind speed
sum Precipitation, radiation (accumulated variables)
min Daily minimum temperature
max Daily maximum temperature
first Snapshot or categorical variables
last Snapshot or categorical variables

Computation

New src/climate_api/processing/resample.py:

def resample_dataset(
    source_zarr_path: str,
    *,
    target_period_type: str,
    method: str,
    start: str,
    end: str,
) -> xr.Dataset:
    ds = xr.open_zarr(source_zarr_path).sel(time=slice(start, end))
    freq = _period_type_to_pandas_freq(target_period_type)  # "1D", "1W-MON", "1ME", "1YE"
    resampler = ds.resample(time=freq)
    result = getattr(resampler, method)()
    # Drop incomplete final period if source data ends mid-period
    return _drop_incomplete_trailing_period(result, source_end=end)

_drop_incomplete_trailing_period compares the last output time step against the expected end of that period. If the source data does not cover the full period, the last output step is dropped.


New endpoint: POST /processes/resample

POST /processes/resample
{
  "source_dataset_id": "era5land_temperature_hourly_sle",
  "target_period_type": "daily",
  "method": "mean",
  "start": "2024-01-01",
  "end": "2024-12-31"
}
→ 202 Accepted + { "job_id": "..." }

The endpoint writes the output GeoZarr and registers it as a published artifact under the resampled dataset ID. On subsequent calls for overlapping ranges it appends only the missing periods.


Automatic cascade on sync

When run_sync completes with status="completed" for a source dataset, the sync engine checks the dataset registry for any resampled datasets that declare that dataset as their source_dataset_id. If found, it triggers a resample job for the new delta range.

# In sync_engine.run_sync, after artifact is stored:
_trigger_downstream_resample_jobs(
    source_dataset_id=managed_dataset_id,
    delta_start=sync_detail.delta_start,
    delta_end=sync_detail.delta_end,
)

Partial period guard: the cascade passes delta_end to the resample job, which drops any trailing incomplete period. The resampled dataset's coverage end will reflect only fully covered periods, and this is recorded in the artifact metadata.

Example cascade: syncing era5land_temperature_hourly_sle through 2024-03-31T23 triggers a resample job for era5land_temperature_daily_sle covering 2024-01-012024-03-31. If era5land_temperature_daily_normals_sle exists (from #56), a normals computation cascade fires next.


Dependency order

1. resample block in dataset template YAML schema + validation
2. resample_dataset computation service
3. POST /processes/resample endpoint
4. Post-sync cascade hook in sync_engine

Steps 1–3 are sequential. Step 4 can be built in parallel with step 3 once the service exists.


Relationship to #56

The ERA5 daily aggregation step described in #56 (Phase 1.4) is the first use case of this process. Once this issue is implemented, #56 Phase 1.4 becomes a dataset template declaration rather than a bespoke computation step.


Deferred

  • Downsampling (daily → hourly interpolation): not a climate use case and out of scope.
  • Custom period alignment (e.g. dekadal — 10-day periods used in agricultural meteorology): the period_type enum can be extended; the xarray resample frequency mapping handles the rest.
  • OGC API Processes compliance: the /processes/resample endpoint is a REST job for now, consistent with /processes/normals and /processes/anomaly from Climate normals and anomalies: computation, storage, and automatic cascade #56.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions