You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CHIRPS is daily. Monthly and weekly aggregates are needed for monthly normals, seasonal analysis, and import into DHIS2 monthly data elements.
Any temporal aggregation pipeline (normals, anomalies, exposure indices) starts with a resampling step.
This issue defines resampling as a first-class process: a job that takes one GeoZarr as input, produces another GeoZarr at a coarser period type, and updates automatically when the upstream dataset is synced.
This is a prerequisite for #56 (climate normals and anomalies), which depends on daily ERA5 and potentially monthly CHIRPS datasets.
Design decisions
Output is a separate dataset, not an in-place transformation. A resampled dataset has a different period type than its source, which means a different dataset ID, different chunk shape, and different sync behaviour. Modifying the source store in place is not possible without violating the one-period-type-per-dataset constraint.
Only complete periods are written. If the source data ends mid-week, the weekly output truncates to the last complete week. This means a resampled dataset may lag the source by up to one output period. This is explicit and documented in the dataset metadata, not a silent omission.
Aggregation method is declared per dataset template. Each climate variable has a correct aggregation method: temperature uses mean (or min/max), accumulated precipitation uses sum. Declaring this in the template rather than at request time prevents incorrect aggregations from being applied silently.
Dataset template design
A new resample block in the YAML template declares the source and aggregation:
- id: era5land_temperature_dailyname: ERA5-Land 2m temperature (daily mean)variable: t2mperiod_type: dailysync_kind: temporalsync_execution: appendresample:
source_dataset_id: era5land_temperature_hourlymethod: mean
- id: era5land_precipitation_dailyname: ERA5-Land total precipitation (daily sum)variable: tpperiod_type: dailysync_kind: temporalsync_execution: appendresample:
source_dataset_id: era5land_precipitation_hourlymethod: sum
- id: chirps3_precipitation_weeklyname: CHIRPS3 total precipitation (weekly sum)variable: precipperiod_type: weeklysync_kind: temporalsync_execution: appendresample:
source_dataset_id: chirps3_precipitation_dailymethod: sumweek_start: monday # ISO default; configurable
- id: chirps3_precipitation_monthlyname: CHIRPS3 total precipitation (monthly sum)variable: precipperiod_type: monthlysync_kind: temporalsync_execution: appendresample:
source_dataset_id: chirps3_precipitation_dailymethod: sum
Supported aggregation methods
Method
Use case
mean
Temperature, humidity, wind speed
sum
Precipitation, radiation (accumulated variables)
min
Daily minimum temperature
max
Daily maximum temperature
first
Snapshot or categorical variables
last
Snapshot or categorical variables
Computation
New src/climate_api/processing/resample.py:
defresample_dataset(
source_zarr_path: str,
*,
target_period_type: str,
method: str,
start: str,
end: str,
) ->xr.Dataset:
ds=xr.open_zarr(source_zarr_path).sel(time=slice(start, end))
freq=_period_type_to_pandas_freq(target_period_type) # "1D", "1W-MON", "1ME", "1YE"resampler=ds.resample(time=freq)
result=getattr(resampler, method)()
# Drop incomplete final period if source data ends mid-periodreturn_drop_incomplete_trailing_period(result, source_end=end)
_drop_incomplete_trailing_period compares the last output time step against the expected end of that period. If the source data does not cover the full period, the last output step is dropped.
The endpoint writes the output GeoZarr and registers it as a published artifact under the resampled dataset ID. On subsequent calls for overlapping ranges it appends only the missing periods.
Automatic cascade on sync
When run_sync completes with status="completed" for a source dataset, the sync engine checks the dataset registry for any resampled datasets that declare that dataset as their source_dataset_id. If found, it triggers a resample job for the new delta range.
# In sync_engine.run_sync, after artifact is stored:_trigger_downstream_resample_jobs(
source_dataset_id=managed_dataset_id,
delta_start=sync_detail.delta_start,
delta_end=sync_detail.delta_end,
)
Partial period guard: the cascade passes delta_end to the resample job, which drops any trailing incomplete period. The resampled dataset's coverage end will reflect only fully covered periods, and this is recorded in the artifact metadata.
Example cascade: syncing era5land_temperature_hourly_sle through 2024-03-31T23 triggers a resample job for era5land_temperature_daily_sle covering 2024-01-01–2024-03-31. If era5land_temperature_daily_normals_sle exists (from #56), a normals computation cascade fires next.
Dependency order
1. resample block in dataset template YAML schema + validation
2. resample_dataset computation service
3. POST /processes/resample endpoint
4. Post-sync cascade hook in sync_engine
Steps 1–3 are sequential. Step 4 can be built in parallel with step 3 once the service exists.
The ERA5 daily aggregation step described in #56 (Phase 1.4) is the first use case of this process. Once this issue is implemented, #56 Phase 1.4 becomes a dataset template declaration rather than a bespoke computation step.
Deferred
Downsampling (daily → hourly interpolation): not a climate use case and out of scope.
Custom period alignment (e.g. dekadal — 10-day periods used in agricultural meteorology): the period_type enum can be extended; the xarray resample frequency mapping handles the rest.
Background
Several derived dataset types require period type conversion before they can be computed:
This issue defines resampling as a first-class process: a job that takes one GeoZarr as input, produces another GeoZarr at a coarser period type, and updates automatically when the upstream dataset is synced.
This is a prerequisite for #56 (climate normals and anomalies), which depends on daily ERA5 and potentially monthly CHIRPS datasets.
Design decisions
Output is a separate dataset, not an in-place transformation. A resampled dataset has a different period type than its source, which means a different dataset ID, different chunk shape, and different sync behaviour. Modifying the source store in place is not possible without violating the one-period-type-per-dataset constraint.
Only complete periods are written. If the source data ends mid-week, the weekly output truncates to the last complete week. This means a resampled dataset may lag the source by up to one output period. This is explicit and documented in the dataset metadata, not a silent omission.
Aggregation method is declared per dataset template. Each climate variable has a correct aggregation method: temperature uses
mean(ormin/max), accumulated precipitation usessum. Declaring this in the template rather than at request time prevents incorrect aggregations from being applied silently.Dataset template design
A new
resampleblock in the YAML template declares the source and aggregation:Supported aggregation methods
meansumminmaxfirstlastComputation
New
src/climate_api/processing/resample.py:_drop_incomplete_trailing_periodcompares the last output time step against the expected end of that period. If the source data does not cover the full period, the last output step is dropped.New endpoint: POST /processes/resample
The endpoint writes the output GeoZarr and registers it as a published artifact under the resampled dataset ID. On subsequent calls for overlapping ranges it appends only the missing periods.
Automatic cascade on sync
When
run_synccompletes withstatus="completed"for a source dataset, the sync engine checks the dataset registry for any resampled datasets that declare that dataset as theirsource_dataset_id. If found, it triggers a resample job for the new delta range.Partial period guard: the cascade passes
delta_endto the resample job, which drops any trailing incomplete period. The resampled dataset's coverage end will reflect only fully covered periods, and this is recorded in the artifact metadata.Example cascade: syncing
era5land_temperature_hourly_slethrough2024-03-31T23triggers a resample job forera5land_temperature_daily_slecovering2024-01-01–2024-03-31. Ifera5land_temperature_daily_normals_sleexists (from #56), a normals computation cascade fires next.Dependency order
Steps 1–3 are sequential. Step 4 can be built in parallel with step 3 once the service exists.
Relationship to #56
The ERA5 daily aggregation step described in #56 (Phase 1.4) is the first use case of this process. Once this issue is implemented, #56 Phase 1.4 becomes a dataset template declaration rather than a bespoke computation step.
Deferred
period_typeenum can be extended; the xarray resample frequency mapping handles the rest./processes/resampleendpoint is a REST job for now, consistent with/processes/normalsand/processes/anomalyfrom Climate normals and anomalies: computation, storage, and automatic cascade #56.