Background
ERA5-Land tp (total precipitation) is stored as accumulated precipitation since the start of each day, not as incremental per-hour values. The accumulation resets at 00:00 UTC each day, and the 00:00 timestamp carries the final accumulated value from the previous day (a one-hour spillover).
Any consumer that sums or aggregates raw tp values — temporal resampling, climate normals, DHIS2 export — will produce incorrect results unless they first deaccumulate. If this is left to each consumer, ERA5-specific encoding knowledge spreads throughout the codebase.
Proposal
Apply deaccumulation once, at ingestion time, so the hourly GeoZarr store contains incremental (per-hour) precipitation from the start. Every downstream step then works with physically correct values and needs no special handling for ERA5 tp.
The three-step deaccumulation (as documented in the DHIS2 Climate Tools aggregation guide):
# Step 1: shift values back one hour to undo the daily spillover
accum = ds["tp"].shift(valid_time=-1)
# Step 2: convert accumulated to incremental hourly values
incremental = accum.diff(dim="valid_time").reindex(valid_time=accum.valid_time)
# Step 3: at each day boundary (hour == 0) the diff is invalid; use raw accumulated value
is_reset = accum["valid_time"].dt.hour == 0
incremental = xr.where(is_reset, accum, incremental)
Dataset template
Add an optional transform field to cache_info declaring a named post-download, pre-write transformation:
# data/datasets/era5_land.yaml
- id: era5land_precipitation_hourly
variable: tp
period_type: hourly
cache_info:
eo_function: dhis2eo.data.cds.era5_land.hourly.download
transform: deaccumulate_era5_tp
deaccumulate_era5_tp is registered in processing/transforms.py. The field is absent for all other variables (temperature, humidity, etc.) and absent for all other dataset sources.
Append boundary
The deaccumulation is a diff operation — the first hour of each new batch requires the last accumulated value from the previous batch as its anchor. When appending a new month:
- Before writing, read the last raw accumulated value for
tp from the final timestamp of the existing Zarr store.
- Prepend it as the anchor for
diff() at hour 0 of the new batch.
- Write only the new incremental values (not the anchor).
This makes the append path stateful with respect to the existing store, but the overlap is a single scalar per spatial grid — not a full re-read.
If no existing store is present (first ingest), the first hourly value is set to the raw accumulated value at 00:00 (which equals the incremental value for that hour, since there is no prior accumulation to subtract).
Zarr metadata
Update the tp variable attributes after transformation:
ds["tp"].attrs["long_name"] = "Hourly total precipitation"
ds["tp"].attrs["cell_methods"] = "time: sum (interval: 1 hour)"
The original ERA5-Land cell_methods value ("time: sum" without interval) is ambiguous about the accumulation window; the corrected attribute makes the per-hour semantics explicit.
Scope
processing/transforms.py — deaccumulate_era5_tp function
data_manager/downloader.py — apply transform after download, before Zarr write; handle append boundary
data/datasets/era5_land.yaml — add transform: deaccumulate_era5_tp to precipitation template
data_registry/ — parse and validate the new transform field
Background
ERA5-Land
tp(total precipitation) is stored as accumulated precipitation since the start of each day, not as incremental per-hour values. The accumulation resets at 00:00 UTC each day, and the 00:00 timestamp carries the final accumulated value from the previous day (a one-hour spillover).Any consumer that sums or aggregates raw
tpvalues — temporal resampling, climate normals, DHIS2 export — will produce incorrect results unless they first deaccumulate. If this is left to each consumer, ERA5-specific encoding knowledge spreads throughout the codebase.Proposal
Apply deaccumulation once, at ingestion time, so the hourly GeoZarr store contains incremental (per-hour) precipitation from the start. Every downstream step then works with physically correct values and needs no special handling for ERA5
tp.The three-step deaccumulation (as documented in the DHIS2 Climate Tools aggregation guide):
Dataset template
Add an optional
transformfield tocache_infodeclaring a named post-download, pre-write transformation:deaccumulate_era5_tpis registered inprocessing/transforms.py. The field is absent for all other variables (temperature, humidity, etc.) and absent for all other dataset sources.Append boundary
The deaccumulation is a diff operation — the first hour of each new batch requires the last accumulated value from the previous batch as its anchor. When appending a new month:
tpfrom the final timestamp of the existing Zarr store.diff()at hour 0 of the new batch.This makes the append path stateful with respect to the existing store, but the overlap is a single scalar per spatial grid — not a full re-read.
If no existing store is present (first ingest), the first hourly value is set to the raw accumulated value at 00:00 (which equals the incremental value for that hour, since there is no prior accumulation to subtract).
Zarr metadata
Update the
tpvariable attributes after transformation:The original ERA5-Land
cell_methodsvalue ("time: sum"without interval) is ambiguous about the accumulation window; the corrected attribute makes the per-hour semantics explicit.Scope
processing/transforms.py—deaccumulate_era5_tpfunctiondata_manager/downloader.py— applytransformafter download, before Zarr write; handle append boundarydata/datasets/era5_land.yaml— addtransform: deaccumulate_era5_tpto precipitation templatedata_registry/— parse and validate the newtransformfield