Skip to content

ERA5 precipitation: deaccumulate tp at ingestion time #58

@turban

Description

@turban

Background

ERA5-Land tp (total precipitation) is stored as accumulated precipitation since the start of each day, not as incremental per-hour values. The accumulation resets at 00:00 UTC each day, and the 00:00 timestamp carries the final accumulated value from the previous day (a one-hour spillover).

Any consumer that sums or aggregates raw tp values — temporal resampling, climate normals, DHIS2 export — will produce incorrect results unless they first deaccumulate. If this is left to each consumer, ERA5-specific encoding knowledge spreads throughout the codebase.

Proposal

Apply deaccumulation once, at ingestion time, so the hourly GeoZarr store contains incremental (per-hour) precipitation from the start. Every downstream step then works with physically correct values and needs no special handling for ERA5 tp.

The three-step deaccumulation (as documented in the DHIS2 Climate Tools aggregation guide):

# Step 1: shift values back one hour to undo the daily spillover
accum = ds["tp"].shift(valid_time=-1)

# Step 2: convert accumulated to incremental hourly values
incremental = accum.diff(dim="valid_time").reindex(valid_time=accum.valid_time)

# Step 3: at each day boundary (hour == 0) the diff is invalid; use raw accumulated value
is_reset = accum["valid_time"].dt.hour == 0
incremental = xr.where(is_reset, accum, incremental)

Dataset template

Add an optional transform field to cache_info declaring a named post-download, pre-write transformation:

# data/datasets/era5_land.yaml
- id: era5land_precipitation_hourly
  variable: tp
  period_type: hourly
  cache_info:
    eo_function: dhis2eo.data.cds.era5_land.hourly.download
    transform: deaccumulate_era5_tp

deaccumulate_era5_tp is registered in processing/transforms.py. The field is absent for all other variables (temperature, humidity, etc.) and absent for all other dataset sources.

Append boundary

The deaccumulation is a diff operation — the first hour of each new batch requires the last accumulated value from the previous batch as its anchor. When appending a new month:

  1. Before writing, read the last raw accumulated value for tp from the final timestamp of the existing Zarr store.
  2. Prepend it as the anchor for diff() at hour 0 of the new batch.
  3. Write only the new incremental values (not the anchor).

This makes the append path stateful with respect to the existing store, but the overlap is a single scalar per spatial grid — not a full re-read.

If no existing store is present (first ingest), the first hourly value is set to the raw accumulated value at 00:00 (which equals the incremental value for that hour, since there is no prior accumulation to subtract).

Zarr metadata

Update the tp variable attributes after transformation:

ds["tp"].attrs["long_name"] = "Hourly total precipitation"
ds["tp"].attrs["cell_methods"] = "time: sum (interval: 1 hour)"

The original ERA5-Land cell_methods value ("time: sum" without interval) is ambiguous about the accumulation window; the corrected attribute makes the per-hour semantics explicit.

Scope

  • processing/transforms.pydeaccumulate_era5_tp function
  • data_manager/downloader.py — apply transform after download, before Zarr write; handle append boundary
  • data/datasets/era5_land.yaml — add transform: deaccumulate_era5_tp to precipitation template
  • data_registry/ — parse and validate the new transform field

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions