---
title: Aggregate data from NetCDF to DHIS2 organisation units
short_title: Aggregata data from CICERO
---

In this notebook we will show how to load daily air pollution data from NetCDF using earthkit and aggregate the data to DHIS2 organisation units.

In [1]:
import earthkit.data
from earthkit.transforms import aggregate
from dhis2eo.integrations.pandas import dataframe_to_dhis2_json

Load a NetCDF file using earthkit.

In [2]:
file = "data/pm_final_srilanka_linearp.nc"
data = earthkit.data.from_source("file", file)

To more easily work with and display the contents of the dataset we can convert it to an xarray. It shows that the file includes 3 dimensions (latitude, longitude and valid_time) and one data variable.

In [3]:
data_array = data.to_xarray()
data_array

Unnamed: 0,Array,Chunk
Bytes,2.11 GiB,2.11 GiB
Shape,"(1401, 450, 450)","(1401, 450, 450)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.11 GiB 2.11 GiB Shape (1401, 450, 450) (1401, 450, 450) Dask graph 1 chunks in 2 graph layers Data type float64 numpy.ndarray",450  450  1401,

Unnamed: 0,Array,Chunk
Bytes,2.11 GiB,2.11 GiB
Shape,"(1401, 450, 450)","(1401, 450, 450)"
Dask graph,1 chunks in 2 graph layers,1 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


Eartkit can also be used to load the organisation units from DHIS2 that we [saved as a GeoJSON file](organization-units). 

In [4]:
district_file = "data/sri-lanka-provinces.geojson"
features = earthkit.data.from_source("file", district_file)

We can display the first feature to see the information we have for each org unit. For the aggregation, we are particularly interested in the id and the geometry (polygon) of the org unit. 

In [5]:
features

Unnamed: 0,shapeName,shapeISO,shapeID,shapeGroup,shapeType,geometry
0,Northern Province,LK-4,99731895B93054189817547,LKA,ADM1,"MULTIPOLYGON (((79.91381 8.94183, 79.91835 8.9..."
1,Eastern Province,LK-5,99731895B51072878877669,LKA,ADM1,"POLYGON ((80.75304 8.90515, 80.78104 8.91667, ..."
2,Central Province,LK-2,99731895B28050807675820,LKA,ADM1,"POLYGON ((80.98913 7.72169, 80.98026 7.7161, 8..."
3,North Central Province,LK-7,99731895B66209916164902,LKA,ADM1,"POLYGON ((80.03237 8.52721, 80.0428 8.50005, 8..."
4,North Western Province,LK-6,99731895B5260290378804,LKA,ADM1,"MULTIPOLYGON (((79.77994 8.26209, 79.78165 8.2..."
5,Sabaragamuwa Province,LK-9,99731895B16804022686405,LKA,ADM1,"POLYGON ((80.42215 7.35518, 80.41441 7.35669, ..."
6,Southern Province,LK-3,99731895B60977758614393,LKA,ADM1,"POLYGON ((81.60831 6.58004, 81.60487 6.5688, 8..."
7,Uva Province,LK-8,99731895B77537271836321,LKA,ADM1,"POLYGON ((80.97791 7.62496, 80.97926 7.61174, ..."
8,Western Province,LK-1,99731895B46623695729728,LKA,ADM1,"POLYGON ((79.84202 7.27339, 79.84265 7.26526, ..."


To aggregate the data to the org unit features we use the aggregate package of [earthkit-transforms](https://earthkit-transforms.readthedocs.io). We keep the daily period type and only aggregate the data spatially to the org unit features. mask_dim is the dimension (org unit id) that will be created after the reduction of the spatial dimensions (longitude/latitude grid). 

In [6]:
agg_data = aggregate.spatial.reduce(data, features, mask_dim="id")

The aggregated data is returned as an xarray with two dimensions (id and valid_time), and the same variable. 

In [7]:
agg_data

We see that the aggregated data is returned as an xarray with two dimensions (id representing the org unit id and valid_time as the time period), and the same data variable named `__xarray_dataarray_variable__`.

In [8]:
dataArray = agg_data['__xarray_dataarray_variable__']

To more easily work with tabular aggregated data, we convert the results to a pandas.DataFrame and inspect the results:

In [9]:
agg_df = agg_data.to_dataframe().reset_index()
agg_df

Unnamed: 0,time,id,__xarray_dataarray_variable__
0,2020-03-01,0,34.554021
1,2020-03-01,1,26.434494
2,2020-03-01,2,22.862232
3,2020-03-01,3,33.436081
4,2020-03-01,4,33.294359
...,...,...,...
12604,2023-12-31,4,23.459132
12605,2023-12-31,5,25.993425
12606,2023-12-31,6,25.063835
12607,2023-12-31,7,23.114059


Two decimals is sufficient for our use so we round all the temperature values:

In [10]:
agg_df['__xarray_dataarray_variable__'] = agg_df['__xarray_dataarray_variable__'].astype('float64').round(decimals=2)
agg_df

Unnamed: 0,time,id,__xarray_dataarray_variable__
0,2020-03-01,0,34.55
1,2020-03-01,1,26.43
2,2020-03-01,2,22.86
3,2020-03-01,3,33.44
4,2020-03-01,4,33.29
...,...,...,...
12604,2023-12-31,4,23.46
12605,2023-12-31,5,25.99
12606,2023-12-31,6,25.06
12607,2023-12-31,7,23.11


Use the `dhis2eo` utility function `dataframe_to_dhis2_json` to translate the `pandas.DataFrame` into the JSON structure used by the DHIS2 Web API:

In [11]:
json_dict = dataframe_to_dhis2_json(
    df = agg_df,                                 # aggregated pandas.DataFrame
    org_unit_col = 'id',                         # column containing the org unit id
    period_col = 'time',                         # column containing the period
    value_col = '__xarray_dataarray_variable__', # column containing the value
    data_element_id = 'abc123'                   # id of the DHIS2 data element
)

We can display the first 3 items to see that we have one value for each org unit and period combination.

In [12]:
json_dict['dataValues'][:3]

[{'orgUnit': 0, 'period': '20200301', 'value': 34.55, 'dataElement': 'abc123'},
 {'orgUnit': 1, 'period': '20200301', 'value': 26.43, 'dataElement': 'abc123'},
 {'orgUnit': 2, 'period': '20200301', 'value': 22.86, 'dataElement': 'abc123'}]

At this point we have successfully aggregated air pollution data in a JSON format that can be used by DHIS2. To learn how to import this JSON data into DHIS2, see [our guide for uploading data values using the Python DHIS2 client](../import-data/using-python-client.ipynb). 