---
title: Automated Data Import from Climate Data Store
short_title: Automated Data Import
---

In this notebook we will demonstrate a full workflow for how we can use Climate Tools to automate regularly downloading data from the [Climate Data Store (CDS)](https://cds.climate.copernicus.eu/datasets), aggregating to DHIS2 organisation units, and importing the aggregated climate data back to DHIS2. 

For our example we will connect to a local DHIS2 instance containing the Sierra Leone demo database, setup new data elements for daily Temperature data and daily Total precipitation data, and show how to use `dhis2eo` to download, and import/update DHIS2 with the latest [daily data from the Climate Data Store ERA5 dataset](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=download). 

----------------------------------------
## Requirements

In order to run this notebook, you first need to connect to an instance of DHIS2. For our example, we will connect to a local instance of DHIS2 containing the standard Sierra Leone demo database, but you should be able to switch out the instance url and credentials to work directly with your own database. 

In [1]:
from dhis2_client import DHIS2Client
from dhis2_client.settings import ClientSettings

# Create DHIS2 client connection
cfg = ClientSettings(
  base_url="http://localhost:8080",
  username="admin",
  password="district"
)
client = DHIS2Client(settings=cfg)

# Verify connection
info = client.get_system_info()
print("Current DHIS2 version:", info["version"])

Current DHIS2 version: 2.42.2


We also need to create the data elements for importing data into. If you haven't already created your data elements manually, you can follow the steps below to create the data element using the `python-dhis2-client`.

First create the temperature data element: 

In [2]:
data_element = {
    "name": "2m Temperature (ERA5)",
    "shortName": "Temperature (ERA5)",
    "valueType": "NUMBER",
    "aggregationType": "AVERAGE",
    "domainType": "AGGREGATE"
}
temperature_de = client.create_data_element(data_element)
print(f"Data element creation status: {temperature_de['status']} and UID: {temperature_de['response']['uid']}")

Data element creation status: OK and UID: gPPVvS6u23w


Next, create the total precipitation data element: 

In [3]:
data_element = {
    "name": "Total precipitation (ERA5)",
    "shortName": "Total precipitation (ERA5)",
    "valueType": "NUMBER",
    "aggregationType": "SUM",
    "domainType": "AGGREGATE"
}
precipitation_de = client.create_data_element(data_element)
print(f"Data element creation status: {precipitation_de['status']} and UID: {precipitation_de['response']['uid']}")

Data element creation status: OK and UID: i9W7DhW60kK


Since we plan to import daily data values, we also create and assign our data element to a new dataset for climate variables with `Daily` period type:

In [4]:
data_set = {
    "name": "Daily climate data", 
    "shortName": "Daily climate data",
    "periodType": "Daily",
    "dataSetElements": [
        {
            "dataElement": {"id": temperature_de['response']['uid']},
            "dataElement": {"id": precipitation_de['response']['uid']}
        }
    ]
}

data_set_response = client.create_data_set(data_set)
print(f"Data set creation status: {data_set_response['status']} and UID: {data_set_response['response']['uid']}")

Data set creation status: OK and UID: hAegfkyGjuu


--------------------------------------------
## Downloading CDS data for a given month

For downloading data from the Climate Data Store (CDS), we will demonstrate step-by-step how to use the `dhsi2eo` package to programmatically retreive the data from the CDS API. 

In [2]:
import dhis2eo
import dhis2eo.org_units
import dhis2eo.data.cds

### Prerequisites

#### 1. Authenticate with your ECMWF user

Before you can download the dataset programmatically, you need to [create an ECMWF user](https://www.ecmwf.int/user/login), and authenticate using your user credentials:

- Go to the [CDSAPI Setup page](https://cds.climate.copernicus.eu/how-to-api) and make sure to login.
- Once logged in, scroll down to the section "Setup the CDS API personal access token". 
  - This should show your login credentials, and look something like this:

        url: https://cds.climate.copernicus.eu/api
        key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

- Copy those two lines to a file `.cdsapirc` in your user's $HOME directory.

#### 2. Accept the dataset license

ECMWF requires that you manually accept the user license for each dataset that you download. 

- Start by visiting the Download page of the dataset we are interested in: ["ERA5 post-processed daily statistics on single levels from 1940 to present"](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=download). 
- Scroll down until you get to the "Terms of Use" section.
- Click the button to accept and login with your user if you haven't already. 

### Retrieving organisation units

Before we can download the data, we first need to load our organisation units in order to limit which region to download data for.

First we retrieve the organisation units as a GeoJSON dict from the `dhis2-python-client`: 

In [3]:
org_units_geojson = client.get_org_units_geojson(level=2)

Next, load this GeoJSON dict as a `geopandas.GeoDataFrame` by using the `dhis2eo.org_units` module. This makes it easier work with the organisation units for later steps: 

In [4]:
org_units = dhis2eo.org_units.from_dhis2_geojson(org_units_geojson)
print(org_units)

    org_unit_id          name  \
0   O6uvpzGd5pu            Bo   
1   fdc6uOvgoji       Bombali   
2   lc3eMKXaEfw        Bonthe   
3   jUb8gELQApl      Kailahun   
4   PMa2VCrupOd        Kambia   
5   kJq2mPyFEHo        Kenema   
6   qhqAxPSTUXp     Koinadugu   
7   Vth0fbpFcsO          Kono   
8   jmIPBj66vD6       Moyamba   
9   TEQlaapDQoK     Port Loko   
10  bL4ooGhyHRQ       Pujehun   
11  eIQbndfxQMb     Tonkolili   
12  at6UHUQatSo  Western Area   

                                             geometry  
0   POLYGON ((-11.5914 8.4875, -11.5906 8.4769, -1...  
1   POLYGON ((-11.8091 9.2032, -11.8102 9.1944, -1...  
2   MULTIPOLYGON (((-12.5568 7.3832, -12.5574 7.38...  
3   POLYGON ((-10.7972 7.5866, -10.8002 7.5878, -1...  
4   MULTIPOLYGON (((-13.1349 8.8471, -13.1343 8.84...  
5   POLYGON ((-11.3596 8.5317, -11.3513 8.5234, -1...  
6   POLYGON ((-10.585 9.0434, -10.5877 9.0432, -10...  
7   POLYGON ((-10.585 9.0434, -10.5848 9.0432, -10...  
8   MULTIPOLYGON (((-12.6351 7.66

### Downloading daily ERA5 data

In order to get users started, we provide a convenience function for downloading the most commonly requested climate variables from the [ERA5 post-processed daily statistics on single levels from 1940 to present](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics). 

Simply provide the year, month, and org_units you want to download for. The region to download data for is automatically calculated from the provided organisation units:

In [7]:
data = dhis2eo.data.cds.get_daily_era5_data(2021, 1, org_units)

dhis2eo.data.cds - INFO - Loading from cache: C:\Users\karimba\AppData\Local\Temp\cds_daily-era5_params-ca5bab_region-37098a_2021-01.nc


Since data downloads can be slow, this function also caches the download results and reuses it if the file has already been downloaded. Calling it again will be much faster by loading directly from the cache: 

In [6]:
data = dhis2eo.data.cds.get_daily_era5_data(2021, 1, org_units)

dhis2eo.data.cds - INFO - Loading from cache: C:\Users\karimba\AppData\Local\Temp\cds_daily-era5_params-ca5bab_region-37098a_2021-01.nc


Note: that since we are working with a daily updated dataset, the function only caches downloads for past months, but not for the current month. 

THIS IS HOW FAR IVE REACHED...............

----------------------------------------------

## Aggregating the data to organisation units

The next step is creating a generic function that aggregates the data downloaded from the previous step to a set of input organisation units:

In [49]:
def aggregate(data, org_units, id_col):
    from earthkit.transforms import aggregate
    # aggregate to org unit for each time period
    agg_data = aggregate.spatial.reduce(data, org_units, mask_dim=id_col)
    # convert to dataframe
    agg_df = agg_data.to_dataframe().reset_index()
    # return
    return agg_df

Let's try it for our previously downloaded test data:

In [62]:
agg = aggregate(data, org_units, id_col='org_unit_id')
agg['t2m'] -= 273.15 # convert to celsius
print(agg)

    valid_time  org_unit_id  number        t2m
0   2012-02-02  O6uvpzGd5pu       0  27.335388
1   2012-02-02  fdc6uOvgoji       0  27.853302
2   2012-02-02  lc3eMKXaEfw       0  26.812439
3   2012-02-02  jUb8gELQApl       0  27.563202
4   2012-02-02  PMa2VCrupOd       0  27.259949
..         ...          ...     ...        ...
346 2012-02-28  jmIPBj66vD6       0  26.785797
347 2012-02-28  TEQlaapDQoK       0  27.028656
348 2012-02-28  bL4ooGhyHRQ       0  26.023712
349 2012-02-28  eIQbndfxQMb       0  27.649323
350 2012-02-28  at6UHUQatSo       0        NaN

[351 rows x 4 columns]


We see that the aggregated data contains temperature values for each organisation unit (`org_unit_id`) and all the 28 days in February 2012 contained in the downloaded NetCDF data. 

-------------------------------------------------

## Determining the data period for importing

Now that we have a simple way to download and aggregate temperature data, we want a function that defines a time period for which we want data. We have two goals here:

1. Since data is downloaded and processed on a monthly basis we want to return which year-month period we want to process. 
2. Return all year-month periods between the last valid data value for a given data element and today's date. 

In [53]:
def iter_month_periods_since_last_data_value(data_element_id, earliest_year, earliest_month):
    from datetime import date
    # get current year and month
    current_date = date.today()
    current_year,current_month = current_date.year, current_date.month
    # get last year and month for which data values exist in dhis2 data element
    first_period_response = {'existing': None} # TODO: update once daily periods are supported # client.analytics_latest_period_for_level(de_uid=data_element_id, level=org_unit_level)
    if first_period_response['existing']: 
        # last data value found
        first_period = first_period_response['existing']['id']
        first_year,first_month = int(first_period[:4]), int(first_period[4:6])
        # but no earlier than earliest year-month
        first_year = max(earliest_year, first_year)
        first_month = max(earliest_month, first_month)
    else:
        # no data values exists, start at earliest year-month
        first_year,first_month = earliest_year,earliest_month
    # loop years and months between last dhis2 value and today's date
    for year in range(first_year, current_year + 1):
        start_month = first_month if year == first_year else 1
        end_month = current_month if year == current_year else 12
        for month in range(start_month, end_month + 1):
            # yield year-month pairs
            yield year,month

---------------------

## The full workflow

Now we have all the components needed to automatically download data from the Climate Data Store. In this last section we will tie all the pieces together into a single function, which we can use to easily perform the data import at regular intervals: 

In [72]:
def main(data_element_id, earliest_year, earliest_month):
    import pandas as pd
    import geopandas as gpd
    from dhis2eo.integrations.pandas import dataframe_to_dhis2_json

    # download org units geojson
    print('Getting organisation units...')
    geojson = client.get_org_units_geojson(level=org_unit_level)

    # add org unit id to properties
    for feat in geojson['features']:
        feat['properties']['org_unit_id'] = feat['id']

    # convert to geopandas and get bbox
    org_units = gpd.GeoDataFrame.from_features(geojson["features"])
    bbox = get_bbox(org_units)

    # fetch, aggregate, and import data month-by-month
    for year, month in iter_month_periods_since_last_data_value(data_element_id, earliest_year, earliest_month):
        print('====================')
        print('Period:', year, month)
        # download data
        print('Getting data...')
        data = get_temperature_data(year, month, bbox)
        # aggregate to org units
        print('Aggregating...')
        agg = aggregate(data, org_units, id_col='org_unit_id')
        # convert to celsius
        agg['t2m'] -= 273.15
        # ignore nan values
        agg = agg[~pd.isna(agg['t2m'])]
        # convert to dhis2 json
        payload = dataframe_to_dhis2_json(
            df=agg,
            org_unit_col='org_unit_id',
            period_col='valid_time',
            value_col='t2m',
            data_element_id=data_element_id,
        )
        # upload to dhis2
        print('Importing to DHIS2...')
        res = client.post("/api/dataValueSets", json=payload)
        print("Results:", res['response']['importCount'])

    print('=====================')
    print('Data import finished!')

Finally, let's try to run the function to import daily temperature data since 1 January 2025 until today: 

In [73]:
data_element_id = 'GbUpvHzCzn8' # data element id that you want to import data into
start_year = 2025
start_month = 1
main(data_element_id, start_year, start_month)

Getting organisation units...
Period: 2025 1
Getting data...
Loading from cache ../data/local\temperature_2025-01_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 348, 'ignored': 0, 'deleted': 0}
Period: 2025 2
Getting data...
Loading from cache ../data/local\temperature_2025-02_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 276, 'ignored': 0, 'deleted': 0}
Period: 2025 3
Getting data...
Loading from cache ../data/local\temperature_2025-03_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 312, 'ignored': 0, 'deleted': 0}
Period: 2025 4
Getting data...
Loading from cache ../data/local\temperature_2025-04_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 348, 'ignored': 0, 'deleted': 0}
Period: 2025 5
Getting data...
Loading from cache ../data/local\temperature_2025-05_bbox_-13_6_-10_10.nc
Aggregating...
Import

2025-10-09 08:59:01,894 INFO Request ID is c8242a40-9ded-41e2-ab18-ee811e8edb0f
2025-10-09 08:59:01,982 INFO status has been updated to accepted
2025-10-09 08:59:10,703 INFO status has been updated to running
2025-10-09 08:59:23,579 INFO status has been updated to successful


7e36e70de28fdff87153c77cd2a5d634.nc:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Data is for the current month and will not be cached, since data is added daily
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 24, 'ignored': 0, 'deleted': 0}
Data import finished!


Note: Running this data import function multiple times in the same month, will result in the entire month being downloaded and imported each time, since the data is updated on a daily basis. But the results from the data import will report how many data values already existed and were ignored, and how many new data values were imported since last time. 

## Next steps

In this notebook we have created a function that can be run at regular intervals, e.g. every day or week, to fetch and import only the latest temperature data for your org units. But we still need a way to run the script. This can be done either manually, or automatically via a `cron` job. Further guidance on how to automatically schedule running a script will be added in the future. 