# Automated Data Download from Climate Data Store

In this notebook we will demonstrate a full workflow for how we can use Climate Tools to automate regularly downloading data from the [Climate Data Store (CDS)](https://cds.climate.copernicus.eu/datasets), aggregating to DHIS2 organisation units, and uploading the aggregated climate data back to DHIS2. 

For our example we will connect to a local DHIS2 instance containing the Sierra Leone demo database, setup a new data element for daily Temperature data, and show how to create a function that can be called at regular intervals in order to update DHIS2 with the latest [daily 2m temperature data from the Climate Data Store](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=download). 

Each of the steps will be explained in detail throughout the notebook. At the end of the notebook, we will tie it all together in a single code snippet that automatically checks what data has already been imported, and downloads and imports only the relevant data. 

----------------------------------------
## Requirements

In order to run this notebook, you first need to connect to an instance of DHIS2. For our example, we will connect to a local instance of DHIS2 containing the standard Sierra Leone demo database, but you should be able to switch out the instance url and credentials to work directly with your own database. 

In [1]:
from dhis2_client import DHIS2Client
from dhis2_client.settings import ClientSettings

# Create DHIS2 client connection
cfg = ClientSettings(
  base_url="http://localhost:8080",
  username="admin",
  password="district"
)
client = DHIS2Client(settings=cfg)

# Verify connection
info = client.get_system_info()
print("Current DHIS2 version:", info["version"])

Current DHIS2 version: 2.42.2


We also need to create the data element for importing data into. If you haven't already created your data element manually, you can follow the steps below to create the data element using the `python-dhis2-client`:

In [3]:
data_element = {
    "name": "2m Temperature (ERA5)",
    "shortName": "Temperature (ERA5)",
    "valueType": "NUMBER",
    "aggregationType": "AVERAGE",
    "domainType": "AGGREGATE"
}
data_element_response = client.create_data_element(data_element)
print(f"Data element creation status: {data_element_response['status']} and UID: {data_element_response['response']['uid']}")

Data element creation status: OK and UID: GbUpvHzCzn8


Since we plan to import daily temperature values, we also create and assign our data element to a new dataset for climate variables with `Daily` period type:

In [4]:
data_set = {
    "name": "Daily climate data", 
    "shortName": "Daily climate data",
    "periodType": "Daily",
    "dataSetElements": [
        {
            "dataElement": {"id": data_element_response['response']['uid']}
        }
    ]
}

data_set_response = client.create_data_set(data_set)
print(f"Data set creation status: {data_set_response['status']} and UID: {data_set_response['response']['uid']}")

Data set creation status: OK and UID: MMPdeGYhikN


--------------------------------------------
## Downloading CDS data for a given month

For downloading data from the Climate Data Store (CDS), we will demonstrate step-by-step how to use the `earthkit.data` package to programmatically retreive the data from the CDS API. For more information about CDS data access, see our guide for [manually downloading CDS data](../getting-data/climate-data-store.ipynb). 

In [5]:
import earthkit.data

### Prerequisites

#### 1. Authenticate with your ECMWF user

Before you can download the dataset programmatically, you need to [create an ECMWF user](https://www.ecmwf.int/user/login), and authenticate using your user credentials:

- Go to the [CDSAPI Setup page](https://cds.climate.copernicus.eu/how-to-api) and make sure to login.
- Once logged in, scroll down to the section "Setup the CDS API personal access token". 
  - This should show your login credentials, and look something like this:

        url: https://cds.climate.copernicus.eu/api
        key: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

- Copy those two lines to a file `.cdsapirc` in your user's $HOME directory.

#### 2. Accept the dataset license

ECMWF requires that you manually accept the user license for each dataset that you download. 

- Start by visiting the Download page of the dataset we are interested in: ["ERA5 post-processed daily statistics on single levels from 1940 to present"](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=download). 
- Scroll down until you get to the "Terms of Use" section.
- Click the button to accept and login with your user if you haven't already. 

### Try an example request query

Earthkit provides a convenience method for retrieving data from CDS, `earthkit.data.from_source("cds", ...)`. To obtain the correct parameters to use for the data query, you can follow these steps:

- Manually go to the dataset [Download page](https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=download)
  - In our case, select 2m Temperature, select all days of a single month, and a subregion containing Sierra Leone. 
- At the bottom of the page, click "Show API Request Code" in the "API Request" section.
- This should show something like this:

        import cdsapi

        dataset = "derived-era5-single-levels-daily-statistics"
        request = {
            "product_type": "reanalysis",
            "variable": ["2m_temperature"],
            "year": "2024",
            "month": ["12"],
            "day": [
                "01", "02", "03",
                "04", "05", "06",
                "07", "08", "09",
                "10", "11", "12",
                "13", "14", "15",
                "16", "17", "18",
                "19", "20", "21",
                "22", "23", "24",
                "25", "26", "27",
                "28", "29", "30",
                "31"
            ],
            "daily_statistic": "daily_mean",
            "time_zone": "utc+00:00",
            "frequency": "1_hourly",
            "area": [10.0, -13.3, 6.9, -10.3],
        }

        client = cdsapi.Client()
        client.retrieve(dataset, request).download()

Since `earthkit` uses `cdsapi` in the background, we can copy these parameter values directly to fill in the parameters required by `earthkit`. We also add two additional parameters to download as unzipped NetCDF format:

In [7]:
data = earthkit.data.from_source("cds",
    "derived-era5-single-levels-daily-statistics",
    {
        "product_type": "reanalysis",
        "variable": ["2m_temperature"],
        "year": "2024",
        "month": ["12"],
        "day": [
            "01", "02", "03",
            "04", "05", "06",
            "07", "08", "09",
            "10", "11", "12",
            "13", "14", "15",
            "16", "17", "18",
            "19", "20", "21",
            "22", "23", "24",
            "25", "26", "27",
            "28", "29", "30",
            "31"
        ],
        "daily_statistic": "daily_mean",
        "time_zone": "utc+00:00",
        "frequency": "1_hourly",
        "area": [10.0, -13.3, 6.9, -10.3],
        "data_format": "netcdf",
        "download_format": "unarchived",
    }
)

2025-10-08 22:42:30,964 INFO Request ID is 031b6ad4-71d0-4c11-aa07-f37d7afe839c
2025-10-08 22:42:31,270 INFO status has been updated to accepted
2025-10-08 22:42:39,739 INFO status has been updated to running
2025-10-08 22:43:21,398 INFO status has been updated to successful


c8eb0315aded24d457343fa69159ad9d.nc:   0%|          | 0.00/34.1k [00:00<?, ?B/s]

### Creating a generalized function for downloading data

In order to make this more useful, we generalize this in a function that modifies some of the download parameters based on input arguments as described below. 

#### Year and month inputs

In our function we want the user to be able to input a `year` and `month`, and then update the necessary query parameters. To automatically select all individual days for a particular month, we can use the builtin Python function `calendar.monthrange(year, month)` to get the first and last day of each month. We also have to left-pad all numbers with zero. This function might look something like this:

In [8]:
def download_temperature_for_month(year, month):
    import calendar
    # construct the query parameters
    params = {
        "product_type": "reanalysis",
        "variable": ["2m_temperature"],
        "year": str(year),
        "month": [str(month).zfill(2)],
        "daily_statistic": "daily_mean",
        "time_zone": "utc+00:00",
        "frequency": "1_hourly",
        "area": [10.0, -13.3, 6.9, -10.3],
        "data_format": "netcdf",
        "download_format": "unarchived",
    }
    first_day,last_day = calendar.monthrange(year, month)
    params['day'] = [str(day).zfill(2) for day in range(first_day, last_day)]
    print(params)
    # download the data
    data = earthkit.data.from_source("cds",
        "derived-era5-single-levels-daily-statistics",
        params,
    )
    # return
    return data

#### Determining the area coordinates from organisation units

Notice that we didn't yet do anything with the `area` parameter. To make our function more generic we also want to set this `area` parameter based on the bounding box of our organisation unit geometries. We therefore add another function that calculates the bounding box from a `geopandas.GeoDataFrame` and allow a bounding box input to our download function. The updated code would look like this:

In [9]:
import geopandas as gpd

def get_bbox(org_units: gpd.GeoDataFrame):
    '''Returns bounding box of a geopandas GeoDataFrame in standard format: xmin,ymin,xmax,ymax.'''
    bbox = org_units.total_bounds
    return bbox

def download_temperature_for_month_and_bbox(year, month, bbox):
    import calendar
    # extract the coordinates from input bounding box
    xmin,ymin,xmax,ymax = bbox
    # construct the query parameters
    params = {
        "product_type": "reanalysis",
        "variable": ["2m_temperature"],
        "year": str(year),
        "month": [str(month).zfill(2)],
        "daily_statistic": "daily_mean",
        "time_zone": "utc+00:00",
        "frequency": "1_hourly",
        "area": [ymax, xmin, ymin, xmax], # notice how we reordered the bbox coordinate sequence
        "data_format": "netcdf",
        "download_format": "unarchived",
    }
    first_day,last_day = calendar.monthrange(year, month)
    params['day'] = [str(day).zfill(2) for day in range(first_day, last_day)]
    # download the data
    data = earthkit.data.from_source("cds",
        "derived-era5-single-levels-daily-statistics",
        params,
    )
    # return
    return data

#### Caching download results

Since data downloads can be slow, we also want to cache the download results and reuse if the file has already been downloaded. We wrap our previous download function, save the results of each download into a local folder, or load the data from disk if the file has already been downloaded. Since we are working with a daily updated dataset, we make sure we only cache downloads for completed months. The cached wrapper function for getting temperature data looks like this: 

In [57]:
def get_temperature_data(year, month, bbox, cache_folder='../data/local'):
    import os
    from datetime import date
    current_date = date.today()
    # convert input args to a cache filename
    xmin,ymin,xmax,ymax = bbox
    file_name = f'temperature_{year}-{str(month).zfill(2)}_bbox_{int(xmin)}_{int(ymin)}_{int(xmax)}_{int(ymax)}.nc'
    file_path = os.path.join(cache_folder, file_name)
    # check if cache filename already exists
    if os.path.exists(file_path):
        # load from cache
        print('Loading from cache', file_path)
        data = earthkit.data.from_source('file', file_path)
    else:
        # download data from the api
        print('Downloading from api...')
        data = download_temperature_for_month_and_bbox(year, month, bbox)
        # save to cache, but not if we're still in the current month
        if year == current_date.year and month == current_date.month:
            print('Data is for the current month and will not be cached, since data is added daily')
        else:
            print('Saving to cache', file_path)
            data.to_target('file', file_path)            
    # return
    return data

#### Test case

Finally, let's try our final download function for February of 2012 and a set of organisation units from Sierra Leone: 

In [32]:
import geopandas as gpd

# get org units
org_unit_level = 2
geojson = client.get_org_units_geojson(level=org_unit_level)

# add org unit id to properties
for feat in geojson['features']:
    feat['properties']['org_unit_id'] = feat['id']

# convert to geopandas
org_units = gpd.GeoDataFrame.from_features(geojson["features"])

# calc bbox
bbox = get_bbox(org_units)
print('Bbox:', bbox)

# download data
data = get_temperature_data(2012, 2, bbox)

Bbox: [-13.3035   6.9176 -10.2658  10.0004]
Loading from cache ../data/local\temperature_2012-02_bbox_-13_6_-10_10.nc


----------------------------------------------

## Aggregating the data to organisation units

The next step is creating a generic function that aggregates the data downloaded from the previous step to a set of input organisation units:

In [49]:
def aggregate(data, org_units, id_col):
    from earthkit.transforms import aggregate
    # aggregate to org unit for each time period
    agg_data = aggregate.spatial.reduce(data, org_units, mask_dim=id_col)
    # convert to dataframe
    agg_df = agg_data.to_dataframe().reset_index()
    # return
    return agg_df

Let's try it for our previously downloaded test data:

In [62]:
agg = aggregate(data, org_units, id_col='org_unit_id')
agg['t2m'] -= 273.15 # convert to celsius
print(agg)

    valid_time  org_unit_id  number        t2m
0   2012-02-02  O6uvpzGd5pu       0  27.335388
1   2012-02-02  fdc6uOvgoji       0  27.853302
2   2012-02-02  lc3eMKXaEfw       0  26.812439
3   2012-02-02  jUb8gELQApl       0  27.563202
4   2012-02-02  PMa2VCrupOd       0  27.259949
..         ...          ...     ...        ...
346 2012-02-28  jmIPBj66vD6       0  26.785797
347 2012-02-28  TEQlaapDQoK       0  27.028656
348 2012-02-28  bL4ooGhyHRQ       0  26.023712
349 2012-02-28  eIQbndfxQMb       0  27.649323
350 2012-02-28  at6UHUQatSo       0        NaN

[351 rows x 4 columns]


We see that the aggregated data contains temperature values for each organisation unit (`org_unit_id`) and all the 28 days in February 2012 contained in the downloaded NetCDF data. 

-------------------------------------------------

## Determining the data period for importing

Now that we have a simple way to download and aggregate temperature data, we want a function that defines a time period for which we want data. We have two goals here:

1. Since data is downloaded and processed on a monthly basis we want to return which year-month period we want to process. 
2. Return all year-month periods between the last valid data value for a given data element and today's date. 

In [53]:
def iter_month_periods_since_last_data_value(data_element_id, earliest_year, earliest_month):
    from datetime import date
    # get current year and month
    current_date = date.today()
    current_year,current_month = current_date.year, current_date.month
    # get last year and month for which data values exist in dhis2 data element
    first_period_response = {'existing': None} # TODO: update once daily periods are supported # client.analytics_latest_period_for_level(de_uid=data_element_id, level=org_unit_level)
    if first_period_response['existing']: 
        # last data value found
        first_period = first_period_response['existing']['id']
        first_year,first_month = int(first_period[:4]), int(first_period[4:6])
        # but no earlier than earliest year-month
        first_year = max(earliest_year, first_year)
        first_month = max(earliest_month, first_month)
    else:
        # no data values exists, start at earliest year-month
        first_year,first_month = earliest_year,earliest_month
    # loop years and months between last dhis2 value and today's date
    for year in range(first_year, current_year + 1):
        start_month = first_month if year == first_year else 1
        end_month = current_month if year == current_year else 12
        for month in range(start_month, end_month + 1):
            # yield year-month pairs
            yield year,month

---------------------

## The full workflow

Now we have all the components needed to automatically download data from the Climate Data Store. In this last section we will tie all the pieces together into a single function, which we can use to easily perform the data import at regular intervals: 

In [72]:
def main(data_element_id, earliest_year, earliest_month):
    import pandas as pd
    import geopandas as gpd
    from dhis2eo.integrations.pandas import dataframe_to_dhis2_json

    # download org units geojson
    print('Getting organisation units...')
    geojson = client.get_org_units_geojson(level=org_unit_level)

    # add org unit id to properties
    for feat in geojson['features']:
        feat['properties']['org_unit_id'] = feat['id']

    # convert to geopandas and get bbox
    org_units = gpd.GeoDataFrame.from_features(geojson["features"])
    bbox = get_bbox(org_units)

    # fetch, aggregate, and import data month-by-month
    for year, month in iter_month_periods_since_last_data_value(data_element_id, earliest_year, earliest_month):
        print('====================')
        print('Period:', year, month)
        # download data
        print('Getting data...')
        data = get_temperature_data(year, month, bbox)
        # aggregate to org units
        print('Aggregating...')
        agg = aggregate(data, org_units, id_col='org_unit_id')
        # convert to celsius
        agg['t2m'] -= 273.15
        # ignore nan values
        agg = agg[~pd.isna(agg['t2m'])]
        # convert to dhis2 json
        payload = dataframe_to_dhis2_json(
            df=agg,
            org_unit_col='org_unit_id',
            period_col='valid_time',
            value_col='t2m',
            data_element_id=data_element_id,
        )
        # upload to dhis2
        print('Importing to DHIS2...')
        res = client.post("/api/dataValueSets", json=payload)
        print("Results:", res['response']['importCount'])

    print('=====================')
    print('Data import finished!')

Finally, let's try to run the function to import daily temperature data since 1 January 2025 until today: 

In [73]:
data_element_id = 'GbUpvHzCzn8' # data element id that you want to import data into
start_year = 2025
start_month = 1
main(data_element_id, start_year, start_month)

Getting organisation units...
Period: 2025 1
Getting data...
Loading from cache ../data/local\temperature_2025-01_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 348, 'ignored': 0, 'deleted': 0}
Period: 2025 2
Getting data...
Loading from cache ../data/local\temperature_2025-02_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 276, 'ignored': 0, 'deleted': 0}
Period: 2025 3
Getting data...
Loading from cache ../data/local\temperature_2025-03_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 312, 'ignored': 0, 'deleted': 0}
Period: 2025 4
Getting data...
Loading from cache ../data/local\temperature_2025-04_bbox_-13_6_-10_10.nc
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 348, 'ignored': 0, 'deleted': 0}
Period: 2025 5
Getting data...
Loading from cache ../data/local\temperature_2025-05_bbox_-13_6_-10_10.nc
Aggregating...
Import

2025-10-09 08:59:01,894 INFO Request ID is c8242a40-9ded-41e2-ab18-ee811e8edb0f
2025-10-09 08:59:01,982 INFO status has been updated to accepted
2025-10-09 08:59:10,703 INFO status has been updated to running
2025-10-09 08:59:23,579 INFO status has been updated to successful


7e36e70de28fdff87153c77cd2a5d634.nc:   0%|          | 0.00/22.5k [00:00<?, ?B/s]

Data is for the current month and will not be cached, since data is added daily
Aggregating...
Importing to DHIS2...
Results: {'imported': 0, 'updated': 24, 'ignored': 0, 'deleted': 0}
Data import finished!


Note: Running this data import function multiple times in the same month, will result in the entire month being downloaded and imported each time, since the data is updated on a daily basis. But the results from the data import will report how many data values already existed and were ignored, and how many new data values were imported since last time. 

## Next steps

In this notebook we have created a function that can be run at regular intervals, e.g. every day or week, to fetch and import only the latest temperature data for your org units. But we still need a way to run the script. This can be done either manually, or automatically via a `cron` job. Further guidance on how to automatically schedule running a script will be added in the future. 