[![logo](https://climate.copernicus.eu/sites/default/files/custom-uploads/branding/LogoLine_horizon_C3S.png)](https://climate.copernicus.eu)

# Downloading PECD4.2 subsample of CSV data from the CDS via cdsapi

Following the lessons learned in [Notebook *download-data-from-cds.ipynb*](./explore-csv-data.ipynb) and [Notebook *explore-csv-data.ipynb*](./explore-csv-data.ipynb), the present one aims at providing a useful tool to efficiently download [PECD4.2](https://cds.climate.copernicus.eu/datasets/sis-energy-pecd?tab=overview) spatially aggregated data (CSV format) and retain only the needed information (e.g. a limited list of countries). Notwithstanding the fact that one might need a more advanced elaboration of the data downstream of the download, this tool can be useful for those with limited storage available who want to get rid of any unnecessary information as soon as possible.

To learn how to access climate and energy related variables from the Pan-European Climate Database (PECD4.2) derived from reanalysis and climate projections, please have a preliminary look at [Notebook *download-data-from-cds.ipynb*](./explore-csv-data.ipynb).

To learn more about CSV data handling, of which we'll make some use here, please have a look at [Notebook *explore-csv-data.ipynb*](./explore-csv-data.ipynb).

In this example, we will download aggregated data in CSV format for one energy variable, Solar PhotoVoltaic Capacity Factor (or SPV), covering both a historical time window (2011-2014) reconstructed using as input ERA5 reanalysis climate data, and a future window (2031-2034) computed using as input 3 different CMIP6 climate projection models for one of the available scenarios, the SSP245. However we will retain only information for a subselection of countries: Italy, France, and Germany.

> **Note**  
>[ERA5](https://www.ecmwf.int/en/forecasts/dataset/ecmwf-reanalysis-v5) is the fifth-generation atmospheric reanalysis program developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) in collaboration with the Copernicus Climate Change Service (C3S). It operates on a global scale and has a spatial resolution of $0.25° \times \ 0.25°$ (latitude and longitude), which corresponds to approximately 31 km; estimates of atmospheric
variables are provided hourly throughout a temporal coverage of about eight decades, from 1940 to today.

> **Note**  
>[CMIP6](https://pcmdi.llnl.gov/CMIP6/) (Coupled Model Intercomparison Project Phase 6) is an international effort that brings together climate models from research institutions worldwide. Its goal is to standardize and compare climate simulations to better understand past and future climate behavior. The results are widely used in scientific research and reports like those from the IPCC.

> **Note**  
>SSP245 (or SSP2-4.5) climate scenario is one of the plausible future pathways that combine assumptions about human development (like population growth, energy use, and policy) with projections of greenhouse gas emissions. Climate models use these scenarios to simulate how the Earth’s climate might respond under different conditions. The SSP2-4.5 is a “middle-of-the-road” scenario that assumes a moderate population and economic growth, a slow and uneven progress toward sustainability, and some mitigation of emissions (though not aggressive climate policies). The "4.5" refers to the projected radiative forcing — the extra energy trapped in the Earth system — of 4.5 W/m² by the year 2100.

## Learning objectives 🎯

In this notebook, you will split a large request of data (through the CDS API) into smaller chunks and send them using parallel calls, in python, while retaining only the needed information from the downloaded CSV files. This will allow to reduce the amount of stored data.

## Target Audience 🎯

**Anyone** interested in learning how to download data from the PECD4.2 dataset and retain just a subsample of needed information.

## Prepare your environment

### Import libraries

In the following we will import a few libraries: the [os](https://docs.python.org/3/library/os.html) module provides a way to interact with the operating system and it is used here to create a folder in our drive; the [glob](https://docs.python.org/3/library/glob.html) finds all the pathnames matching a specified pattern according to the rules used by the Unix shell; the [pandas](https://pandas.pydata.org/) library is one of the most common and easy to use tools for data analysis and manipulation; [cdsapi](https://github.com/ecmwf/cdsapi?tab=readme-ov-file), which provides programmatic access to the Copernicus Climate Data Store (CDS), allowing you to download data; the [multiprocessing](https://docs.python.org/3/library/multiprocessing.html) module allows us to fully leverage multiple processors on a given machine (in our case, on colab) and it is used here to handle parallel sending of api requests.

In [1]:
import os
import glob
import pandas as pd
import cdsapi
from multiprocessing import Pool

### Set up the CDS API and your credentials


To learn how to use the CDS API, see the [official guide](https://cds.climate.copernicus.eu/how-to-api). If you have already set up your .cdsapirc file locally, you can upload it directly to your home directory.

Alternatively, you can replace None in the following code cell with your API Token as a string (i.e. enclosed in quotes, like ```"your_api_key"```). Your token can be found on the CDS portal at: https://cds.climate.copernicus.eu/profile (you will need to log in to view your credentials).
Remember to agree to the Terms and Conditions of every dataset you intend to download.

In [2]:
# If you have already setup your .cdsapirc file you can leave this as None
cdsapi_key = None
cdsapi_url = "https://cds.climate.copernicus.eu/api"

## Create a function to handle the data download

The PECD data can be downloaded from the CDS download form, by ticking the boxes of interest. Once all the required information is manually selected, scroll to the bottom of the form and click on "Show API request". This will reveal a code block that can be copied and pasted directly into a cell of your Jupyter Notebook. If you'd like to try it yourself, visit the [CDS download form](https://cds.climate.copernicus.eu/datasets/sis-energy-pecd?tab=download) and test it.

However, in this exercise, we will skip this step and directly build a function to send a single api request, download and unzip files, and select just the data we need. After that, we will split a large number of request into smaller ones, and then call the function to process in parallel all the requests.

After creating a folder where data will be stored, we'll define the function `retrieve_sel_cds_csv_data`, which will take as input several arguments that identify the specific data you need to download:


*   `dataset` (string): the name of the dataset to download from.
*   `pecd_version` (string): The version of the Pan-European Climate Database (PECD) you are interested in.
*   `temporal_period` (list of strings): specifies the time period of the data (e.g., 'historical', 'future_projections').
*   `origin` (list of strings): indicates the source of the data, such as a specific climate model or reanalysis dataset.
*   `spatial_resolution` (list of strings): defines the geographical resolution of the data (e.g., 'nuts_0').
*   `variable` (list of strings): the specific climate or energy variable you want to download (e.g., '2m_temperature', 'solar_generation_capacity_factor').
*   `year` (list of integers): the years for which you want to retrieve data.
*   `reg_list` (list of strings, optional, default is None): if applicable, the needed regions' codes (e.g., 'IT', 'FR').
*   `emissions` (list of strings, optional, default is None): if applicable, the emissions scenario (e.g., 'ssp2_4_5'). This parameter is optional.
*   `technology` (list of strings, optional, default is None): if applicable, specifies a technology related to the energy variables. This parameter is also optional.


In [3]:
# create folder to store downloaded data
folder = "cds_data/dowload_subsample_data_from_cds"
os.system(f"mkdir -p {folder}")


def retrieve_sel_cds_csv_data(
    dataset: str,
    pecd_version: str,
    temporal_period: list[str],
    origin: list[str],
    spatial_resolution: list[str],
    variable: list[str],
    year: list[int],
    reg_list: list[str] = None,
    technology: list[str] = None,
    emissions: list[str] = None,
):

    # dictionary of the api request
    request = {
        "pecd_version": pecd_version,
        "temporal_period": temporal_period,
        "origin": origin,
        "spatial_resolution": spatial_resolution,
        "year": year,
        "variable": variable,
    }

    # build the file path to the downloaded data
    id_string = (f"{pecd_version}_{temporal_period[0]}_{origin[0]}_"
                 f"{variable[0]}_{spatial_resolution[0]}_{year[0]}")
    folder_i = f"{folder}/{id_string}"
    os.system(f"mkdir -p {folder_i}")
    file_path = f"{folder_i}/{id_string}"

    # add emissions and technology fields if needed
    if emissions is not None:
        request["emission_scenario"] = emissions
        file_path += f"_{emissions[0]}"
    if technology is not None:
        request["technology"] = technology
        file_path += f"_{technology[0]}"
    file_path += ".zip"

    # initialize Client object
    client = cdsapi.Client(cdsapi_url, cdsapi_key)
    # call retrieve method that downloads the data
    client.retrieve(dataset, request, file_path)  # .download()

    # unzipping files to temporary folder
    os.system(f"unzip {file_path} -d {folder_i}/temp")

    # listing all newly downloaded files
    fpaths = sorted(glob.glob(os.path.join(folder_i, "temp", "*")))

    # Checking if list of regions was submitted
    if reg_list:
        for fpath in fpaths:
            df = pd.read_csv(fpath, comment="#", index_col=["Date"], parse_dates=["Date"])
            # checking if regions are present in CSV file
            for reg in reg_list:
                if reg not in df.columns:
                    print(f"MIND: Region {reg} not available in dataframe. Skipping this region.")
                    reg_list.remove(reg)
            if not reg_list:
                print("None of provided regions were in downloaded file.")
                continue
            # selecting needed regions from CSV file
            df = df[reg_list]
            # saving new CSV file
            df.to_csv(os.path.join(folder, os.path.basename(fpath)))
        # deleting unnecessary original CSVs
        for f in glob.glob(f"{folder_i}/temp/*.csv"):
            os.remove(f)
    else:
        # moving files from temporary folder
        for fpath in fpaths:
            os.rename(fpath, os.path.join(folder, os.path.basename(fpath)))
            os.remove(fpath)
    # deleting unnecessary .zip files
    for f in glob.glob(f"{folder_i}/*.zip"):
        os.remove(f)

## Set up the parameters for data download

This section of the code defines several variables that will be used to specify the data to be downloaded from the Climate Data Store. These variables act as parameters for the API requests that will be made later. We will create a list of years both for historical data and projection data, then divide those lists into 2 years chunks.

In [4]:
# define our dataset
dataset = "sis-energy-pecd"

# constants
pecd_version = "pecd4_2"
emissions = ["ssp2_4_5"]
spatial_resolution = ["nuts_0"]
reg_list = ["IT", "FR", "DE"]  # list of regions (based on spatial resolution)

# list of years to download
hist_start, hist_end = 2011, 2012
proj_start, proj_end = 2031, 2032
hist_years = [str(i) for i in range(hist_start, hist_end + 1)]
proj_years = [str(i) for i in range(proj_start, proj_end + 1)]

# divide our list of years into 2 groups of 2 years each
n = 2
hist_years_list = [hist_years[n * i: n * (i + 1)] for i in range(0, len(hist_years) // n)]
proj_years_list = [proj_years[n * i: n * (i + 1)] for i in range(0, len(proj_years) // n)]

# list of variables to download
vars = ["solar_photovoltaic_generation_capacity_factor"]
technology = ["60"]

# dictionary of origins - projection models
origins = {
    "historical": ["era5_reanalysis"],
    "future_projections": ["cmcc_cm2_sr5", "ec_earth3", "mpi_esm1_2_hr"],
}

## Generate a list of api requests

This section of the code focuses on creating a list of requests that will be used to download data from the Copernicus Climate Change Service (C3S) Climate Data Store (CDS). Each item in this list represents a specific data download request.

We will create a nested loop structure. The outer loop iterates through each variable defined in the vars list. For each variable, the code will generate requests for both historical and future projection data, contained in a tuple object. The inner loop iterates through each group of years in the corresponding years list. This list of tuples are necessary in order to call the starmap method of multiprocessing.

In [5]:
requests = []
# outer loop through variables
for var in vars:
    period = "historical"
    # loop through historical years
    for year in hist_years_list:
        request = (
            dataset,
            pecd_version,
            [period],
            origins[period],
            spatial_resolution,
            [var],
            year,
            reg_list,
            technology,
        )
        requests.append(request)
    period = "future_projections"
    # loop through projection years
    for year in proj_years_list:
        for origin in origins[period]:
            request = (
                dataset,
                pecd_version,
                [period],
                [origin],
                spatial_resolution,
                [var],
                year,
                reg_list,
                technology,
                emissions,
            )
            requests.append(request)

# print requests
print(f"total requests: {len(requests)}")
for request in requests:
    print(request)

total requests: 4
('sis-energy-pecd', 'pecd4_2', ['historical'], ['era5_reanalysis'], ['nuts_0'], ['solar_photovoltaic_generation_capacity_factor'], ['2011', '2012'], ['IT', 'FR', 'DE'], ['60'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['cmcc_cm2_sr5'], ['nuts_0'], ['solar_photovoltaic_generation_capacity_factor'], ['2031', '2032'], ['IT', 'FR', 'DE'], ['60'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['ec_earth3'], ['nuts_0'], ['solar_photovoltaic_generation_capacity_factor'], ['2031', '2032'], ['IT', 'FR', 'DE'], ['60'], ['ssp2_4_5'])
('sis-energy-pecd', 'pecd4_2', ['future_projections'], ['mpi_esm1_2_hr'], ['nuts_0'], ['solar_photovoltaic_generation_capacity_factor'], ['2031', '2032'], ['IT', 'FR', 'DE'], ['60'], ['ssp2_4_5'])


These requests can be parallelized with multiprocessing (as done here), but you might as well choose to create a simple for loop over the requests list (not shown). Both ways, the result will be the same.

In this example we initialize the Pool object with 4 processes and call the starmap method, passing as arguments the function previously defined and the list of tuples created before.

In [6]:
# Running several processes
with Pool(4) as p:
    p.starmap(retrieve_sel_cds_csv_data, requests)

2025-07-14 08:35:31,140 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-07-14 08:35:31,148 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-07-14 08:35:31,158 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-07-14 08:35:31,164 INFO [2024-09-26T00:00:00] Watch our [Forum](https://forum.ecmwf.int/) for Announcements, news and other discussed topics.
2025-07-14 08:35:36,292 INFO Request ID is 0d481b03-2fb9-4dad-b7db-a3ea19b79b73
2025-07-14 08:35:36,347 INFO status has been updated to accepted
2025-07-14 08:35:36,397 INFO Request ID is 16766230-a40d-4f41-b4da-d1640802078e
2025-07-14 08:35:36,549 INFO status has been updated to accepted
2025-07-14 08:35:37,147 INFO Request ID is d345825d-edad-4a44-90ab-5494711d76e2
2025-07-14 08:35:37,192 INFO status has be

d84c50a7cb489bdc2fd2e31699e3f6d2.zip:   0%|          | 0.00/1.51M [00:00<?, ?B/s]

Archive:  cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_ec_earth3_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/pecd4_2_future_projections_ec_earth3_solar_photovoltaic_generation_capacity_factor_nuts_0_2031_ssp2_4_5_60.zip
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_ec_earth3_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/temp/P_CMI6_ECEC_ECE3_SPV_0000m_Pecd_NUT0_S203101010000_E203112312300_CFR_TIM_01h_NA-_noc_org_60_SP245_NA---_PhM03_PECD4.2_fv1.csv  
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_ec_earth3_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/temp/P_CMI6_ECEC_ECE3_SPV_0000m_Pecd_NUT0_S203201010000_E203212312300_CFR_TIM_01h_NA-_noc_org_60_SP245_NA---_PhM03_PECD4.2_fv1.csv  


2025-07-14 08:35:51,213 INFO status has been updated to successful


10ac1e890af4776dc80c9ce4cd6d8569.zip:   0%|          | 0.00/1.51M [00:00<?, ?B/s]

Archive:  cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_mpi_esm1_2_hr_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/pecd4_2_future_projections_mpi_esm1_2_hr_solar_photovoltaic_generation_capacity_factor_nuts_0_2031_ssp2_4_5_60.zip
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_mpi_esm1_2_hr_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/temp/P_CMI6_MPI-_MEHR_SPV_0000m_Pecd_NUT0_S203101010000_E203112312300_CFR_TIM_01h_NA-_noc_org_60_SP245_NA---_PhM03_PECD4.2_fv1.csv  
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_mpi_esm1_2_hr_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/temp/P_CMI6_MPI-_MEHR_SPV_0000m_Pecd_NUT0_S203201010000_E203212312300_CFR_TIM_01h_NA-_noc_org_60_SP245_NA---_PhM03_PECD4.2_fv1.csv  


2025-07-14 08:35:57,620 INFO status has been updated to successful


fb0e4412effeec518746e67415299e1c.zip:   0%|          | 0.00/1.51M [00:00<?, ?B/s]

2025-07-14 08:35:58,154 INFO status has been updated to successful


Archive:  cds_data/dowload_subsample_data_from_cds/pecd4_2_historical_era5_reanalysis_solar_photovoltaic_generation_capacity_factor_nuts_0_2011/pecd4_2_historical_era5_reanalysis_solar_photovoltaic_generation_capacity_factor_nuts_0_2011_60.zip
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_historical_era5_reanalysis_solar_photovoltaic_generation_capacity_factor_nuts_0_2011/temp/H_ERA5_ECMW_T639_SPV_0000m_Pecd_NUT0_S201101010000_E201112312300_CFR_TIM_01h_COM_noc_org_60_NA---_NA---_PhM03_PECD4.2_fv1.csv  
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_historical_era5_reanalysis_solar_photovoltaic_generation_capacity_factor_nuts_0_2011/temp/H_ERA5_ECMW_T639_SPV_0000m_Pecd_NUT0_S201201010000_E201212312300_CFR_TIM_01h_COM_noc_org_60_NA---_NA---_PhM03_PECD4.2_fv1.csv  


7bae838a3a17f1d5f9621db8db6c5bdb.zip:   0%|          | 0.00/1.51M [00:00<?, ?B/s]

Archive:  cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_cmcc_cm2_sr5_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/pecd4_2_future_projections_cmcc_cm2_sr5_solar_photovoltaic_generation_capacity_factor_nuts_0_2031_ssp2_4_5_60.zip
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_cmcc_cm2_sr5_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/temp/P_CMI6_CMCC_CMR5_SPV_0000m_Pecd_NUT0_S203101010000_E203112312300_CFR_TIM_01h_NA-_noc_org_60_SP245_NA---_PhM03_PECD4.2_fv1.csv  
  inflating: cds_data/dowload_subsample_data_from_cds/pecd4_2_future_projections_cmcc_cm2_sr5_solar_photovoltaic_generation_capacity_factor_nuts_0_2031/temp/P_CMI6_CMCC_CMR5_SPV_0000m_Pecd_NUT0_S203201010000_E203212312300_CFR_TIM_01h_NA-_noc_org_60_SP245_NA---_PhM03_PECD4.2_fv1.csv  


## Take home messages 📌



*  To download sevaral data from the CDS efficiently you need to split your request into smaller requests; you can do this very easily with python code, and you can also parallelize them
*  If you are interested in keeping only some specific information and drop unnecessary data, you can build a function that does it for you straight after each request goes through, so that the final files contain just what you need.

