**<center>Downloading CMIP6 data from the ESGF platform using ESGF Python package "esgf-pyclient"</center>**

*More info about the esgf-pyclient package is found here: https://esgf-pyclient.readthedocs.io/en/latest/index.html*

*This script was developed on Python 3.9.19.*

*I recommend that you create a virtual environment specifically for this script in which you will install the following packages: esgf-pyclient, pandas, xarray, requests*

*Feel free to optimise this script and share the update*

**1. Importing necessary libraries**

In [None]:
from pyesgf.search import SearchConnection
import os
import pandas as pd
import requests

**2. Data search based on criteria**

In [None]:
# Here we are defining criteria to be taken into account to obtain the desired dataset. 

project = ["CMIP6"] # This is to indicate that we want CMIP6 data

source_id = ["TaiESM1"] # Here is to specify the models you are searching for. The list can contain more than one model.

experiment_id = ["historical", "ssp245", "ssp585"] # Here is to specify if you want historical data or projection (ssp245, ssp585, etc.)

variant_label = ["r1i1p1f1"] # Here is to specify the variant label

frequency = ["day"] # Here is to specify the frequency

variable = ["tas", "tasmin", "tasmax"] # Here is to specify the climate variable you are searching for

# The list below ("facets) must contain all the variables you intend to use to filter your search. 
# The element "latest" is to specify whether or not you want the latest data
# The element "replica" is to specify whether or not you want duplicate to be included in your search results 
# It is advised to always keep "latest" and "replica" in the list
facets = ["project", "source_id", "experiment_id", "variable", "frequency", "variant_label", "latest", "replica"]

In [None]:
# Here we specify the node from which we want the data search to start. Setting the "distrib" parameter to "true" will expand the data search to other nodes
conn = SearchConnection('https://esgf-node.ipsl.upmc.fr/esg-search', distrib = True)

# Here we launch the search
query = conn.new_context (latest = True, # Set to "true" to only get the latest version of dataset
                          replica = False, # Set to "false" to avoid getting duplicate in the search result.
                          project = project,
                          source_id = source_id,
                       experiment_id = experiment_id,
                       variable = variable,
                       frequency = frequency,
                       variant_label = variant_label,
                       facets = facets) # This is to confirm the search criteria we have set. Must always be kept here


# Here is to obtain the total number of results the search has returned
results_count = query.hit_count 

print (f"The search has returned {results_count} results")

*It is important to highlight that each result contains a certain number of files. And each file is associated with a unique URL that can be used to download it. The line below is written to extract these URLs and store them in an Excel spreadsheet.*

In [None]:
# Initialising an empty list that will be used to store the extracted URLs
urls = [] 


# Starting the extraction of URLs.
for i in range(results_count): # This first loop will iterate over each result

    dataset = query.search()[i] # This open a dataset 

    files_list = dataset.file_context().search() # This create a list of files contained in the opened dataset

    for file in files_list: # This loop will iterate over each file of the list to extract their URLs

        urls.append(file.download_url)

    print (f"Results {i+1} out of {results_count} processed")

# Saving the URLs in an Excel spreadsheet
df = pd.DataFrame(urls, columns = ["Links"])

df.to_excel("C:/Users/gilunga/Documents/files_url.xlsx")


*This script does not allow to filter date from the search results. If you want to remove some years, you can directly remove them from the created spreadsheet before heading to the next part of the script*

**3. Data download**

*Using the URLs extracted in the previous step, we will now download the dataset of interest*

In [None]:
# Opening the Excel spreadsheet containing the URLs
df_urls = pd.read_excel("C:/Users/gilunga/Documents/files_url.xlsx")

# Converting the dataframe into list
list_urls = df_urls["Links"].tolist()

# Directory where files will be saved
path = "C:/Users/gilunga/Documents/"

In [None]:
# Launching download
for url in list_urls:

    file_name = str(url.rsplit("/", 1)[-1]) # Extracting the file name from the URL

    output = path + file_name

    response = requests.get(url, stream=True)
    
    if response.status_code == 200:
        
        with open(output, 'wb') as f:
            
            for chunk in response.iter_content(chunk_size = 8192):
                
                f.write(chunk)

        print(f'Successful download of: {file_name}')
    
    else:
        print(f'Unsuccessful donwnload of: {file_name}. Status code: {response.status_code}')

print ("End of download. Check the log to ensure all data were successfuly downloaded")