# Get metadata on available ESGF (CMIP6) datasets and store in json files
In this notebook ```ESMValTool``` is used to query available ESGF nodes to find which available climate model runs contain the variables we are interested in. Using a template dictionary we find all available datasets that contain variables from a list of variables and match the template. For each ```dataset``` (for example 'EC-Earth3') we create a json file that holds the meta-info on all ensemble members that match the criteria from the template. These json files can later be used to generate forcing 

We use this here to get meta-info for a specific CMIP6 scenarion (in our case: ssp585) so we can later build forcing input for hydrological models in eWaterCycle.

## Note on names
Confusingly, names like ```dataset``` and not uniquely defined. A ```Dataset``` object in EMSValTool is a combination of climate model, ensemble member, experiment, etc. However, the 'climate model' in a ```Dataset``` is identified using the key ```dataset```. This is confusing, but outside of my control.

## Note on downloaded data
This notebook only queries for meta-data and does not download any other data. It will generate, however, quite some json files that are stored in the configFiles directory

## Note on eWaterCycle vs ESMValTool
In this notebook we use none of the core eWaterCycle package functions. The function to query ESGF is from ESMValTool and there is no eWaterCycle wrapper around ESMValTool for this specific use case.

In [1]:
# libaries needed from ESMValTool
from esmvalcore.config import CFG
from esmvalcore.dataset import Dataset

# more general libraries used
import json
from rich import print
from pathlib import Path

In [2]:
# User settings. If this notebook is used for different scenario's, models, etc. only this cell should
# be changed. 

# Variable short names. We will look for datasets that make all these available and only report 
# back on those datasets. For the purpose of using this notebook to subsequently run an ESMvalTool
# recipe (for example to generate eWaterCycle forcing), this list must include all the variables
# that the recipe requires. The naming convention used is those from the ESGF list. 
short_names = ['tas','pr','rsds']


# The fields of the datasets that we fix, the other fields are the ones that will be queried.
# The jargon of ESGF, CMIP and ESMValTool can be jarring when new in this field. A combination
# of project, activity and exp define a protocol for a climate model run that multiple research
# groups can subsequently run with their own climate model. Dataset refers to the climate model used
# institute to the organisation that manages this particular climate modes. ensemble identifies a 
# particular run, since often multiple runs are done for the same experiment.
dataset_template = {
  'project': 'CMIP6',
  'activity': 'ScenarioMIP',
  'exp': 'ssp585',
  'mip': 'day',
}

#These are the further variables that we will be querying and that need to be found
#  'ensemble': '*',      
#  'grid': '*',
#  'institute: '*',
#  'dataset': '*',

#The location (directory) where the json files will be written.
json_output_dir = Path.cwd() / "configFiles"

In [3]:
# setting for ESMValTool to make sure the online esgf resources are always used and
# we don't rely on locally cached information.
CFG['search_esgf'] = 'always'

In [4]:
# loop through the short_names and for each, ask ESGF datasets that match the template
# create a long list of all these datasets.
# the server that holds the ESGF meta-data might react slowly (minutes). If the exact request has been made
# recently before, the result is still cached and returned in 10-30 seconds per request.

#emtpy list to hold results
datasets = list()

for short_name_str in short_names:

    #dataset object to pass to esmvaltool
    dataset_query = Dataset(
        short_name=short_name_str,
        activity=dataset_template['activity'],
        mip=dataset_template['mip'],
        project=dataset_template['project'],
        exp=dataset_template['exp'],
        dataset='*',
        institute = '*',
        ensemble='*',
        grid='*',
    )

    #this line does the actual query to ESGF, which is hidden in the 'from_files()'
    datasets_this_shortname = list(dataset_query.from_files())

    #print 2 datasets to show what this data looks like
    print(f"Found {len(datasets_this_shortname)} datasets for short name: { short_name_str }, showing the first 2:")
    print(datasets_this_shortname[:2])

    #add to the list of datasets
    datasets.extend(datasets_this_shortname)

In [5]:
# quickly define a function to go through the list of datasets and return a unique list of keys.
# useful for, for example, getting a list of all the climate models (ie. 'datasets') 
def get_unique_keys_from_nested_datasets(datasets,key_category):
    key_found_set = set()

    for dataset in datasets:
        key_found_set.add(dataset[key_category])

    return key_found_set

In [6]:
# get lists of the unique datasets (climate models) and ensemble members
# we will loop over this later.
dataset_names = get_unique_keys_from_nested_datasets(datasets,'dataset')
unique_ensemble_members = get_unique_keys_from_nested_datasets(datasets,'ensemble')


In [7]:
# First we create a dict that links which climate model ('dataset') has with which ensembles are available
# from that climate model
unique_ensemble_per_dataset = {}
for dataset_name in dataset_names:
    unique_ensemble_per_dataset[dataset_name] = set()
    for dataset in datasets:
        if dataset['dataset'] == dataset_name:
            unique_ensemble_per_dataset[dataset_name].add(dataset['ensemble'])


# here we create the json output per model ('dataset').
# as a double check, we do check if a combination of climate model and ensemble does indeed
# have all the variables (short names) available. 
correct_ensemble_per_dataset = {}
dataset_json_output = dict()

for dataset_name in unique_ensemble_per_dataset.keys():
    correct_ensemble_per_dataset[dataset_name] = set()
    
    dataset_json_output = dict()

    dataset_json_filename = json_output_dir / ("datasets_" + dataset_name + 
                                               "_" + dataset_template['exp'] + ".json")
    
    for ensemble in unique_ensemble_per_dataset[dataset_name]:
        count = 0 
        for dataset in datasets:
            if (dataset['dataset'] == dataset_name): 
                if (dataset['ensemble'] == ensemble) :
                    # check of the amount of datasets found equals the amount of variables,
                    # under the assumption that no duplicates exist 
                    count  = count + 1
                    if (count == len(short_names)):
                        correct_ensemble_per_dataset[dataset_name].add(dataset['ensemble'])

                        # create a dict with all the meta-info on this dataset
                        # note that we assume that all variables use the same grid
                        # because we assume that all variables are from the same
                        # model run on a single grid
                        this_dataset = {}
                        this_dataset['project'] = dataset_template['project']
                        this_dataset['activity'] = dataset_template['activity']
                        this_dataset['exp'] = dataset_template['exp']
                        this_dataset['mip'] = dataset_template['mip']
                        this_dataset['dataset'] = dataset_name
                        this_dataset['ensemble'] = ensemble
                        this_dataset['institute'] = dataset['institute']
                        this_dataset['grid'] = dataset['grid']
    
                        dataset_json_output[ensemble] = this_dataset
                        
                        count = 0

    #write the json file for this climate model.                    
    with open(dataset_json_filename, 'x') as the_file:
        json.dump(dataset_json_output, the_file)
           