# Get metadata on available ESGF (CMIP6) datasets and store in json files
In this notebook ```ESMValTool``` is used to query available ESGF nodes to find which available climate model runs contain the variables we are interested in. Using a template dictionary we find all available datasets that contain variables from a list of variables and match the template. For each ```dataset``` (for example 'EC-Earth3') we create a json file that holds the meta-info on all ensemble members that match the criteria from the template. These json files can later be used to generate forcing 

We use this here to get meta-info for a specific CMIP6 scenarion (in our case: ssp585) so we can later build forcing input for hydrological models in eWaterCycle.

## Note on names
Confusingly, names like ```dataset``` and not uniquely defined. A ```Dataset``` object in EMSValTool is a combination of climate model, ensemble member, experiment, etc. However, the 'climate model' in a ```Dataset``` is identified using the key ```dataset```. This is confusing, but outside of my control.

## Note on downloaded data
This notebook only queries for meta-data and does not download any other data. It will generate, however, quite some json files that are stored in the configFiles directory

## Note on eWaterCycle vs ESMValTool
In this notebook we use none of the core eWaterCycle package functions. The function to query ESGF is from ESMValTool and there is no eWaterCycle wrapper around ESMValTool for this specific use case.

In [1]:
import ewatercycle.esmvaltool.search
from esmvalcore.config import CFG

from rich import print
import json

In [2]:
# Setting for ESMValTool to make sure the online esgf resources are always used and
# we don't rely on locally cached information.
CFG['search_esgf'] = 'always'

In [3]:
experiment=["historical","ssp245","ssp585"],
project = "CMIP6",
frequency="day",
variables=["pr", "tas", "rsds"]


In [4]:
# We query the ESGF databases for available datasets. This calls external servers that host ESGF
# metadata which may be down at any moment. This query can take a long time to complete (minutes 
# to half an hour easily)



# valid_datasets = ewatercycle.esmvaltool.search.search_esgf(
#     experiment=["historical","ssp126","ssp245","ssp370","ssp585"],
#     project = "CMIP6",
#     frequency="day",
#     variables=["pr", "tas", "rsds"]
# ),

In [5]:
# print(valid_datasets)

In [6]:
# # Convert sets to lists for JSON serialization
# serializable_valid_datasets = {key: list(value) for key, value in valid_datasets[0].items()}
# #
# # # Write to a JSON file
# with open("available_climate_datasets.json", "w") as json_file:
#     json.dump(serializable_valid_datasets, json_file, indent=4)

In [7]:
# Read from the JSON file
with open("available_climate_datasets.json", "r") as json_file:
    loaded_dict = json.load(json_file)

# Convert lists back to sets
restored_valid_datasets = {key: set(value) for key, value in loaded_dict.items()}

print("Restored dictionary:", restored_valid_datasets)