# Get metadata on available ESGF (CMIP6) datasets and store in json files
In this notebook ```ESMValTool``` is used to query available ESGF nodes to find which available climate model runs contain the variables we are interested in. The ```search_esgf``` function does a call to online meta-data servers and returns all available climate datasets that match our specified criteria. This notebook is used to make a choice which climate model to use for the impact analyses. In notebook 0a this choice is hard-coded.

## Note on names
Confusingly, names like ```dataset``` and not uniquely defined. A ```Dataset``` object in EMSValTool is a combination of climate model, ensemble member, experiment, etc. However, the 'climate model' in a ```Dataset``` is identified using the key ```dataset```. This is confusing, but outside of my control. See [Juckes et al 2020](https://doi.org/10.5194/gmd-13-201-2020) for a detailed description of the jargon used in CMIP6.

## Note on downloaded data
This notebook only queries for meta-data and does not download any other data. The actual data sits on different servers which may at any time be offline. This function returns a list of data that exists, but it does not guarantee that the data is available at the current time.

## Note on eWaterCycle vs ESMValTool
In this notebook we use the search_esgf function we build on top of ESMValTool, but the actual call to the meta data server is made from the ESMValTool package that we have wrapped in eWaterCycle. None of this would be possible without the work of the ESMValTool team. 

In [1]:
import ewatercycle.esmvaltool.search
from esmvalcore.config import CFG

from rich import print
import json

  __import__('pkg_resources').declare_namespace(__name__)


In [2]:
# Parameters
region_id = None
settings_path = "settings.json"

In [3]:
# Parameters
region_id = "camelsgb_22001"
settings_path = "regions/camelsgb_22001/settings.json"


In [4]:
# Setting for ESMValTool to make sure the online esgf resources are always used and
# we don't rely on locally cached information.
CFG['search_esgf'] = 'always'

In [5]:
experiment_of_interest=["historical","ssp126","ssp245","ssp370","ssp585"]
project_of_interest = "CMIP6"
frequency_of_interest="day"
variables_of_interest=["pr", "tas", "rsds"]


In [6]:
# We query the ESGF databases for available datasets. This calls external servers that host ESGF
# metadata which may be down at any moment. This query can take a long time to complete (minutes 
# to half an hour easily)



# valid_datasets = ewatercycle.esmvaltool.search.search_esgf(
#     experiment=experiment_of_interest,
#     project = project_of_interest,
#     frequency = frequency_of_interest,
#     variables = variables_of_interest
# )

In [7]:
# print(valid_datasets["MPI-ESM1-2-HR"])

In [8]:
# # Convert sets to lists for JSON serialization
# serializable_valid_datasets = {key: list(value) for key, value in valid_datasets[0].items()}

# # Write to a JSON file
# with open(f"regions/{region_id}/available_climate_datasets.json", "w") as json_file:
#     json.dump(serializable_valid_datasets, json_file, indent=4)

In [9]:
# Read from the JSON file
with open(f"regions/{region_id}/available_climate_datasets.json", "r") as json_file:
    loaded_dict = json.load(json_file)

# Convert lists back to sets
restored_valid_datasets = {key: set(value) for key, value in loaded_dict.items()}

print("Restored dictionary:", restored_valid_datasets)