# Importing CMIP6 data from glade (on Cheyenne)

## Here is a notebook example of how to access CMIP6 data from glade using intake-esm.

### Make sure CMIP6 2019.10 kernel is selected.

Imports:

In [1]:
%matplotlib inline

import xarray as xr
import intake
import util 

Use the following data cataloging utility to source CMIP6 data sets:

In [2]:
if util.is_ncar_host():
    col = intake.open_esm_datastore("../catalogs/glade-cmip6.json")
else:
    col = intake.open_esm_datastore("../catalogs/pangeo-cmip6.json")
col

glade-cmip6-ESM Collection with 698724 entries:
	> 13 activity_id(s)

	> 24 institution_id(s)

	> 47 source_id(s)

	> 68 experiment_id(s)

	> 162 member_id(s)

	> 35 table_id(s)

	> 1027 variable_id(s)

	> 12 grid_label(s)

	> 59 dcpp_init_year(s)

	> 248 version(s)

	> 6813 time_range(s)

	> 698724 path(s)

A list of CMIP6 global attributes and id(s) can be found here: https://www.earthsystemcog.org/site_media/projects/wip/CMIP6_global_attributes_filenames_CVs_v6.2.6.pdf

For our purposes, we will look for datasets that contain monthly temperature (``Amon``) from models on a native grid (``gn``) that have ``historical``, ``ssp585`` (scenario MIP), and ``piControl`` (pre-industrial control) experiments.

In [3]:
uni_dict = col.unique(['source_id', 'experiment_id', 'table_id'])
models = set(uni_dict['source_id']['values']) # all the models

#search for models with three experiment ids
for experiment_id in ['historical', 'ssp585', 'piControl']:
    query = dict(experiment_id=experiment_id, table_id='Amon', 
                 variable_id='tas', grid_label='gn')  
    cat = col.search(**query)
    models = models.intersection({model for model in cat.df.source_id.unique().tolist()})

models = list(models)
models

['MRI-ESM2-0',
 'CAMS-CSM1-0',
 'FGOALS-g3',
 'UKESM1-0-LL',
 'MIROC-ES2L',
 'CanESM5',
 'BCC-CSM2-MR',
 'MIROC6']

Search through the CMIP6 dataset catalogue using the listed id(s) and variables. Then, we will list all metadata from the filtered models as a pandas dataframe.

In [4]:
cat = col.search(experiment_id=['historical', 'ssp585', 'piControl'], table_id='Amon', 
                 variable_id='tas', grid_label='gn', source_id=models)
cat.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
35731,CMIP,BCC,BCC-CSM2-MR,historical,r2i1p1f1,Amon,tas,gn,,v20181115,185001-201412,/glade/collections/cmip/CMIP6/CMIP/BCC/BCC-CSM...
36172,CMIP,BCC,BCC-CSM2-MR,historical,r1i1p1f1,Amon,tas,gn,,v20181126,185001-201412,/glade/collections/cmip/CMIP6/CMIP/BCC/BCC-CSM...
36609,CMIP,BCC,BCC-CSM2-MR,historical,r3i1p1f1,Amon,tas,gn,,v20181119,185001-201412,/glade/collections/cmip/CMIP6/CMIP/BCC/BCC-CSM...
37431,CMIP,BCC,BCC-CSM2-MR,piControl,r1i1p1f1,Amon,tas,gn,,v20181016,185001-244912,/glade/collections/cmip/CMIP6/CMIP/BCC/BCC-CSM...
174777,CMIP,CAS,FGOALS-g3,historical,r2i1p1f1,Amon,tas,gn,,v20190828,196001-196912,/glade/collections/cmip/CMIP6/CMIP/CAS/FGOALS-...
...,...,...,...,...,...,...,...,...,...,...,...,...
652996,ScenarioMIP,CAMS,CAMS-CSM1-0,ssp585,r1i1p1f1,Amon,tas,gn,,v20190708,201501-209912,/glade/collections/cmip/CMIP6/ScenarioMIP/CAMS...
697364,ScenarioMIP,MRI,MRI-ESM2-0,ssp585,r1i1p1f1,Amon,tas,gn,,v20190222,201501-210012,/glade/collections/cmip/CMIP6/ScenarioMIP/MRI/...
697661,ScenarioMIP,MIROC,MIROC-ES2L,ssp585,r1i1p1f2,Amon,tas,gn,,v20190823,201501-210012,/glade/collections/cmip/CMIP6/ScenarioMIP/MIRO...
697914,ScenarioMIP,MIROC,MIROC6,ssp585,r2i1p1f1,Amon,tas,gn,,v20190627,201501-210012,/glade/collections/cmip/CMIP6/ScenarioMIP/MIRO...


Finally, we will save the pandas dataframe as a CSV file that contains model information for the simulations that we will use for follow-up analyses.

In [5]:
cat.df.to_csv('/glade/scratch/molina/CMIP6_pathnames.csv')

File saved that can be used to loop through ensemble members and simulations for running follow-up notebooks.
### All done!
## Extra: Exploring the metadata of selected CMIP6 models

Using previous ``cat`` containing filtered model data, we can lazily load the models into an xarray dataset (may take a little while).

In [6]:
dset_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True, 'decode_times': False}, 
                                cdf_kwargs={'chunks': {}, 'decode_times': False})


xarray will load netCDF datasets with dask using a single chunk for all arrays.
For effective chunking, please provide chunks in cdf_kwargs.
For example: cdf_kwargs={'chunks': {'time': 36}}

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There will be 24 group(s)


Example of what ``cat`` contains.

In [7]:
cat

glade-cmip6-ESM Collection with 285 entries:
	> 2 activity_id(s)

	> 7 institution_id(s)

	> 8 source_id(s)

	> 3 experiment_id(s)

	> 59 member_id(s)

	> 1 table_id(s)

	> 1 variable_id(s)

	> 1 grid_label(s)

	> 0 dcpp_init_year(s)

	> 24 version(s)

	> 114 time_range(s)

	> 285 path(s)

Print out a list of the file names of the data sets using the ``keys`` method.

In [8]:
dset_dict.keys()

dict_keys(['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn', 'CMIP.BCC.BCC-CSM2-MR.piControl.Amon.gn', 'CMIP.CAMS.CAMS-CSM1-0.historical.Amon.gn', 'CMIP.CAMS.CAMS-CSM1-0.piControl.Amon.gn', 'CMIP.CAS.FGOALS-g3.historical.Amon.gn', 'CMIP.CAS.FGOALS-g3.piControl.Amon.gn', 'CMIP.CAS.FGOALS-g3.ssp585.Amon.gn', 'CMIP.CCCma.CanESM5.historical.Amon.gn', 'CMIP.CCCma.CanESM5.piControl.Amon.gn', 'CMIP.MIROC.MIROC-ES2L.historical.Amon.gn', 'CMIP.MIROC.MIROC-ES2L.piControl.Amon.gn', 'CMIP.MIROC.MIROC6.historical.Amon.gn', 'CMIP.MIROC.MIROC6.piControl.Amon.gn', 'CMIP.MOHC.UKESM1-0-LL.historical.Amon.gn', 'CMIP.MOHC.UKESM1-0-LL.piControl.Amon.gn', 'CMIP.MRI.MRI-ESM2-0.historical.Amon.gn', 'CMIP.MRI.MRI-ESM2-0.piControl.Amon.gn', 'ScenarioMIP.BCC.BCC-CSM2-MR.ssp585.Amon.gn', 'ScenarioMIP.CAMS.CAMS-CSM1-0.ssp585.Amon.gn', 'ScenarioMIP.CCCma.CanESM5.ssp585.Amon.gn', 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Amon.gn', 'ScenarioMIP.MIROC.MIROC6.ssp585.Amon.gn', 'ScenarioMIP.MOHC.UKESM1-0-LL.ssp585.Amon.gn', 'Scenari

Select a single member ID from the historical CMIP6 ensembles and view xarray metadata.

In [9]:
first_dataset = dset_dict['CMIP.BCC.BCC-CSM2-MR.historical.Amon.gn'].sel(member_id='r1i1p1f1')
first_dataset

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, time: 1980)
Coordinates:
  * lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
  * time       (time) float64 15.5 45.0 74.5 ... 6.015e+04 6.018e+04 6.021e+04
  * lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
    member_id  <U8 'r1i1p1f1'
Dimensions without coordinates: bnds
Data variables:
    lat_bnds   (lat, bnds) float64 -90.0 -88.59 -88.59 ... 88.59 88.59 90.0
    time_bnds  (time, bnds) float64 dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lon_bnds   (lon, bnds) float64 -0.5625 0.5625 0.5625 ... 358.3 358.3 359.4
    height     float64 2.0
    tas        (time, lat, lon) float32 dask.array<chunksize=(1980, 160, 320), meta=np.ndarray>
Attributes:
    forcing_index:          1
    source_id:              BCC-CSM2-MR
    initialization_index:   1
    run_variant:            forcing: greenhouse gases,solar constant,aerosol,...
    table_info:             Creation D

View the shape of the data array for the monthly temperature variable (``tas``).

In [11]:
first_dataset.tas.data

Unnamed: 0,Array,Chunk
Bytes,405.50 MB,405.50 MB
Shape,"(1980, 160, 320)","(1980, 160, 320)"
Count,16 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 405.50 MB 405.50 MB Shape (1980, 160, 320) (1980, 160, 320) Count 16 Tasks 1 Chunks Type float32 numpy.ndarray",320  160  1980,

Unnamed: 0,Array,Chunk
Bytes,405.50 MB,405.50 MB
Shape,"(1980, 160, 320)","(1980, 160, 320)"
Count,16 Tasks,1 Chunks
Type,float32,numpy.ndarray
