In [1]:
# temporary while code remains on branch
!pip install git+https://github.com/andersy005/intake-esm.git@refactor -q

# Hello World!

Here's an example notebook with some documentation on how to access CMIP data.

In [2]:
%matplotlib inline

import xarray as xr
import intake

import util

In [3]:
print('hello world!')

hello world!


## Demonstrate how spin-up a dask cluster
Syntax is different if on an NCAR machine versus the cloud.

In [4]:
if util.is_ncar_host():
    from ncar_jobqueue import NCARCluster
    cluster = NCARCluster()
    cluster.adapt(minimum_jobs=1, maximum_jobs=40)
# need cloud block
cluster

VBox(children=(HTML(value='<h2>NCARCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    .…

In [5]:
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client

0,1
Client  Scheduler: tcp://128.117.181.208:45376  Dashboard: http://128.117.181.208/proxy/8787/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


## Demonstrate how to use `intake-esm`
[Intake-esm](https://intake-esm.readthedocs.io) is a data cataloging utility that facilitates access to CMIP data. It's pretty awesome.

An `intake-esm` collection object establishes a link to a database that contains file locations and associated metadata (i.e., which experiement, model, the come from). 

### Opening a collection
First step is to open the collection by pointing the collection definition file, which is a JSON file that conforms to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). 

In [19]:
if util.is_ncar_host():
    col = intake.open_esm_metadatastore("../catalogs/glade-cmip6.json")
else:
    col = intake.open_esm_metadatastore("../catalogs/pangeo-cmip6.json")
col

ESM Collection with 608513 entries:
	> 9 activity_id(s)

	> 21 institution_id(s)

	> 38 source_id(s)

	> 55 experiment_id(s)

	> 161 member_id(s)

	> 34 table_id(s)

	> 1022 variable_id(s)

	> 11 grid_label(s)

	> 225 version(s)

	> 4128 time_range(s)

	> 608513 path(s)

`intake-esm` is build on top of [pandas](https://pandas.pydata.org/pandas-docs/stable). It is possible to view the `pandas.DataFrame` as follows.

In [20]:
col.df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,version,time_range,path
0,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,hfls,gn,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
1,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,va,gn,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
2,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,tas,gn,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
3,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,rsds,gn,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
4,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,pr,gn,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...


It is possible to interact with the `DataFrame`; for instance, we can see what the "attributes" of the datasets are by printing the columns.

In [10]:
col.df.columns

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'version',
       'time_range', 'path'],
      dtype='object')

### Search and discovery

#### Finding unique entries
Let's query the data to see what models ("source_id"), experiments ("experiment_id") and temporal frequencies ("table_id") are available.

In [22]:
import pprint 
uni_dict = col.unique(['source_id', 'experiment_id', 'table_id'])
pprint.pprint(uni_dict, compact=True)

{'experiment_id': {'count': 55,
                   'values': ['ssp370', 'histSST-piNTCF', 'histSST',
                              'histSST-1950HC', 'hist-1950HC', 'hist-piNTCF',
                              'piClim-NTCF', 'ssp370SST-lowNTCF',
                              'ssp370-lowNTCF', 'ssp370SST', 'amip-future4K',
                              'amip-m4K', 'a4SST', 'aqua-p4K', 'piSST',
                              'amip-4xCO2', 'a4SSTice', 'amip-p4K',
                              'aqua-control', 'aqua-4xCO2', 'abrupt-4xCO2',
                              'historical', 'piControl', 'amip', '1pctCO2',
                              'esm-hist', 'esm-piControl', 'ssp245', 'ssp585',
                              'ssp126', 'highresSST-present',
                              'land-hist-princeton', 'land-hist-cruNcep',
                              'land-hist', 'deforest-globe',
                              'esm-ssp585-ssp126Lu', 'land-cCO2', 'hist-noLu',
                              

#### Searching for specific datasets

Let's find all the dissovle oxygen data at annual frequency from the ocean for the `historical` and `ssp585` experiments.

In [26]:
cat = col.search(experiment_id=['historical', 'ssp585'], table_id='Oyr', variable_id='o2')
cat.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,version,time_range,path
42864,CMIP,NCAR,CESM2-WACCM,historical,r2i1p1f1,Oyr,o2,gn,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
42865,CMIP,NCAR,CESM2-WACCM,historical,r2i1p1f1,Oyr,o2,gr,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
45458,CMIP,NCAR,CESM2-WACCM,historical,r1i1p1f1,Oyr,o2,gn,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
45459,CMIP,NCAR,CESM2-WACCM,historical,r1i1p1f1,Oyr,o2,gr,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
48052,CMIP,NCAR,CESM2-WACCM,historical,r3i1p1f1,Oyr,o2,gn,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
...,...,...,...,...,...,...,...,...,...,...,...
578708,ScenarioMIP,DKRZ,MPI-ESM1-2-HR,ssp585,r1i1p1f1,Oyr,o2,gn,v20190710,2030-2034,/glade/collections/cmip/CMIP6/ScenarioMIP/DKRZ...
578709,ScenarioMIP,DKRZ,MPI-ESM1-2-HR,ssp585,r1i1p1f1,Oyr,o2,gn,v20190710,2085-2089,/glade/collections/cmip/CMIP6/ScenarioMIP/DKRZ...
578710,ScenarioMIP,DKRZ,MPI-ESM1-2-HR,ssp585,r1i1p1f1,Oyr,o2,gn,v20190710,2055-2059,/glade/collections/cmip/CMIP6/ScenarioMIP/DKRZ...
607737,ScenarioMIP,MIROC,MIROC-ES2L,ssp585,r1i1p1f2,Oyr,o2,gn,v20190823,2015-2100,/glade/collections/cmip/CMIP6/ScenarioMIP/MIRO...


### Loading data

The best part about `intake-esm` is that it enables loading data directly into an [xarray.Dataset](http://xarray.pydata.org/en/stable/api.html#dataset).

Note that data on the cloud are in 
[zarr](https://zarr.readthedocs.io/en/stable/) and data on 
[glade](https://www2.cisl.ucar.edu/resources/storage-and-file-systems/glade-file-spaces) are stored as 
[netCDF](https://www.unidata.ucar.edu/software/netcdf/) files. This is opaque to the user!

`intake-esm` has rules for aggegating datasets; these rules are defined in the collection-specification file.

In [None]:
dset_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True, 'decode_times': False}, 
                                cdf_kwargs={'decode_times': False})
dset_dict