# Hello World!

Here's an example notebook with some documentation on how to access CMIP data.

In [1]:
%matplotlib inline

import xarray as xr
import intake

# util.py is in the local directory
# it contains code that is common across project notebooks
# or routines that are too extensive and might otherwise clutter
# the notebook design
import util 



In [2]:
print('hello world!')

hello world!


## Demonstrate how spin-up a dask cluster
Syntax is different if on an NCAR machine versus the cloud.

In [3]:
if util.is_ncar_host():
    from ncar_jobqueue import NCARCluster
    cluster = NCARCluster(project='UCGD0006')
    cluster.adapt(minimum_jobs=1, maximum_jobs=40)
else:
    from dask_kubernetes import KubeCluster
    cluster = KubeCluster()
    cluster.adapt(minimum=1, maximum=40)
cluster

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


VBox(children=(HTML(value='<h2>NCARCluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n    .…

In [4]:
from dask.distributed import Client
client = Client(cluster) # Connect this local process to remote workers
client

0,1
Client  Scheduler: tcp://128.117.181.208:46641  Dashboard: https://jupyterhub.ucar.edu/dav/user/mclong/proxy/46201/status,Cluster  Workers: 0  Cores: 0  Memory: 0 B


## Demonstrate how to use `intake-esm`
[Intake-esm](https://intake-esm.readthedocs.io) is a data cataloging utility that facilitates access to CMIP data. It's pretty awesome.

An `intake-esm` collection object establishes a link to a database that contains file locations and associated metadata (i.e., which experiement, model, the come from). 

### Opening a collection
First step is to open the collection by pointing the collection definition file, which is a JSON file that conforms to the [ESM Collection Specification](https://github.com/NCAR/esm-collection-spec). 

The collection JSON files are stored locally in this repository for purposes of reproducibility---and because Cheyenne compute nodes don't have Internet access. 

The primary source for these files is the [intake-esm-datastore](https://github.com/NCAR/intake-esm-datastore) repository. Any changes made to these files should be pulled from that repo. For instance, the Pangeo cloud collection is available [here](https://raw.githubusercontent.com/NCAR/intake-esm-datastore/master/catalogs/pangeo-cmip6.json).

In [5]:
if util.is_ncar_host():
    col = intake.open_esm_metadatastore("../catalogs/glade-cmip6.json")
else:
    col = intake.open_esm_metadatastore("../catalogs/pangeo-cmip6.json")
col

ESM Collection with 590735 entries:
	> 10 activity_id(s)

	> 21 institution_id(s)

	> 38 source_id(s)

	> 60 experiment_id(s)

	> 161 member_id(s)

	> 34 table_id(s)

	> 1022 variable_id(s)

	> 11 grid_label(s)

	> 59 dcpp_init_year(s)

	> 222 version(s)

	> 4275 time_range(s)

	> 590735 path(s)

`intake-esm` is build on top of [pandas](https://pandas.pydata.org/pandas-docs/stable). It is possible to view the `pandas.DataFrame` as follows.

In [6]:
col.df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
0,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,hfls,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
1,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,va,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
2,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,tas,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
3,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,rsds,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...
4,AerChemMIP,BCC,BCC-ESM1,ssp370,r2i1p1f1,Amon,pr,gn,,v20190624,201501-205512,/glade/collections/cmip/CMIP6/AerChemMIP/BCC/B...


It is possible to interact with the `DataFrame`; for instance, we can see what the "attributes" of the datasets are by printing the columns.

In [7]:
col.df.columns

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'dcpp_init_year',
       'version', 'time_range', 'path'],
      dtype='object')

### Search and discovery

#### Finding unique entries
Let's query the data to see what models ("source_id"), experiments ("experiment_id") and temporal frequencies ("table_id") are available.

In [8]:
import pprint 
uni_dict = col.unique(['source_id', 'experiment_id', 'table_id'])
pprint.pprint(uni_dict, compact=True)

{'experiment_id': {'count': 60,
                   'values': ['ssp370', 'histSST-piNTCF', 'histSST',
                              'histSST-1950HC', 'hist-1950HC', 'hist-piNTCF',
                              'piClim-NTCF', 'ssp370SST-lowNTCF',
                              'ssp370-lowNTCF', 'ssp370SST', 'amip-future4K',
                              'amip-m4K', 'a4SST', 'aqua-p4K', 'piSST',
                              'amip-4xCO2', 'a4SSTice', 'amip-p4K',
                              'aqua-control', 'aqua-4xCO2', 'abrupt-4xCO2',
                              'historical', 'piControl', 'amip', '1pctCO2',
                              'esm-hist', 'esm-piControl', 'ssp245', 'ssp585',
                              'ssp126', 'dcppA-hindcast',
                              'dcppC-hindcast-noPinatubo',
                              'dcppC-hindcast-noElChichon', 'dcppA-assim',
                              'dcppC-hindcast-noAgung', 'highresSST-present',
                              'land-

#### Searching for specific datasets

Let's find all the dissolved oxygen data at annual frequency from the ocean for the `historical` and `ssp585` experiments.

In [9]:
cat = col.search(experiment_id=['historical', 'ssp585'], table_id='Oyr', variable_id='o2', grid_label='gn')
cat.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
42454,CMIP,NCAR,CESM2-WACCM,historical,r2i1p1f1,Oyr,o2,gn,,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
44704,CMIP,NCAR,CESM2-WACCM,historical,r1i1p1f1,Oyr,o2,gn,,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
46954,CMIP,NCAR,CESM2-WACCM,historical,r3i1p1f1,Oyr,o2,gn,,v20190917,1850-2014,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
263234,CMIP,CCCma,CanESM5,historical,r2i1p1f1,Oyr,o2,gn,,v20190429,1850-2014,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
263717,CMIP,CCCma,CanESM5,historical,r5i1p1f1,Oyr,o2,gn,,v20190429,1850-2014,/glade/collections/cmip/CMIP6/CMIP/CCCma/CanES...
...,...,...,...,...,...,...,...,...,...,...,...,...
560932,ScenarioMIP,DKRZ,MPI-ESM1-2-HR,ssp585,r1i1p1f1,Oyr,o2,gn,,v20190710,2030-2034,/glade/collections/cmip/CMIP6/ScenarioMIP/DKRZ...
560933,ScenarioMIP,DKRZ,MPI-ESM1-2-HR,ssp585,r1i1p1f1,Oyr,o2,gn,,v20190710,2085-2089,/glade/collections/cmip/CMIP6/ScenarioMIP/DKRZ...
560934,ScenarioMIP,DKRZ,MPI-ESM1-2-HR,ssp585,r1i1p1f1,Oyr,o2,gn,,v20190710,2055-2059,/glade/collections/cmip/CMIP6/ScenarioMIP/DKRZ...
589961,ScenarioMIP,MIROC,MIROC-ES2L,ssp585,r1i1p1f2,Oyr,o2,gn,,v20190823,2015-2100,/glade/collections/cmip/CMIP6/ScenarioMIP/MIRO...


### Loading data

The best part about `intake-esm` is that it enables loading data directly into an [xarray.Dataset](http://xarray.pydata.org/en/stable/api.html#dataset).

Note that data on the cloud are in 
[zarr](https://zarr.readthedocs.io/en/stable/) and data on 
[glade](https://www2.cisl.ucar.edu/resources/storage-and-file-systems/glade-file-spaces) are stored as 
[netCDF](https://www.unidata.ucar.edu/software/netcdf/) files. This is opaque to the user!

`intake-esm` has rules for aggegating datasets; these rules are defined in the collection-specification file.

In [10]:
dset_dict = cat.to_dataset_dict(zarr_kwargs={'consolidated': True, 'decode_times': False}, 
                                cdf_kwargs={'chunks': {}, 'decode_times': False})

xarray will load the datasets with dask using a single chunk for all arrays.


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'

--> There will be 9 groups


`dset_dict` is a dictionary of `xarray.Dataset`'s; its keys are constructed to refer to compatible groups.

In [11]:
dset_dict.keys()

dict_keys(['CMIP.CCCma.CanESM5.historical.Oyr.gn', 'CMIP.IPSL.IPSL-CM6A-LR.historical.Oyr.gn', 'CMIP.MIROC.MIROC-ES2L.historical.Oyr.gn', 'CMIP.NCAR.CESM2-WACCM.historical.Oyr.gn', 'CMIP.NCC.NorESM2-LM.historical.Oyr.gn', 'ScenarioMIP.CCCma.CanESM5.ssp585.Oyr.gn', 'ScenarioMIP.DKRZ.MPI-ESM1-2-HR.ssp585.Oyr.gn', 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Oyr.gn', 'ScenarioMIP.MIROC.MIROC-ES2L.ssp585.Oyr.gn'])

We can access a particular dataset as follows.

In [12]:
dset_dict['CMIP.CCCma.CanESM5.historical.Oyr.gn']

<xarray.Dataset>
Dimensions:             (bnds: 2, i: 360, j: 291, lev: 45, member_id: 20, time: 165, vertices: 4)
Coordinates:
  * time                (time) float64 182.5 547.5 912.5 ... 5.968e+04 6.004e+04
  * j                   (j) int32 0 1 2 3 4 5 6 ... 284 285 286 287 288 289 290
  * i                   (i) int32 0 1 2 3 4 5 6 ... 353 354 355 356 357 358 359
  * lev                 (lev) float64 3.047 9.454 16.36 ... 5.375e+03 5.625e+03
  * member_id           (member_id) <U9 'r12i1p1f1' 'r14i1p1f1' ... 'r9i1p1f1'
Dimensions without coordinates: bnds, vertices
Data variables:
    vertices_latitude   (j, i, vertices) float64 -78.29 -78.49 ... 50.11 50.11
    lev_bnds            (lev, bnds) float64 0.0 6.194 6.194 ... 5.5e+03 5.75e+03
    time_bnds           (time, bnds) float64 dask.array<chunksize=(165, 2), meta=np.ndarray>
    vertices_longitude  (j, i, vertices) float64 74.0 74.0 73.0 ... 72.95 73.0
    latitude            (j, i) float64 -78.39 -78.39 -78.39 ... 50.23 50.01
 

In [13]:
cat_fx = col.search(table_id='Ofx', grid_label='gn',
                    variable_id='volcello')
cat_fx.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
3391,AerChemMIP,NCAR,CESM2-WACCM,hist-1950HC,r1i1p1f1,Ofx,volcello,gn,,v20190606,,/glade/collections/cmip/CMIP6/AerChemMIP/NCAR/...
4919,AerChemMIP,NCAR,CESM2-WACCM,hist-piNTCF,r1i2p1f1,Ofx,volcello,gn,,v20190531,,/glade/collections/cmip/CMIP6/AerChemMIP/NCAR/...
8650,AerChemMIP,NCAR,CESM2-WACCM,ssp370-lowNTCF,r1i2p1f1,Ofx,volcello,gn,,v20191001,,/glade/collections/cmip/CMIP6/AerChemMIP/NCAR/...
9856,AerChemMIP,NCAR,CESM2,ssp370-lowNTCF,r3i2p1f1,Ofx,volcello,gn,,v20191001,,/glade/collections/cmip/CMIP6/AerChemMIP/NCAR/...
10686,AerChemMIP,NCAR,CESM2,ssp370-lowNTCF,r2i2p1f1,Ofx,volcello,gn,,v20191001,,/glade/collections/cmip/CMIP6/AerChemMIP/NCAR/...
37088,CMIP,NCAR,CESM2-WACCM,abrupt-4xCO2,r1i1p1f1,Ofx,volcello,gn,,v20190425,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
47238,CMIP,NCAR,CESM2-WACCM,piControl,r1i1p1f1,Ofx,volcello,gn,,v20190320,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
60320,CMIP,NCAR,CESM2-WACCM,1pctCO2,r1i1p1f1,Ofx,volcello,gn,,v20190425,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2-...
63723,CMIP,NCAR,CESM2,abrupt-4xCO2,r1i1p1f1,Ofx,volcello,gn,,v20190425,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
63724,CMIP,NCAR,CESM2,abrupt-4xCO2,r1i1p1f1,Ofx,volcello,gn,,v20190927,,/glade/collections/cmip/CMIP6/CMIP/NCAR/CESM2/...
