Pangeo & Data Catalogs
====================
An exploration of the cataloging approaches used by Pangeo along with instructions on how to use them.

### What approaches are there currently?
Pangeo currently offers two primary approaches to data cataloging:
- Intake, a lightweight YAML-based Python package
- ESMCol, a collection specification method for large homogeneous data

## Intake Data Catalogs

<img src="https://intake.readthedocs.io/en/latest/_static/images/logo.png" align="right" width=20% alt="Dask Logo">

Intake allows us to load in YAML-based catalogs with specified metadata describing how to open the data files they point to.

This allows us to move from individual entries in an Intake catalog to data in xarray with one method:

In [5]:
import intake

url = "https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/atmosphere.yaml"

cat = intake.open_catalog(url)
display(list(cat))

display(cat['gmet_v1'].describe())

['gmet_v1',
 'trmm_3b42rt',
 'sam_ngaqua_qobs_eqx_3d',
 'sam_ngaqua_qobs_eqx_2d',
 'gpcp_cdr_daily_v1_3',
 'wrf50_erai']

{'name': 'gmet_v1',
 'container': 'xarray',
 'plugin': ['zarr'],
 'description': 'Full GMET version 1 (Newman) met ensemble in zarr format',
 'direct_access': 'forbid',
 'user_parameters': [],
 'metadata': {},
 'args': {'storage_options': {'project': 'pangeo-181919',
   'token': 'anon',
   'access': 'read_only'},
  'urlpath': 'gcs://pangeo-data/gmet_v1.zarr',
  'consolidated': True}}

In [None]:
cat['gmet_v1'].to_dask()

### Searching & filtering Intake catalogs
Entries in an Intake catalog can be filtered using either Intake's `search` or `gui`:

In [None]:
search = cat.search("sam")
display(list(search))

In [None]:
cat.gui

In [14]:
cat.gui.sources[0].to_dask()

<xarray.Dataset>
Dimensions:        (lat: 480, lon: 1440, time: 41320)
Coordinates:
  * lat            (lat) float64 59.88 59.62 59.38 ... -59.38 -59.62 -59.88
  * lon            (lon) float64 0.125 0.375 0.625 0.875 ... 359.4 359.6 359.9
  * time           (time) datetime64[ns] 2000-03-01T12:00:00 ... 2014-04-22T09:00:00
Data variables:
    precipitation  (time, lat, lon) float32 dask.array<shape=(41320, 480, 1440), chunksize=(40, 480, 1440)>

## ESM Collection Specification

The Earth System Model Collection Specification (ESMCol) describes a way of cataloging large datasets with a homogeneous metadata structure, such as those produced by the Coupled Model Intercomparison Project of the World Climate Research Programme.

ESMCol will serve as the primary cataloging approach for the NCAR CMIP6 Hackathon, and through it CMIP6 data can be accessed directly from a Jupyter environment, or viewed from a higher level through Pangeo's Cloud Data Catalog.

ESMCol consists of three parts:

### Collection
A single JSON file, containing *homogeneous* metadata pertaining to a catalog of data along with a path to access it. 
This metadata may include information on how to interpret the data as well as how it is encoded:

### Catalog
The singular catalog which the collection points to is a CSV file, containing rows representing individual datasets:

|activity_id|source_id|path|
|-| -| -|
|CMIP|ACCESS-CM2|gs://pangeo-data/store1.zarr|
|CMIP| GISS-E2-1-G|gs://pangeo-data/store1.zarr|

### Assets
Ultimately, each row of the catalog will have a path pointing to some data file, the location of which has been specified in the collection.
For this hackathon, these data files will be either netCDF or zarr.

## Using ESMCol in Jupyter
With the path to an ESMCol catalog, we can use `pandas.read_csv` to generate a DataFrame using the CSV file:

In [16]:
import pandas as pd

df = pd.read_csv("https://storage.googleapis.com/pangeo-cmip6/pangeo-cmip6-zarr-consolidated-stores.csv")
df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year
0,AerChemMIP,BCC,BCC-ESM1,ssp370,r1i1p1f1,Amon,pr,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...,
1,AerChemMIP,BCC,BCC-ESM1,ssp370,r1i1p1f1,Amon,prsn,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...,
2,AerChemMIP,BCC,BCC-ESM1,ssp370,r1i1p1f1,Amon,tas,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...,
3,AerChemMIP,BCC,BCC-ESM1,ssp370,r1i1p1f1,Amon,tasmax,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...,
4,AerChemMIP,BCC,BCC-ESM1,ssp370,r1i1p1f1,Amon,tasmin,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/ssp370/r1i1...,


In [2]:
df.institution_id.unique(), df.variable_id.unique()

(array(['BCC', 'CCCma', 'CNRM-CERFACS', 'MOHC', 'NASA-GISS', 'NCAR',
        'NOAA-GFDL', 'AWI', 'CAMS', 'CAS', 'E3SM-Project',
        'EC-Earth-Consortium', 'FIO-QLNM', 'IPSL', 'MIROC', 'MRI', 'NCC',
        'NUIST', 'SNU', 'UA', 'CMCC', 'ECMWF', 'DKRZ'], dtype=object),
 array(['pr', 'prsn', 'tas', 'tasmax', 'tasmin', 'ts', 'ua', 'va', 'cLeaf',
        'cVeg', 'gpp', 'lai', 'npp', 'ra', 'tran', 'chl', 'detoc',
        'diftrblo', 'difvho', 'difvso', 'dissic', 'dissicabio',
        'dissicnat', 'fgco2', 'fgco2abio', 'fgco2nat', 'no3', 'o2', 'phyc',
        'phyn', 'pon', 'talk', 'zooc', 'nbp', 'fgo2', 'hfds', 'sos', 'tos',
        'calc', 'dfe', 'dissoc', 'expc', 'expn', 'expp', 'expsi', 'graz',
        'nh4', 'ph', 'phydiat', 'phydiaz', 'phypico', 'pnitrate', 'po4',
        'pp', 'remoc', 'si', 'hus', 'psl', 'ta', 'zg', 'mlotst', 'so',
        'tauuo', 'tauvo', 'thetao', 'thetaoga', 'uo', 'vo', 'volo', 'wo',
        'zos', 'sithick', 'huss', 'rlds', 'rlus', 'hfls', 'uas', 'vas',
    

From here, we can search through the data using familiar `pandas` methodology:

In [3]:
df_search = df[(df.institution_id == "BCC") & (df.variable_id == "sfcWind")]
df_search

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year
343,CMIP,BCC,BCC-CSM2-MR,historical,r1i1p1f1,day,sfcWind,gn,gs://cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i...,
606,CMIP,BCC,BCC-ESM1,historical,r1i1p1f1,day,sfcWind,gn,gs://cmip6/CMIP/BCC/BCC-ESM1/historical/r1i1p1...,
19235,ScenarioMIP,BCC,BCC-CSM2-MR,ssp245,r1i1p1f1,day,sfcWind,gn,gs://cmip6/ScenarioMIP/BCC/BCC-CSM2-MR/ssp245/...,
19333,ScenarioMIP,BCC,BCC-CSM2-MR,ssp585,r1i1p1f1,day,sfcWind,gn,gs://cmip6/ScenarioMIP/BCC/BCC-CSM2-MR/ssp585/...,


Once we have a suitable subset of the data which we would like to view in `xarray`, we can do so using `gcsfs`; first we must initialize a `GCSFileSystem` which connects us to Pangeo's cloud bucket:

In [20]:
import gcsfs

fs = gcsfs.GCSFileSystem(project='pangeo-181919', token='anon', access='read_only')
fs

<gcsfs.core.GCSFileSystem at 0x7f58ec4bc048>

With the file system initialized, we can now use `fs.get_mapper` on any of the paths listed in `zstore` to get a mapping which can be opened in `xarray`:

In [5]:
import xarray as xr

store = df_search.zstore.values[0]
mapper = fs.get_mapper(store)
xr.open_zarr(mapper)

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 160, lon: 320, time: 60225)
Coordinates:
    height     float64 ...
  * lat        (lat) float64 -89.14 -88.03 -86.91 -85.79 ... 86.91 88.03 89.14
    lat_bnds   (lat, bnds) float64 dask.array<shape=(160, 2), chunksize=(160, 2)>
  * lon        (lon) float64 0.0 1.125 2.25 3.375 ... 355.5 356.6 357.8 358.9
    lon_bnds   (lon, bnds) float64 dask.array<shape=(320, 2), chunksize=(320, 2)>
  * time       (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
    time_bnds  (time, bnds) object dask.array<shape=(60225, 2), chunksize=(30113, 1)>
Dimensions without coordinates: bnds
Data variables:
    sfcWind    (time, lat, lon) float32 dask.array<shape=(60225, 160, 320), chunksize=(600, 160, 320)>
Attributes:
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            CMIP
    branch_method:          Standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  2289.0
    cmor_version:           3.3.2
    comment:          

## Using ESMCol in a Browser
If we are unfamiliar with `pandas` or simply want to view the data outside of a Jupyter environment, we can do so using the [Pangeo Cloud Data Catalog](https://pangeo-data.github.io/pangeo-datastore/cmip6_pangeo.html).

In [1]:
df = pd.read_csv("export.csv")
df.head()

NameError: name 'pd' is not defined

In [None]:
store = df_search.zstore.values[0]
mapper = fs.get_mapper(store)
xr.open_zarr(mapper)

## Where to go from here?
The methods of cataloging data at Pangeo are changing rapidly!

To keep up with this development, there are a variety of places to look:
- Discussion and development on Intake can be viewed on its [Github repository](https://github.com/intake/intake).
- Progress on ESMCol can be tracked at its [Github repository](https://github.com/NCAR/esm-collection-spec).
- The entirety of Pangeo's Intake and ESMCol catalogs can be viewed at [https://pangeo-data.github.io/pangeo-datastore/](https://pangeo-data.github.io/pangeo-datastore/).
- This presentation (and the interactive code blocks in it) can be viewed in a Binder at [https://github.com/charlesbluca/pangeo-catalogs](https://github.com/charlesbluca/pangeo-catalogs).