# Intake / Pangeo Catalog: Making It Easier To Consume Earth’s Climate and Weather Data

## Introduction

Computer simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on HPC systems or in the cloud across multiple data assets of a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it. 


In this notebook, we demonstrate [intake-esm](https://github.com/intake/intake), a Python package and an [intake](https://github.com/intake/intake) plugin with an aim of facilitating
- the discovery of earth's climate and weather datasets.
- the ingestion of these datasets into xarray dataset containers.


The common/popular starting point for finding, investigating large datasets is with a data catalog. A *Data Catalog* is a collection of metadata, combined with search tools, that helps data analysts and other data users to find the data that they need. For a user to take full advantage of intake-esm, they need to point intake-esm to an ESM data catalog file. This is a JSON-based catalog file that conforms to the Earth System Model (ESM) collection specification. 

## ESM Collection Specification

The [ESM collection specification](https://github.com/NCAR/esm-collection-spec) provides a machine-readable (JSON) format for describing a wide range of climate and weather datasets. ESM’s goal is to make it easier to index and discover climate and weather data assets. An asset is any netCDF/HDF file or Zarr store that contains relevant data.


An ESM catalog serves as an inventory of available data, and provides information to explore the existing data assets. Additionaly, an ESM catalog can contain information about how to aggregate compatible groups of data assets into single xarray datasets. 


## Use Case: CMIP6 hosted on Google Cloud


The Coupled Model Intercomparison Project (CMIP) is an international collaborative effort to improve the knowledge about climate change and its impacts on the Earth System and on our society. [CMIP began in 1995](https://www.wcrp-climate.org/wgcm-cmip), and  today we are in its sixth phase (CMIP6). The CMIP6 data archive consists of data models created across approximately 30 working groups and 1,000 researchers investigating the urgent environmental problem of climate change. The CMIP6 will provide a wealth of information for the next Assessment Report (AR6) of the [Intergovernmental Panel on Climate Change](https://www.ipcc.ch/) (IPCC).

Last year, Pangeo partnered with Google Cloud to bring CMIP6 climate data to Google Cloud’s Public Datasets Program. You can read more about the dataset the process [here](https://cloud.google.com/blog/products/data-analytics/new-climate-model-data-now-google-public-datasets). 

For the rest of the notebook, we will demonstrate intake-esm's features using the intake-esm catalog for the CMIP6 data stored on Google Cloud Storage. This catalog resides [here](https://storage.googleapis.com/cmip6/pangeo-cmip6.json).


### Load an intake-esm catalog

In [1]:
# Import intake
import intake

In [2]:
col = intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json", sep=".")
col

Unnamed: 0,unique
activity_id,15
institution_id,33
source_id,70
experiment_id,102
member_id,140
table_id,29
variable_id,369
grid_label,10
zstore,267459
dcpp_init_year,60


The summary above is telling us that this catalog contains close to 268, 000 data assets. The first line (shown below) in the catalog contains the Ambient Aerosol Optical Thickness at 550nm (`variable_id='od550aer'`), as a function of latitude, longitude, time, and member_id, in an individual climate model experiment with the Taiwan Earth System Model 1.0 model (`source_id='TaiESM1'`) developed by the Taiwan Research Center for Environmental Changes (`instution_id='AS-RCEC'`). This model is **forced** by the experiment histSST (`experiment_id='histSST'`), which stands for *Historical transient with SSTs prescribed from historical*. This simulation was run as part of the AerChemMIP activity, which stands for *Aerosols and Chemistry Model Intercomparison Project*.

In [3]:
# List assets in the catalog
col.df

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,zstore,dcpp_init_year
0,AerChemMIP,AS-RCEC,TaiESM1,histSST,r1i1p1f1,AERmon,od550aer,gn,gs://cmip6/AerChemMIP/AS-RCEC/TaiESM1/histSST/...,
1,AerChemMIP,BCC,BCC-ESM1,histSST,r1i1p1f1,AERmon,mmrbc,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...,
2,AerChemMIP,BCC,BCC-ESM1,histSST,r1i1p1f1,AERmon,mmrdust,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...,
3,AerChemMIP,BCC,BCC-ESM1,histSST,r1i1p1f1,AERmon,mmroa,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...,
4,AerChemMIP,BCC,BCC-ESM1,histSST,r1i1p1f1,AERmon,mmrso4,gn,gs://cmip6/AerChemMIP/BCC/BCC-ESM1/histSST/r1i...,
...,...,...,...,...,...,...,...,...,...,...
267454,ScenarioMIP,UA,MCM-UA-1-0,ssp585,r1i1p1f2,Omon,tos,gn,gs://cmip6/ScenarioMIP/UA/MCM-UA-1-0/ssp585/r1...,
267455,ScenarioMIP,UA,MCM-UA-1-0,ssp585,r1i1p1f2,Omon,uo,gn,gs://cmip6/ScenarioMIP/UA/MCM-UA-1-0/ssp585/r1...,
267456,ScenarioMIP,UA,MCM-UA-1-0,ssp585,r1i1p1f2,Omon,vo,gn,gs://cmip6/ScenarioMIP/UA/MCM-UA-1-0/ssp585/r1...,
267457,ScenarioMIP,UA,MCM-UA-1-0,ssp585,r1i1p1f2,Omon,wo,gn,gs://cmip6/ScenarioMIP/UA/MCM-UA-1-0/ssp585/r1...,


### Search 

We can now use intake-esm methods to search the collection. 

In [4]:
query = dict(
    experiment_id=['historical', 'ssp245', 'ssp585'],
    table_id='Amon',
    variable_id=['tas'],
    member_id = 'r1i1p1f1',
    grid_label='gr'
)

col_subset = col.search(require_all_on=["source_id"], **query)
col_subset.df.groupby("source_id")[
    ["experiment_id", "variable_id", "table_id"]
].nunique()

Unnamed: 0_level_0,experiment_id,variable_id,table_id
source_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CIESM,3,1,1
EC-Earth3,3,1,1
EC-Earth3-Veg,3,1,1
FGOALS-f3-L,3,1,1
IPSL-CM6A-LR,3,1,1
KACE-1-0-G,3,1,1


In [5]:
# Find entries in the catalog
col_subset.keys()

['CMIP.CAS.FGOALS-f3-L.historical.Amon.gr',
 'CMIP.EC-Earth-Consortium.EC-Earth3.historical.Amon.gr',
 'CMIP.EC-Earth-Consortium.EC-Earth3-Veg.historical.Amon.gr',
 'CMIP.IPSL.IPSL-CM6A-LR.historical.Amon.gr',
 'CMIP.NIMS-KMA.KACE-1-0-G.historical.Amon.gr',
 'CMIP.THU.CIESM.historical.Amon.gr',
 'ScenarioMIP.CAS.FGOALS-f3-L.ssp245.Amon.gr',
 'ScenarioMIP.CAS.FGOALS-f3-L.ssp585.Amon.gr',
 'ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp245.Amon.gr',
 'ScenarioMIP.EC-Earth-Consortium.EC-Earth3.ssp585.Amon.gr',
 'ScenarioMIP.EC-Earth-Consortium.EC-Earth3-Veg.ssp245.Amon.gr',
 'ScenarioMIP.EC-Earth-Consortium.EC-Earth3-Veg.ssp585.Amon.gr',
 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp245.Amon.gr',
 'ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.Amon.gr',
 'ScenarioMIP.NIMS-KMA.KACE-1-0-G.ssp245.Amon.gr',
 'ScenarioMIP.NIMS-KMA.KACE-1-0-G.ssp585.Amon.gr',
 'ScenarioMIP.THU.CIESM.ssp245.Amon.gr',
 'ScenarioMIP.THU.CIESM.ssp585.Amon.gr']

In [6]:
dsets = col_subset.to_dataset_dict(zarr_kwargs={'consolidated': True})

Dataset(s):   0%|                                       | 0/18 [00:00<?, ?it/s]


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'


Dataset(s): 100%|██████████████████████████████| 18/18 [00:01<00:00, 11.05it/s]


In [7]:
dsets['ScenarioMIP.THU.CIESM.ssp585.Amon.gr']

Unnamed: 0,Array,Chunk
Bytes,4.61 kB,4.61 kB
Shape,"(288, 2)","(288, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 4.61 kB 4.61 kB Shape (288, 2) (288, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  288,

Unnamed: 0,Array,Chunk
Bytes,4.61 kB,4.61 kB
Shape,"(288, 2)","(288, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,16.51 kB,16.51 kB
Shape,"(1032, 2)","(1032, 2)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray
"Array Chunk Bytes 16.51 kB 16.51 kB Shape (1032, 2) (1032, 2) Count 2 Tasks 1 Chunks Type object numpy.ndarray",2  1032,

Unnamed: 0,Array,Chunk
Bytes,16.51 kB,16.51 kB
Shape,"(1032, 2)","(1032, 2)"
Count,2 Tasks,1 Chunks
Type,object,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.07 kB,3.07 kB
Shape,"(192, 2)","(192, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 3.07 kB 3.07 kB Shape (192, 2) (192, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  192,

Unnamed: 0,Array,Chunk
Bytes,3.07 kB,3.07 kB
Shape,"(192, 2)","(192, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,228.26 MB,67.02 MB
Shape,"(1, 1032, 192, 288)","(1, 303, 192, 288)"
Count,9 Tasks,4 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 228.26 MB 67.02 MB Shape (1, 1032, 192, 288) (1, 303, 192, 288) Count 9 Tasks 4 Chunks Type float32 numpy.ndarray",1  1  288  192  1032,

Unnamed: 0,Array,Chunk
Bytes,228.26 MB,67.02 MB
Shape,"(1, 1032, 192, 288)","(1, 303, 192, 288)"
Count,9 Tasks,4 Chunks
Type,float32,numpy.ndarray
