# Intake  - find, browse and access `intake-esm` collections

<a class="anchor" id="motivation"></a>

For an introduction to intake please see the [intake documentation](https://intake.readthedocs.io/en/latest/).
We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html). 

## Motivation of intake and the intake-esm plugin

> Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.

> `Intake` provides a lightweight package for finding, investigating, loading and disseminating data. Different types of catalogs are supported via different drivers and plugins. 

> The `Intake-esm` plugin provides functionality for **searching, discovering, data access and data loading** climate model data. 

> The `Intake-esm` data cataloging utility is built on top of **intake, pandas, and xarray**.

For intake users, many data preparation tasks **are no longer necessary**. They do not need to know:

- 🌍 where data is saved
- 🪧 how data is saved
- 📤  how data should be loaded

but still can search, discover, access and load data of a project.

<a class="anchor" id="features"></a>

## Features of intake and intake-esm

Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data.

- display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation
- browse 🔍 through the catalog and select your data without
    - being next to the data (e.g. logged in on dkrz's luv)
    - knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates
- open climate data in an analysis ready dictionary of `xarray` datasets 🎁

All required information for searching, accessing and loading the catalog's data is configured within the catalogs:

- 🌍 where data is saved
    * users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax
    * data of different platforms (cloud or disk) can be combined in one catalog
    * on mid term, intake catalogs can be **a single point of access**
- 🪧 how data is saved
    * users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format.
    * catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data.
- 📤  how data should be loaded
    * users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design.
    * with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded

In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.
CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.
The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.

<a class="anchor" id="terminology"></a>

## Terminology: **Catalog**, **Catalog file** and **Collection**

We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms:

- a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog)  📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋  <a class="anchor" id="catalogfile"></a>
    - is a `.yaml` file
    - can be opened with `open_catalog`, e.g.:
```python
    intake.open_catalog(["https://dkrz.de/s/intake"])
```
- **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. There are [many driver](https://intake.readthedocs.io/en/latest/plugin-directory.html) libraries for intake, we will concentrate on the intake-esm driver for climate model data <a class="anchor" id="intakedriver"></a>.

- a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a>
    - a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a>
        - an **asset** is most often a file. <a class="anchor" id="asset"></a>
    - a **collection** of all [assets](#asset) of the data set(s).   <a class="anchor" id="collection"></a>
        - the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.:
```json
  "catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz"
```

```{note}
The term *collection* is often used synonymically for [catalog](#catalog).
```

- a *intake-esm* **catalog**  📖 consists of a `.json` file (the **description**) and the underlying data base. The data base is either provided within the `.json` file or as a `.csv.gz` formatted list. 

The intake-esm catalog can be opened with intake-esm's function `intake.open_esm_datastore()` where the `.json` part is the argument, e.g:

```python
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_cmip6_disk.json")
```
    

In [None]:
#note that intake_esm is imported with `import intake` as a plugin
import intake

<a class="anchor" id="browse"></a>

## Open and browse through catalogs

There are essentially two options to work with specific catalogs:
1) use intake to **open** top level catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`.
2) use intake-esm to directly open esm catalogs in `json` format. 


In [None]:
# on DKRZ resources the catalog is accessible in the data pool directory
dkrz_catalog=intake.open_catalog(["/pool/data/Catalogs/dkrz_catalog.yaml"])
#
# for opening the catalog from remote it is also availabe on gitlab: 
# dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])

We can look into the catalog with `print` and `list`

Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ.

In [None]:
list(dkrz_catalog)

All these catalogs are **intake-esm** catalogs. You can find this information via the `_entries` attribute. The line `plugin: ['esm_datastore']
` refers to **intake-esm**'s function `open_esm_datastore()`.

In [None]:
print(dkrz_catalog._entries)

The DKRZ ESM-Collections follow a name template:

`dkrz_${project}_${store}[_${auxiliary_catalog}]`

where

- **project** can be one of the *model intercomparison project*, e.g. `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of:
    - `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. Working next to the data on the file system will be the fastest way possible.
    - `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
    - `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid*

### The `intake-esm` catalogs

We now look into a catalog which is opened by the plugin `intake-esm`. 
As mentioned earlier there are two options to open intake-esm catalogs: 
1) use intake to **open** top level catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`.
2) use intake-esm to directly open esm catalogs in `json` format. 

> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).

Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.

```{note}
The project catalogs contain only valid and current project data. They are constantly updated.

If your work is based on a catalog and a subset of the data from it, be sure to save that subset so you can later compare your database to the most current catalog.
```

#### use the top level catalog to access the intake-esm catalog

In [None]:
dkrz_catalog=intake.open_catalog(["/pool/data/Catalogs/dkrz_catalog.yaml"])
#
# for opening the catalog from remote it is also availabe on gitlab: 
# dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])

In [None]:
esm_col=dkrz_catalog.dkrz_cmip6_disk
print(esm_col)

`intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names.

The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.1 Mio Files** in 2022.

The data base is loaded into an underlying `panda`s dataframe which we can access with `esm_col.df`. `esm_col.df.head()` displays the first rows of the table:

In [None]:
esm_col.df.head() 

### Browse through the data of the ESM collection

You will browse the collection technically by setting values the **column names** of the underlying table. Per default, the catalog was loaded with all cmip6 attributes/columns that define the CMIP6 data standard:

In [None]:
list(esm_col.df.columns)

These are configured in the top level catalog so you <mark> do not need to open the catalog to see the columns </mark>

In [None]:
query = dict(
    variable_id="tas",
    table_id="Amon",
    experiment_id=["piControl", "historical", "ssp370"])
# piControl = pre-industrial control, simulation to represent a stable climate from 1850 for >100 years.
# historical = historical Simulation, 1850-2014
# ssp370 = Shared Socioeconomic Pathways (SSPs) are scenarios of projected socioeconomic global changes. Simulation covers 2015-2100
cat = esm_col.search(**query)

In [None]:
cat.df.head()

We could also use *Wildcards*. For example, in order to find out which ESMs of the institution *MPI-M* have produced data for our subset:

In [None]:
result = cat.search(source_id="MPI-ES*")
result

We can find out which models have submitted data for at least one of them by:

In [None]:
list(result.df["source_id"].unique())

If we instead look for the models that have submitted data for ALL experiments, we use the `require_all_on` keyword argument:

In [None]:
cat = esm_col.search(require_all_on=["source_id"], **query)
list(cat.df["source_id"].unique())

Note that only the combination of a `variable_id` and a `table_id` is unique in CMIP6. If you search for `tas` in all tables, you will find many entries more:

In [None]:
query = dict(
    variable_id="tas",
#    table_id="Amon",
    experiment_id=["piControl", "historical", "ssp370"])
cat = esm_col.search(**query)
list(cat.df["table_id"].unique())

Be careful when you search for specific time slices. Each frequency is connected with a individual name template for the filename. If the data is yearly, you have YYYY-YYYY whereas you have YYYYMM-YYYYMM for monthly data. 

<a class="anchor" id="dataaccess"></a>

## Access and load data of the ESM collection

With the power of `xarray`, `intake` can load your subset into a `dict`ionary of datasets. We therefore focus on the data of `MPI-ESM1-2-LR`:

In [None]:
#case insensitive?
query = dict(
    variable_id="tas",
    table_id="Amon",
    source_id="MPI-ESM1-2-LR",
    experiment_id="historical")
cat = esm_col.search(**query)
cat

**Intake-ESM** natively supports the following data formats or access formats (since opendap is not really a file format):

- netcdf
- opendap
- zarr

You can also open **grb** data but right now only by specifying xarray's attribute *engine* in the *open* function which is defined in the following. I.e., it does not make a difference if you specify **grb** as format.

You can find an example in the *era5* notebook.

The function to open data is `to_dataset_dict`. 

We recommend to set a keyword argument `cdf_kwargs` for the chunk size of the variable's data array. Otherwise, `xarray` may choose too large chunks. Most often, your data contains a time dimension so that you could set `cdf_kwargs={"chunks":{"time":1}}`. 

If your collection contains **zarr** formatted data, you need to add another keyword argument `zarr_kwargs`. <mark> The trick is: You can just specify both. Intake knows from the `format` column which *kwargs* should be taken.

In [None]:
xr_dict = cat.to_dataset_dict(xarray_open_kwargs=dict(chunks=dict(time=1)),
                                              #decode_times=True,
                                              #use_cftime=True)
                             )
xr_dict

`Intake` was able to aggregate many files into only one dataset:
- The `time_range` column was used to **concat** data along the `time` dimension
- The `member_id` column was used to generate a new dimension

The underlying `dask` package will only load the data into memory if needed. Note that attributes which disagree from file to file, e.g. *tracking_id*, are excluded from the dataset.

If we are only interested in the **first** dataset of the dictionary, we can *pop it out*:

In [None]:
xr_dset = xr_dict.popitem()[1]
xr_dset

In [None]:
import hvplot.xarray
xr_dset["tas"].hvplot.quadmesh(width=600)