# The CLINT data repository and Intake data catalogs

## Organisation of the CLINT storage area containing the data repository

- The central CLINT data repository is hosted at **/work/bk1318/Data_Repository** as part of the **/work/bk1318** CLINT storage allocation
- The overall structure of the CLINT work area is:
     - *Data_Repository*: the central CLINT data repository, hosting FAIR CLINT data
     - kxxxx: private working areas of CLINT members

## Central data repository structure
- The overall structure will be agreed and updated regularly (PMB meetings and General Assembly meetings)
- The initial structure is:
    - **Catalogs**: catalogs helping to locate and access data (including external sources)
    - **HISTORY.md**: high level book keeping info on major updates/changes of the data repository content and structure
    - **notebooks**: notebooks illustrating repository structure and use aspects
    - **wp1** .. **wp8**: work package areas
    - **climate_service**: data associated to (pre-operational) climate service prototypes

## CLINT repository associated directly accessible data pools 
- DKRZ hosts large climate data collections (CMIP3, CMIP5, CMIP6, CORDEX, ERA5, etc.) as part of their **/pool/data** storage pool
- all this data can be searched and directly accessed with the help of intake data catalogs
- A detailed overview of the use of Intake at DKRZ is available:
    - [Intake tutorials](https://tutorials.dkrz.de/tutorial_intake-1-introduction.html)
    - [Use cases](https://tutorials.dkrz.de/Use-cases.html)

### Intake Catalog examples
- use catalogs of DKRZ data pool 
- use ad hoc catalogs for CLINT

In [None]:
import intake
import xarray as xr


In [None]:
# the master catalog references all available catalogs

master_catalog = intake.open_catalog('/pool/data/Catalogs/dkrz_catalog.yaml')
list(master_catalog)

The catalogs are acccessible at /pool/data/Catalogs and can also be directly loaded via intake_esm (see [intake tutorials])(https://tutorials.dkrz.de/intake.html)

In [None]:
!ls /pool/data/Catalogs

## Example 1: Use CMIP6 intake catalog

In [None]:
cmip6_catalog = master_catalog["dkrz_cmip6_disk"]

In [None]:
cmip6_catalog.df.head()

In [None]:
tas_catalog = cmip6_catalog.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
tas_catalog.df.head()

In [None]:
tas_path = tas_catalog.df['uri'].iloc[1]
tas_path

In [None]:
ds_tas = xr.open_dataset(tas_path)
ds_tas

## Example 2: Use ERA5 intake catalog

In [None]:
era5_catalog = master_catalog['dkrz_era5_disk']


In [None]:
query = {
    'level_type':'surface',
    'frequency':'hourly',
    'code':167,
}

my_catalog = era5_catalog.search(**query)
my_catalog.df.head()

In [None]:
era_path = my_catalog.df['uri'].iloc[1]
era_path

In [None]:
ds_era = xr.load_dataset(era_path, engine='cfgrib', backend_kwargs= {'indexpath':''})
ds_era