# The CLINT data repository and Intake data catalogs

## Organisation of the CLINT storage area containing the data repository

- The central CLINT data repository is hosted at **/work/bk1318/Data_Repository** as part of the **/work/bk1318** CLINT storage allocation
- The overall structure of the CLINT work area is:
     - *Data_Repository*: the central CLINT data repository, hosting FAIR CLINT data
     - kxxxx: private working areas of CLINT members

## Central data repository structure
- The overall structure will be agreed and updated regularly (PMB meetings and General Assembly meetings)
- The initial structure is:
    - **Catalogs**: catalogs helping to locate and access data (including external sources)
    - **history.txt**: high level book keeping info on major updates/changes of the data repository content and structure
    - **notebooks**: notebooks illustrating repository structure and use aspects
    - **wp1** .. **wp8**: work package areas
    - **climate_service**: data associated to (pre-operational) climate service prototypes

## CLINT repository associated directly accessible data pools 
- DKRZ hosts large climate data collections (CMIP3, CMIP5, CMIP6, CORDEX, ERA5, etc.) as part of their **/pool/data** storage pool
- all this data can be searched and directly accessed with the help of intake data catalogs
- A detailed overview of the use of Intake at DKRZ is available:
    - [Intake tutorials](https://tutorials.dkrz.de/tutorial_intake-1-introduction.html)
    - [Uce cases](https://tutorials.dkrz.de/use-cases.html)

### Intake Catalog examples
- use catalogs of DKRZ data pool 
- use ad hoc catalogs for CLINT

In [None]:
import intake
import xarray as xr
#import xarray as xr
intake.gui

In [None]:
# the master catalog references all available catalogs
# it can be added to to gui and loaded 
intake.gui.add('/pool/data/Catalogs/dkrz_catalog.yaml')
master_catalog = intake.open_catalog('/pool/data/Catalogs/dkrz_catalog.yaml')
#parent_col

The catalogs are acccessible at /pool/data/Catalogs and can also be directly loaded via intake_esm (see [intake tutorials])(https://tutorials.dkrz.de/intake.html)

In [None]:
!ls /pool/data/Catalogs

## Example 1: Use CMIP6 intake catalog

In [None]:
cmip6_catalog = master_catalog["dkrz_cmip6_disk"]

In [None]:
cmip6_catalog.df.head()

In [None]:
tas = cmip6_catalog.search(experiment_id="historical", source_id="MPI-ESM1-2-HR", variable_id="tas", table_id="Amon", member_id="r1i1p1f1")
tas

In [None]:
my_tas = tas.to_dataset_dict()
my_tas['CMIP.MPI-ESM1-2-HR.historical.Amon.gn']

## Example 2: Use ERA5 intake catalog

In [None]:
era5_catalog = master_catalog['dkrz_era5_disk']

query = {'level_type':'surface',
         'frequency':'hourly',
         'code':167,
        }


In [None]:
my_catalog =  era5_catalog.search(**query)
my_catalog.df.head()

In [None]:
path1 = my_catalog.df['path'].iloc[1]
path1

In [None]:
wds = xr.load_dataset(path1,engine='cfgrib',backend_kwargs= {'indexpath':''})
wds

## Intake example for structuring CLINT data repository data collections 
- if a simple naming convention is used e.g. a_b_c_d.nc to structure files this can be directly exploited to make data searchable and accessible via intake:

In [None]:
intake.gui.add('/work/bk1318/Data_Repository/Catalogs/test.yml')


In [None]:
cat = intake.open_catalog('/work/bk1318/Data_Repository/Catalogs/test.yml')

In [None]:
cat.wp8.get_entry_kwarg_sets()

In [None]:
cat.wp8.get_entry(foo='a', bar='b')