# Using intake to access the Australian Community Reference  Climate Data Collection at NCI 

This notebook will show you how you can use the intake catalogue we prepared for the aus-ref-clim-data-nci collection hosted ont he NCI server under the `ia39` project.<br>
To access the data and the catalogue you need to part of the ia39 project<br><br>

You also need to have intake and intake-esm installed. They are both installed in the CLEX managed hh5 conda environments.<br>
NB as long as intake-esm is installed you only need to load intake in your code.

In [None]:
#!module use /g/data3/hh5/public/modules 
#!module load conda/analysis3

In [1]:
import intake

In [38]:
cat = intake.open_catalog('/g/data/ia39/aus-ref-clim-data-nci/acs-intake/catalogue.yaml')
list(cat)

['cmip6_etccdi', 'gpcc', 'gpcp', 'cmap', 'frogs', 'ghcn']

NB you can also use `cat._entries` to see a much more detailed description of catalogue entries.

In [3]:
# NB you can also use cat._entries to see a much more detailed description of catalogue entries
cat._entries

{'cmip6_etccdi': name: cmip6_etccdi
 container: xarray
 plugin: ['esm_datastore']
 driver: ['esm_datastore']
 description: Replica of the Climate extreme indices and heat stress indicators derived from CMIP6 global climate projections dataset (CICERO_ETCCDI) from the Copernicus Climate Datastore on gadi. This collection is a copy of CICERO_ETCCDI, which includes climate indices calculated on CMIP6 models historical and projections experiments. The collection includes ETCCDI (climate extrems indices) at yearly and monthly resolution, and HSI (Heat Stress Indicators) at daily resolution.
 This dataset is part of the Australian Community Reference Climate Data Collection at NCI.
 The files were downloaded from the CDS as netcdf files and compressed using nccopy.
 
 direct_access: forbid
 user_parameters: []
 metadata: 
 args: 
   esmcol_obj: {{CATALOG_DIR}}/cmip6_etccdi/catalogue.json,
 'gpcc': name: gpcc
 container: xarray
 plugin: ['esm_datastore']
 driver: ['esm_datastore']
 descriptio

In [4]:
# load the entry for cmip6_etccdi, you can see etcddi is a pandas DataFrame
etccdi = cat['cmip6_etccdi']
etccdi

Unnamed: 0,unique
path,33220
index_type,2
base,5
frequency,3
experiment,5
model,27
ensemble,77
variable,32
date_range,22


You can see `etcddi` is a pandas DataFrame, with `path` and all the dataset attributes as columns.<br>
As it is a DataFrame you can use pandas methods like `head` and `columns` via the `df` accessor.

In [5]:
print(f"Columns are {etccdi.df.columns}\n")
etccdi.df.head()

Columns are Index(['path', 'index_type', 'base', 'frequency', 'experiment', 'model',
       'ensemble', 'variable', 'date_range'],
      dtype='object')



Unnamed: 0,path,index_type,base,frequency,experiment,model,ensemble,variable,date_range
0,/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccd...,etccdi,base_independent,yr,ssp370,ACCESS-CM2,r1i1p1f1,txx,2015-2100
1,/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccd...,etccdi,base_independent,yr,ssp370,ACCESS-CM2,r1i1p1f1,r10mm,2015-2100
2,/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccd...,etccdi,base_independent,yr,ssp370,ACCESS-CM2,r1i1p1f1,tnn,2015-2100
3,/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccd...,etccdi,base_independent,yr,ssp370,ACCESS-CM2,r1i1p1f1,tnx,2015-2100
4,/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccd...,etccdi,base_independent,yr,ssp370,ACCESS-CM2,r1i1p1f1,txn,2015-2100


In [None]:
# to see all the methods are available for the catalogue use dir()
# dir(etccdi)

Let's check what is in `etccdi`, we can use:<br>
 * `description` to get an overview;
 * `aggregation_info` to check aggreagtion options for this dataset
 * `unique()` to return all the unique values available for each column
 
As unique() is dictionary if we know the columns we can directly access the values for a selected key, as shown below.

In [6]:
print(etccdi.description)
print()
print(etccdi.aggregation_info)
print()
etccdi.unique()['frequency']

Replica of the Climate extreme indices and heat stress indicators derived from CMIP6 global climate projections dataset (CICERO_ETCCDI) from the Copernicus Climate Datastore on gadi. This collection is a copy of CICERO_ETCCDI, which includes climate indices calculated on CMIP6 models historical and projections experiments. The collection includes ETCCDI (climate extrems indices) at yearly and monthly resolution, and HSI (Heat Stress Indicators) at daily resolution.
This dataset is part of the Australian Community Reference Climate Data Collection at NCI.
The files were downloaded from the CDS as netcdf files and compressed using nccopy.


AggregationInfo(groupby_attrs=['index_type', 'base', 'frequency', 'experiment', 'model', 'ensemble'], variable_column_name='variable', aggregations=[{'type': 'union', 'attribute_name': 'variable'}], agg_columns=['variable'], aggregation_dict={'variable': {'type': 'union'}})



{'count': 3, 'values': ['yr', 'mon', 'day']}

We can execute a query using the method `search()`. Let's select a subset passing the search() method some constraints.

In [7]:
subset = etccdi.search( base='base_independent', 
                        frequency='mon', 
                        model='ACCESS-CM2', 
                        experiment='historical', 
                        ensemble='r1i1p1f1' )
subset

Unnamed: 0,unique
path,7
index_type,1
base,1
frequency,1
experiment,1
model,1
ensemble,1
variable,7
date_range,1


Our subset consists of 7 files representing 7 variables for the same model, ensemble, experiment, etc.<br>
To actually load the data in xarray, we first have to create a dictionary of datasets using `to_dataset_dict()`.

In [8]:
dset_dict = subset.to_dataset_dict()
dset_dict.keys()


--> The keys in the returned dictionary of datasets are constructed as follows:
	'index_type.base.frequency.experiment.model.ensemble'


dict_keys(['etccdi.base_independent.mon.historical.ACCESS-CM2.r1i1p1f1'])

Finally we can simply load a dataset using its key

In [9]:
ds = dset_dict['etccdi.base_independent.mon.historical.ACCESS-CM2.r1i1p1f1']
ds

Unnamed: 0,Array,Chunk
Bytes,3.00 kiB,3.00 kiB
Shape,"(192, 2)","(192, 2)"
Count,30 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 3.00 kiB 3.00 kiB Shape (192, 2) (192, 2) Count 30 Tasks 1 Chunks Type float64 numpy.ndarray",2  192,

Unnamed: 0,Array,Chunk
Bytes,3.00 kiB,3.00 kiB
Shape,"(192, 2)","(192, 2)"
Count,30 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.25 kiB,2.25 kiB
Shape,"(144, 2)","(144, 2)"
Count,30 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 2.25 kiB 2.25 kiB Shape (144, 2) (144, 2) Count 30 Tasks 1 Chunks Type float64 numpy.ndarray",2  144,

Unnamed: 0,Array,Chunk
Bytes,2.25 kiB,2.25 kiB
Shape,"(144, 2)","(144, 2)"
Count,30 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,30.94 kiB,30.94 kiB
Shape,"(1980, 2)","(1980, 2)"
Count,30 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 30.94 kiB 30.94 kiB Shape (1980, 2) (1980, 2) Count 30 Tasks 1 Chunks Type datetime64[ns] numpy.ndarray",2  1980,

Unnamed: 0,Array,Chunk
Bytes,30.94 kiB,30.94 kiB
Shape,"(1980, 2)","(1980, 2)"
Count,30 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray


Please note that the aggregation define for this dataset has been automatically applied, so the 7 variables files have been united in one xarray Dataset.<br>
If you want to access the files without aggregation you set the optional argiment `aggregate` to False.<br>

In [10]:
dset_dict2 = subset.to_dataset_dict(aggregate=False)
dset_dict2.keys()


--> The keys in the returned dictionary of datasets are constructed as follows:
	'path.index_type.base.frequency.experiment.model.ensemble.variable.date_range'


dict_keys(['/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccdi/data/v1-0/etccdi/base_independent/mon/historical/ACCESS-CM2/txxETCCDI_mon_ACCESS-CM2_historical_r1i1p1f1_no-base_v20191108_185001-201412_v1-0.nc.etccdi.base_independent.mon.historical.ACCESS-CM2.r1i1p1f1.txx.185001-201412', '/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccdi/data/v1-0/etccdi/base_independent/mon/historical/ACCESS-CM2/tnnETCCDI_mon_ACCESS-CM2_historical_r1i1p1f1_no-base_v20191108_185001-201412_v1-0.nc.etccdi.base_independent.mon.historical.ACCESS-CM2.r1i1p1f1.tnn.185001-201412', '/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccdi/data/v1-0/etccdi/base_independent/mon/historical/ACCESS-CM2/rx5dayETCCDI_mon_ACCESS-CM2_historical_r1i1p1f1_no-base_v20191108_185001-201412_v1-0.nc.etccdi.base_independent.mon.historical.ACCESS-CM2.r1i1p1f1.rx5day.185001-201412', '/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccdi/data/v1-0/etccdi/base_independent/mon/historical/ACCESS-CM2/rx1dayETCCDI_mon_ACCESS-CM2_historical_r1i1p1f1_no-base

This time we can a key per file, and the key is the file path.<br>
We can then load one of the file as we've done before using its key/path.

In [11]:
# Finally we can simply load a dataset using its key
ds = dset_dict2['/g/data/ia39/aus-ref-clim-data-nci/cmip6-etccdi/data/v1-0/etccdi/base_independent/mon/historical/ACCESS-CM2/rx5dayETCCDI_mon_ACCESS-CM2_historical_r1i1p1f1_no-base_v20191108_185001-201412_v1-0.nc.etccdi.base_independent.mon.historical.ACCESS-CM2.r1i1p1f1.rx5day.185001-201412']
ds

Unnamed: 0,Array,Chunk
Bytes,3.00 kiB,3.00 kiB
Shape,"(192, 2)","(192, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 3.00 kiB 3.00 kiB Shape (192, 2) (192, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  192,

Unnamed: 0,Array,Chunk
Bytes,3.00 kiB,3.00 kiB
Shape,"(192, 2)","(192, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.25 kiB,2.25 kiB
Shape,"(144, 2)","(144, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 2.25 kiB 2.25 kiB Shape (144, 2) (144, 2) Count 2 Tasks 1 Chunks Type float64 numpy.ndarray",2  144,

Unnamed: 0,Array,Chunk
Bytes,2.25 kiB,2.25 kiB
Shape,"(144, 2)","(144, 2)"
Count,2 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,30.94 kiB,30.94 kiB
Shape,"(1980, 2)","(1980, 2)"
Count,2 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray
"Array Chunk Bytes 30.94 kiB 30.94 kiB Shape (1980, 2) (1980, 2) Count 2 Tasks 1 Chunks Type datetime64[ns] numpy.ndarray",2  1980,

Unnamed: 0,Array,Chunk
Bytes,30.94 kiB,30.94 kiB
Shape,"(1980, 2)","(1980, 2)"
Count,2 Tasks,1 Chunks
Type,datetime64[ns],numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 208.83 MiB 208.83 MiB Shape (1980, 144, 192) (1980, 144, 192) Count 2 Tasks 1 Chunks Type float32 numpy.ndarray",192  144  1980,

Unnamed: 0,Array,Chunk
Bytes,208.83 MiB,208.83 MiB
Shape,"(1980, 144, 192)","(1980, 144, 192)"
Count,2 Tasks,1 Chunks
Type,float32,numpy.ndarray


## Other datasets 

### GPCP aggregation along time dimension

With the `cmip6_etccdi` dataset we used an aggregation of type `union`. We can also aggregate files along the time dimension. We will use the `gpcp` dataset to demonstrate.<br>
The `gpcp` daily data is organised as *frequency/version/year/files*.

In [None]:
gpcp = cat['gpcp']
gpcp.unique().keys()

Let's get all the daily files for the year 1999.

In [None]:
subset = gpcp.search(frequency='day', year='1999')

This subset contains 365 files and we want to load them as one aggregated dataset.

In [None]:
ds_dict = subset.to_dataset_dict()
ds_dict.keys()

As we can see from the keys all files are united in one aggregation, that we can load as a xarray Dataset.

In [None]:
gpcc_ds = ds_dict['v1-3.day']
gpcc_ds

### GHCN csv files

In [39]:
ghcn = cat['ghcn']

In [40]:
ghcn

ghcn:
  args:
    urlpath: /g/data/ia39/aus-ref-clim-data-nci/ghcn/data/daily/by_year/{year}.csv
  description: 'Replica of the daily station precipitation data from the Global Historical
    Climatology Network (GHCN).

    GHCN-Daily is an integrated database of daily climate summaries from land surface
    stations across the globe.

    It contains records from over 100,000 stations in 180 countries and territories.
    The period of record station files are parsed into yearly csv files that contain
    all available GHCN Daily station data for that year plus a time of observation
    field (where available).

    '
  driver: intake.source.csv.CSVSource
  metadata:
    catalog_dir: /g/data/ia39/aus-ref-clim-data-nci/acs-intake/


As this is using the `csv` package to handle the data, we need first to read the data as we would for a csv file.

In [None]:
# This is still not working probably the type of data, as some are strings!
ghcn.read()