Cal-Adapt Analytics Engine Data Catalog Access and Data Download
----------------------------------------------------------------

All of the climate data within the Analytics Engine is stored in a publicly accessible AWS S3 bucket. If you are familiar with programming in Python you can easily access the data using the intake package to create an xarray dataset. This xarray dataset then can be exported to NetCDF and stored physically on your computer.

In [1]:
#If running this notebook in an environment outside of the Cal-Adapt Analytics Engine Jupyter Hub make sure to install intake-esm and s3fs packages
import intake

To connect to the data catalog that stores all the relevant metadata needed to access the data issue these commands:

In [2]:
# Open catalog of available data sets using intake-esm package
cat = intake.open_esm_datastore('https://cadcat.s3.amazonaws.com/cae-collection.json')

In [3]:
# inspecting the catalog object will show the number of datasets and unique attributes
cat

Unnamed: 0,unique
activity_id,2
institution_id,3
source_id,18
experiment_id,5
member_id,15
table_id,4
variable_id,64
grid_label,3
path,6371
derived_variable_id,0


This catalog object can be converted to a Pandas dataframe to easily see the contents:

In [4]:
# Access catalog as dataframe and inspect the first few rows
cat_df = cat.df
cat_df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,path
0,LOCA2,UCSD,ACCESS-CM2,historical,r1i1p1f1,day,hursmax,d03,s3://cadcat/loca2/ucsd/access-cm2/historical/r...
1,LOCA2,UCSD,ACCESS-CM2,historical,r1i1p1f1,day,hursmin,d03,s3://cadcat/loca2/ucsd/access-cm2/historical/r...
2,LOCA2,UCSD,ACCESS-CM2,historical,r1i1p1f1,day,huss,d03,s3://cadcat/loca2/ucsd/access-cm2/historical/r...
3,LOCA2,UCSD,ACCESS-CM2,historical,r1i1p1f1,day,pr,d03,s3://cadcat/loca2/ucsd/access-cm2/historical/r...
4,LOCA2,UCSD,ACCESS-CM2,historical,r1i1p1f1,day,rsds,d03,s3://cadcat/loca2/ucsd/access-cm2/historical/r...


You can also list just the column names in the catalog by doing:

In [5]:
# Print column names
for col in cat_df:
    print(col)

activity_id
institution_id
source_id
experiment_id
member_id
table_id
variable_id
grid_label
path


To see the unique values in each column run the following code:

In [6]:
# unique values in each column. Not all combinations of values will link to a dataset.
for col in cat_df:
    print(cat_df[col].unique())

['LOCA2' 'WRF']
['UCSD' 'CAE' 'UCLA']
['ACCESS-CM2' 'CESM2-LENS' 'CNRM-ESM2-1' 'EC-Earth3' 'EC-Earth3-Veg'
 'FGOALS-g3' 'GFDL-ESM4' 'HadGEM3-GC31-LL' 'INM-CM5-0' 'IPSL-CM6A-LR'
 'KACE-1-0-G' 'MIROC6' 'MPI-ESM1-2-HR' 'MRI-ESM2-0' 'TaiESM1' 'ensmean'
 'CESM2' 'ERA5']
['historical' 'ssp245' 'ssp370' 'ssp585' 'reanalysis']
['r1i1p1f1' 'r2i1p1f1' 'r3i1p1f1' 'r10i1p1f1' 'r4i1p1f1' 'r5i1p1f1'
 'r6i1p1f1' 'r7i1p1f1' 'r8i1p1f1' 'r9i1p1f1' 'r1i1p1f2' 'r1i1p1f3'
 'r2i1p1f3' 'r3i1p1f3' 'r11i1p1f1' nan]
['day' 'mon' 'yrmax' '1hr']
['hursmax' 'hursmin' 'huss' 'pr' 'rsds' 'tasmax' 'tasmin' 'uas' 'vas'
 'wspeed' 'lwdnbc' 'lwdnb' 'lwupbc' 'lwupb' 'prec' 'psfc' 'q2' 'rainc'
 'rainnc' 'runsb' 'runsf' 'snow' 'snownc' 'swddif' 'swdnbc' 'swdnb'
 'swupbc' 'swupb' 't2' 'tsk' 'u10' 'v10' 'etrans_sfc' 'evap_sfc' 'gh_sfc'
 'iwp' 'lh_sfc' 'lw_dwn' 'lwp' 'lw_sfc' 'prec_c' 'prec_max' 'prec_snow'
 'rh' 'sfc_runoff' 'sh_sfc' 'subsfc_runoff' 'sw_dwn' 'sw_sfc' 't2max'
 't2min' 'tskin' 'wspd10max' 'wspd10mean' 'p' 'ph' 

This will give you an idea of the available query parameters that can be entered to retrieve a particular set of data. Below is a sample query against the whole catalog to refine catalog entries to those of interest.

In [7]:
# form query dictionary
query = {
    # Downscaling method
    'activity_id': 'WRF',
    # GCM name
    'source_id': 'CESM2',
    # time period - historical or emissions scenario
    'experiment_id': ['historical', 'ssp370'],
    # variable
    'variable_id': 't2',
    # monthly time resolution
    'table_id': 'mon',
    # grid resolution: d01 = 45km, d02 = 9km, d03 = 3km
    'grid_label': 'd03'
}
# subset catalog and get some metrics grouped by 'source_id'
cat_subset = cat.search(require_all_on=['source_id'], **query)
cat_subset

  for _, group in grouped:


Unnamed: 0,unique
activity_id,1
institution_id,1
source_id,1
experiment_id,2
member_id,1
table_id,1
variable_id,1
grid_label,1
path,2
derived_variable_id,0


The zarr datasets of interest can then be loaded into memory as xarray datasets using:

In [8]:
dsets = cat_subset.to_dataset_dict(xarray_open_kwargs={'consolidated': True},
                                   storage_options={'anon': True})


--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'


To see the dataset keys, you can list the keys.

In [9]:
# See object keys in dsets
list(dsets)

['WRF.UCLA.CESM2.historical.mon.d03', 'WRF.UCLA.CESM2.ssp370.mon.d03']

To get down to one dataset of interest, use the key to query.

In [10]:
# Subset to historical time period and examine data object
data = dsets['WRF.UCLA.CESM2.historical.mon.d03']
data

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 467.02 kiB 387.61 kiB Shape (492, 243) (449, 221) Count 2 Graph Layers 4 Chunks Type float32 numpy.ndarray",243  492,

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 467.02 kiB 387.61 kiB Shape (492, 243) (449, 221) Count 2 Graph Layers 4 Chunks Type float32 numpy.ndarray",243  492,

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 467.02 kiB 387.61 kiB Shape (492, 243) (449, 221) Count 2 Graph Layers 4 Chunks Type float32 numpy.ndarray",243  492,

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 467.02 kiB 387.61 kiB Shape (492, 243) (449, 221) Count 2 Graph Layers 4 Chunks Type float32 numpy.ndarray",243  492,

Unnamed: 0,Array,Chunk
Bytes,467.02 kiB,387.61 kiB
Shape,"(492, 243)","(449, 221)"
Count,2 Graph Layers,4 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,186.08 MiB,127.94 MiB
Shape,"(1, 408, 492, 243)","(1, 338, 449, 221)"
Count,3 Graph Layers,8 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 186.08 MiB 127.94 MiB Shape (1, 408, 492, 243) (1, 338, 449, 221) Count 3 Graph Layers 8 Chunks Type float32 numpy.ndarray",1  1  243  492  408,

Unnamed: 0,Array,Chunk
Bytes,186.08 MiB,127.94 MiB
Shape,"(1, 408, 492, 243)","(1, 338, 449, 221)"
Count,3 Graph Layers,8 Chunks
Type,float32,numpy.ndarray


Finally to save a dataset to NetCDF run the last cell below.

In [11]:
data.to_netcdf('WRF-UCLA-CESM2-historical-mon-d03.nc')


KeyboardInterrupt

