Cal-Adapt Analytics Engine Data Catalog Access and Data Download
----------------------------------------------------------------

**All the climate data used by the Analytics Engine is stored in a publically accessible AWS S3 bucket.
If you are familiar with Python you can easily access the data using the intake package to create an xarray dataset.
This xarray dataset then can be exported to NetCDF and stored physically on your computer.**

In [1]:
#If running this notebook in an environment outside of the Cal-Adapt Analytics Engine Jupyter Hub make sure to install intake-esm and s3fs packages
import intake

**To connect to the data catalog that stores all the relavant metadata needed to access the data issue these commands:**

In [2]:
# Open catalog of available data sets using intake-esm package
cat = intake.open_esm_datastore('https://cadcat.s3.amazonaws.com/cae-collection.json')

In [3]:
# inspecting the catalog object will show the number of datasets and unique attributes
cat

Unnamed: 0,unique
activity_id,2
institution_id,3
source_id,18
experiment_id,5
member_id,15
table_id,3
variable_id,50
grid_label,3
path,3540


**This catalog object can be converted to a Pandas dataframe to easily see the contents:**

In [4]:
# Access catalog as dataframe and inspect the first few rows
cat_df = cat.df
cat_df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,path
0,WRF,UCLA,CESM2,historical,r11i1p1f1,1hr,lwdnb,d01,s3://cadcat/wrf/ucla/cesm2/historical/1hr/lwdn...
1,WRF,UCLA,CESM2,historical,r11i1p1f1,1hr,lwdnb,d02,s3://cadcat/wrf/ucla/cesm2/historical/1hr/lwdn...
2,WRF,UCLA,CESM2,historical,r11i1p1f1,1hr,lwdnb,d03,s3://cadcat/wrf/ucla/cesm2/historical/1hr/lwdn...
3,WRF,UCLA,CESM2,historical,r11i1p1f1,1hr,lwdnbc,d01,s3://cadcat/wrf/ucla/cesm2/historical/1hr/lwdn...
4,WRF,UCLA,CESM2,historical,r11i1p1f1,1hr,lwdnbc,d02,s3://cadcat/wrf/ucla/cesm2/historical/1hr/lwdn...


**You can also list just the column names in the catalog by doing:**

In [5]:
# Print column names
for col in cat_df:
    print(col)

activity_id
institution_id
source_id
experiment_id
member_id
table_id
variable_id
grid_label
path


**To see the unique values in each column run the following code:**

In [6]:
# unique values in each column. Not all combinations of values will link to a dataset.
for col in cat_df:
    print(cat_df[col].unique())

['WRF' 'LOCA2']
['UCLA' 'CAE' 'UCSD']
['CESM2' 'CNRM-ESM2-1' 'EC-Earth3-Veg' 'ERA5' 'FGOALS-g3' 'ensmean'
 'ACCESS-CM2' 'CESM2-LENS' 'EC-Earth3' 'GFDL-ESM4' 'HadGEM3-GC31-LL'
 'INM-CM5-0' 'IPSL-CM6A-LR' 'KACE-1-0-G' 'MIROC6' 'MPI-ESM1-2-HR'
 'MRI-ESM2-0' 'TaiESM1']
['historical' 'ssp245' 'ssp370' 'ssp585' 'reanalysis']
['r11i1p1f1' 'r1i1p1f2' 'r1i1p1f1' nan 'r2i1p1f1' 'r3i1p1f1' 'r10i1p1f1'
 'r4i1p1f1' 'r5i1p1f1' 'r6i1p1f1' 'r7i1p1f1' 'r8i1p1f1' 'r9i1p1f1'
 'r1i1p1f3' 'r2i1p1f3' 'r3i1p1f3']
['1hr' 'day' 'mon']
['lwdnb' 'lwdnbc' 'lwupb' 'lwupbc' 'psfc' 'q2' 'rainc' 'rainnc' 'runsb'
 'runsf' 'snow' 'snownc' 'swddif' 'swdnb' 'swdnbc' 'swupb' 'swupbc' 't2'
 'tsk' 'u10' 'v10' 'etrans_sfc' 'evap_sfc' 'gh_sfc' 'iwp' 'lh_sfc'
 'lw_dwn' 'lw_sfc' 'lwp' 'prec' 'prec_c' 'prec_max' 'prec_snow' 'rh'
 'sfc_runoff' 'sh_sfc' 'subsfc_runoff' 'sw_dwn' 'sw_sfc' 't2max' 't2min'
 'tskin' 'wspd10max' 'wspd10mean' 'huss' 'pr' 'tasmax' 'tasmin' 'uas'
 'vas']
['d01' 'd02' 'd03']
['s3://cadcat/wrf/ucla/cesm2/his

**This will give you an idea of the available query parameters that can be entered to retrieve a particular set of data. Below is a sample query against the whole catalog to refine catalog entries to those of interest:**

In [7]:
cat_loca = cat.search(activity_id="LOCA2")
unique_mem_ids = cat_loca.unique()["member_id"]
print("{0} unique member_ids".format(len(unique_mem_ids)))
dsets = {}
for member_id in unique_mem_ids: 
    print("getting data for member_id: {0}".format(member_id))
    cat_subset = cat_loca.search(member_id=member_id)
    try: 
        data_dict = cat_subset.to_dataset_dict(
            xarray_open_kwargs={'consolidated': True},
            storage_options={'anon': True}
        )
        dsets = dsets | data_dict
    except: 
        print("Encountered an issue with {0}...continuing loop".format(member_id))

2 unique member_ids
getting data for member_id: count
Encountered an issue with count...continuing loop
getting data for member_id: values
Encountered an issue with values...continuing loop


  warn(message)


**To see the dataset keys type:**

In [8]:
# See object keys in dsets
list(dsets)

['LOCA2.UCSD.TaiESM1.ssp370.day.d03',
 'LOCA2.UCSD.FGOALS-g3.ssp585.day.d03',
 'LOCA2.UCSD.EC-Earth3.ssp585.day.d03',
 'LOCA2.UCSD.INM-CM5-0.ssp370.day.d03',
 'LOCA2.UCSD.MIROC6.ssp585.day.d03',
 'LOCA2.UCSD.GFDL-ESM4.ssp245.day.d03',
 'LOCA2.UCSD.KACE-1-0-G.ssp585.day.d03',
 'LOCA2.UCSD.ACCESS-CM2.ssp245.day.d03',
 'LOCA2.UCSD.ACCESS-CM2.ssp370.day.d03',
 'LOCA2.UCSD.MRI-ESM2-0.ssp585.day.d03',
 'LOCA2.UCSD.MPI-ESM1-2-HR.ssp245.day.d03',
 'LOCA2.UCSD.EC-Earth3.historical.day.d03',
 'LOCA2.UCSD.MIROC6.ssp245.day.d03',
 'LOCA2.UCSD.ACCESS-CM2.historical.day.d03',
 'LOCA2.UCSD.TaiESM1.ssp245.day.d03',
 'LOCA2.UCSD.GFDL-ESM4.ssp585.day.d03',
 'LOCA2.UCSD.EC-Earth3.ssp370.day.d03',
 'LOCA2.UCSD.GFDL-ESM4.ssp370.day.d03',
 'LOCA2.UCSD.IPSL-CM6A-LR.ssp245.day.d03',
 'LOCA2.UCSD.KACE-1-0-G.historical.day.d03',
 'LOCA2.UCSD.MRI-ESM2-0.historical.day.d03',
 'LOCA2.UCSD.KACE-1-0-G.ssp245.day.d03',
 'LOCA2.UCSD.EC-Earth3.ssp245.day.d03',
 'LOCA2.UCSD.INM-CM5-0.ssp245.day.d03',
 'LOCA2.UCSD.MIROC6

**To get down to one dataset of interest just use the key:**

In [9]:
# Subset to historical time period and examine data object
data = dsets['LOCA2.UCSD.CNRM-ESM2-1.ssp245.day.d03']
data

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 32.38 GiB 127.31 MiB Shape (31411, 495, 559) (1952, 123, 139) Dask graph 425 chunks in 2 graph layers Data type float32 numpy.ndarray",559  495  31411,

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 32.38 GiB 127.31 MiB Shape (31411, 495, 559) (1952, 123, 139) Dask graph 425 chunks in 2 graph layers Data type float32 numpy.ndarray",559  495  31411,

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 32.38 GiB 127.31 MiB Shape (31411, 495, 559) (1952, 123, 139) Dask graph 425 chunks in 2 graph layers Data type float32 numpy.ndarray",559  495  31411,

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 32.38 GiB 127.31 MiB Shape (31411, 495, 559) (1952, 123, 139) Dask graph 425 chunks in 2 graph layers Data type float32 numpy.ndarray",559  495  31411,

Unnamed: 0,Array,Chunk
Bytes,32.38 GiB,127.31 MiB
Shape,"(31411, 495, 559)","(1952, 123, 139)"
Dask graph,425 chunks in 2 graph layers,425 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


**Finally to save a dataset to NetCDF use:**

In [None]:
data.to_netcdf('LOCA2.UCSD.CNRM-ESM2-1.ssp245.day.d03.nc')