# CMIP 6 catalog

The CMIP6 data catalog is hosted as a csv file in google cloud. We can read from this catalog, filter the datasets we want to work with, and then only load the data we need. This is a great way to work with large datasets without having to download everything.

In [None]:
if "google.colab" in str(get_ipython()):
    print("Running on CoLab")
    !pip install zarr==2.18 cftime
else:
    print("Not running on CoLab")

In [None]:
import gcsfs
import pandas as pd
import xarray as xr

First we need to do an anonymous log-in to Google Cloud file system:

In [None]:
fs = gcsfs.GCSFileSystem(token="anon", access="read_only")

We can now work with the CMIP6 catalog as a `pandas` dataframe. 

In [None]:
cat = pd.read_csv(
    "https://storage.googleapis.com/cmip6/cmip6-zarr-consolidated-stores.csv"
)
cat.head()

Using `pandas` methods, we can filter and select the variables of interest. Then, we can use `xarray` to load the data and work with it. More information about the CMIP6 catalog can be found in this excel file: [CMIP6_MIP_tables.xlsx](https://github.com/ckaramp-research/code-snippets/blob/main/data/CMIP6_MIP_tables.xlsx)

In [None]:
data_query = cat.query(
    "activity_id == 'CMIP' & table_id == 'Amon' & variable_id == 'tas' & experiment_id == 'historical' & source_id == 'GFDL-CM4'"
)
data_query

In [None]:
xrdata = xr.open_zarr(fs.get_mapper(data_query.zstore.iloc[0]), consolidated=True)
xrdata

We can also query variables with vertical levels and save the output using `xarray` methods. Thanks to zarr, we should be able to access only the data we need without downloading the entire dataset.

In [None]:
data_query = cat.query(
    "activity_id == 'CMIP' & table_id == 'Omon' & variable_id == 'thetao' & experiment_id == 'historical' & source_id == 'GFDL-CM4' & grid_label == 'gr'"
)
data_query

In [None]:
xrdata = xr.open_zarr(fs.get_mapper(data_query.zstore.iloc[0]), consolidated=True)
xrdata

In [None]:
xrdata.nbytes / 1e9  # in GB

We can see that the size of this dataset is almost 18GB. Since we are interested in only a region, we can filter the data and only load the region of interest. This will save us a lot of time and storage space.

In [None]:
subset_region = xrdata.sel(lat=slice(-5, 5), lon=slice(190, 240))
subset_region

In [None]:
subset_region.nbytes / 1e9  # in GB

This still might take a while, but it is much better than downloading the entire dataset.

In [None]:
subset_region.to_netcdf("subset_region.nc")  # Save the subset to a NetCDF file