# Subset and regrid CMIP6 data from ESGF

This notebook showcases how to use `intake_esfg` to search and filter the CMIP6 collection, and how to use `rooki` to subset and regrid the data on the cloud.

In [14]:
import os

import intake_esgf

# Run this on the DKRZ node in Germany, using the ESGF1 index node at LLNL
os.environ["ROOK_URL"] = "http://rook.dkrz.de/wps"
# data download directory
import os

os.environ["ROOKI_OUTPUT_DIR"] = os.path.join(os.getcwd(), "rookie_output")

intake_esgf.conf.set(
    indices={"anl-dev": False, "ornl-dev": False, "esgf-node.llnl.gov": True}
)

import xarray as xr
from intake_esgf import ESGFCatalog
from rooki import operators as ops

## Retrieve subset of CMIP6 data

The CMIP6 dataset is identified by a dataset-id. Using intake-esgf we can query the ESGF database for the variables and models we are interested in. For this demo we are interested in the tos (sea surface temperature) variable for the historical runs. Also, for sake of simplicity we will only query a subset of the models available.

In [15]:
cat = ESGFCatalog()
cat.search(
    experiment_id=["historical"],
    variable_id=["tos"],
    table_id=["Omon"],
    project=["CMIP6"],
    grid_label=["gn"],
    source_id=[
        "CESM2-FV2",
        "CESM2-WACCM-FV2",
        "FGOALS-f3-L",
        "MIROC-ES2L",
    ],
)
cat.remove_ensembles()  # we only want to work with the parent datasets
print(cat)

   Searching indices:   0%|          |0/1 [       ?index/s]

Summary information for 4 results:
mip_era                                                     [CMIP6]
activity_drs                                                 [CMIP]
institution_id                                   [NCAR, CAS, MIROC]
source_id         [CESM2-FV2, CESM2-WACCM-FV2, FGOALS-f3-L, MIRO...
experiment_id                                          [historical]
member_id                                      [r1i1p1f1, r1i1p1f2]
table_id                                                     [Omon]
variable_id                                                   [tos]
grid_label                                                     [gn]
dtype: object


Once the catalog has been queried, we have to do some manipulation in pandas to keep only the dataset_id. This has to be done because the same data has multiple locations online, and these get appended at the end of the dataset_id. Rookie only accepts the dataset_id without the online location, so we get rid of it in the next step.

In [16]:
cat.df.id[0]

['CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1.Omon.tos.gn.v20191120|esgf-data.ucar.edu',
 'CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1.Omon.tos.gn.v20191120|esgf-data1.llnl.gov',
 'CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1.Omon.tos.gn.v20191120|esgf-data04.diasjp.net']

In [17]:
def keep_ds_id(ds):
    return ds[0].split("|")[0]

These paths are what we are looking for

In [18]:
collections = cat.df.id.apply(keep_ds_id).to_list()
collections

['CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1.Omon.tos.gn.v20191120',
 'CMIP6.CMIP.NCAR.CESM2-WACCM-FV2.historical.r1i1p1f1.Omon.tos.gn.v20191120',
 'CMIP6.CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Omon.tos.gn.v20191007',
 'CMIP6.CMIP.MIROC.MIROC-ES2L.historical.r1i1p1f2.Omon.tos.gn.v20190823']

## Subset and regrid
We define a function that will do the subset and regridding for us for each of the dataset_ids we have. The function will take the dataset_id as input and then use Rookie functions to select 100 years of data for the tos variable in the Pacific Ocean region.

For more information about the operations, you can go to [rook's documentation](https://rook-wps.readthedocs.io/en/latest/processes.html#).

For regridding, refer to this [source code](https://github.com/roocs/rook/blob/main/src/rook/processes/wps_regrid.py)


**Note:** Some dataset requests might fail when querying more than 25 years of data (might be size related, needs more testing). So it would be safer to keep the request below that threshold and implement a loop to retrieve more data if needed.

In [19]:
def get_pacific_ocean(dataset_id):
    wf = ops.Regrid(
        ops.Subset(
            ops.Input("tos", [dataset_id]),
            time="1850-01-01/1875-01-31",
            area="100,-20,280,20",
        ),
        method="bilinear",
        grid="1deg",
    )
    resp = wf.orchestrate()
    if resp.ok:
        print(f"{resp.size_in_mb=}")
        ds = resp.datasets()[0]
    else:
        raise ValueError(resp)
        # ds = xr.Dataset()
    return ds

In [None]:
sst_data = {dset: get_pacific_ocean(dset) for dset in collections}

resp.size_in_mb=74.72034358978271
Downloading to /Users/dangomelon/Repos/code_snippets/CMIP6/rookie_output/metalink_i7djw9c2/tos_Omon_CESM2-FV2_historical_r1i1p1f1_gr_18500115-18750115_regrid-bilinear-180x360_cells_grid.nc.
resp.size_in_mb=74.72035503387451
Downloading to /Users/dangomelon/Repos/code_snippets/CMIP6/rookie_output/metalink_ud0lqq1s/tos_Omon_CESM2-WACCM-FV2_historical_r1i1p1f1_gr_18500115-18750115_regrid-bilinear-180x360_cells_grid.nc.


The result will be downloaded to a temp folder in our local machine. We can then explore the data using xarray or any other tool of our choice.

In [None]:
sst_data

{'CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1.Omon.tos.gn.v20191120': <xarray.Dataset> Size: 78MB
 Dimensions:    (lat: 180, lon: 360, bnds: 2, time: 301, d2: 2)
 Coordinates:
   * lat        (lat) float64 1kB -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
   * lon        (lon) float64 3kB 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
     lat_bnds   (lat, bnds) float64 3kB ...
     lon_bnds   (lon, bnds) float64 6kB ...
   * time       (time) object 2kB 1850-01-15 13:00:00.000008 ... 1875-01-15 12...
     time_bnds  (time, d2) object 5kB ...
 Dimensions without coordinates: bnds, d2
 Data variables:
     tos        (time, lat, lon) float32 78MB ...
     mask       (lat, lon) int32 259kB ...
 Attributes: (12/50)
     Conventions:                  CF-1.7 CMIP-6.2
     activity_id:                  CMIP
     branch_method:                standard
     branch_time_in_child:         674885.0
     branch_time_in_parent:        10950.0
     case_id:                      1559
     ...   

## Requesting data with vertical levels

This process should be similar to what we have already explored before. The general steps for requesting data are as follows:

1. Use `intake_esgf` to search for the dataset of interest.
2. Filter the results to get the dataset_ids we want.
3. Use `rooki` to subset and regrid the data.
4. Download the data.

In [None]:
cat = ESGFCatalog()
cat.search(
    experiment_id=["historical"],
    variable_id=["thetao"],
    table_id=["Omon"],
    project=["CMIP6"],
    grid_label=["gn"],
    source_id=[
        "CESM2-FV2",
        "CESM2-WACCM-FV2",
        "FGOALS-f3-L",
        "MIROC-ES2L",
    ],
)
cat.remove_ensembles()
print(cat)

   Searching indices:   0%|          |0/1 [       ?index/s]

Summary information for 4 results:
mip_era                                                     [CMIP6]
activity_drs                                                 [CMIP]
institution_id                                   [NCAR, CAS, MIROC]
source_id         [CESM2-FV2, CESM2-WACCM-FV2, FGOALS-f3-L, MIRO...
experiment_id                                          [historical]
member_id                                      [r1i1p1f1, r1i1p1f2]
table_id                                                     [Omon]
variable_id                                                [thetao]
grid_label                                                     [gn]
dtype: object


In [None]:
cat.df

Unnamed: 0,project,mip_era,activity_drs,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,version,id
0,CMIP6,CMIP6,CMIP,NCAR,CESM2-FV2,historical,r1i1p1f1,Omon,thetao,gn,20191120,[CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1...
1,CMIP6,CMIP6,CMIP,NCAR,CESM2-WACCM-FV2,historical,r1i1p1f1,Omon,thetao,gn,20191120,[CMIP6.CMIP.NCAR.CESM2-WACCM-FV2.historical.r1...
2,CMIP6,CMIP6,CMIP,CAS,FGOALS-f3-L,historical,r1i1p1f1,Omon,thetao,gn,20191007,[CMIP6.CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f...
5,CMIP6,CMIP6,CMIP,MIROC,MIROC-ES2L,historical,r1i1p1f2,Omon,thetao,gn,20190823,[CMIP6.CMIP.MIROC.MIROC-ES2L.historical.r1i1p1...


In [None]:
collections = cat.df.id.apply(keep_ds_id).to_list()
collections

['CMIP6.CMIP.NCAR.CESM2-FV2.historical.r1i1p1f1.Omon.thetao.gn.v20191120',
 'CMIP6.CMIP.NCAR.CESM2-WACCM-FV2.historical.r1i1p1f1.Omon.thetao.gn.v20191120',
 'CMIP6.CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Omon.thetao.gn.v20191007',
 'CMIP6.CMIP.MIROC.MIROC-ES2L.historical.r1i1p1f2.Omon.thetao.gn.v20190823']

In [None]:
def get_pacific_ocean(dataset_id):
    wf = ops.Regrid(
        ops.Subset(
            ops.Input("thetao", [dataset_id]),
            time="1850-01-01/1851-01-31",
            area="100,-10,280,10",
            level="0/50",
        ),
        method="billinear",
        grid="2pt5deg",
    )
    resp = wf.orchestrate()
    if resp.ok:
        print(f"{resp.size_in_mb=}")
        ds = resp.datasets()[0]
    else:
        ds = xr.Dataset()
    return ds

This might take some time to complete depending on the region and time selection. Another convenient method to load this data is to use the google cloud storage bucket, which can be find [here](https://github.com/ckaramp-research/code-snippets/tree/main)

In [None]:
thetao_data = {dset: get_pacific_ocean(dset) for dset in collections}