# 0.01 Concatenate S2S

---

Concatenates all files into one large `zarr` file for anomaly generation.

Initializations run from 1999-2015. 10 ensemble members. Weekly inits.

**Make sure to run the `launch_cluster` notebook first to get your `dask` cluster going.**

You'll then open the dask tab on the left and type `proxy/####` where `####` is the port displayed under the "Dashboard" link. I then usually drag "Dask Graph" and "Dask Progress" to the right side of the notebook. Also "Dask Workers" is nice.

In [1]:
%load_ext lab_black

import cftime
import numpy as np
import xarray as xr
import glob
from tqdm import tqdm

from dask.distributed import Client

In [2]:
# Scheduler address from the launch_cluster notebook.
client = Client("tcp://10.12.205.11:42722")
filepath = "/glade/scratch/jaye/S2S/tas_2m/"

Grab list of file names.

In [3]:
files = !ls /glade/scratch/jaye/S2S/tas_2m/
# Drops the folders with year initializations.
files = files[17::]
# Extracts initialization date from file names.
initializations = [f.split("_")[4] for f in files]
initializations = np.unique(initializations)

In [4]:
def update_leads(x):
    """Converts from time coordinates to lead coordinates.

    I.e., lead days.
    """
    x = x.rename({"TIME": "lead", "LAT": "lat", "LON": "lon"})
    x = x.assign_coords(lead=np.arange(x.lead.size).astype(int))
    return x

The below loop goes through each initialization and loads in all 10 ensemble members. It converts dimension names to the appropriate names for `climpred` and then chunks them into 100MB chunks in `dask`. We append it all to one big python list which will be used to concatenate into a large dataset with `(init, lead, member, lat, lon)`. This took about 2 minutes for me to run.

In [5]:
S2Sensemble = []
# Loop through each initialization and concatenate.
for init in tqdm(initializations):
    # Sort to make sure we have the same ensemble members at every init.
    filelist = np.sort(
        glob.glob(filepath + f"tas_2m_CESM1_30LCAM5_{init}_00z_d01_d45_*.nc")
    )

    # Open all ensemble members for a given init.
    ds = xr.open_mfdataset(
        filelist,
        combine="nested",
        parallel=True,
        concat_dim="member",
        preprocess=update_leads,
        chunks={"TIME": -1, "LON": -1, "LAT": -1},
        # speeds things up a bit
        coords="minimal",
        compat="override",
    )

    # Derive datetime for initialization based on the string in the filename.
    MON_TO_INT = {
        "jan": 1,
        "feb": 2,
        "mar": 3,
        "apr": 4,
        "may": 5,
        "jun": 6,
        "jul": 7,
        "aug": 8,
        "sep": 9,
        "oct": 10,
        "nov": 11,
        "dec": 12,
    }

    day = init[0:2]
    mon = init[2:5]
    mon = MON_TO_INT[mon]
    year = init[5::]

    # Assign initialization year and member numbers as coordinates.
    ds = ds.assign_coords(
        init=cftime.DatetimeNoLeap(int(year), int(mon), int(day)),
        member=np.arange(10) + 1,
    )

    # Chunk into one chunk. About 100MB per initialization
    # (full globe, all leads, all members)
    ds = ds.chunk({"member": -1})
    S2Sensemble.append(ds)

100%|██████████| 887/887 [02:01<00:00,  7.28it/s]


Now we can concatenate into one big dataset. Keep in mind we've used `dask` this whole time so everything is out of memory. (Should take seconds or less)

In [6]:
# Concatenate into one dataset with all inits, leads, members.
# Keywords just speed things up a bit.
ds = xr.concat(
    S2Sensemble,
    dim="init",
    coords="minimal",
    compat="override",
    combine_attrs="override",
)

This results in a ~100GB ensemble with 887 inits, 10 members, 45 leads, 180x360 lat/lon. We did all the chunking beforehand so the chunks look good. I'll now save this out as a `zarr` file since it will make bias-correcting much more efficient than trying to work with a netCDF. `zarr` is aware of the chunks we set up here.

In [7]:
display(ds.TAS)

Unnamed: 0,Array,Chunk
Bytes,104.03 GB,117.29 MB
Shape,"(887, 10, 45, 181, 360)","(1, 10, 45, 181, 360)"
Count,38141 Tasks,887 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 104.03 GB 117.29 MB Shape (887, 10, 45, 181, 360) (1, 10, 45, 181, 360) Count 38141 Tasks 887 Chunks Type float32 numpy.ndarray",10  887  360  181  45,

Unnamed: 0,Array,Chunk
Bytes,104.03 GB,117.29 MB
Shape,"(887, 10, 45, 181, 360)","(1, 10, 45, 181, 360)"
Count,38141 Tasks,887 Chunks
Type,float32,numpy.ndarray


Save out the full raw results as a `zarr` file. Took 40s on my end. Not too bad!

In [9]:
%time ds.to_zarr('/glade/scratch/rbrady/S2S/CESM1.S2S.tas_2m.raw.zarr', consolidated=True)

CPU times: user 7.38 s, sys: 327 ms, total: 7.7 s
Wall time: 38.8 s


<xarray.backends.zarr.ZarrStore at 0x2afb12b27130>