In [1]:
import copy
import os

import numpy as np
import xarray as xr

from tqdm import tqdm

In this notebook, we read the ERA5 data stored on `/glade` and save the necessary variables in a more accessible manner locally. Specifically, we will read $u$, $v$, $w$, $T$, and $Z$ (where $Z$ is the geopotential), as well as the coordinate variables $\vartheta$ and $p$. Seviour et al. (2012) suggest that six-hourly resolution is necessary to capture the diurnal variability of the residual upwelling. Fortunately, there is an ERA5 dataset on disk ([633.1](https://rda.ucar.edu/datasets/ds633.1/)) of monthly mean data derived from the original six-hourly resolution. To reduce the size of the data we store locally, we will only store DJF and JJA averages.

We begin by setting the appropriate directory and enumerating the available years.

In [2]:
era_dir = '/gpfs/fs1/collections/rda/data/ds633.1/e5.moda.an.pl'
years = [int(x) for x in sorted(os.listdir(era_dir))]
n_year = len(years)

print(f'Found {n_year} years of ERA5 data, starting with {years[0]} and ending with {years[-1]}.')

Found 43 years of ERA5 data, starting with 1979 and ending with 2021.


Unfortunately, the full dataset is rather too large to work with cleanly in memory, even with seasonal means taken, so we will create an `xr.Dataset` for each year and save it to disk. Below, we write a function to initialize a new `Dataset` for each year. For convenience, we pull the three spatial coordinates (pressure, latitude, and longitude) from one of the existing ERA5 files.

In [3]:
coords_template = {
    'year' : [None],
    'season' : ['DJF', 'JJA']
}

fname = 'e5.moda.an.pl.128_131_u.ll025uv.1979010100_1979120100.nc'
with xr.open_dataset(f'{era_dir}/1979/{fname}') as f:
    coords_template['level'] = f['level']
    coords_template['latitude'] = f['latitude']
    coords_template['longitude'] = f['longitude']

shape = tuple(len(a) for _, a in coords_template.items())
names = ['u', 'v', 'w', 'T', 'Z']

def initialize_dataset(year):
    coords = copy.deepcopy(coords_template)
    coords['year'] = [year]
    data = {name : (coords.keys(), np.zeros(shape)) for name in names}
    
    return xr.Dataset(data, coords)

After defining the indices for the appropriate months for each season, we are now ready to step through the years, selecting the appropriate file for each variable and computing the seasonal averages we want. Using the initialization function defined above, we create a new `Dataset` for each year and save it to disk before moving on to the next one. We also change the ERA5 notation for geopotential $Z$ to the more conventional $\Phi$ before saving.

In [4]:
idxs = {
    'DJF' : np.array([11, 0, 1]),
    'JJA' : np.array([5, 6, 7])
}

output_dir = '/glade/work/dconnell/brewson'
for year in tqdm(years, unit='year'):
    ds = initialize_dataset(year)
    year_dir = f'{era_dir}/{year}'
    fnames = [x for x in os.listdir(year_dir) if x.endswith('.nc')]
    
    for name in names:
        fname = [x for x in fnames if f'_{name.lower()}.' in x][0]
        with xr.open_dataset(f'{year_dir}/{fname}') as f:
            for season, idx in idxs.items():
                ds[name].loc[dict(season=season)] = f[name.upper()][idx].mean('time')
    
    ds.rename({'Z' : 'Phi'}).to_netcdf(f'{output_dir}/{year}.nc')

100%|██████████| 43/43 [28:35<00:00, 39.88s/year]
