This example shows how to get a day's worth of analysis files for a single variable and combine them using high-level APIs.

In [9]:
import s3fs
import xarray as xr
import metpy
import datetime
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

We use xarray's open_mfdataset to load the data. There's a couple things missing from the metadata, so we use a metpy extension to add projection info and latitude/longitude. We also promote the "time" attribute to a coordinate so that combining the datasets for each hour will work later on.

In [10]:
def load_dataset(urls):
    fs = s3fs.S3FileSystem(anon=True)
    ds = xr.open_mfdataset([s3fs.S3Map(url, s3=fs) for url in urls], engine='zarr')
    ds = ds.rename(projection_x_coordinate="x", projection_y_coordinate="y")
    ds = ds.metpy.assign_crs(grid_mapping_name="lambert_conformal_conic", longitude_of_central_meridian=-97.5,
                                 latitude_of_projection_origin=38.5,
                                 standard_parallel=38.5)
    ds = ds.metpy.assign_latitude_longitude()
    ds = ds.set_coords("time")
    return ds

The following function demonstrates how to format the urls to load the data, as well as how to combine the hours using xarray.concat. Note that because there's an extra level of nesting for the main data variable (level and variable name), we have to get both the zarr group url and the url for the nested subgroup. That's why we have to use open_mfdataset ("mf" means "multifile")––other zarr datasets likely won't have this quirk.

In [11]:
def load_combined_dataset(start_date, num_hours, level, param_short_name):
    combined_ds = None
    for i in range(num_hours):
        time = start_date + datetime.timedelta(hours=i)
        group_url = time.strftime(f"s3://hrrrzarr/sfc/%Y%m%d/%Y%m%d_%Hz_anl.zarr/{level}/{param_short_name}")
        subgroup_url = group_url + f"/{level}"
        partial_ds = load_dataset([group_url, subgroup_url])
        if not combined_ds:
            combined_ds = partial_ds
        else:
            combined_ds = xr.concat([combined_ds, partial_ds], dim="time", combine_attrs="drop_conflicts")
    return combined_ds

Just for demonstration purposes, we load up the data and calculate the standard deviation so we have something to plot across the geospatial domain. Note that this whole thing takes 2 minutes on my laptop, mostly spent on downloading the data. You'll need some performance optimizations or parallelization if you're doing a large analysis.

In [12]:
%%time
ds = load_combined_dataset(datetime.datetime(2021, 4, 1), 24, "1000mb", "TMP")

CPU times: user 29 s, sys: 562 ms, total: 29.5 s
Wall time: 1min 32s


In [13]:
%%time
std_dev = ds.TMP.std(dim="time")

CPU times: user 18.3 ms, sys: 16 ms, total: 34.3 ms
Wall time: 34.2 ms


In [None]:
ax = plt.axes(projection=ccrs.PlateCarree())
ax.contourf(std_dev.longitude, std_dev.latitude, std_dev)
ax.coastlines()

plt.show()

**Troubleshoot:** In my Jupyter notebook setup, the pyproj package gives the following error when it tries looking up the projection info:

```
CRSError: Invalid datum string: urn:ogc:def:datum:EPSG::6326: (Internal Proj Error: proj_create: SQLite error on SELECT name, ellipsoid_auth_name, ellipsoid_code, prime_meridian_auth_name, prime_meridian_code, publication_date, frame_reference_epoch, deprecated FROM geodetic_datum WHERE auth_name = ? AND code = ?: no such column: publication_date)
```

I believe this is because I run Jupyter from a conda environment that's different than the kernel Jupyter is using. In any case, there's an easy fix:

In [6]:
import pyproj
#pyproj.datadir.set_data_dir("/Users/<me>/.conda/envs/<this notebook's kernel env>/share/proj")
pyproj.datadir.set_data_dir("/Users/adairkovac/.conda/envs/TetheredBalloon-7710/share/proj")