# Demo of archiving NWM predictions 

This notebook demonstrates how to download NWM predictions and append them to Zarr files, and could form the basis of an NWM archive service. Zarr works well for gridded data, but Parquet seems preferable for point-based data. However, it is unclear if we can append to Parquet. A future iteration of this may investigate how to append to Parquet.

In [6]:
from os.path import join, exists, basename
import tempfile
from urllib import request
from os import makedirs
import os
import shutil

from tqdm import tqdm
import numpy as np
import xarray as xr
import pandas as pd

In [7]:
out_dir = '/opt/data/noaa/nwm-preds'
archive_dir = join(out_dir, 'archive')
tmp_dir = join(out_dir, 'tmp')
makedirs(archive_dir, exist_ok=True)
makedirs(tmp_dir, exist_ok=True)

In [8]:
from dask.distributed import Client
client = Client(n_workers=4)

Perhaps you already have a cluster running?
Hosting the HTTP server on port 44873 instead


In [9]:
def get_nwm_uri(date, data_type, cycle_runtime, forecast_hour):
    cycle_runtime = f'{cycle_runtime:02}'
    forecast_hour = f'{forecast_hour:03}'
    return (
        f'https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.{date}/short_range/'
        f'nwm.t{cycle_runtime}z.short_range.{data_type}.f{forecast_hour}.conus.nc')

## Download a subset of NWM predictions for today

For each day, there is a prediction file for each point within a 2D space with the following dimensions:
* `cycle_runtime`: a time when the predictions were generated; values are in [0-23]
* `forecast_hour`: how far into the future the predictions are, indexed by the hour offset from the `cycle_runtime`; values are in [1-18].

Here we download a 2x2 grid for testing purposes.

In [10]:
ts = pd.Timestamp.utcnow()
date = ts.strftime("%Y%m%d")
# The set of data_types includes ['channel_rt', 'land', 'reservoir', 'terrain_rt']
data_type = 'terrain_rt'

for cycle_runtime in [1, 2]:
    for forecast_hour in [1, 2]:
        nwm_uri = get_nwm_uri(date, data_type, cycle_runtime, forecast_hour)
        nwm_path = join(tmp_dir, basename(nwm_uri))
        print(f'Downloading {nwm_uri}')
        request.urlretrieve(nwm_uri, nwm_path)

Downloading https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.20220317/short_range/nwm.t01z.short_range.terrain_rt.f001.conus.nc
Downloading https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.20220317/short_range/nwm.t01z.short_range.terrain_rt.f002.conus.nc
Downloading https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.20220317/short_range/nwm.t02z.short_range.terrain_rt.f001.conus.nc
Downloading https://nomads.ncep.noaa.gov/pub/data/nccf/com/nwm/prod/nwm.20220317/short_range/nwm.t02z.short_range.terrain_rt.f002.conus.nc


To create the archive file, we need to append predictions along the two dimensions, which isn't possible using Zarr. So, our workaround is to create a temporary Zarr file for a single `cycle_runtime`, and append along the `forecast_hour` dimension. Once we have the full temporary Zarr file for a single `cycle_runtime`, we can append it to the main archive Zarr file along the `cycle_runtime` dimension.

In [11]:
out_path = join(archive_dir, f'{data_type}.zarr')
for cycle_runtime in tqdm([1, 2]):
    cycle_tmp_path = join(tmp_dir, f'cycle-tmp-{cycle_runtime}.zarr')
    
    # Append forecast_hour to the temporary file.
    for forecast_hour in tqdm([1, 2]):
        nwm_uri = get_nwm_uri(date, data_type, cycle_runtime, forecast_hour)
        nwm_path = join(tmp_dir, basename(nwm_uri))

        with xr.open_dataset(nwm_path, chunks={'time': 1, 'y': 1000}) as ds:
            ds = ds.drop('crs')
            # Replace with offset hour instead of absolute time.
            ds = ds.assign_coords(time=np.array([forecast_hour]))
            append_dim = 'time' if forecast_hour != 1 else None
            ds.to_zarr(cycle_tmp_path, append_dim=append_dim)
        
    # Append the temporary file along the cycle_runtime dimension, and then delete the temp file.
    # Need to use chunks argument to use Dask arrays which allow streaming IO.
    with xr.open_dataset(cycle_tmp_path, chunks={'time': 1, 'y': 1000}) as ds:
        ds = ds.assign_coords(reference_time=np.array([cycle_runtime]))
        ds['zwattablrt'] = ds.zwattablrt.expand_dims('reference_time')
        ds['sfcheadsubrt'] = ds.sfcheadsubrt.expand_dims('reference_time')
        append_dim = 'reference_time' if cycle_runtime != 1 else None
        ds.to_zarr(out_path, append_dim=append_dim)
        
    shutil.rmtree(cycle_tmp_path)

100%|██████████| 2/2 [00:11<00:00,  5.60s/it]
100%|██████████| 2/2 [00:10<00:00,  5.32s/it]
100%|██████████| 2/2 [00:42<00:00, 21.28s/it]


In [12]:
ds = xr.open_dataset(out_path, chunks={'time': 1, 'y': 1000})
ds



Unnamed: 0,Array,Chunk
Bytes,8.44 GiB,140.62 MiB
Shape,"(2, 2, 15360, 18432)","(1, 1, 1000, 18432)"
Count,65 Tasks,64 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 8.44 GiB 140.62 MiB Shape (2, 2, 15360, 18432) (1, 1, 1000, 18432) Count 65 Tasks 64 Chunks Type float64 numpy.ndarray",2  1  18432  15360  2,

Unnamed: 0,Array,Chunk
Bytes,8.44 GiB,140.62 MiB
Shape,"(2, 2, 15360, 18432)","(1, 1, 1000, 18432)"
Count,65 Tasks,64 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,8.44 GiB,140.62 MiB
Shape,"(2, 2, 15360, 18432)","(1, 1, 1000, 18432)"
Count,65 Tasks,64 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 8.44 GiB 140.62 MiB Shape (2, 2, 15360, 18432) (1, 1, 1000, 18432) Count 65 Tasks 64 Chunks Type float64 numpy.ndarray",2  1  18432  15360  2,

Unnamed: 0,Array,Chunk
Bytes,8.44 GiB,140.62 MiB
Shape,"(2, 2, 15360, 18432)","(1, 1, 1000, 18432)"
Count,65 Tasks,64 Chunks
Type,float64,numpy.ndarray


In [13]:
ds.zwattablrt

Unnamed: 0,Array,Chunk
Bytes,8.44 GiB,140.62 MiB
Shape,"(2, 2, 15360, 18432)","(1, 1, 1000, 18432)"
Count,65 Tasks,64 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 8.44 GiB 140.62 MiB Shape (2, 2, 15360, 18432) (1, 1, 1000, 18432) Count 65 Tasks 64 Chunks Type float64 numpy.ndarray",2  1  18432  15360  2,

Unnamed: 0,Array,Chunk
Bytes,8.44 GiB,140.62 MiB
Shape,"(2, 2, 15360, 18432)","(1, 1, 1000, 18432)"
Count,65 Tasks,64 Chunks
Type,float64,numpy.ndarray
