# Load and Save Files for CanRCM4 Downscaling Project
Load in and process HRDPS files here then export them as netcdf or numpy arrays

*This version is the 2nd part of code review changes and comments by Doug of `LoadFiles.ipynb`.*

I'm adding the stdlib `time` module to the imports to use to time some of the steps.
Dropping `datatime` because I don't think you need to worry about excluding the 2016 leap day.

In [1]:
from pathlib import Path
import time

import numpy as np
import xarray as xr

In [2]:
hrdps_path = Path("/results/forcing/atmospheric/GEM2.5/operational/")

We're going to iterate over the years that we want to process several times,
so I am giving the `range()` object the name `years`.

In [3]:
start_year, end_year = 2015, 2019
years = range(start_year, end_year+1)

Processing all of the years at once with a single `open_mfdataset()` call
was slow and had a huge memory footprint. So,
I decided to process 1 year at a time, storing the years in a `dict`,
then concatenate them near the end of the processing.

Also, in keeping with the philosophy of trimming the datasets as early as possible,
I am dropping the variables that we don't need.
We usually think of that trimming in terms of selecting spatial and/or temporal slices,
but dropping variables we don't need is important too.

This cell also shows how I am using `time.time()` to calculate how long sections of code take
to execute.
Also, adding `print()` messages to long-running pieces of code is helpful to monitor progress.
There are other ways of emitting messages from code using logging that are more appropriate for
library code, but `print()` works really well in notebooks.

The `PerformanceWarning` messages and a substantial increase in execution time and memory footprint
were due to the addition of the 3 variables `LHTFL_surface`, `PRATE_surface`, and `RH_2maboveground`
to the HRDPS datasets on 5-Dec-2018.
Adding those variables to the ones to drop resolved those issues.

In [4]:
drop_vars = (
    "precip", "qair", "solar", "therm_rad", "percentcloud", 
    "LHTFL_surface", "PRATE_surface", "RH_2maboveground",
)
hrdps = {}
t_total_start = time.time()
for year in years:
    print(f"start gathering {year}")
    t_start = time.time()
    hrdps[year] = xr.open_mfdataset(
        sorted(hrdps_path.glob(f"ops_y{year}*.nc")),
        drop_variables=drop_vars,
    )
    print(f"finished gathering {year}: {time.time() - t_start}s")
print(f"total gathering time: {time.time() - t_total_start}s")

start gathering 2015
finished gathering 2015: 44.388766288757324s
start gathering 2016
finished gathering 2016: 34.02922606468201s
start gathering 2017
finished gathering 2017: 33.67256569862366s
start gathering 2018
finished gathering 2018: 35.99414849281311s
start gathering 2019
finished gathering 2019: 57.364177227020264s
total gathering time: 205.451247215271s


Looking at the 2016 dataset we can see that the variables are `dask` arrays.
The `xarray` docs about `dask` at https://xarray.pydata.org/en/stable/dask.html
give some explanation of `dask` arrays.
The [dask](https://docs.dask.org/) docs go into way, way more detail.
Try not to be intimidated by those docs. Parallel processing is a "hard problem".
It takes time, thought, experimentation, and patience to learn about.

The key thing to understand is that those `dask` arrays represent computations that will
be executed sometime in the future.
The reason for deferring the execution of the computations is that the data we are processing
is too large to fit into memory all at once.
`dask` breaks to processing into a graph of tasks that can be distributed across multiple processes
to facilitate things like task-wise excution that will fit into memory,
access to multiple cores for parallel processing,
access to more memory on multiple compute nodes.
What we want to do is to control when and how `dask` executes those tasks so that they happen 
in a way that makes the best use of the cores and memory we have available to us.

I don't know if it is visible in the nbviewer rending of this notebook
(I know that it is not in the GitHub rendering),
but if you are running the notebook the representation of `hrdps[2016]` below is a fancy HTML table
with embedded Javascript.
You can click on the things that look like a stack of 3 disks on the right side of the coordinate and data
variables and see the repsesentation of the `dask` arrays.
They show information that includes the number of tasks that will be executed to calculate the result.
There is also information about chunks there.
I don't think we have to deal with chunks in this notebook,
but chunking is another important tuning parameter in other contexts.

In [5]:
hrdps[2016]

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.39 GB 6.54 MB Shape (8784, 266, 256) (24, 266, 256) Count 1098 Tasks 366 Chunks Type float32 numpy.ndarray",256  266  8784,

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.79 GB,13.07 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1464 Tasks,366 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 4.79 GB 13.07 MB Shape (8784, 266, 256) (24, 266, 256) Count 1464 Tasks 366 Chunks Type float64 numpy.ndarray",256  266  8784,

Unnamed: 0,Array,Chunk
Bytes,4.79 GB,13.07 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1464 Tasks,366 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.79 GB,13.07 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1464 Tasks,366 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 4.79 GB 13.07 MB Shape (8784, 266, 256) (24, 266, 256) Count 1464 Tasks 366 Chunks Type float64 numpy.ndarray",256  266  8784,

Unnamed: 0,Array,Chunk
Bytes,4.79 GB,13.07 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1464 Tasks,366 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.39 GB 6.54 MB Shape (8784, 266, 256) (24, 266, 256) Count 1098 Tasks 366 Chunks Type float32 numpy.ndarray",256  266  8784,

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.39 GB 6.54 MB Shape (8784, 266, 256) (24, 266, 256) Count 1098 Tasks 366 Chunks Type float32 numpy.ndarray",256  266  8784,

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 2.39 GB 6.54 MB Shape (8784, 266, 256) (24, 266, 256) Count 1098 Tasks 366 Chunks Type float32 numpy.ndarray",256  266  8784,

Unnamed: 0,Array,Chunk
Bytes,2.39 GB,6.54 MB
Shape,"(8784, 266, 256)","(24, 266, 256)"
Count,1098 Tasks,366 Chunks
Type,float32,numpy.ndarray


The [`dask` optimization tips](https://xarray.pydata.org/en/stable/dask.html#optimization-tips) section in the `xarray` docs
recommends doing `.sel()` and `.isel()` operations before `resample()` and `groupby()` ones.
So, we will do that.

The cell below is a "`dict` comprehension";
a compact way of building a `dict` when you need to iterate over something to do so.
It is equivalent to, but more optimized than:
```python
hrdps_ssc = {}
for year in years:
    hrdps_ssc[year] = hrdps[year].sel(x=slice(0, 48000))
```

I'm choosing to store the datasets we create at each step in new variables so that we can inspect them.
They don't take up too much more memory because they are collections of `dask` arrays,
not the actual results.

In [6]:
hrdps_ssc = {
    year: hrdps[year].sel(x=slice(0, 480000))
    for year in years
}

The main things to see here are that the `x` dimension has been reduced in size from 256 to 193,
and that the coordinates and data variables are still `dask` arrays,
abeit with more tasks that before.

In [7]:
hrdps_ssc[2016]

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.80 GB 4.93 MB Shape (8784, 266, 193) (24, 266, 193) Count 1464 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  8784,

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.61 GB,9.86 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1830 Tasks,366 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 3.61 GB 9.86 MB Shape (8784, 266, 193) (24, 266, 193) Count 1830 Tasks 366 Chunks Type float64 numpy.ndarray",193  266  8784,

Unnamed: 0,Array,Chunk
Bytes,3.61 GB,9.86 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1830 Tasks,366 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.61 GB,9.86 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1830 Tasks,366 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 3.61 GB 9.86 MB Shape (8784, 266, 193) (24, 266, 193) Count 1830 Tasks 366 Chunks Type float64 numpy.ndarray",193  266  8784,

Unnamed: 0,Array,Chunk
Bytes,3.61 GB,9.86 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1830 Tasks,366 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.80 GB 4.93 MB Shape (8784, 266, 193) (24, 266, 193) Count 1464 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  8784,

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.80 GB 4.93 MB Shape (8784, 266, 193) (24, 266, 193) Count 1464 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  8784,

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 1.80 GB 4.93 MB Shape (8784, 266, 193) (24, 266, 193) Count 1464 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  8784,

Unnamed: 0,Array,Chunk
Bytes,1.80 GB,4.93 MB
Shape,"(8784, 266, 193)","(24, 266, 193)"
Count,1464 Tasks,366 Chunks
Type,float32,numpy.ndarray


Now we resample to get daily averages.
Again, I am using a `dict` comprehension.
This step does trigger some computation across all of the blocks of data that
`dask` has divided things up into.
So, it takes a bit of time, and I have wrapped it in timing code.

In [8]:
t_start = time.time()
day_avgs = {
    year: hrdps_ssc[year].resample(time_counter="D").mean(dim="time_counter")
    for year in years
}
print(f"resampled to day averages: {time.time() - t_start}s")

resampled to day averages: 33.99842166900635s


Here, notice that the `time_counter` dimension has been reduced from 8784 hours to 366 days.
The coordinate and data variables are still `dask` arrays, with even larger task counts.
Sadly, most of the attribute metadata got stripped away by `resample()`.
I don't know how to get it to be retained, so we will put it back later.

In [9]:
day_avgs[2016]

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 75.16 MB 205.35 kB Shape (366, 266, 193) (1, 266, 193) Count 3294 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  366,

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,150.32 MB,410.70 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3660 Tasks,366 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 150.32 MB 410.70 kB Shape (366, 266, 193) (1, 266, 193) Count 3660 Tasks 366 Chunks Type float64 numpy.ndarray",193  266  366,

Unnamed: 0,Array,Chunk
Bytes,150.32 MB,410.70 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3660 Tasks,366 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,150.32 MB,410.70 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3660 Tasks,366 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 150.32 MB 410.70 kB Shape (366, 266, 193) (1, 266, 193) Count 3660 Tasks 366 Chunks Type float64 numpy.ndarray",193  266  366,

Unnamed: 0,Array,Chunk
Bytes,150.32 MB,410.70 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3660 Tasks,366 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 75.16 MB 205.35 kB Shape (366, 266, 193) (1, 266, 193) Count 3294 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  366,

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 75.16 MB 205.35 kB Shape (366, 266, 193) (1, 266, 193) Count 3294 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  366,

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 75.16 MB 205.35 kB Shape (366, 266, 193) (1, 266, 193) Count 3294 Tasks 366 Chunks Type float32 numpy.ndarray",193  266  366,

Unnamed: 0,Array,Chunk
Bytes,75.16 MB,205.35 kB
Shape,"(366, 266, 193)","(1, 266, 193)"
Count,3294 Tasks,366 Chunks
Type,float32,numpy.ndarray


Next, we concatenate the years together.
Recall that `day_avgs`, like `hrdps` above is a `dict` whose keys are the `year` numbers,
and whose values are `xarray` datasets.
We get the the collection of datasets by calling the `values()` method on the `dict`.
It's unfortunately confusing the method to do that has the same name as the method to 
access the NumPy array underlying an `xarray.DataArray`.

In [10]:
hrdps_allyears = xr.concat(day_avgs.values(), dim="time_counter")

In [11]:
hrdps_allyears

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,749.95 MB,410.70 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,20086 Tasks,1826 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 749.95 MB 410.70 kB Shape (1826, 266, 193) (1, 266, 193) Count 20086 Tasks 1826 Chunks Type float64 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,749.95 MB,410.70 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,20086 Tasks,1826 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,749.95 MB,410.70 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,20086 Tasks,1826 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 749.95 MB 410.70 kB Shape (1826, 266, 193) (1, 266, 193) Count 20086 Tasks 1826 Chunks Type float64 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,749.95 MB,410.70 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,20086 Tasks,1826 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray


Now for some cleanup...

In [12]:
# Drop the time_counter coordinate from nav_lat and nav_lon to make them 2d variables
hrdps_allyears["nav_lat"] = hrdps_allyears.nav_lat.sel(time_counter=hrdps_allyears.time_counter[0])
hrdps_allyears["nav_lon"] = hrdps_allyears.nav_lon.sel(time_counter=hrdps_allyears.time_counter[0])

In [13]:
# Restore that attribute metadata of the variables that got lost during resampling
for var in hrdps_allyears.data_vars:
    hrdps_allyears[var].attrs = hrdps[start_year][var].attrs

In [14]:
hrdps_allyears

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,410.70 kB,410.70 kB
Shape,"(266, 193)","(266, 193)"
Count,20087 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 410.70 kB 410.70 kB Shape (266, 193) (266, 193) Count 20087 Tasks 1 Chunks Type float64 numpy.ndarray",193  266,

Unnamed: 0,Array,Chunk
Bytes,410.70 kB,410.70 kB
Shape,"(266, 193)","(266, 193)"
Count,20087 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,410.70 kB,410.70 kB
Shape,"(266, 193)","(266, 193)"
Count,20087 Tasks,1 Chunks
Type,float64,numpy.ndarray
"Array Chunk Bytes 410.70 kB 410.70 kB Shape (266, 193) (266, 193) Count 20087 Tasks 1 Chunks Type float64 numpy.ndarray",193  266,

Unnamed: 0,Array,Chunk
Bytes,410.70 kB,410.70 kB
Shape,"(266, 193)","(266, 193)"
Count,20087 Tasks,1 Chunks
Type,float64,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 374.97 MB 205.35 kB Shape (1826, 266, 193) (1, 266, 193) Count 18260 Tasks 1826 Chunks Type float32 numpy.ndarray",193  266  1826,

Unnamed: 0,Array,Chunk
Bytes,374.97 MB,205.35 kB
Shape,"(1826, 266, 193)","(1, 266, 193)"
Count,18260 Tasks,1826 Chunks
Type,float32,numpy.ndarray


This is where the rubber hits the road!

All of the above processing has been telling dask to add tasks to its processing graph
but deferring the actual processing until we actually need to access its results
(lazy processing).
Doing things like accessing the underlying NumPy arrays in the dataset to plot them
or look at a slice triggers processing.
But it is best for large data collections for us to control when the processing is triggered,
and how it is executed.
We do that with the `load()` method on the dataset.

`load()` defauts to using threads in the same process that our notebook is running in.
That is rarely the most efficient way of doing things so it takes a long time,
and uses a lot of memory.
For the kind of "concatenate and lightly process lots of netCDF datasets" workload
in this notebook, telling dask to launch a collection of separate Python processes
(workers) to distribute the tasks in its graph on to is usually the best approach.

Here I chose to use 8 workers because I was running the notebook on `salish` while
the nowcast-dev NEMO run was in progress. If `top` showed me that `salish` was not
busy doing anything else I would have used up to 32 workers.

This step will take the longest of any in the notebook; 2-3 minutes per year using 8 workers.

In [15]:
num_workers = 8
t_start = time.time()
hrdps_loaded = hrdps_allyears.load(scheduler="processes", num_workers=num_workers)
print(f"dask processing in {num_workers} processes to load result: {time.time() - t_start}s")

dask processing in 8 processes to load result: 759.3855674266815s


Notice that the data variables are now NumPy arrays instead of dask arrays.

In [16]:
hrdps_loaded

I'm not going to bother adding dataset level metadata attributes like Susan does
for the rivers climatology datasets because this dataset is an intermediate processing artefact,
whereas the rivers climatology datasets are (in some sense) widely used research products.
But you can add metadata if you want.

Dump the daily averaged HRDPS fields and geo-coordinates to a netCDF
so that they can be loaded with `xarray.open_dataset()` in the other processing steps
without repeating the above processing.

`nc_file` can be any path your want. The one below puts the file in the same directory
as this notebook.

**Please don't commit the netCDF file to git. It is way too large to push to GitHub.**
It is an example of a "product of processing" file that is the kind of file we don't
track with git because it can be re-generated by the code in this notebook.
It is also a large binary file and those are generally unsuitable for tracking by git.

In [17]:
nc_file = Path("hrdps_day_avgs.nc")
hrdps_loaded.to_netcdf(nc_file)