# Intro to Reshapr

MOAD Group Software Presentation & Discussion

5 Aug 2022

This notebook can be viewed as a slideshow by using the 
[RISE](https://rise.readthedocs.io/en/stable/index.html])
slide show extension for Jupyter.

*Note: RISE only works with `jupyter notebook`, not with `jupyter lab` :-(*

If you are working in an up to date clone of the 
[UBC-MOAD/PythonNotes repo](https://github.com/UBC-MOAD/PythonNotes),
you can run the slideshow locally.
To do so:
* create an conda environment containing `jupyter` and `rise` with:
```bash
conda env create -f PythonNotes/reshapr-intro/environment.yaml
```
* start `jupyter notebook`
* open `PythonNotes/reshapr-intro/ReshaprIntroSlides.ipynb`
* use `Alt+r` or the `Enter/Exit RISE Slideshow` toobar button to start/stop the slideshow mode
* use `Space` and `Shift+Space` to navigate forward and backward through the slide cells

# Reshapr

* Python package
* Command-line tool
* Based on Xarray and Dask
* Extraction of model variable time series from model products
* SalishSeaCast, HRDPS, (maybe CANESM2/CGCM4)

# Outline

* Motivation
* Code, Docs & Installation
* A Taste of Using Reshapr
* Discovery - `reshapr --help` and `reshapr info`
* Extraction - `reshapr extract` & Extraction Configurations
    * Time Series Extraction
    * Temporal & Spatial Selection
    * Resampling & Aggregation
* Give Me Your Use Cases and Bugs!
* Model Profiles
* Dask Cluster Configurations
* Discussion

# Motivation

* Ocean, climate & atmospheric model outputs are netCDF4 files

* Time series of model variables/fields are interesting and useful

<img src="https://salishsea.eos.ubc.ca/nowcast-green/02aug22/nitrate_diatoms_timeseries_02aug22.svg" alt="Recent nitrate & diatom concentrations at S3" />

# Motivation

* Ocean, climate & atmospheric model outputs are netCDF4 files


* Time series of model variables/fields are interesting and useful


* Variable values for time series are stored across many files (daily-ish or monthly-ish)
* Files usually contain more variables than you're interested in

* Files are large:
  24 hours, 40 depths levels, 398 x 898 x-y grid points of SalishSeaCast biology is 2.1 Gb on disk

* ~10 Gb in memory due to ~80% deflation when models write files

* Time series extraction = dealing with multiple files that exceed memory size if all loaded at once,
  and that have a computational cost to simply open due to the necessary data decompression

[Xarray](https://xarray.pydata.org/en/latest/) and [Dask](https://docs.dask.org/en/latest/) Address Those Challenges

* Xarray provides a high level programming interface for labelled multi-dimensional arrays
  that maps especially well to handling netCDF4 model results.

* Dask provides a flexible parallel computing framework for operating on extremely large
  datasets without loading them into memory.

`xarray.open_dataset()` ➛ `xarray.open_mfdataset()`!!

* Abstracts away the challenge of operating on tens or hundreds of multi-gigabyte files

* Hides the factor of ~5 expansion of the in-memory size of the data compared to the file sizes

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/SNice.svg/1200px-SNice.svg.png" alt=":-)" width=50/>
... for a while ...

Then, confusion and disappointment:

* Slow, at best, or failure due to exhausting the physical and virtual memory 

<img src="https://cdn-0.emojis.wiki/emoji-pics/google/crying-face-google.png" alt=":-(" width=50/>

Why the Sadness?

* Dask's task graph architecture

* Threads, processes, cluster, schedulers ...

* Compute latency vs. input/output latency ...

* ... And chunking in conjunction with those things

<img src="https://s3.amazonaws.com/pix.iemoji.com/images/emoji/apple/ios-12/256/exploding-head.png" alt="=-(" width=50/>

# Code, Docs & Installation

Code: https://github.com/UBC-MOAD/Reshapr

Docs: https://reshapr.readthedocs.io/en/latest/

# Installation

Install on `/ocean/` (e.g. `/ocean/dlatorne/MOAD/Reshapr/`) because `salish` is where you want to run Reshapr.

1. Clone the repo: `git clone git@github.com:UBC-MOAD/Reshapr.git`

2. Go into the Reshapr directory: `cd Reshapr`

3. Build a conda env: `conda env create -f envs/environment-user.yaml`

4. Activate the env: `conda activate reshapr`

5. Install the package: `python3 -m pip install --editable .`

6. Confirm your installation: `reshapr info` or `reshapr --version`

[Installation docs](https://reshapr.readthedocs.io/en/latest/installation.html)

Other details about:

* updating your installation
* adding Reshapr to your own conda env
* uninstalling

are also in the docs and in some appendix slides below.

# A Taste of Using Reshapr

Raisha needs day-averaged diatoms fields from the 15 complete years of 201905
to construct forcing files for Atlantis that will "nudge" its calculated diatoms
toward the SalishSeaCast values.
The files need to contain the lons and lats of the grid.
They need to be in NETCDF4_CLASSIC format.

We'll do a month of that extraction:

`reshapr extract docs/examples/extract_atlantis_diatoms.yaml`

Demo ...

```yaml
# reshapr extract processing configuration for diatoms nudging field
# for Atlantis ecosystem model

dataset:
  model profile: SalishSeaCast-201905.yaml
  time base: day
  variables group: biology

dask cluster: salish_cluster.yaml

start date: 2007-01-01
end date: 2021-12-31

extract variables:
  - diatoms

include lons lats: True

extracted dataset:
  name: SalishSeaCast_day_avg_diatoms
  description: Day-averaged diatoms biomass extracted from SalishSeaCast v201905 hindcast
  format: NETCDF4_CLASSIC
  dest dir: /ocean/dlatorne/Atlantis/day-avg-diatoms/
```

# Discovery

Reshapr can tell you:

* about itself

* how to use it

* about the model products is knows how to process

* about the Dask clusters it can use for that processing

Demo ...

* "Tell me about yourself"

  `reshapr --help`

* "Tell me how to do an extraction"

  `reshapr extract --help`

* "Tell how to find out what you know about"
  
  `reshapr info --help`

* "Tell what you know about"

  `reshapr info`

* "Tell what you know about SalishSeaCast 201905"

  `reshapr info SalishSeaCast-201905`

* "Tell what you know about SalishSeaCast 201905 day-averaged biology"

  `reshapr info SalishSeaCast-201905 day biology`

# Extraction

* Time Series Extraction

* Temporal & Spatial Selection

* Resampling & Aggregation

# Time Series Extraction

```yaml
dataset:
  model profile: SalishSeaCast-201905.yaml
  time base: day
  variables group: biology

dask cluster: salish_cluster.yaml

start date: 2007-01-01
end date: 2007-01-31

extract variables:
  - diatoms

extracted dataset:
  name: SalishSeaCast_day_avg_diatoms
  description: Day-averaged diatoms biomass extracted from SalishSeaCast v201905 hindcast
  dest dir: /ocean/dlatorne/
```

# YAML File Details

* Use `reshapr info` to get most of the contents of the YAML file:

  * `model profile`
  * `time base`
  * `variable group`
  * list of variables in `extract variables`
  
* Output will be stored in location given by `dest dir`

* Output file name will be value of `name` with `start date`, `end date`, and `.nc` appended; e.g. :
    
    `SalishSeaCast_day_avg_diatoms_20070101_20070131.nc`

# Output File Details

* Time coordinate is named `time`

* Depth coordinate is named `depth`

* y and x grid index coordinates are named `gridY` and `gridX`

* If requested, longitude and latitude are variables named `longitude` and `latitude` with coordinates `gridY` and `gridX`

* Default storage settings are for best space-efficiency

* Adequate, consistent variable and coordinate metadata, and dataset metadata

* None of the extraneous NEMO variables and coordinates

Demo with `ncdump -cst` ...

# Best Practises

* Start small, then scale up

  * 1 variable
  * a few days
  
* Only extract the variables, time range, spatial regions, you need

  * More variables and longer time range make the Dask task graph larger and more complicated
  * Limiting depth, y, and x ranges reduces memory size and maybe task graph; allows more concurrent tasks

# Temporal & Spatial Selection

```yaml
dataset:
  model profile: SalishSeaCast-201812.yaml
  time base: hour
  variables group: biology

dask cluster: salish_cluster.yaml

start date: 2015-01-01
end date: 2015-01-10

extract variables:
  - diatoms
  - nitrate

selection:
  time interval: 3  # multiple of dataset: time base
  depth:
    depth min: 0
    depth max: 25
    depth interval: 2  # multiple of depth index; e.g. 2 means every 2nd depth
  grid y:
    y min: 600
    y max: 700
    y interval: 10  # multiple of grid y index; e.g. 10 means every 10th grid point
  grid x:
    x min: 100
    x max: 300
    x interval: 5  # multiple of grid x index; e.g. 5 means every 5th grid point

extracted dataset:
  name: SalishSeaCast_hour_avg_biology_3h
  description: Hour-averaged diatoms biomass and nitrate every 3rd hour extracted from SalishSeaCast v201812 hindcast
  dest dir: /ocean/dlatorne/
```

Selection parameters all have sensible defaults so you can leave out coordinates you don't want to limit:

* default `time interval`, `depth interval`, `y interval`, and `x interval` are `1`
* default `depth min`, `y min`, and `x min` are `0`
* default `depth max`, `y max`, and `x max` are full grid extent

Presently you have to handle the 0-based indexing +1 on the max values,
but that will be fixed (https://github.com/UBC-MOAD/Reshapr/issues/38).

# Resampling & Aggregation

```yaml
dataset:
  model profile: SalishSeaCast-201905.yaml
  time base: day
  variables group: biology

dask cluster: salish_cluster.yaml

start date: 2009-02-01
end date: 2009-02-28

extract variables:
  - diatoms
  - nitrate

resample:
  # A pandas time series frequency offset alias
  # with an optional multiplier digit prefix
  time interval: 1M
  # An xarray dataset reduction method to use for aggregation.
  # See the "resampling and grouped operations" sub-section in the Time Series Data
  # section fo the xarray User Guide.
  # default: mean
  aggregation: mean

extracted dataset:
  name: SalishSeaCast_1m_ptrc_T
  description: Month-averaged diatoms biomass and nitrate extracted from SalishSeaCast v201905 hindcast
  deflate: True
  format: NETCDF4
  dest dir: /ocean/dlatorne/
```

Selection and resampling can be combined.
Selection happens first, then resampling.

Down-sampling implies aggregation.

Up-sampling might work, but hasn't been tested.
If it doesn't work, it could probably be made to.

Pandas time series frequency offset aliases can be tricky!

* The syntax can be non-intuitive: check the docs at 
  https://pandas.pydata.org/docs/user_guide/timeseries.html#dateoffset-objects
  
* The results can be non-intuitive:
  For Karyn's 5-year average biology extraction I discovered that I had to use
  `1826D` rather than `5A`. Maybe because of leapyear, but I'm not sure.

# Give Me Your Use Cases and Bugs!

* 100% you will think of things to do with Reshapr that I haven't thought of
  
  * Tell me about them and we'll figure out how to do them 
    or try to figure out how to add code to make them possible
    
  * Even if it "just works" for something that is not in the
    [use examples docs](https://reshapr.readthedocs.io/en/latest/examples/index.html),
    adding a new use example blurb and YAML file is a Good Thing™
    
* If Reshapr raises an exception and gives you a traceback, that is a bug and I want to know about it!

  * Please tell me somehow!
  
  * My preference is the [issue tracker](https://github.com/UBC-MOAD/Reshapr/issues) but any channel will do!

# Model Profiles

YAML files that provide a standardized interface for the details of different model's netCDF-4 files

* Coordinate names
* Chunk sizes
* Mapping between y-x grid indices and lons/lats
* Time origin for extractions
* Path where model product files are stored
* File path/name patterns and depth coordinates for variable group files

All the details in docs at https://reshapr.readthedocs.io/en/latest/model_profiles.html

# Dask Cluster Configurations

YAML files that provide details of Dask clusters to use for extractions

A key part of keeping the `xarray.open_mfdatset()` sadness away!

For now there is only `salish_cluster`.
MPI cluster config for `graham` or other HPC systems are possible.

# On-Demand vs. Persistent Clusters

## On-Demand

* Highly recommended!

* Cluster is started for your extraction and shut down when it finishes

* Put the cluster config file name in your extraction YAML:

  `dask cluster: salish_cluster.yaml`
  
## Persistent Cluster

* Launch scheduler and workers in `tmux` terminals;
  docs at: https://reshapr.readthedocs.io/en/latest/dask_clusters.html#persistent-cluster

* I used one for month-by-month calculation of averages for hindcast: 
  hundreds of `reshapr extract` runs in a loop
  
* If everyone tries to run their own persistent cluster on `salish` there will be sadness!

# Cluster Monitoring

Dask clusters have a web dashboard that can be used for monitoring and analysis of the processing.
Mostly blinking lights :-)

See step 5 in persistent cluster docs at 
https://reshapr.readthedocs.io/en/latest/dask_clusters.html#persistent-cluster

[Matthew Rocklin YouTube video about the dashboard](https://www.youtube.com/watch?v=N_GqzcuGLCY)

# Discussion

# Appendix Slides

# Updating Your Installation

1. Pull from GitHub: `git pull`

2. On rare occassions (I'll tell you when): `python3 -m pip install --editable .`

# Adding Reshapr to Your Own Conda Env

**Python 3.10 or later !!!**

1. Edit your env description YAML file to add packages that are in `envs/environment-user.yaml`

2. Update your env: `conda env update -f your-env-yaml`

3. Install Reshapr: `python3 -m pip install --editable path/to/Reshapr`

# Uninstalling

See the docs: https://reshapr.readthedocs.io/en/latest/installation.html#uninstalling