# Sample Diagnostic

Example of a diagnostic showing how the COSIMA cookbook works.

An objective of the COSIMA cookbook is to catalogue useful diagnostics for ocean and ice models.  Certain
tools and patterns are used extensively in these examples.  While the diagnostic itself should be portable
to another framework, there are some conventions used throughout that require explanation.

## Background of a diagnostic notebook

Each diagnostic is written up as a Jupyter notebook with the extension `.ipynb`.  The first cell in the notebook
must be a Markdown cell with a header and a one-line description.  This cell is used by sphinx-nbgallery to collect
all of the diagnostics together the http://cosima-cookbook.readthedocs.io site. 

In this first section, a brief background on the theory of the diagnostic is presented. For this worked example, we will be calculating the eddy kinetic energy (see the Kinetic Energy notebook).  While in real diagnostic notebook, commentary about the diagnostic is presented, here we provide commentary about the technical aspects of how these diagnostics have been implemented.

### A note on names
- The project's long name is "COSIMA Cookbook".
- The GitHub project name is "cosima-cookbook".
- The Python package is called "cosima_cookbook". 

This conventions appears to be consistent with other Python based projects.

### Python import statements

Early in the notebook, there is a code cell that imports all of the needed Python packages.  Internal to cosima_cookbook, other packages may also be important. But, if they are needed in this notebook, they must be
imported explicitly into the namespace.

#### Example of a import cell block

In [None]:
files = pd.DataFrame(list(db['ncfiles'].all()))

In [None]:
files

Files seen before but now not found on disk

Notice that, unlike the find command above, the glob has only identified .nc files that with in the
__configuration__/__experiment__/__run__ directory structure.

We want to produce an index over all of these NetCDF files. Once we do that, we can build our diagnostics by first
querying that index.

In [None]:
rows = [cosima_cookbook.index_ncfile(fn[0]) for fn in ncfiles[:15]]

index_ncfile() returns a list of dictionaries describing each variable in an the NetCDF file. 

In [None]:
rows[0][:3]

The list comprehension gives back a list of lists.  To continue, we first flatten this list of lists:

In [None]:
rows = [item for sublist in rows for item in sublist]

Finally, we can convert this dictionary 

In [None]:
%matplotlib inline

import cosima_cookbook as cc
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xarray as xr

from tqdm import tqdm_notebook

For static plots, the plotting package matplotlib is used. The inline statement tells Jupyter to place those plots within the notebook file.

It is common to import standard packages with abbreviated package names such as plt, np, pd, and xr. XArray is for
named arrays and can be thought of as layer that sits above the netCDF4 package.

The package tqdm is for progress bars.

These diagnostics are usually very memory and/or computationally expensive.  We leverage the `dask` library http://dask.pydata.org and its related package `distributed` https://distributed.readthedocs.io.

In [None]:
# output* directories
# match the parent and grandparent directory to configuration/experiment
m = re.compile('(.*)/(.*)/(.*)/(output\d+)/.*\.nc')

def index_variables(ncfile):

    matched = m.match(ncfile)
    if matched is None:
        return []
    
    if not os.path.exists(ncfile):
        return []
    
    try: 
        with netCDF4.Dataset(ncfile) as ds:
            ncvars = [ {'ncfile': ncfile,
                   'rootdir': matched.group(1),
                   'configuration': matched.group(2),
                   'experiment' : matched.group(3),
                   'run' : matched.group(4),
                   'basename' : os.path.basename(ncfile),
                   'variable' : v.name
                   } for v in ds.variables.values()]
    except:
        return []
    
    return ncvars

Parallel approach

In [None]:
import dask.bag
from distributed.diagnostics.progressbar import progress

In [None]:
%%time

bag = dask.bag.from_sequence(files_to_add)
bag = bag.map(index_variables).flatten()
ncvars = bag.compute()

In [None]:
print(len(ncvars))

In [None]:
db['ncfiles'].insert_many(ncvars)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(rows)

In [None]:
df

So this gives us a nice index for all variables over all NetCDF files.  However, since we have tens of thousands of ncfiles, generating this index can be slow.  To improve performance, we can use a dask bag.

In [None]:
bag = dask.bag.from_sequence([_[0] for _ in ncfiles],npartitions=1000)
rows = bag.map(cosima_cookbook.index_ncfile)

In [None]:
futures = client.compute(rows)

In [None]:
futures

In [None]:
import dataset

In [None]:
progress(futures)

To actually get the computation to occur, we can conver the bag into a list.  This takes a few minutes.

As before, we convert the list of lists to a single list

Finally, we can put this all into a pandas DataFrame

__Runs__ may change which variables they saved and at what temporal resolution over the course of an __experiment__.  Rather than trying to enumerate the variables _a priori_, we can a data discovery approach using a glob.

In [None]:
directoriesToSearch = ['/g/data3/hh5/tmp/cosima/', 
                      ]

In [None]:
import netCDF4

In [None]:
import dataset
import re
import os
import fnmatch

Build index of all NetCDF files found in directories to search. 

In [None]:
%%time
m = re.compile('.*\.nc$')

ncfiles = []
for directoryToSearch in directoriesToSearch:
    for root, dirs, filenames in os.walk(directoryToSearch):
        for filename in filenames:
            if m.match(filename) is not None:
                ncfiles.append(os.path.join(root, filename))

print(len(ncfiles))

We can persist this index by storing it in a sqlite database placed in a centrally available location.

In [None]:
cosima_cookbook_dir = '/g/data1/v45/cosima-cookbook'
if not os.path.exists(cosima_cookbook_dir):
    os.mkdir(cosima_cookbook_dir)

database_file = '/{}/cosima-cookbook.db'.format(cosima_cookbook_dir)

In [None]:
# os.remove(database_file)

The use of the `dataset` module hides the details of working with SQL directly.

In [None]:
db = dataset.connect('sqlite://' + database_file)

In this database is a single table listing all variables in NetCDF4 seen previously.

The above steps are implemented in

In [None]:
df = cosima_cookbook.build_index()

Here are all of the unique experiments found in the data

In [None]:
expts = df.experiment.unique()
expts

In [None]:
expts[7]

In [None]:
list(df)

## Calculation of EKE

Let's choose a specific experiment:

In [None]:
pd.DataFrame(list(db['ncfiles'].distinct('configuration', 'experiment')))

In [None]:
db['ncfiles'].columns

In [None]:
expt = 'KDS75'
expt

To calculate the eddy kinetic energy, we are going to consider only  portions of simulations which have 5-day average velocities saved, which means directories with `ocean__*.nc` files.

In [None]:
!ls {datadir}

The data directory contains several model __configurations__ (mom01v5 or mom025)

In [None]:
!ls {datadir}/mom01v5

Each configuration contains a number of __experiments__ (KDS75 or KDS75_wind)

In [None]:
!ls {datadir}/mom01v5/KDS75_salt10days

Which are each made up of a set of several __runs__ (e.g. output266)

In [None]:
!ls {datadir}/mom01v5/KDS75_salt10days/output266

The actual model out in stored in NetCDF4 files (denoted by the extention .nc). 

In [None]:
fn = os.path.join(datadir, 'mom01v5/KDS75_salt10days/output266/ocean.nc')
xr.open_dataset(fn)

There are many, many such NetCDF4 files.

In [None]:
!find {datadir} -name '*.nc' | wc

In [None]:
import dask
import distributed

By default, we create a collection of workers -- one work for each core. The memory is set as 70% of the total memory of the node. Beyond that, distributed will start caching results locally.

In [None]:
client = distributed.Client()
client

You see above that there is a URL for the Dashboard. This is a very useful tool for inspecting the progress of a dask
calculation. If you are running the VDI over VNC you should be able to click on the link to make the dashboard open in another tab.  

If you are running this notebook over a SSH tunnel, you will also have to tunnel the port for the dashboard to your local machine. Here's some code which generates the need string.  Run that command on your local machine. Then the link above should work.  

In [None]:
import os

params = {'host': os.environ['HOSTNAME'],
          'user': os.environ['USER'],
          'port': client.scheduler_info()['services']['bokeh']}

tunnel_cmd = "ssh {host}.nci.org.au -l {user} -L {port}:127.0.0.1:{port}".format(**params)
print(tunnel_cmd)

### Organization of the model data

By default, all of the model output is assumed to stored in the directory given by

In [None]:
df = pd.DataFrame.from_records(rows)

In [None]:
import cosima_cookbook.netcdf_index
cosima_cookbook.netcdf_index.directoriesToSearch

This global variable may be changed if needed.



In [None]:
datadir = cosima_cookbook.netcdf_index.directoriesToSearch[0]

In [None]:
files_already_seen = set([_['ncfile'] for _ in db['ncfiles'].distinct('ncfile')])
print(len(files_already_seen))

NetCDF files found on disk not seen before:

In [None]:
res = db.query('SELECT ncfile FROM ncfiles \
                WHERE experiment = "KDS75" \
                AND basename LIKE "%ocean__%" \
                AND variable = "u" \
                ORDER BY ncfile \
               ')
ncfiles = [row['ncfile'] for row in res]
ncfiles

Using our index of ncfiles, we search for such nc files.

In [None]:
files_to_add = set(ncfiles) - set(files_already_seen)
print(len(files_to_add))

For these new files, we can determine their configuration, experiment, and run. Using NetCDF4 to get list of all variables in each file.