# Exploring netCDF Files
Notebook copied from https://salishsea-meopar-docs.readthedocs.io/en/latest/work_env/python_notes.html and adjusted for the ClimateNet (https://gmd.copernicus.org/articles/14/107/2021/) data set.

This notebook provides discussion, examples, and best practices for working with netCDF files in Python.
Topics include:

* The [`netcdf4-python`](http://http://unidata.github.io/netcdf4-python/) library
* The [`salishsea_tools.nc_tools`](http://salishsea-meopar-tools.readthedocs.org/en/latest/SalishSeaTools/salishsea-tools.html#module-nc_tools) code module
* Reading netCDF files into Python data structures
* Exploring netCDF dataset dimensions, variables, and attributes
* Working with netCDF variable data as [NumPy](http://www.numpy.org/) arrays

The [`netcdf4-python`](http://unidata.github.io/netcdf4-python/) library
does all of the heavy lifting to let us work with netCDF files and their data.
Follow the link to get to the library documentation.
The [salishsea_tools.nc_tools](http://salishsea-meopar-tools.readthedocs.org/en/latest/SalishSeaTools/salishsea-tools.html#module-nc_tools) code module provides some shortcut functions for exploring netCDF datasets.
Let's go ahead and import those two packages,
We'll also import `numpy` because we're going to use it later and it's good Python form
to keep all of our imports at the top of the file.

This notebook assumes that you are working in Python 3.
If you don't have a Python 3 environment set up,
please see our
[Anaconda Python Distribution](http://salishsea-meopar-docs.readthedocs.org/en/latest/work_env/anaconda_python.html)
docs for instructions on how to set one up.

In [None]:
import netCDF4 as nc
import numpy as np

from salishsea_tools import nc_tools

Note that:

* By convention, we alias `netCDF4` to `nc` and `numpy` to `np`
so that we don't have to type as much
* For the same reason we use the `from ... import ...` form to get `nc_tools`
so that we can avoid typing `salishsea_tools.nc_tools` everywhere

`netCDF` provides a `Dataset` object that allows us to load the contents
of a netCDF file into a Python data structure by simply passing in the
path and file name.
Let's explore the Salish Sea NEMO model bathymetry data:

In [None]:
ds = nc.Dataset('/mnt/data/ai4good/climatenet_new/train/data-2000-12-20-01-1_5.nc')

netCDF files are organized around 4 big concepts:

* groups
* dimensions
* variables
* attributes

NEMO doesn't use netCDF groups, so we'll ignore them.

`nc_tools` provides useful (convenience) functions to look at the other 3.

In [None]:
nc_tools.show_dimensions(ds)

- 3 dimension: `lat`, `lon`, `time`

In [None]:
nc_tools.show_variables(ds)

In [None]:
nc_tools.show_dataset_attrs(ds)

netCDF attributes are metadata.
In the cast of the dataset attributes they tell us about the dataset as a whole:
how, when, and by whom it was created, how it has been modified, etc.
The meanings of the various attributes and the conventions for them that we use
in the Salish Sea MEOPAR project are documented [elsewhere](http://salishsea-meopar-docs.readthedocs.org/en/latest/code-notes/salishsea-nemo/nemo-forcing/netcdf4.html).
Variables also have attributes and `nc_tools` provides a function to display them too:

In [None]:
nc_tools.show_variable_attrs(ds, 'lat')

In [None]:
nc_tools.show_variable_attrs(ds)

Before we can go further exploring and working with the variables we need to
associate them with Python variables names.
We do that by accessing them by name in the `variables` attribute of our `Dataset` object.
`variables` is a Python `dict`.
We can use any Python variable names we like, so let's shorten them
(being careful not to sacrifice readability for ease of typing):

In [None]:
lons = ds.variables['lon']
lats = ds.variables['lat']
times = ds.variables['time']
tmqs = ds.variables['TMQ']
labels = ds.variables['LABELS']

Our variables are instances of the `netCDF.Variable` object.
In addition to their attributes, they carry a bunch of other
useful properties and methods that you can read about in the netCDF4-python docs.
Perhaps more importantly the data associated with the variables
are stored as NumPy arrays.
So, we can use NumPy indexing and slicing to access the data values.
For instance, to get the latitudes and longitudes of the 4 corners of the domain:

In [None]:
lons.shape, lats.shape, times.shape, tmqs.shape, labels.shape

In [None]:
print('Latitudes and longitudes of domain corners:')
print('  0, 0:        ', lats[0], lons[0])
print('  y-max, x-max:', lats[-1], lons[-1])

You can also access the entire variable data array, or subsets of it using slicing.
The `[:]` slice notation is a convenient shorthand that means "the entire array".

In [None]:
lats[:]

In [None]:
lons[:]

In some cases, like our bathymetry depths, 
the netCDF variable has a `_FillingValue` attribute value that is equal
to values in the variable data.
In that case the data are represented by a [NumPy Masked Array](http://docs.scipy.org/doc/numpy/reference/maskedarray.html) with the
mask applied there the data values equal the `_FillingValue`:

You can test to see if a variables data is masked like this:

In [None]:
np.ma.is_masked(labels[:])

Masked arrays are useful because require less storage than a comparable
size fully populated array.
Also, when masked arrays are plotted the maked values are all plotted
in the same colour (white by default).
We'll see in other example notebooks how this allows us to very easily 
plot our bathymetry in a meaningfully way,
and use it,
or other values to mask velocity component, salinity, etc. results so
that they show values only in the water areas of the domain.