# Inspecting NetCDF data using xarray

This notebook provides some basic examples on opening a NetCDF dataset using the xarray package.  Many of the cells will need to be updated with variable names from the dataset you ar inspecting, and are commented out. Substitute appropriate variable, coordinate, or attribute names where appropriate and uncomment to run the cells.

This notebook is based on an excellent tutorial notebook by Kristen Thyng and Rob Hetland in their [python4geosciences github repository](https://github.com/kthyng/python4geosciences).  See also [this xarray tutorial by Anderson Banihirwe](https://github.com/andersy005/xarray-tutorial), and on [YouTube](https://www.youtube.com/watch?v=Ss4ryKukhi4&t=145s)


`xarray` expands the utility of the time series analysis package `pandas` into more than one dimension. It is actively being developed in conjunction with many other packages under the [Pangeo](https://pangeo.io/) umbrella. For example, you can run with Dask to use multiple cores on your laptop when you are working with data read in with `xarray`.

NetCDF is a binary storage format for many different kinds of rectangular data. Examples include atmosphere and ocean model output, satellite images, and timeseries data. NetCDF files are intended to be device independent, and the dataset may be queried in a fast, random-access way. More information about NetCDF files can be found [here](http://www.unidata.ucar.edu/software/netcdf/). The [CF conventions](http://cfconventions.org) are used for storing NetCDF data for earth system models, so that programs can be aware of the coordinate axes used by the data cubes.

We will read the netCDF file using `xarray`

This template works with datasets that have schema.org metadata registered with the EarthCube GeoCODES catalog, and have a valid URL Dataset/distribution/contentURL with an associated ./encodingFormat value of 'application/x-netcdf'. This is a fall-through demonstration for NetCDF encoded datasets that do not self-identify in the metadata as conforming to a more specific NetCDF profile. 

In [None]:
#github.com/nteract/papermill'parameters'tag used to inject them into template then post a gist runable by colab
url,ext,urn=None,None,None

# Parameters
# these parameters are passed from the GeoCodes Searth interface; 
# assign default values:
#url = "http://cmore.soest.hawaii.edu/cmoredata/Doney/3D/CMORE_NPAC_BEC.gx3.22.anthro.cv2.1959.nc"
url = "http://cmore.soest.hawaii.edu/cmoredata/Doney/3D/CMORE_NPAC_BEC.gx3.22.anthro.cv2.1982.nc"
ext = ""
urn = ""

In [None]:
import numpy as np
import netCDF4
import matplotlib.pyplot as plt
%matplotlib inline
import cartopy
#import cmocean.cm as cmo
import pandas as pd
import requests 
import xarray as xr
import cftime


In [None]:
def testurl(theurl):
    #try HEAD first in case the response document is big
    r = requests.head(theurl)
    if (r.status_code != requests.codes.ok):
        #check GET in case is an incomplete http implementation
        r = requests.get(theurl, stream=True)
        print('content size:', r.headers['content-length'])
        if (r.status_code == requests.codes.ok):
            return True
        else:
            print ('status code: ', r.status_code)
            return False
    else:
        return True

In [None]:
# get file-like object with .nc extension from the URL

if testurl(url):
    response = requests.get(url, allow_redirects=True)
    open('temp.nc', 'wb').write(response.content)
else:
    print('url ', url, 'not responding')

## open dataset
We'll use the `xarray` package to read this file, which has already been saved into the `data` directory.

One of the useful things about `xarray` is that it doesn't deal with the numbers in the file until it has to. This is called "lazy evaluation". It will note the operations you want done, but won't actually perform them until it needs to spit out numbers.

Viewing metadata is instantaneous since no calculations need to be done, even if the file is huge.

An xarray data object is a "dataset" or "data array".

In [None]:
ds = xr.open_dataset('temp.nc',decode_times=False)

# look at overview of metadata for file
ds

In [None]:
# variables that are in our dataset
ds.data_vars

In [None]:
# dataset dimensions
ds.dims

In [None]:
# dataset coordinates
ds.coords

In [None]:
# dataset global attributes
ds.attrs

In [None]:
# look at shape and units for a variable
# copy in a variable name for {varname}
# ds.{varname}.shape, ds.{varname}.units

In [None]:
# view the metadata
ds.history

## Extract numbers

Note that you can always extract the actual numbers from a called to your dataset using `.values` at the end. Be careful when you use this since it might be a lot of information. Always check the metadata without using `.values` first to see how large the arrays are you'll be reading in.

In [None]:
#pick a variable from ds.data_vars list. (put a variable name to replace  {var}
thevar = "HMXL"

In [None]:
# Extract a variable (dataarray) (put a variable name to replace  {var}
ds[thevar]

In [None]:
# The actual array data
ds[thevar].data

In [None]:
# The actual array data
ds[thevar].coords

In [None]:
# dataarray attributes
ds[thevar].attrs

## Select data

Extract data from `xarray` datasets using `.sel` and `.isel`. `.sel` uses variable names, `.isel` uses integer index values for the dimension. 

When files are read in, data arrays are read in as variables and the coordinates that they are in reference to are called "coordinates". For example, in the present dataset, we have the following coordinates:

In [None]:
ds.coords

We also have the following data variables, which are the main data of the file:

In [None]:
ds.data_vars

You should subselect from a data variable with respect to the coordinates. We can select from none up to all of the coordinates that the variable is respect to. In the following cell, the coordinates for the selected variable are indicated with an asterisk:

In [None]:
#ds.{datavar}.coords
ds[thevar].coords

### Selection by label 


The `.sel()` method is the primary access method for **purely coordinate label based indexing.**. The following are valid inputs:

- A single coordinate label e.g. `time="2021-03-01"`
- A list or array of coordinate labels `time=[="2021-01-01", ="2021-03-10", ="2021-03-12"]`
- A slice object with coordinate labels e.g. `time=slice("2021-01-01", "2021-03-01")`.  (Note that contrary to usual Python slices, both the start and the stop are included, when present in the index!)

We'll start with a small example: let's plot a data series with a single coordinate. 
Choose one of the coordinates to select on and substitute for {thecoord} in the cell below.

if the variable is a floating point number, its useful to use method="nearest" to avoid precision problems. See the [xarray documentation](https://xarray.pydata.org/en/stable/generated/xarray.DataArray.sel.html) for more on usage of `method` and `tolerance` parameters in `.sel()`. 

In [None]:
#ds[thevar].sel({thecoord}=7.151e+05,  method="nearest")

Now let's plot it! 
This plot assumes the x axis is longitude the y axis is latitude. Substitute the appropriate coordinate names from your data for {long} and {lat}. Use the selection from the previous cell as the independent variable.

Note that we are using `cartopy` to plot our maps and need to input the projection information (proj variable) for the projection appropriate to the dataset,  with the "transform" keyword argument to convert to PlateCarre projection (pc), which maps meridians to vertical straight lines of constant spacing, and circles of latitude to horizontal straight lines of constant spacing.

Documentation on functions used here:

[matplotlib pyplot module](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html)

[cartopy](https://scitools.org.uk/cartopy/docs/latest/reference/index.html)

In [None]:
proj = cartopy.crs.Mollweide(central_longitude=180)
pc = cartopy.crs.PlateCarree()
#plt is matplotlib pyplot module
fig = plt.figure(figsize=(20,15))

ax = fig.add_subplot(111, projection=proj)
mappable = ax.contourf(ds.TLONG, ds.TLAT, ds[thevar].sel(time=7.151e+05,  method="nearest"), 10,  transform=pc)

another approach to generating plots: 
## Basic plotting with via `.plot()`

Xarray provides a `.plot()` method on `DataArray` and `Dataset`. This method is a wrapper around Matplotlib's `matplotlib.pyplot.plot()`. xaarray will automatically guess the type of plot based on the dimensionality of the data. By default `.plot()` creates:

- a **line** plot for `1-D arrays` using `matplotlib.pyplot.plot()`
- a **pcolormesh** plot for 2-D arrays using `matplotlib.pyplot.pcolormesh()`
- a **histogram** for everything else (more than 2 dimensions) using `matplotlib.pyplot.hist()`

In [None]:
#Selecting the long and latitude by array index integers:
ds[thevar].isel(nlon=3,nlat=6).plot(marker="o", size=6)

In [None]:
ds[thevar].isel(nlon=3).plot()

We can either select by coordinate type, such as in the following cell where we choose all times between (and including) the years 1900 and 1950, longtitudes between 260 and 280 degrees, and latitude between 16 and 30 degrees. (substitute appropriate variable names for you dataset and uncomment to test)

In [None]:
#ds.{variable}.sel(time=slice('1900','1950'), lon=slice(-100+360, -80+360), lat=slice(30,16))

.... or by index, such as in the following cell where we select the first index of data in terms of with time, longitude, and latitude:

In [None]:
#ds.{variable}.isel(time=0, lon=0, lat=0)

# Calculations

You can do basic operations using `xarray`, such as take the mean. You can input the axis or axises you want to take the operation over in the function call.

In [None]:
#dsp[{variable}].mean('time')

In [None]:
#ds[{variable}].sum(('lat','lon'))

### THREDDS example. Loading data from a remote dataset.

The netCDF library can be compiled such that it is 'THREDDS enabled', which means that you can put in a URL instead of a filename. This allows access to large remote datasets, without having to download the entire file. You can find a large list of datasets served via an OpenDAP/THREDDs server [here](http://apdrc.soest.hawaii.edu/data/data.php).

Let's look at the ESRL/NOAA 20th Century Reanalysis – Version 2. You can access the data by the following link (this is the link of the `.dds` and `.das` files without the extension.):

In [None]:
loc = 'http://apdrc.soest.hawaii.edu/dods/public_data/Reanalysis_Data/NOAA_20th_Century/V2c/daily/monolevel/cprat'
ds2 = xr.open_dataset(loc)
ds2

In [None]:
ds2['cprat'].long_name

In [None]:
proj = cartopy.crs.Sinusoidal(central_longitude=180)

fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ax.coastlines(linewidth=0.25)
# use the last time available
mappable = ax.contourf(ds2.lon, ds2.lat, ds2.cprat.isel(time=-1), 20, cmap=cmo.tempo, transform=pc)
ax.set_title(pd.Timestamp(ds2.time[-1].values).isoformat()[:10])  # or use .strftime instead of .isoformat
fig.colorbar(mappable).set_label('%s' % ds2['cprat'].long_name)

Note that you can also just plot against the included coordinates with built-in convenience functions (this is analogous to `pandas` which was for one dimension). The sst is being plotted against longitude and latitude, which is flattening it out.

In [None]:
ds.sst.sel(time='1954-6-1').plot()#transform=pc)  # the plot's projection

In [None]:
proj = cartopy.crs.Mollweide(central_longitude=180)
fig = plt.figure(figsize=(14,6))
ax = fig.add_subplot(111, projection=proj)
ds.sst.sel(time='1954-6-1').plot(transform=pc)  # the plot's projection

## GroupBy

You can use the `groupby` method to do some neat things. Let's group by an attribute {attribute} on a variable {var} and save a new file.

In [None]:
#some_mean = ds.groupby('{var}.{attribute}').mean('{var}')
#some_mean

## Saving NetCDF files

Creating netCDF files is tedious if doing it from scratch, but it is very easy when starting from data that has been read in using `xarray`.

In [None]:
fname = 'test.nc'
some_mean.to_netcdf(fname)

In [None]:
xr.open_dataset(fname)