<div class="alert alert-success">  
    
-------
# XArray 101 🌍  
-------
* Jupyter and Python Basics
* __Xarray Intro__
* Xarray Advanced
* Vector Data
* Remote Sensing
* Visualization

-------  
</div>

# Working with gridded data: xarray

![xarray](http://xarray.pydata.org/en/stable/_static/dataset-diagram-logo.png)



[Xarray](http://xarray.pydata.org/en/stable/) is one of the great packages to know if you work with any gridded data. 

To cite from their homepage:

>Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw [NumPy](http://www.numpy.org/)-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.
>
>Xarray was inspired by and borrows heavily from [pandas](http://pandas.pydata.org/), the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with [netCDF](http://www.unidata.ucar.edu/software/netcdf) files, which were the source of xarray’s data model, and integrates tightly with [dask](http://dask.org/) for parallel computing.

This is great since we know that all things numpy are fast, pandas at the center of Python Data Science anyways since it's so friendly, powerful and flexible. Furthermore, netCDF is a really good data format to use since it encapsulates not only potentially multiple variables, but also meta-data and units and is very widely used in sciences and in the industry. And finally, dask is really great if you have to work with large and potentially distributed data. We will have a quick look at dask later in the course. For now it's good to know that xarray will automatically utilize it if it's installed.

Let's get a quick overview:

## Basics

Xarray has two core data structures, which build upon and extend the core strengths of NumPy and pandas. Both are fundamentally N-dimensional:

- **DataArray** is a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series.
- **Dataset** is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

The value of attaching labels to numpy’s numpy.ndarray may be fairly obvious, but the dataset may need more motivation. The dataset data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.

Let's start with some [very basic examples](http://xarray.pydata.org/en/stable/quick-overview.html) to see it in action. We then proceed to a more realsistic example.

In [None]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [None]:
import xarray as xr
import numpy as np

In [None]:
# we define a dataarray with 2 dimensions (named x and y) and the coordinate labels 10 and 20 for the x dimensions
data = xr.DataArray(np.random.randn(2, 3), dims=('x', 'y'), coords={'x': [10, 20]})
data

In [None]:
# like in pandas, values is a numpy array that you can modify in-place
data.values

In [None]:
data.dims

In [None]:
data.coords

You can store additional meta-data in the `attrs` dictionary:

In [None]:
data.attrs

### Indexing

Like in numpy and pandas, indexing can get pretty complex but is really powerful. This only shows you the very basics. You probably want to read up on it [here](http://xarray.pydata.org/en/stable/indexing.html#indexing).

However, in xarray, there are 4 ways of doing it:

In [None]:
# positional and by integer label, like numpy
data[[0, 1]]

In [None]:
# positional and by coordinate label, like pandas
data.loc[10:20]

In [None]:
# by dimension name and integer label
data.isel(x=slice(2))

In [None]:
# by dimension name and coordinate label
data.sel(x=[10, 20])

### Attributes/ Meta-data

It’s often a good idea to set metadata attributes. A useful choice is to set data.attrs['long_name'] and data.attrs['units'] since xarray will use these, if present, to automatically label your plots. These special names were chosen following the [NetCDF Climate and Forecast (CF) Metadata Conventions](http://cfconventions.org/cf-conventions/cf-conventions.html). `attrs` is just a Python dictionary, so you can assign anything you wish.

In [None]:
# assigning attributes to dataarray
data.attrs['long_name'] = 'random velocity'
data.attrs['units'] = 'metres/sec'
data.attrs['description'] = 'A random variable created as an example.'
data.attrs['random_attribute'] = 123
data.attrs

### Computation

Another great feature is that dataarrays work similar to numpy ndarrays. Observe:

In [None]:
data + 10

In [None]:
np.sin(data)

In [None]:
data.T

In [None]:
data.sum()

However, aggregation operations can use dimension names instead of axis numbers:

In [None]:
# take the mean over the x-dimension
data.mean(dim='x')

### GroupBy

Like pandas, xarray supports `groupby` operations (see: [here](http://xarray.pydata.org/en/stable/groupby.html#groupby)).

In [None]:
labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
labels

In [None]:
# the mean of y over the labels
data.groupby(labels).mean('y')

### Plotting

You can directly plot on xarray objects (like with pandas).

In [None]:
data.plot();

## Datasets

**xr.Dataset** is a dict-like container of aligned DataArray objects. You can think of it as a multi-dimensional generalization of the **pd.DataFrame**:

In [None]:
ds = xr.Dataset({'foo': data, 'bar': ('x', [1, 2]), 'baz': np.pi})
ds

You can access the individual variables/ dataarrays of a dataset like with a dictionary:

In [None]:
ds['foo']

In [None]:
ds.foo

### Read/ write netCDF files

NetCDF is the recommended file format for xarray objects. Users from the geosciences will recognize that the Dataset data model looks very similar to a netCDF file (which, in fact, inspired it).
You can directly read and write xarray objects to disk using `to_netcdf()`, `open_dataset()` and `open_dataarray()`.

Later, we will also use another vaariant. As it is common for datasets to be distributed across multiple files (commonly one file per timestep) xarray supports this use-case by providing the `open_mfdataset()` and the `save_mfdataset()` methods. For more, see [Reading and writing files](http://xarray.pydata.org/en/stable/io.html#io) or later notebooks in the course.

In [None]:
ds.to_netcdf('example.nc')

In [None]:
ds2 = xr.open_dataset('example.nc')
ds2

In [None]:
# cleanup
! rm example.nc

## Open multiple files as one

The following example is from: https://rabernat.github.io/research_computing_2018/xarray-tips-and-tricks.html

One thing we love about xarray is the `open_mfdataset()` function, which combines many netCDF files into a single xarray Dataset.
But what if the files are stored on a remote server and accessed over OpenDAP. An example can be found in NOAA's NCEP Reanalysis catalog.

https://www.esrl.noaa.gov/psd/thredds/catalog/Datasets/ncep.reanalysis/surface/catalog.html

The dataset is split into different files for each variable and year. For example, a single file for surface air temperature looks like:

http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995.1948.nc

In [None]:
# dataset split into different files
base_url = 'http://www.esrl.noaa.gov/psd/thredds/dodsC/Datasets/ncep.reanalysis/surface/air.sig995'
files = [f'{base_url}.{year}.nc' for year in range(1948, 2019)]
files

However, we can open them as if they were a single file!

In [None]:
# might fail due to network
ds = xr.open_mfdataset(files[-10:])
ds

We now can operate on them (like selecting a region), too.

In [None]:
dseu = ds.sel(lat=slice(60,20), lon=slice(0,30))

ts = dseu.mean(dim=['lat','lon'])
ts

In [None]:
ts.air.plot();

# Some actual Analysis: SST example

Example from: https://rabernat.github.io/research_computing_2018/intermediate-xarray.html  

In [None]:
# we want to use interactive plotting with hvplot
import holoviews as hv
from holoviews.streams import Params
import hvplot.xarray

In [None]:
# get the file from this address... 
! wget http://ldeo.columbia.edu/~rpa/NOAA_NCDC_ERSST_v3b_SST.nc

In [None]:
ds = xr.open_dataset('NOAA_NCDC_ERSST_v3b_SST.nc')
ds

As you can see the longitudes are organized from 0 - 360 (US style). We can change this easily:

In [None]:
# convert to -180/180
ds.coords['lon'] = (ds.coords['lon'] + 180) % 360 - 180
ds = ds.sortby(ds.lon)

sst = ds.sst
sst

We can also operate over the dimensions. In the following cell we group by months, and compute the mean over the time dimensions creating a monthly climatology. The we compute the anomaly of each month in the original time-series to this climatology.

In [None]:
# group by time axis - take the mean of the grouped batches over the time dim
sst_clim = sst.groupby('time.month').mean(dim='time')
# substract the climatology from the months of every year
sst_anom = sst.groupby('time.month') - sst_clim

If we `groupby` in hvplot we get a slider where we can interact with the plot with.

In [None]:
sst_anom.hvplot('lon','lat',groupby='time', width=600, cmap='RdBu', clim=(-2,2))

We can also select a point in the dataset. You do not have to specify the exact matching grid cell - use `nearest` instead...

In [None]:
sst_ref = sst_anom.sel(lon=-160, lat=0, method='nearest')
sst_ref.plot();

In [None]:
def covariance(x, y, dims=None):
    return xr.dot(x - x.mean(dims), y - y.mean(dims), dims=dims) / x.count(dims)

def correlation(x, y, dims=None):
    return covariance(x, y, dims) / (x.std(dims) * y.std(dims))

We can then do some computations between the point and the array:

In [None]:
sst_cor = correlation(sst_anom, sst_ref, dims='time')
pc = sst_cor.plot()
pc.axes.set_title('Correlation btw. global SST Anomaly and SST Anomaly at one point');