# xarray Tutorial

*Author: Creare* <br>
*Date: April 01 2020* <br>

**Keywords**: xarray

## Overview

Short tutorial on using [`xarray`](http://xarray.pydata.org/en/stable/), which is used as the core underlying data structure of PODPAC.

### Prerequisites

- Python 3.6 or above
- [`xarray`](http://xarray.pydata.org/en/stable/)
- *Review the [README.md](../README.md) and [jupyter-tutorial.ipynb](jupyter-tutorial.ipynb) for additional info on using jupyter notebooks*


### See Also

- [xarray quick overview](https://xarray.pydata.org/en/stable/quick-overview.html)

## Labeled arrays using [xarray](http://xarray.pydata.org/en/stable/)

PODPAC uses the Python library `xarray` as the output from PODPAC Nodes. `xarray` uses "labeled" arrays, which can be confusing to new users. Labeled arrays give a dimension name, and coordinates for the different dimensions of an array. 

For example, data in a 2-D array might have different rows related to latitudes, and different columns related to longitudes. `xarray` explicitly adds this information to the array. This has a number of advantages:

* Arrays are automatically aligned. If I store my data as latitude=rows and longitude=columns, but someone else stores it as latitude=columns and longitude=rows, then `xarray` will automatically transpose one of these arrays when doing math with them
* Arrays are automatically broadcast. If I wanted to add a 2-D array with latitude and longitude coordinates to a 3-D array with latitude, longitude, time coordinates, `xarray` will automatically broadcast the 2-D array, creating copies for each time point. 
* Operations can be done by dimension name instead of axis. To take the mean over the 'time' dimension, `xarray` allows you to specify 'time' as the axis. You no longer have to remember if it was the first, last, or a different axis in your array. 
* Data can be accessed via dimension name. Again, instead of remembering the axis, the data can be subsetted or sliced by the name. 

While `xarray` offers many advantages over raw `Numpy` arrays, there are a few caveats and drawbacks. For example, since `xarray` automatically aligns coordinates, it's difficult to take the difference between two arrays with different times. For example:

In [1]:
import xarray as xr

# create a labeled array
a = xr.DataArray([2018.1, 2018.2], dims=['time'], coords=[['2018-01-01', '2018-01-02']])
a

In [2]:
# create another labeled array with different time coordinate
b = xr.DataArray([2018.3, 2018.4], dims=['time'], coords=[['2018-01-03', '2018-01-04']])
b

In [3]:
# take the difference between the two arrays
# The result is an empty array, because none of the coordinates align
c = a - b
c

In [4]:
# The proper way to do this with xarray is indexing the time to remove the dimension
b_0 = b[0] # b[0] is now a scalar

# Now we can take the difference
c = a - b[0]
c

In [5]:
# or alternatively selecting the dimension by time
c = a - b.sel(time='2018-01-03')
c

Fortunately, if you prefer raw arrays, the raw `Numpy` array can always be accessed on the `data` attribute.

In [6]:
a.data

array([2018.1, 2018.2])