# Xarray

https://carmengg.github.io/eds-220-book/lectures/lesson-15-xarray.html

Xarray
- python package
- augments numpy by adding labeled dimmesnions, coordinates, and attributes
- based on the netcdf data model

Today: 
- learn xararay data array and the xarray dataset

xarray.dataarray:
- primary object of xarray
- n-dimmensional array with **labeled** dimmesnions
- respresents a single variabel in the ncdf data form: holds the variables values, dimensions, and attributes

In xarray, each dimmension has a set of coordinates
indicate the dimmensions values (tick labels along the dimmension)

## Create an xarray.DataArray

Let’s suppose we want to make an xarray.DataArray that includes the information from our previous example about measuring temperature across three days. First, we import all the necessary libraries.

In [1]:
import pandas as pd
import numpy as np

import xarray as xr   # This is the package we'll explore

## Variable Values

The underlying data in the xarray.DataArray is a numpy.ndarray that holds the variable values. So we can start by making a numpy.array with our mock temperature data:

In [4]:
# values of a single variable at each point of the coords 
## 5 by 5 by 3
temp_data = np.array([np.zeros((5,5)), # gives array of zeroes in 5x5 array
                      np.ones((5,5)), # gives array of ones in 5x5 array
                      np.ones((5,5))*2]).astype(int) # get it to be an array of 2s by mult by 2, make them all int
temp_data

array([[[0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0]],

       [[1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1]],

       [[2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2],
        [2, 2, 2, 2, 2]]])

We could think this is “all” we need to represent our data. But if we stopped at this point, we would need to

1. remember that the numbers in this array represent the temperature in degrees Celsius (doesn’t seem too bad),

2. remember that the first dimension of the array represents time, the second latitude and the third longitude (maybe ok), and

3. keep track of the range of values that time, latitude, and longitude take (not so good).

Keeping track of all this information separately could quickly get messy and could make it challenging to share our data and analyses with others. This is what the netCDF data model and xarray aim to simplify. We can get data and its descriptors together in an xarray.DataArray by adding the dimensions over which the variable is being measured and including attributes that appropriately describe dimensions and variables.

## Dimensions and Coordinates

To specify the dimensions of our upcoming xarray.DataArray, we must examine how we’ve constructed the numpy.ndarray holding the temperature data. The diagram in the book shows how the dimensions of temp_data are ordered: 

**the first dimension is time, the second is latitude (rows), and the third is longitude (cols).**

Remember that indexing in 2-dimensional numpy.ndarrays starts at the top-left corner of the array, and it is done by rows first and columns second (like matrices). This is why latitude is the second dimension and longitude the third. From the diagram, we can also see that the coordinates (values of each dimension) are as follows:

- date coordinates are 2022-09-01, 2022-09-02, 2022-09-03
- latitude coordinates are 70, 60, 50, 40, 30 (notice decreasing order)
- longitude coordinates are 60, 70, 80, 90, 100 (notice increasing order)

We add the dimensions as a tuple of strings and coordinates as a dictionary:

In [8]:
# names of the dimensions in the required order
dims = ('time', 'lat', 'lon') # this is the tuple

# create coordinates to use for indexing along each dimension 
# this is a dictionary
coords = {'time' : pd.date_range("2022-09-01", "2022-09-03"),
          'lat' : np.arange(70, 20, -10), # 70 to 20 decreasing by 10
          'lon' : np.arange(60, 110, 10)}  # 60 to 100 increasing by 10

## Attributes

Next, we add the attributes (metadata) for our temperature data as a dictionary:

In [9]:
# attributes (metadata) of the data array 
attrs = { 'title' : 'temperature across weather stations',
          'standard_name' : 'air_temperature',
          'units' : 'degree_c'}



## Putting It All Together

Finally, we put all these pieces together (data, dimensions, coordinates, and attributes) to create an xarray.DataArray:

In [10]:
# initialize xarray.DataArray
temp = xr.DataArray(data = temp_data, 
                    dims = dims,
                    coords = coords,
                    attrs = attrs)
temp

We can also update the variable’s attributes after creating the object. Notice that each of the coordinates is also an xarray.DataArray, so we can add attributes to them.


In [11]:
# update attributes
temp.attrs['description'] = 'simple example of an xarray.DataArray'

# add attributes to coordinates 
temp.time.attrs = {'description':'date of measurement'}

temp.lat.attrs['standard_name']= 'grid_latitude'
temp.lat.attrs['units'] = 'degree_N'

temp.lon.attrs['standard_name']= 'grid_longitude'
temp.lon.attrs['units'] = 'degree_E'
temp

At this point, since we have a single variable, the dataset attributes and the variable attributes are the same.

## 17.2.2 Subsetting
An xarray.DataArray is a multi-dimensional array with laballed dimensions. To select data from it we need to specify which subsets along each dimension we are interested in. We can specify the data we need from each dimension either by relying on the dimension’s positions (dimension lookup by position) or by calling each dimension by its name (dimension lookup by name). Let’s see some examples.

Example

Suppose we want to know what was the temperature recorded by the weather station located at 40°0′N 80°0′E on September 1st, 2022.

## Dimension lookup by position

When we want to rely on the position of the dimensions in the xarray.DataArray, we need to remember that lat is the first dimension, lon is the second, and date the third.

Then, we can then access the values along each dimension in two ways:

- by integer: the exact same as a np.array. Use the locator brackets [] and “simply” remember that:

In [19]:
# access dimensions by position, then use integers for indexing
temp[0,3,2] 
# 0 is the first layer aka day 1, 3 is 0,1,2,3 to get to 40 lat, then 0,1,2 for long

- by label: same as pandas. We use the .loc[] locator to look up a specific coordiante at each position (which represents a dimension):

In [14]:
# access dimensions by position, then use labels for indexing
temp.loc['2022-09-01', 40, 80] # 40 degrees north, 80 degrees east

For datasets with dozens of dimensions, it can be confusing to remember which dimensions go where.

## Dimension lookup by name

We can also use the dimension names to subset data, without the need to remember which dimensions goes where In this case, there are still two ways of selecting data along a dimension:

- by integer: we specify the integer location of the data we want along each dimension:


In [15]:
# acess dimensions by name, then use integers for indexing
temp.isel(time=0, lon=2, lat=3)

- by label: we use the coordinate values we want to get!

In [18]:
# access dimensions by name, then use labels for indexing
temp.sel(time='2022-09-01', lat=40, lon=80)

Notice that the result of this indexing is a 1x1 xarray.DataArray. This is because operations on an xarray.DataArray always return another xarray.DataArray. In particular, operations returning scalar values will also produce xarray objects, so we need to cast them as numbers manually. See xarray.DataArray.item.