# Xarray

![image.png](attachment:75b68d25-e549-4734-be0a-d6d5e8206c62.png)

Previous classes explored Pandas, a popular package for working with tabular data (i.e. rows and columns). Pandas powerful features include indexing by row and column names and flexible methods to average over multiple columns or groups. 

Many geoscience datasets are ***multidimensional*** (or *N-dimensional*), meaning that they have many independent dimensions. For example, temperature in Earth's atmosphere is a function of latitude, longitude, altitute and time, making it a four-dimensional variable
$$T(x,y,z,t).$$
Pandas isn't well suited for 2-D and higher dimensional data, but xarray is!

Xarray supports pandas-like features with multidimensional data.

![image.png](attachment:95e6f70e-a271-4f35-b012-34a0fb957503.png)

## Important Terms

**NetCDF**

NetCDF (Network Common Data Format) is a widely used file format for multi-dimensional data. NetCDF files are self-describing, meaning they contain both the data values and metadata needed to understand them (e.g. coordinates, units, version, data source). 

NetCDF filenames usually end with '.nc' or '.nc4'.

Xarray is designed for working with netCDF files.

**DataArray**

A single multi-dimensional variable, with its coordinates and attributes.

Corresponds to a NetCDF variable.

**Dataset**

A dict-like collection of multiple DataArray objects that potentially share coordinates. 

Corresponds to a NetCDF file (which often contains multiple variables)

**Dimension**

The names of dimension axes of a DataArray (e.g. 'x', 'y', 'z', 'latitude', 'longitude', 'altitude')

**Coordinate**

An array that provides values for a dimension. For a dimension 'latitude', the coordinate would be the values (e.g. -90, -45, 0, 45, 90).  



## Reading a netCDF file

In [None]:
# Conventional way to import xarray
import xarray as xr

import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Open a NetCDF file with xarray
merra2 = xr.open_dataset('datasets/MERRA2.2d_selected.20181010.nc4')

In [None]:
# Dataset contents (interactive mode)
merra2

# Non-interactive
# print( merra2.info() )

## DataArray

A DataArray has four essential attributes
* `values`: a numpy.ndarray holding the array’s values
* `dims`: dimension names for each axis (e.g., ('x', 'y', 'z'), ('time','lat','lon'))
* `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
* `attrs`: a dictionary to hold metadata (attributes)


In [None]:
# Select a DataArray from a Dataset 
# Just like a pandas DataFrame column or dictionary key
da = merra2['PS']

In [None]:
# DataArray contents (interactive mode)
da

In [None]:
# Access DataArray attributes
#da.values
# da.dims
# da.coords
# da.attrs

#### Exercise

Explore the `merra2` Dataset using the commands above.

1. What does the variable named 'H1000' represent and what are its units?
2. What variable contains sea-level pressure? 

In [None]:
# Write your code here
merra2['H1000'].attrs['units']

## Selecting or Indexing

Xarray provides `.sel()` and `.isel()` methods to select elements and slices of a Dataset or DataArray

`isel()` selects by the index position (similar to pandas `.iloc[]`)

`sel()` selects by coordinate value (similar to pandas `.loc[]`)

In [None]:
# 1st time element, 10th latitude, 5th longitude
da.isel(time=0, lat=9, lon=4)

In [None]:
# Select the location closest to Tallahassee
da.sel(lat=30.44, 
       lon=-84.28, 
       time='2018-10-10 12:00', 
       method='nearest' )

In [None]:
# We *can* use numpy indexing on DataArrays, but that's usually not

# Select the 1st time, 2nd latitude, 3rd longitude
da[0,1,2]

# Recall da.dims is (time,lat,lon)


Selection works on the same for `Dataset` and `DataArray`

In [None]:
# Select 1st time step of *all variables* in a Dataset
time1 = merra2.isel(time=0)
time1

Use `slice()` to select a range of values.

In [None]:
# Select a region over the contiguous United States
merra2_conus = merra2.sel(lon=slice(-130,-60), 
                          lat=slice(25,52))
merra2_conus

Use a `list` to select multiple specific values

In [None]:
# Select a few discrete values
da.isel(lat=[0,100,200,300],
        lon=10,
        time=0)           

#### Exercise

In MERRA2, the surface variables are averaged over 1-hour and the time is marked at the middle of the time interval. For example, the first value is average over 00:00-1:00 UTC and it is labeled at 00:30: '2018-10-10T00:30:00'. 

1. What is was the pressure in Tallahassee (30.44 °N, 84.28 °W) for 0-1 UTC on this day? (You should find 98693.54 Pa)
2. What was the pressure in Tallahassee at hour 23-24 UTC?

In [None]:
# Write your code here

## Plotting

Plotting in xarray works much like pandas:
1. Select part of the Dataset or DataArray (if desired)
2. Use the `.plot()` method; xarray chooses the appropriate plot type
3. Combine with Matplotlib commands as needed.

[xarray plotting guide](https://docs.xarray.dev/en/latest/user-guide/plotting.html)

In [None]:
# Select one time slice (result is 2D) and display
merra2['PS'].isel(time=0).plot()

# Notice that xarray knows & uses the DataArray and coordinate units

In [None]:
# Also subset latitude and longitude
merra2['PS']\
    .isel(time=0)\
    .sel(lat=slice(25,52),
         lon=slice(-130,-66))\
    .plot()

In [None]:
# Select a time series at a location
da.sel(lat=30.44, 
       lon=-84.28,
       method='nearest' )\
    .plot( ylim=(98000,101000) ) 
# xarray's plot command has lots of optional formatting arguments

In [None]:
# Plot multiple lines
da.sel(lat=[30,40,50], 
       lon=0,
       method='nearest' )\
    .plot.line(x='time')

In [None]:
# 3D variables are plotted as a histogram by default
da.plot(bins=100)

In [None]:
# Scatter plot of two variables
merra2.isel(time=0)\
    .plot.scatter(x='PS',y='T2M')

In [None]:
# Quiver
merra2_conus.isel(time=0)\
    .plot.quiver(x='lon',y='lat',u='U10M',v='V10M')

Xarray plotting works well with Matplotlib, just like pandas. 

In [None]:
# Create a two panels
fig, axs = plt.subplots( ncols=2, figsize=(8,4) )
ax0, ax1 = axs

# Plot a map in the left panel
merra2['PS'].isel(time=0).plot( ax=ax0 )

# Annotate Tallahassee using standard Matplotlib commands
ax0.scatter(-84.28,30.44,color='black')
ax0.text(-84.28,30.44,'Tallahassee')

# Plot a time series in the right panel
merra2['T2M']\
    .sel(lat=30.44, 
       lon=-84.28,
       method='nearest' )\
    .plot( ax=ax1 )

# Automatic fix spacing between panels
# fig.tight_layout()

The [User Guide](https://docs.xarray.dev/en/latest/user-guide/plotting.html) describes additional plotting capabilities.

#### Exercise

Plot a time series of sea-level pressure for Mexico Beach, FL (29.9 °N, 85.4 °W) for this date


In [None]:
# Write your code here

#### Exercise

Plot a map of sea-level pressure over the region 25-30 °N, 80-90 °W, for the first time in the dataset. (Reminder: use slice())

In [None]:
# Write your code here

## Computation

`DataArrays` and `Datasets` work seamlessly with arithmethtic operators and numpy array functions.

In [None]:
# Convert Pa -> hPa
da_hPa = merra2['PS'] / 100
# Update the units attribute
da_hPa.attrs['units'] = 'hPa'

da_hPa.isel(time=0).plot()

DataArrays can be easily added to a Dataset (like adding columns to a pandas DataFrame or keys to a dictionary).

In [None]:
# Add DataArray to the merra2 Dataset
merra2['PS_hPa'] = da_hPa
merra2

#### Exercise

MERRA2 doesn't contain the wind speed, but we can create it from the U and V components. 

1. Create a variable named 'WS10M' that contains wind speed. Recall that $s=\sqrt{u^2+v^2}$.
2. Set the units for the new variable.

In [None]:
# Write your code here

## Broadcasting & Alignment

### Broadcasting
Broadcasting is tricky with numpy arrays but much easier with labeled dimensions.

This is a useless calculation, but illustrates operating on `DataArrays` with different coordinates.

In [None]:
merra2

In [None]:
# lat and lon are 1-D arrays with different lengths
lat_times_lon = np.cos(merra2['lat']*np.pi/180) \
              * np.sin(merra2['lon']*np.pi/180)
# Product is 2-D
lat_times_lon.plot()

### Alignment

For operations with two DataArrays that share a dimension name, xarray *aligns* the coordinates first. 

To illustrate, subset the data, then do arithmetic.

In [None]:
# Create subsets for illustration
da_tropics = da.sel(lat=slice(-23.5,23.5))
da_nh      = da.sel(lat=slice(0,90))

# Multiply
prod = da_tropics * da_nh

# Inspect coordinates of the product
# Product contains coordinates that were in *both* factors; *inner join*
prod['lat']

Alternately, `xr.align(..., join='outer')` can expand both factors with missing data.

## Creating DataArrays

Datasets and DataArrays are most commonly defined by opening and reading a file. However, we can explicitly create them as well. 

In [None]:
# DataArray of number of days in each month, non-leap years
ndays = xr.DataArray([31,28,31,30,31,30,31,31,30,31,30,31],
                   dims=('month'),
                   coords={'month':np.arange(1,13)})
print( ndays.sel(month=2).values )
ndays

## Combining Data: Concat and Merge

* `xr.concat`: concatenate DataArrays into a bigger *DataArray*, extending their dimensions
* `xr.merge`: combine DataArrays or Datasets into a larger *Dataset*

To illustrate concat, we will split the data into northern and southern hemispheres and then recombine them.

In [None]:
# Split into northern and southern hemisphere
NH = merra2.sel(lat=slice(0,90))
SH = merra2.sel(lat=slice(-90,0))

In [None]:
# Concatenate two Datasets along the 'lat' dimension
# Works the same for DataArrays
ds_concat = xr.concat( [SH, NH], dim='lat')
ds_concat

We can also concatenate along a *new* dimension

In [None]:
# Create a new dimension while concatenating
ds_concat = xr.concat( [SH, NH], dim='hemi')
ds_concat

In [None]:
ds_concat['PS'].isel(time=0,hemi=1).plot()

Merging:

In [None]:
# Merge a list of DataArrays or Datasets into a single Dataset
xr.merge( [merra2['PS'], merra2['T2M']] )

## Reductions

Xarray can compute the mean, sum, standard deviation and other statistics across any dimension of a DataArray, just like Numpy. Xarray data reductions can refer to dimension names (rather than axis in Numpy), which adds clarity.

In [None]:
# Average over time
da_time_mean = da.mean(dim='time')
da_time_mean.plot()

In [None]:
# Zonal (average over longitude) and Time mean
da_mean = da.mean(dim=['lon','time'])

da_mean.plot()

Reduction methods include
* `mean()`
* `min()`, `max()`, `median()`
* `quantile()`
* `std()`
* `sum()`
* and others

#### Exercise

1. What was the mean sea-level pressure in Tallahassee (30.4 °N, 84.3 °W) on this date?
2. What was the minimum sea-level pressure?

In [None]:
# Write your code here
merra2['SLP']\
    .sel(lat=30.4,
         lon=-84.3,
         method='nearest')\
    .mean()

merra2.mean(dim='time')\
    .sel(lat=30.4,
         lon=-84.3,
         method='nearest')['SLP'].values

## Weighted reductions

For some averages (and other reductions), some elements of the array should have greater weight than others. For example, the MERRA2 grid cells (0.5° $\times$ 0.625°) have larger surface area near the equator than near the poles, which should be accounted for in global averages.  

In [None]:
# Artificial weights based on hour of the day
weights = np.sin( merra2['time'].dt.hour * np.pi/24 )**2
weights.plot()

In [None]:
# Weighted mean
T2M_weighted_mean = merra2['T2M'].weighted(weights).mean(dim='time')
# Unweighted (equal weights) mean
T2M_unweighted_mean = merra2['T2M'].mean(dim='time')


In [None]:
# Plot the weighted mean and compare to unweighted
T2M_weighted_mean.sel(lat=50,method='nearest').plot(label='weighted')

T2M_unweighted_mean.sel(lat=50,method='nearest').plot(label='unweighted')

plt.legend()

## Online datasets

Numerous datasets are available online through protocols that provide access to data *without* first downloading a file (e.g. THREDDS, OpenDAP, S3). Xarray can access many of these protocols in almost exactly the same way as locally stored files.

Advantages of online or cloud data access:
* Less storage required on local computer
* Faster access if only a small part of a large file is required

Advantages of downloading a data file:
* Reading a local file is faster than over a network
* Available offline; insensitive to network outages, server changes
* Better for repeated data access 

In [None]:
# Open NOAA Extended Reconstructed Sea Surface Temperature (ERSST) version 5
ersst = xr.open_dataset('http://psl.noaa.gov/thredds/dodsC/Datasets/noaa.ersst.v5/sst.mnmean.nc')
ersst

In [None]:
# Plot SST
# ersst['sst'].isel(time=-1).plot()
ersst['sst'].sel(time='1945-12-01').plot()

## Xarray is "***lazy***" (a good thing)

Lazily-evaluated operations do not load data into memory until necessary. Instead of doing calculations right away, xarray lets you plan what calculations you want to do, like finding the average temperature in a dataset. This planning is called “lazy evaluation.” Later, when you’re ready to see the final result, you tell xarray, “Okay, go ahead and do those calculations now!” That’s when xarray starts working through the steps you planned and gives you the answer you wanted. This lazy approach helps save time and memory because xarray only does the work when you actually need the results.

For example, a program may read a file containing multiple variables over a large domain (e.g. global) but creates a figure or prints a value for just a single variable and single site. Xarray only loads the data into memory that it needs, saving the computational effort and time of loading and processing all data in the file.

## Interpolation

Interpolation is a method of estimating an unknown value at a new location from known values at nearby locations.

In [None]:
# DataArray of x**3 for integer values of x
x = np.arange(6)
da = xr.DataArray(x**3, 
                  dims=['x'], 
                  coords={'x': x})
da.plot(marker='o')

What if we want a value at x=3.5? 

In [None]:
da.sel(x=3.5) # We could use method='nearest', but that isn't what we want

Interpolation can do it!

In [None]:
da.interp(x=3.5).values

Xarray's interpolate uses [scipy.interpolate](https://docs.scipy.org/doc/scipy/reference/interpolate.html) internally. There are several interpolation methods (linear, cubic, nearest). 

In [None]:
print(da.interp(x=3.5,method='linear').values,
      da.interp(x=3.5,method='cubic').values,
      da.interp(x=3.5,method='nearest').values)

Interpolate can process many values simultaneously.

In [None]:
xnew = np.linspace(0,5,21)
da_linear = da.interp(x=xnew,method='linear')
da_cubic  = da.interp(x=xnew,method='cubic')

da.plot(marker='o',label='original')
da_linear.plot(marker='.',label='linear')
da_cubic.plot(marker='.',label='cubic')
plt.legend()

**Caution** Interpolation outside the range of original data will produce NaN. Interpolating with latitude and longitude and longitude is tricky as Xarray has no understanding of spherical geometry. 

## Groupby

Xarray has a groupby features very similar to pandas. We will illustrate this with the ERSST v5 dataset

In [None]:
url = 'http://psl.noaa.gov/thredds/dodsC/Datasets/noaa.ersst.v5/sst.mnmean.nc'
ds = xr.open_dataset(url,drop_variables=['time_bnds'])
ds = ds.sel(time=slice('1960','2020')).load()

In [None]:
ds['sst'].isel(time=0).plot(vmin=-2,vmax=30)

In [None]:
# Select a single point
ds.sst.sel(lon=300, lat=50).plot()

As we can see from the plot, the timeseries at any one point is totally dominated by the seasonal cycle. We would like to remove this seasonal cycle (called the “climatology”) in order to better see the long-term variaitions in temperature. We will accomplish this using groupby.

The syntax of Xarray’s groupby is almost identical to Pandas. We will first apply groupby to a single DataArray.

### Split Step

The most important argument is `group`: this defines the unique values we will us to "split" the data for grouped analysis. We can pass either a DataArray or a name of a variable in the dataset. Lets first use a DataArray. Just like with Pandas, we can use the time indexe to extract specific components of dates and times. Xarray uses a special syntax for this `.dt`, called the `DatetimeAccessor`.

In [None]:
ds.time.dt

In [None]:
ds.time.dt.month

In [None]:
ds.time.dt.year

We can use these time accessors in a groupby operation.

In [None]:
gb = ds.sst.groupby(ds.time.dt.month)
gb

Xarray also offers a more concise syntax when the variable you're grouping on is already present in the dataset. This is identical to the previous line:

In [None]:
# Equivalent to prior example, but more concise
gb = ds.sst.groupby('time.month')
gb

### Map & Combine

Now that we have groups defined, it's time to "apply" a calculation to the group. Like in Pandas, these calculations can either be:
- _aggregation_: reduces the size of the group
- _transformation_: preserves the group's full size

At then end of the apply step, xarray will automatically combine the aggregated / transformed groups back into a single object.

```{warning}
Xarray calls the "apply" step `map`. This is different from Pandas!
```

#### Aggregations

Like Pandas, xarray's groupby object has many built-in aggregation operations (e.g. `mean`, `min`, `max`, `std`, etc). Other functions and custom functions can be applied to the groups using `.map()`

In [None]:
# Mean for each of 12 calendar months
sst_mm = gb.mean(dim='time')
sst_mm

In [None]:
# Climatological mean for a single point
sst_mm.sel(lon=300, lat=50).plot()

In [None]:
# Zonal mean climatology
sst_mm.mean(dim='lon').transpose().plot.contourf(levels=12, vmin=-2, vmax=30)

In [None]:
# Climatology difference between January mean and July mean
(sst_mm.sel(month=1) - sst_mm.sel(month=7)).plot(vmax=10)

#### Transformations

Now we want to _remove_ this climatology from the dataset, to examine the residual, called the _anomaly_, which is the interesting part from a climate perspective.
Removing the seasonal climatology is a perfect example of a transformation: it operates over a group, but doesn't change the size of the dataset. 

Xarray makes these sorts of transformations easy by supporting _groupby arithmetic_.
This concept is easiest explained with an example:

In [None]:
gb = ds.groupby('time.month')
ds_anom = gb - gb.mean(dim='time')
ds_anom

Now we can view the climate signal without the overwhelming influence of the seasonal cycle.

In [None]:
# Anomaly time series for a single point in the N. Atlantic
ds_anom.sst.sel(lon=300, lat=50).plot()

In [None]:
ds_anom.time

In [None]:
# Anomaly difference between Jan 1, 2018 and Jan 1, 1960
(ds_anom.sel(time='2018-01-01') - ds_anom.sel(time='1960-01-01')).sst.plot()

### Climatology and Anomaly summary

Here is a summary of the process of computing a mean annual cycle (i.e. climatology) and anomalies, using SST as the example.

In [None]:
# In this example, the dataset of interest is 'ds'.
# We could select a single variable from 'ds' if desired
# We assume that the time variable is named 'time'

# Compute climatology, grouping by calendar month and taking the time mean for each month
ds_clim = ds.groupby('time.month').mean(dim='time')

# Compute the anomalies, subtracting the climatology value from each monthly value
ds_anom = ds.groupby('time.month') - ds_clim

## Grouby-Related: Resample, Rolling, Coarsen


### Resample

Resample in xarray is nearly identical to Pandas.
**It can be applied only to time-index dimensions.** Here we compute the five-year mean.
It is effectively a group-by operation, and uses the same basic syntax.
Note that resampling changes the length of the the output arrays.

In [None]:
# Resample to 5 year means
ds_anom_resample = ds_anom.resample(time='5YE').mean(dim='time')
ds_anom_resample

In [None]:
ds_anom.sst.sel(lon=300, lat=50).plot()
ds_anom_resample.sst.sel(lon=300, lat=50).plot(marker='o')

In [None]:
# Compare 2010-2015 vs. 1960-1965
(ds_anom_resample.sel(time='2015-01-01', method='nearest') -
 ds_anom_resample.sel(time='1965-01-01', method='nearest')).sst.plot()

### Rolling

Rolling is also similar to pandas.
It does not change the length of the arrays.
Instead, it allows a moving window to be applied to the data at each point.

In [None]:
ds_anom_rolling = ds_anom.rolling(time=12, center=True).mean()
ds_anom_rolling

In [None]:
ds_anom.sst.sel(lon=300, lat=50).plot(label='monthly anom')
ds_anom_resample.sst.sel(lon=300, lat=50).plot(marker='o', label='5 year resample')
ds_anom_rolling.sst.sel(lon=300, lat=50).plot(label='12 month rolling mean', color='k')
plt.legend()

## Review Questions

### Code snippet

An xarray `Dataset` named `merra2` has the following contents

![image.png](attachment:c44220ab-774b-4656-8947-b189e3322986.png)

Write xarray commands to access the following parts of `merra2`

1. `T2M`
2. `T2M` and `Q2M` together
3. `T2M` for 2020-01-12
4. All variables for 2020-01-11 through 2020-01-13
5. All variables for the location nearest 45 °N, 100 °W.
6. All variables for the region 20°S-20°N, 0-90°E, on 2020-01-11
7. Units of `H1000`
8. 10th latitude and 12th longitude
5. 1st time for the location nearest 45 °N, 100 °W.

### Code snippet

An xarray `DataArray` named `da` has the following contents

![image.png](attachment:dc3be16b-97b0-4f75-a674-87e9d6d368d8.png)

Write xarray commands to compute the following quantities.

1. Overall sum
2. Zonal mean
3. Meridional mean
5. Sum over altitudes (column sum)
4. Standard deviation across longitude
6. Median for each altitude
7. Zonal mean of the column sum (\*challenge)
8. Sum of all points in the northern hemisphere (\*challenge)

### Code snippet

An xarray `Dataset` named `ds` contains variables named `T`,`q`, and `u`. Write a loop that prints the mean value of each of these variables in the form '*name* = *value*', where *name* is the variable name and *value* is its value.