# 09 Introduction: Multidimensional (N-d) arrays, xarray, ERA5 climate reanalysis data

UW Geospatial Data Analysis  
CEE498/CEWA599  
David Shean  

## Introduction

This week we are going to do some basic analysis of climate reanalysis data. This could be useful for some of your projects, especially if considering time series data.

We will use a few different products from the state-of-the-art global ERA5 reanalysis, which currently span 1950-present with hourly timestep at up to a 9 km resolution.

We will use xarray to open, combine, analyze and plot the data.

## xarray

Take a moment to review this high-level introduction:
* http://xarray.pydata.org/en/stable/why-xarray.html

>"xarray (formerly xray) is an open source project and Python package that makes working with labelled multi-dimensional arrays simple, efficient, and fun!

>Xarray introduces **labels** in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

>Xarray is inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing."

Remember back to Lab03, when you tried to extract a column of elevation values from a 2D NumPy array, with something like `myarray[:,4]`?  And then how much easier it was to do the same thing with a labeled pandas DataFrame `mydf['glas_z']`? Same deal here, just extended beyond 2D.

Why?
* Excellent choice for working with large datasets, as it uses lazy evaluation and parallel processing with Dask: http://xarray.pydata.org/en/stable/dask.html#
* Lots of great tutorials and resources: https://xarray.pydata.org/en/stable/tutorials-and-videos.html
* Big user community
* Some feel it should be the defacto Python data science object: https://xarray.pydata.org/en/stable/getting-started-guide/why-xarray.html#goals-and-aspirations

Why not?
* General means complicated
* Steep learning curve, esp for new users unfamiliar with Pandas
* Sometimes overkill for simple problems

## xarray data model overview
So, what's an nD array? (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.ndarray.html)  You've been using them all quarter, but mostly 1D and 2D NumPy arrays.  

As with many of the packages we've covered this quarter, vocabulary can be one of the biggest blocks to learning.  Let's discuss.

![xarray image with labels](http://matthewrocklin.com/blog/images/xarray-boxes-2.png)
(http://xarray.pydata.org/en/latest/data-structures.html#dataset)

### Comparison with Pandas

Pandas is very good at handling 2D tabular datasets (e.g., csv with columns and rows, time series of met station variables from a single station [T, precip, etc]) or a single variable across multiple stations.  
* "If your data fits nicely into a pandas DataFrame then you’re better off using one of the more developed tools there." (https://xarray.pydata.org/en/latest/user-guide/plotting.html)
* https://xarray.pydata.org/en/stable/getting-started-guide/faq.html#should-i-use-xarray-instead-of-pandas
* "pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in." (http://xarray.pydata.org/en/stable/why-xarray.html#goals-and-aspirations)

xarray extends the Pandas functionality to support 3+ dimensions (e.g., time series of 2D rasters).

#### xarray is to Pandas...

* xarray DataArray : Pandas DataSeries
* xarray DataSet : Pandas DataFrame

## Terminology
* https://xarray.pydata.org/en/stable/user-guide/terminology.html
* https://xarray.pydata.org/en/latest/user-guide/data-structures.html#data-structures

### DataArray
Four essential pieces:
* `values`: a numpy.ndarrays with actual data values (e.g., ('t2m', 'tp')
* `dims`: dimension names for each axis (e.g., ('lon', 'lat', 'time'))
* `coords`: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
* `attrs`: an OrderedDict containing additional metadata (attributes)

### Dataset
* Essentially, a collection of DataArrays (like a dictionary of DataArrays)
* http://xarray.pydata.org/en/latest/data-structures.html#dataset

Notes:
* One value in one of the contained arrays (say a single temperature measurement) usually has multiple coordinates ('lon', 'lat', 'time')

### Useful xarray examples and references
* Indexing and selection: https://xarray.pydata.org/en/stable/user-guide/indexing.html
* Plotting: https://xarray.pydata.org/en/stable/user-guide/plotting.html
* Visualization examples: http://xarray.pydata.org/en/stable/examples/visualization_gallery.html
* Time-series analysis: https://xarray.pydata.org/en/stable/user-guide/time-series.html
* https://rabernat.github.io/research_computing/xarray.html

## netCDF format
Much of the xarray design and functionality is derived from the NetCDF project:

>"NetCDF (Network Common Data Form) is a set of software libraries and machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data. It is also a community standard for sharing scientific data." [https://www.unidata.ucar.edu/software/netcdf/]

>"Data in netCDF format is:
>* Self-Describing. A netCDF file includes information about the data it contains.
>* Portable. A netCDF file can be accessed by computers with different ways of storing integers, characters, and floating-point numbers.
>* Scalable. Small subsets of large datasets in various formats may be accessed efficiently through netCDF interfaces, even from remote servers.
>* Appendable. Data may be appended to a properly structured netCDF file without copying the dataset or redefining its structure.
>* Sharable. One writer and multiple readers may simultaneously access the same netCDF file.
>* Archivable. Access to all earlier forms of netCDF data will be supported by current and future versions of the software."


>"Commonly used in climatology, meteorology and oceanography applications (e.g., weather forecasting, climate change) and GIS applications." [https://en.wikipedia.org/wiki/NetCDF]

In [None]:
import xarray as xr
import os
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Demo: Create simple DataArray
* From https://xarray.pydata.org/en/latest/user-guide/data-structures.html#data-structures

In [None]:
data = np.random.rand(4, 3)

In [None]:
xr.DataArray(data)

In [None]:
locs = ["IA", "IL", "IN"]

In [None]:
times = pd.date_range("2000-01-01", periods=4)
times

In [None]:
foo = xr.DataArray(data, coords=[times, locs], dims=["time", "space"])

In [None]:
foo

### Indexing

In [None]:
foo[0, :]

In [None]:
foo.loc['2000-01-03']

In [None]:
foo.isel(time=3)

In [None]:
foo.sel(time='2000-01-03')

## Demo: Create xarray DataSet from SNOTEL data

In [None]:
snotel_datadir = '../08_Vector_TimeSeries_SNOTEL'

In [None]:
sites_fn = os.path.join(snotel_datadir, 'snotel_conus_sites.json')
#singlesite_pkl_fn = 'SNOTEL-SNWD_D_679_WA_SNTL.pkl'
allsites_SNWD_pkl_fn = os.path.join(snotel_datadir, 'SNOTEL-SNWD_D_CONUS_all.pkl')
allsites_WTEQ_pkl_fn = os.path.join(snotel_datadir, 'SNOTEL-WTEQ_D_CONUS_all.pkl')

In [None]:
sites_gdf_all = gpd.read_file(sites_fn).set_index('index')
allsites_snwd_df = pd.read_pickle(allsites_SNWD_pkl_fn).dropna(axis=0, how='all')
allsites_wteq_df = pd.read_pickle(allsites_WTEQ_pkl_fn).dropna(axis=0, how='all')

In [None]:
#allsites_snwd_df.to_xarray()

In [None]:
allsites_snwd_df

In [None]:
allsites_wteq_df = pd.read_pickle(allsites_WTEQ_pkl_fn)

In [None]:
allsites_wteq_df

###  Note difference in number of records and columns
* Some stations have one but not the other
* WTEQ extends farther back in time

In [None]:
def get_DataArray(df, sites, name="SNWD_D"):
    valid_sites = sites.loc[df.columns]
    
    site_id = valid_sites.index.values
    lon = valid_sites.geometry.x.values
    lat = valid_sites.geometry.y.values
    elev = valid_sites.elevation_m.values
    site_name = valid_sites.name.values
    
    da = xr.DataArray(df, dims=("time", "site_id"), name=name)
    #For some reason, the times are not read as datetime64 objects, so reassign
    da["time"] = df.index.values
    da = da.assign_coords(lon=("site_id", lon), lat=("site_id", lat), elev=("site_id", elev))
    return da

In [None]:
snwd_da = get_DataArray(allsites_snwd_df, sites_gdf_all, name="SNWD_D")

In [None]:
snwd_da

In [None]:
snwd_da.attrs['description'] = 'SNOTEL snow depth measurements'

In [None]:
snwd_da.attrs['units'] = 'inches'

In [None]:
snwd_da

In [None]:
wteq_da = get_DataArray(allsites_wteq_df, sites_gdf_all, name="WTEQ_D")

In [None]:
wteq_da

In [None]:
wteq_da.attrs['units'] = 'inches w.e.'
wteq_da.attrs['description'] = 'SNOTEL snow water equivalent measurements'

In [None]:
wteq_da

### Merge the two DataArrays into a single DataSet

In [None]:
ds = xr.merge([snwd_da, wteq_da])

In [None]:
ds.attrs = {}

In [None]:
ds

### Write out as NetCDF file
* Better than random pickle file

In [None]:
out_fn = os.path.join(snotel_datadir, 'SNOTEL_CONUS_all.nc')

In [None]:
ds.to_netcdf(out_fn)

In [None]:
reopened = xr.open_dataset(out_fn)

In [None]:
reopened

## Isolate one site

In [None]:
sitecode = 'SNOTEL:679_WA_SNTL'

In [None]:
ds.sel(site_id=sitecode)

In [None]:
ds.sel(site_id=sitecode).plot()

In [None]:
ds.sel(site_id=sitecode).WTEQ_D.plot();

In [None]:
f, ax = plt.subplots()
ds.sel(site_id=sitecode).WTEQ_D.plot(ax=ax, label='WTEQ_D')
ds.sel(site_id=sitecode).SNWD_D.plot(ax=ax, label='SNWD_D')
ax.legend();

In [None]:
import hvplot.xarray

In [None]:
ds.sel(site_id=sitecode).hvplot()

In [None]:
ds.isel(site_id=0).plot.scatter(x="SNWD_D", y="WTEQ_D", s=1, c=ds.isel(site_id=0)['time'])

In [None]:
ds.plot.scatter(x="lon", y="lat", c=ds["elev"])