# Xarray Fundamentals

---

## Learning Objectives

- Provide an overview of xarray
- Describe the core xarray data structures, the `DataArray` and the `Dataset`, and the components that make them up
- Load xarray dataset from a netCDF file 
- Load xarray dataset from a GRIB file
- Load xarray dataset from a remote dataset from a THREDDS server


## Prerequisites


| Concepts | Importance | Notes |
| --- | --- | --- |
| Basic familiarity with NumPy | Necessary | |
| Basic familiarity with Pandas | Helpful | |
| [Understanding of NetCDF Data Model](https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html) | Helpful | Familiarity with metadata structure |


- **Time to learn**: *15-20 minutes*



---

## Imports


In [None]:
import xarray as xr  # "canonical" namespace short-hand

## What is Xarray?

Xarray is a Python library for working with **labeled**, **multi dimensional** arrays. 

- Built on top of numpy and pandas 
- Brings the power of pandas to multidimensional arrays 
- Supports data of any dimensionality 

## Core Data Structures

- Xarray has **two** main data structures:
    - `xarray.DataArray`: a fancy, labelled version of `numpy.ndarray`  with associated coordinates. 
    - `xarray.Dataset`: a collection of multiple `xarray.DataArray` that share the same coordinates and/or dimensions.

---

<img src="../images/xarray-data-structures.svg">

### Dataset

Xarray's interface is heavily inspired by the [netCDF data model](https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html). Xarray's Dataset is designed as an in-memory representation of a netCDF dataset. 


#### Loading data from a netCDF file

First, let's open a local netCDF file using the `xarray.open_dataset()` function:

In [None]:
%%time
ds = xr.open_dataset(
    "./data/tas_Amon_CESM2_historical_r11i1p1f1_gn_200001-201412.nc", engine="netcdf4"
)

By default, `xarray.open_dataset()` function uses **lazy loading** i.e. it just loads in the coordinate and attribute metadata and **not** the data that correspond to data variables themselves. The data variables are loaded only on actual values access (e.g. when performing some calculation, slicing, ...) or with `.load()` method. 

Let's look at the HTML representation of the loaded dataset:

In [None]:
ds


<div class="admonition alert alert-info">
    <p class="title" style="font-weight:bold">Text based representation</p>
    If you prefer a text based representation, you can set the display_style='text' by uncommenting the line below
</div>


In [None]:
# xr.set_options(display_style="text")

To look at the corresponding netCDF representation, we can use the `.info()` method:

In [None]:
ds.info()

Datasets have the following key properties:
- `data_vars`: an dictionary of `DataArrays` corresponding to data variables 
- `dims`: a dictionary mapping from dimenion names to the fixed length of each dimension (e.g. `{'time': 1815, 'nv': 2, 'latitude': 180, 'longitude': 360}` )
- `coords`: a dictionary-like container of arrays (coordinates) that label each point (tick label) along our dimensions
- `attrs`: a dictionary holding arbitrary metadata pertaining to the dataset

In [None]:
# variables that are in our dataset
ds.data_vars

In [None]:
# dataset dimensions
ds.dims

In [None]:
# dataset coordinates
ds.coords

In [None]:
# dataset global attributes
ds.attrs

### DataArray

The DataArray is xarray's implementation of a labeled, multi-dimensional array. It has several key properties:

- `data`: a Duck array (`numpy.ndarray` or [`dask.array`](https://docs.dask.org/en/latest/array.html) or [`sparse`](https://sparse.pydata.org/en/stable/) or [`cupy.array`](https://docs.cupy.dev/en/stable/index.html) holding the array's values). 
- `dims`: dimension names for each axis e.g. `(lat, lon, time)`
- `coords`:  a dictionary-like container of arrays (coordinates) that label each point (tick label) along our dimensions
- `attrs`: a dictionary that holds arbitrary attributes/metadata (such as units). 
- `name`: an arbitrary name of the array

In [None]:
# Extract the tas variable (dataarray)
ds["tas"]

In [None]:
# ds["tas"] is equivalent to ds.tas
ds.tas


<div class="admonition alert alert-warning">
    <p class="admonition-title" style="font-weight:bold">Warning: dot notation vs bracket notation</p>




<ul>
    <li>You can use this dot notation access only if the  variable/datarray name is a valid Python identifier, e.g. "mydataset.1" is not allowed. See <a href="https://docs.python.org/3/reference/lexical_analysis.html#identifiers">here</a> for an explanation of valid identifiers.</li>
<li>Some unexpected behavior may occur if the variable/datarray name conflicts with an existing method name, e.g. Using "ds.min" to refer to a variable called "min" collides with the "min" (minimum) xarray method, but "ds['min']" works fine.</li>
</ul>
</div>

In [None]:
# The actual array data
ds["tas"].data

In [None]:
# datarray coordinates
ds["tas"].coords

In [None]:
# dataarray attributes
ds["tas"].attrs

### Dimensions vs Coordinates

- A dimension is just a name of an axis, like "longitude" or "time"
- Labeled coordinates are tick labels along an axis, e.g. "2021-06-08"


#### `repr` & HTML representation of dimensions with or without coordinates 

| Dimension | HTML repr | Text based repr |
| --- | --- | --- |
| with coordinates | **bold** | `*` symbol in `.coords` |
| without coordinates | normal | listed explicitly |



In [None]:
ds

In [None]:
with xr.set_options(display_style="text"):
    print(ds)



### Loading data in other file formats 


#### Loading data from a grib file 

To load a grib file in an xarray Dataset, we use the `xarray.open_dataset()` and we need to specify `engine="cfgrib"`. This requires the presence of `cfgrib` package in our Python environment:

In [None]:
ds = xr.open_dataset("./data/era5-levels-members.grib", engine="cfgrib")
ds

#### Loading data from a remote OPENDAP server 


If you happen to have access to netCDF datasets that are hosted remotely on a THREDDS server, you can point xarray to a url and it will load/stream the data over the network without needing to download it locally. 

In [None]:
url = "http://esgf-data.ucar.edu/thredds/dodsC/esg_dataroot/CMIP6/CMIP/NCAR/CESM2/historical/r11i1p1f1/Amon/tas/gn/v20190514/tas_Amon_CESM2_historical_r11i1p1f1_gn_200001-201412.nc"

In [None]:
xr.open_dataset(url, engine="netcdf4")

---

In [None]:
%load_ext watermark
%watermark --time --python --updated --iversion

## Summary 


- Xarray has two main data structures: DataArray and Dataset
- DataArrays store the multi-dimensional arrays
- Xarray is built on top of Numpy and Pandas and its architecture is heavily inspired by the netCDF data model

## Resources and References

- [Xarray Documentation on Data Structures](http://xarray.pydata.org/en/latest/data-structures.html)
- [Xarray Documentation on reading files and writing files](https://xarray.pydata.org/en/stable/io.html)
- [cfgrib Documentation](https://github.com/ecmwf/cfgrib)

<div class="admonition alert alert-success">
    <p class="title" style="font-weight:bold">Next: <a href="./02-indexing-and-selecting-data.ipynb">Indexing and selecting data</a></p>
</div>