# Xarray Fundamentals

---

## Learning Objectives

- Provide an overview of xarray
- Describe the core xarray data structures, the `DataArray` and the `Dataset`, and the components that make them up
- Load xarray dataset from a netCDF file 
- Load xarray dataset from a GRIB file
- Load xarray dataset from a remote dataset from a THREDDS server


## Prerequisites


| Concepts | Importance | Notes |
| --- | --- | --- |
| Basic familiarity with NumPy | Necessary | |
| Basic familiarity with Pandas | Helpful | |
| [Understanding of NetCDF Data Model](https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html) | Helpful | Familiarity with metadata structure |


- **Time to learn**: *short*
- **System requirements**: 
    - Python
    - xarray
    - jupyterlab 
    - netcdf4
    - [cfgrib](https://github.com/ecmwf/cfgrib)
    - pydap
    



---

## Imports


In [1]:
import xarray as xr  # "canonical" namespace short-hand

## What is Xarray?

Xarray is a Python library for working with **labelled**, **multi dimensional** arrays. 

- Built on top of numpy and pandas 
- Brings the power of pandas to multidimensional arrays 
- Supports data of any dimensionality 

## Core Data Structures

- Xarray has **two** main data structures:
    - `xarray.DataArray`: a fancy, labelled version of `numpy.ndarray`  with associated coordinates. 
    - `xarray.Dataset`: a collection of multiple `xarray.DataArray` that share the same coordinates and/or dimensions.

---

<img src="../images/xarray-data-structures.svg">

### Dataset

Xarray's interface is heavily inspired by the [netCDF data model](https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_model.html). Xarray's Dataset is designed as an in-memory representation of a netCDF dataset. 


#### Loading data from a netCDF file

First, let's open a local netCDF file using the `xarray.open_dataset()` function:

In [2]:
%%time
ds = xr.open_dataset("./data/HadISST_sst.nc", engine="netcdf4")

CPU times: user 21.7 ms, sys: 2.87 ms, total: 24.6 ms
Wall time: 23.5 ms


By default, `xarray.open_dataset()` function uses **lazy loading** i.e. it just loads in the coordinate and attribute metadata and **not** the data that correspond to data variables themselves. Let's look at the HTML representation of the loaded dataset:

In [3]:
ds


<div class="admonition alert alert-info">
    <p class="title" style="font-weight:bold">Text based representation</p>
    If you prefer a text based representation, you can set the display_style='text' by uncommenting the line below
</div>


In [4]:
# xr.set_options(display_style="text")

To look at the corresponding netCDF representation, we can use the `.info()` method:

In [5]:
ds.info()

xarray.Dataset {
dimensions:
	latitude = 180 ;
	longitude = 360 ;
	nv = 2 ;
	time = 1815 ;

variables:
	datetime64[ns] time(time) ;
		time:long_name = Time ;
		time:standard_name = time ;
	float32 time_bnds(time, nv) ;
	float32 latitude(latitude) ;
		latitude:units = degrees_north ;
		latitude:long_name = Latitude ;
		latitude:standard_name = latitude ;
	float32 longitude(longitude) ;
		longitude:units = degrees_east ;
		longitude:long_name = Longitude ;
		longitude:standard_name = longitude ;
	float32 sst(time, latitude, longitude) ;
		sst:standard_name = sea_surface_temperature ;
		sst:long_name = sst ;
		sst:units = C ;
		sst:cell_methods = time: lat: lon: mean ;

// global attributes:
	:Title = Monthly version of HadISST sea surface temperature component ;
	:description = HadISST 1.1 monthly average sea surface temperature ;
	:institution = Met Office Hadley Centre ;
	:source = HadISST ;
	:reference = Rayner, N. A., Parker, D. E., Horton, E. B., Folland, C. K., Alexander, L. V., Ro

Datasets have the following key properties:
- `data_vars`: an dictionary of `DataArrays` corresponding to data variables 
- `dims`: a dictionary mapping from dimenion names to the fixed length of each dimension (e.g. `{'time': 1815, 'nv': 2, 'latitude': 180, 'longitude': 360}` )
- `coords`: a dictionary-like container of arrays (coordinates) that label each point (tick label) along our dimensions
- `attrs`: a dictionary holding arbitrary metadata pertaining to the dataset

In [6]:
# variables that are in ourd dataset
ds.data_vars

Data variables:
    time_bnds  (time, nv) float32 0.0 31.0 31.0 ... 5.521e+04 5.524e+04
    sst        (time, latitude, longitude) float32 ...

In [7]:
# dataset dimensions
ds.dims

Frozen(SortedKeysDict({'time': 1815, 'nv': 2, 'latitude': 180, 'longitude': 360}))

In [8]:
# dataset coordinates
ds.coords

Coordinates:
  * time       (time) datetime64[ns] 1870-01-16T11:59:59.505615234 ... 2021-0...
  * latitude   (latitude) float32 89.5 88.5 87.5 86.5 ... -87.5 -88.5 -89.5
  * longitude  (longitude) float32 -179.5 -178.5 -177.5 ... 177.5 178.5 179.5

In [9]:
# dataset global attributes
ds.attrs

{'Title': 'Monthly version of HadISST sea surface temperature component',
 'description': 'HadISST 1.1 monthly average sea surface temperature',
 'institution': 'Met Office Hadley Centre',
 'source': 'HadISST',
 'reference': 'Rayner, N. A., Parker, D. E., Horton, E. B., Folland, C. K., Alexander, L. V., Rowell, D. P., Kent, E. C., Kaplan, A.  Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century J. Geophys. Res.Vol. 108, No. D14, 4407 10.1029/2002JD002670',
 'Conventions': 'CF-1.0',
 'history': '25/5/2021 converted to netcdf from pp format',
 'supplementary_information': 'Updates and supplementary information will be available from http://www.metoffice.gov.uk/hadobs/hadisst',
 'comment': 'Data restrictions: for academic research use only. Data are Crown copyright see (http://www.opsi.gov.uk/advice/crown-copyright/copyright-guidance/index.htm)'}

### DataArray

The DataArray is xarray's implementation of a labeled, multi-dimensional array. It has several key properties:

- `data`: a Duck array (`numpy.ndarray` or `dask.array` or `sparse` or `cupy.ndarray` holding the array's values. 
- `dims`: dimension names for each axis e.g. `(latitute, longitude, time)`
- `coords`:  a dictionary-like container of arrays (coordinates) that label each point (tick label) along our dimensions
- `attrs`: a dictionary that holds arbitrary attributes/metadata (such as units). 
- `name`: an arbitrary name of the array

In [10]:
# Extract the sset variable (dataarray)
ds["sst"]

In [11]:
# ds["sst"] is equivalent to ds.sst
ds.sst

In [12]:
# The actual array data
ds["sst"].data

array([[[-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        [-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        [-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        ...,
        [    nan,     nan,     nan, ...,     nan,     nan,     nan],
        [    nan,     nan,     nan, ...,     nan,     nan,     nan],
        [    nan,     nan,     nan, ...,     nan,     nan,     nan]],

       [[-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        [-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        [-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        ...,
        [    nan,     nan,     nan, ...,     nan,     nan,     nan],
        [    nan,     nan,     nan, ...,     nan,     nan,     nan],
        [    nan,     nan,     nan, ...,     nan,     nan,     nan]],

       [[-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
        [-1000. , -1000. , -1000. , ..., -1000. , -1000. , -1000. ],
    

In [13]:
# datarray coordinates
ds["sst"].coords

Coordinates:
  * time       (time) datetime64[ns] 1870-01-16T11:59:59.505615234 ... 2021-0...
  * latitude   (latitude) float32 89.5 88.5 87.5 86.5 ... -87.5 -88.5 -89.5
  * longitude  (longitude) float32 -179.5 -178.5 -177.5 ... 177.5 178.5 179.5

In [14]:
# dataarray attributes
ds["sst"].attrs

{'standard_name': 'sea_surface_temperature',
 'long_name': 'sst',
 'units': 'C',
 'cell_methods': 'time: lat: lon: mean'}

### Named Dimensions vs Labeled Coordinates

- A dimension is just a name of an axis, like "longitude" or "time"
- Labeled coordinates are tick labels along an axis, e.g. "2021-06-08"


#### `repr` & HTML representation of dimensions with or without coordinates 

| Dimension | HTML repr | Text based repr |
| --- | --- | --- |
| with coordinates | **bold** | `*` symbol in `.coords` |
| without coordinates | normal | listed explicitly |



In [15]:
ds

In [16]:
with xr.set_options(display_style="text"):
    print(ds)

<xarray.Dataset>
Dimensions:    (latitude: 180, longitude: 360, nv: 2, time: 1815)
Coordinates:
  * time       (time) datetime64[ns] 1870-01-16T11:59:59.505615234 ... 2021-0...
  * latitude   (latitude) float32 89.5 88.5 87.5 86.5 ... -87.5 -88.5 -89.5
  * longitude  (longitude) float32 -179.5 -178.5 -177.5 ... 177.5 178.5 179.5
Dimensions without coordinates: nv
Data variables:
    time_bnds  (time, nv) float32 0.0 31.0 31.0 ... 5.521e+04 5.524e+04
    sst        (time, latitude, longitude) float32 -1e+03 -1e+03 ... nan nan
Attributes:
    Title:                      Monthly version of HadISST sea surface temper...
    description:                HadISST 1.1 monthly average sea surface tempe...
    institution:                Met Office Hadley Centre
    source:                     HadISST
    reference:                  Rayner, N. A., Parker, D. E., Horton, E. B., ...
    Conventions:                CF-1.0
    history:                    25/5/2021 converted to netcdf from pp format
 



### Loading data in other file formats 


#### Loading data from a grib file 

To load a grib file in an xarray Dataset, we use the `xarray.open_dataset()` and we need to specify `engine="cfgrib"`. This requires the presence of `cfgrib` package in our Python environment:

In [17]:
ds = xr.open_dataset("./data/era5-levels-members.grib", engine="cfgrib")
ds

#### Loading data from a remote OPENDAP server 


If you happen to have access to netCDF datasets that are hosted remotely on a THREDDS server, you can point xarray to a url and it will load/stream the data over the network without needing to download it locally. 

In [18]:
url = (
    "https://thredds.unidata.ucar.edu/thredds/dodsC/casestudies/python-gallery/GFS_20101026_1200.nc"
)

In [19]:
xr.open_dataset(url, engine="netcdf4")

---

## Resources and References

- [Xarray Documentation on Data Structures](http://xarray.pydata.org/en/latest/data-structures.html)
- [Xarray Documentation on Reading files and writing files](https://xarray.pydata.org/en/stable/io.html)
- [cfgrib Documentation](https://github.com/ecmwf/cfgrib)

- “HadISST data were obtained from https://www.metoffice.gov.uk/hadobs/hadisst/ and are © British Crown Copyright, Met Office, provided under a [Non-Commercial Government Licence](http://www.nationalarchives.gov.uk/doc/non-commercial-government-licence/version/2/)”



<div class="admonition alert alert-success">
    <p class="title" style="font-weight:bold">Next: <a href="./02-subsetting-indexing.ipynb">Indexing, Slicing, and Subsetting</a></p>
</div>