# Introduction to xarray

This is an introduction to the python package [xarray](http://xarray.pydata.org/en/stable/):

> Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures.

> Xarray was inspired by and borrows heavily from pandas, the popular data analysis package focused on labelled tabular data. It is particularly tailored to working with netCDF files, which were the source of xarray’s data model, and integrates tightly with dask for parallel computing.

In particular this series is focussed on using xarray for analysing climate data stored in netCDF files.

These notebooks are an accompaniment to training videos in the [CLEX CMS youtube channel](https://www.youtube.com/channel/UCSmoK6oWV9O0Hmyt9UdDNsQ)

The series consists of notebooks and videos

1. Reading data and associated metadata from a netCDF file into an xarray dataset
2. Subsetting a dataset by time and space
3. Plotting
4. Calculating metrics, e.g. mean, maximum
5. Grouping and resampling in time
6. Masking
7. Opening multiple files as a single dataset
8. Saving dataset to netCDF


# Opening a dataset

An xarray [dataset](http://xarray.pydata.org/en/stable/data-structures.html#dataset) is a container for data and it's associated metadata, including labelled coordinates.

First step, import the xarray package

In [1]:
import xarray

When opening a netCDF file, the file metadata is read and stored as an `xarray.DataSet`. In this case the file is accessed via an [OpenDap](https://www.opendap.org) server so it is universally accessible. The equivalent command if the netCDF file was saved in the same directory as the notebook is shown commented out for reference

In [2]:
# open with OpenDap URL
ds = xarray.open_dataset('http://dapds00.nci.org.au/thredds/dodsC/rr3/CMIP5/output1/CSIRO-BOM/ACCESS1-3/historical/mon/atmos/Amon/r1i1p1/latest/tas/tas_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc')
# open on local filesystem
# ds = xarray.open_dataset('tas_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc')

The metadata for this dataset is now stored in the variable `ds`. A more informative name could be chosen, but `ds` is fast to type! To examine the contents it is sufficient to just put the variable name in a cell and evaluate it, which is equivalent to `print(ds)` in a python program

In [3]:
ds

<xarray.Dataset>
Dimensions:    (bnds: 2, lat: 145, lon: 192, time: 1872)
Coordinates:
  * time       (time) datetime64[ns] 1850-01-16T12:00:00 ... 2005-12-16T12:00:00
  * lat        (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon        (lon) float64 0.0 1.875 3.75 5.625 ... 352.5 354.4 356.2 358.1
    height     float64 ...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) datetime64[ns] ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tas        (time, lat, lon) float32 ...
Attributes:
    institution:                     CSIRO (Commonwealth Scientific and Indus...
    institute_id:                    CSIRO-BOM
    experiment_id:                   historical
    source:                          ACCESS1-3 2011. Atmosphere: AGCM v1.0 (N...
    model_id:                        ACCESS1.3
    forcing:                         GHG, Oz, SA, Sl, Vl, BC, OC, (GHG = CO2,...
    parent_experiment_id:  

There are four sections to note: `Dimensions`, `Coordinates`, `Data variables` and `Attributes`.

`Dimensions` give the size of each named dimension. This is a CF-compliant (http://cfconventions.org) dataset, which means any variable which has the same name as a dimension is assumed to be a coordinate. There are other metadata which may be used to denote a coordinate. In this case `xarray` signifies coordinates associated with variables with `*`.

`Data variables` lists all the variables that are not coordinates. In this case three of them are bounds variables, definining the beginning and end values for the three coordinates. The only true data variables is `tas`.

# Accessing the data

The `open_dataset` command only reads the metadata from the netCDF file. It does not attempt to read any data until there is an operation that requires this.

The `xarray.DataSet` object has a number of methods for accessing the coordinates, attributes and data. The data variables are saved in a `dict`-like structure, `ds.data_vars`:

In [4]:
ds.data_vars

Data variables:
    time_bnds  (time, bnds) datetime64[ns] ...
    lat_bnds   (lat, bnds) float64 ...
    lon_bnds   (lon, bnds) float64 ...
    tas        (time, lat, lon) float32 ...

It is possible to loop over the data variables just by looping over the dataset, which returns each variable name in turn:

In [5]:
for varname in ds:
    print(varname)

time_bnds
lat_bnds
lon_bnds
tas


An individual variable can be accessed using it's name, either as a `dict` like key

In [6]:
ds['tas']

<xarray.DataArray 'tas' (time: 1872, lat: 145, lon: 192)>
[52116480 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2005-12-16T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Attributes:
    standard_name:     air_temperature
    long_name:         Near-Surface Air Temperature
    units:             K
    cell_methods:      time: mean
    cell_measures:     area: areacella
    history:           2012-02-05T23:49:51Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

For ease of use xarray also provides access to data variables as a python attribute

In [7]:
ds.tas

<xarray.DataArray 'tas' (time: 1872, lat: 145, lon: 192)>
[52116480 values with dtype=float32]
Coordinates:
  * time     (time) datetime64[ns] 1850-01-16T12:00:00 ... 2005-12-16T12:00:00
  * lat      (lat) float64 -90.0 -88.75 -87.5 -86.25 ... 86.25 87.5 88.75 90.0
  * lon      (lon) float64 0.0 1.875 3.75 5.625 7.5 ... 352.5 354.4 356.2 358.1
    height   float64 ...
Attributes:
    standard_name:     air_temperature
    long_name:         Near-Surface Air Temperature
    units:             K
    cell_methods:      time: mean
    cell_measures:     area: areacella
    history:           2012-02-05T23:49:51Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

So `ds.tas` is an `xarray.DataArray` and has it's own metadata giving more information about the variable itself. In this case it is near-surface air temperature in Kelvin.