# Introduction

- Unlabelled, N-dimensional arrays of numbers are frequently used data structure in scientific computing. For example, sets of climate variables (e.g. temperature and precipitation) that vary in space and time and are represented on a regularly-spaced grid. Often we need to subset a large global grid to look at data for a particular region, or select a specific time slice. Then we might want to apply statistical functions to these subsetted groups to generate summary information [[1]](https://geohackweek.github.io/nDarrays/01-introduction/).

- NetCDF (Network Common Data Form) developed for multidimensional scientific data

    - __latitude x longitude x time x variable__

![](http://xarray.pydata.org/en/stable/_images/dataset-diagram.png)

- It has been widely adopted as a standard format for distributing N-dimensional arrays. 

- It is commonly used in climatology, meteorology and oceanography applications (e.g., weather forecasting, climate change, land cover, biomass) and GIS applications [[2]](https://en.wikipedia.org/wiki/NetCDF).

- Originally used for climate data

- File extension = `.nc` or sometimes `.nc4`

- Compatible with ArcGIS

- The Python programming language can operate with netCDF files with the `xarray` module (and also R)

In [3]:
# import the package
import xarray as xr

In [2]:
ds = xr.open_dataset(r'data\CCSM4-rcp45-tasmax.nc4')

## Dataset properties

Datasets (storage model) have the following components:

  - Dimensions (`dims`)  - record shape of the data (number of time points, number of latitudinal grid cells)
  - Coordinates (`coords`) - list of coordinates, time points, longitudes and latitudes
  - Data variables (`data_vars`) - hold data values, the shape is specified with a list of dimensions
  - Attributes (`attrs`) - describe the variables (units)
  
  - Global attributes apply to the entire file - authors, provenance etc.

In [3]:
ds

<xarray.Dataset>
Dimensions:  (lat: 720, lon: 1440, time: 12)
Coordinates:
  * time     (time) object 2008-01-01 12:00:00 ... 2008-12-01 12:00:00
  * lon      (lon) float32 -179.875 -179.625 -179.375 ... 179.625 179.875
  * lat      (lat) float32 -89.875 -89.625 -89.375 ... 89.375 89.625 89.875
Data variables:
    tasmax   (time, lat, lon) float32 ...
Attributes:
    CDI:          Climate Data Interface version ?? (http://mpimet.mpg.de/cdi)
    Conventions:  CF-1.6
    history:      Fri Jul 26 17:44:32 2019: cdo splitday 1.0.nc4 out.nc4
    version:      1.0
    repo:         https://gitlab.com/ClimateImpactLab/dtr_fix/
    frequency:    annual
    oneline:      linear interpolates tasmax and tasmin in grids with tasmax ...
    file:         /global/home/users/jsimcock/code/dtr_fix/dtr_fix.py
    year:         2008
    description:  linear interpolates tasmax and tasmin in grids with tasmax ...
    execute:      python /global/home/users/jsimcock/code/dtr_fix/dtr_fix.py run
    scenari

In [4]:
temperature = ds.tasmax

In [5]:
temperature

<xarray.DataArray 'tasmax' (time: 12, lat: 720, lon: 1440)>
[12441600 values with dtype=float32]
Coordinates:
  * time     (time) object 2008-01-01 12:00:00 ... 2008-12-01 12:00:00
  * lon      (lon) float32 -179.875 -179.625 -179.375 ... 179.625 179.875
  * lat      (lat) float32 -89.875 -89.625 -89.375 ... 89.375 89.625 89.875
Attributes:
    standard_name:     air_temperature
    long_name:         Daily Maximum Near-Surface Air Temperature
    units:             K
    time:              365.5
    original_name:     TREFHTMX
    comment:           TREFHTMX no change
    cell_methods:      time: maximum (interval: 1 day)
    cell_measures:     area: areacella
    history:           2011-11-09T20:45:21Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

In [6]:
temperature.attrs

OrderedDict([('standard_name', 'air_temperature'),
             ('long_name', 'Daily Maximum Near-Surface Air Temperature'),
             ('units', 'K'),
             ('time', 365.5),
             ('original_name', 'TREFHTMX'),
             ('comment', 'TREFHTMX no change'),
             ('cell_methods', 'time: maximum (interval: 1 day)'),
             ('cell_measures', 'area: areacella'),
             ('history',
              "2011-11-09T20:45:21Z altered by CMOR: Treated scalar dimension: 'height'. 2011-11-09T20:45:21Z altered by CMOR: Reordered dimensions, original order: lat lon time. 2011-11-09T20:45:21Z altered by CMOR: replaced missing value flag (-1e+32) with standard missing value (1e+20)."),
             ('associated_files',
              'baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation gridspecFile: gridspec_atmos_fx_CCSM4_rcp45_r0i0p0.nc areacella: areacella_fx_CCSM4_rcp45_r0i0p0.nc')])

## Indexing

Indexing is used to select specific elements from the dataset.

In [8]:
# positional indexing
temperature[0,0,0]

<xarray.DataArray 'tasmax' ()>
array(277.1034, dtype=float32)
Coordinates:
    time     object 2008-01-01 12:00:00
    lon      float32 -179.875
    lat      float32 -89.875
Attributes:
    standard_name:     air_temperature
    long_name:         Daily Maximum Near-Surface Air Temperature
    units:             K
    time:              365.5
    original_name:     TREFHTMX
    comment:           TREFHTMX no change
    cell_methods:      time: maximum (interval: 1 day)
    cell_measures:     area: areacella
    history:           2011-11-09T20:45:21Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

In [9]:
# isel refers to a selection by integer position
temperature.isel(time=0, lat=0, lon=0)

<xarray.DataArray 'tasmax' ()>
array(277.1034, dtype=float32)
Coordinates:
    time     object 2008-01-01 12:00:00
    lon      float32 -179.875
    lat      float32 -89.875
Attributes:
    standard_name:     air_temperature
    long_name:         Daily Maximum Near-Surface Air Temperature
    units:             K
    time:              365.5
    original_name:     TREFHTMX
    comment:           TREFHTMX no change
    cell_methods:      time: maximum (interval: 1 day)
    cell_measures:     area: areacella
    history:           2011-11-09T20:45:21Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

In [11]:
# positional indexing using labels
temperature.loc['2008-01-01T12:00:00',:,:]

<xarray.DataArray 'tasmax' (time: 1, lat: 720, lon: 1440)>
[1036800 values with dtype=float32]
Coordinates:
  * time     (time) object 2008-01-01 12:00:00
  * lon      (lon) float32 -179.875 -179.625 -179.375 ... 179.625 179.875
  * lat      (lat) float32 -89.875 -89.625 -89.375 ... 89.375 89.625 89.875
Attributes:
    standard_name:     air_temperature
    long_name:         Daily Maximum Near-Surface Air Temperature
    units:             K
    time:              365.5
    original_name:     TREFHTMX
    comment:           TREFHTMX no change
    cell_methods:      time: maximum (interval: 1 day)
    cell_measures:     area: areacella
    history:           2011-11-09T20:45:21Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...

In [15]:
temperature.sel(time='2008-01-01T12:00:00', lat=89.375, lon=179.625)

<xarray.DataArray 'tasmax' (time: 1)>
array([229.54843], dtype=float32)
Coordinates:
  * time     (time) object 2008-01-01 12:00:00
    lon      float32 179.625
    lat      float32 89.375
Attributes:
    standard_name:     air_temperature
    long_name:         Daily Maximum Near-Surface Air Temperature
    units:             K
    time:              365.5
    original_name:     TREFHTMX
    comment:           TREFHTMX no change
    cell_methods:      time: maximum (interval: 1 day)
    cell_measures:     area: areacella
    history:           2011-11-09T20:45:21Z altered by CMOR: Treated scalar d...
    associated_files:  baseURL: http://cmip-pcmdi.llnl.gov/CMIP5/dataLocation...