In [1]:
import xarray as xr
import numpy as np

In [2]:
data = xr.DataArray(np.random.rand(5))

In [3]:
data

<xarray.DataArray (dim_0: 5)>
array([ 0.46885859,  0.09113166,  0.02410605,  0.72529958,  0.60852871])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3 4

## Dimension labels

In [4]:
data1 = xr.DataArray(np.random.rand(4, 2), [('sample',['a', 'b', 'c', 'd']), ('size',['width', 'height'])])

In [5]:
data1.dtype

dtype('float64')

In [6]:
data1

<xarray.DataArray (sample: 4, size: 2)>
array([[ 0.26656025,  0.60613195],
       [ 0.28862459,  0.17809684],
       [ 0.3569416 ,  0.16762222],
       [ 0.86648851,  0.71052394]])
Coordinates:
  * sample   (sample) <U1 'a' 'b' 'c' 'd'
  * size     (size) <U6 'width' 'height'

In [7]:
data1.sum('sample')

<xarray.DataArray (size: 2)>
array([ 1.77861495,  1.66237495])
Coordinates:
  * size     (size) <U6 'width' 'height'

# Exercise

These two arrays contain average monthly temperatures (in Celsius degrees) in Erlangen and Paris:

```
erlangen = [-0.5, 0.7, 4.4, 8.5, 13.3, 16.7, 18.2, 17.5, 13.7, 8.9, 4.0, 0.9]
paris = [3.3, 4.2, 7.8, 10.8, 14.3, 17.5, 19.4, 19.1, 16.4, 11.6, 7.2, 4.2]
```

Design a `DataArray` for storing these data. Calculate average annual temperature per location.

## Indexing

In [8]:
data1[2]

<xarray.DataArray (size: 2)>
array([ 0.3569416 ,  0.16762222])
Coordinates:
    sample   <U1 'c'
  * size     (size) <U6 'width' 'height'

In [9]:
data1.loc['a']

<xarray.DataArray (size: 2)>
array([ 0.26656025,  0.60613195])
Coordinates:
    sample   <U1 'a'
  * size     (size) <U6 'width' 'height'

In [10]:
data1.sel(size='width')

<xarray.DataArray (sample: 4)>
array([ 0.26656025,  0.28862459,  0.3569416 ,  0.86648851])
Coordinates:
  * sample   (sample) <U1 'a' 'b' 'c' 'd'
    size     <U6 'width'

## Alignment

In [11]:
day2 = xr.DataArray(np.random.rand(4,2), [('sample',['b', 'c', 'd', 'e']), ('size', ['width', 'height'])])

In [12]:
data1 + day2

<xarray.DataArray (sample: 3, size: 2)>
array([[ 0.92768971,  0.38311718],
       [ 0.74488189,  1.05205423],
       [ 1.69517388,  0.91080021]])
Coordinates:
  * sample   (sample) object 'b' 'c' 'd'
  * size     (size) <U6 'width' 'height'

## Broadcasting

In [13]:
units = xr.DataArray([0.001, 0.01, 1], [('unit', ['mm', 'cm', 'm'])])

In [14]:
data1 * units

<xarray.DataArray (sample: 4, size: 2, unit: 3)>
array([[[  2.66560246e-04,   2.66560246e-03,   2.66560246e-01],
        [  6.06131952e-04,   6.06131952e-03,   6.06131952e-01]],

       [[  2.88624594e-04,   2.88624594e-03,   2.88624594e-01],
        [  1.78096844e-04,   1.78096844e-03,   1.78096844e-01]],

       [[  3.56941600e-04,   3.56941600e-03,   3.56941600e-01],
        [  1.67622219e-04,   1.67622219e-03,   1.67622219e-01]],

       [[  8.66488515e-04,   8.66488515e-03,   8.66488515e-01],
        [  7.10523936e-04,   7.10523936e-03,   7.10523936e-01]]])
Coordinates:
  * sample   (sample) <U1 'a' 'b' 'c' 'd'
  * size     (size) <U6 'width' 'height'
  * unit     (unit) <U2 'mm' 'cm' 'm'

## Interoperability with pandas

## Exercise

*Inspired by data science [challenge](http://www.ramp.studio/events/drug_spectra) by C. Marini et al*

A researcher measured a [Raman spectrum](https://en.wikipedia.org/wiki/Raman_spectroscopy) of an unknown sample. Now he wants to determine the substance and its concentration. He has calibration data with Raman spectra of four different compounds at three different concentrations. Calculate mean square error between sample and all calibration spectra and find the closest compound and concentration.

```python
import pandas as pd
df = pd.DataFrame.from_csv('raman_data.csv', index_col=[0, 1, 2])
calibration = xr.DataArray.from_series(df['Raman'])

sample = xr.DataArray([[0, 10]], [('sample', ['X1042']),
                                  ('wavelength', [100, 300])])
```

**Hint**: To find the calibration sample with minimum error, you may convert the DataArray to pandas:

```python
err.to_series().argmin()
```

## Comparison

|     | pandas.DataFrame | xarray.DataArray | Structured NumPy array|
|-----|------------------|------------------|--------------|
|max. dimensions | 2 | 32 | 32 |
| non-homogeneous arrays | Yes | No | Yes |
|labelled dimensions | 2 | 32 | 1 |
| labelled coordinates | Yes | Yes | No |
| broadcasting | No | Yes | Yes |
| auto-alignment | Yes | Yes | No |
| groupby-split-combine | Yes | Yes | No |

# Other features

* `Dataset` -- key/value store; generalisation of `DataFrame` in `pandas` for N-dimenisonal data
* groupby/split/combine
* NetCDF io

## Further reading

* xarray docs, http://xarray.pydata.org/en/stable/