# The `xarray` package for multidimensional data

Notes from [Multi-Dimensional Arrays for Decision Analysis, Persuasive Python](https://www.persuasivepython.com/4-multidimarrays)

Here we'll experiment with the `xarray` package as a method of storing multiple simulation runs, as a more flexible alternative to dataframes and more user-friendly than nested dictionaries

Concepts:

* `DataArray`s
* Naming dimensions with `dims`
* Coordinates index an array and are specified with `coords`
* A `DataSet` can be created with multiple `DataArray`s

This example, we'll simulate demand values which takes on 100 draws from a binomial distribution, or:

$$
d \sim Binomial(200, 0.2)
$$

In [76]:
from numpy.random import default_rng
import numpy as np
import xarray as xr

rng = default_rng(seed = 111)  ## set random seed 
demand = rng.binomial(n=200,p=0.2,size=100)   ## get demand values

## make data array
xr.DataArray(data=demand)

### Naming Dimensions

We'll instead give this data array a meaningful name for the dimension instead of `dim_0`

In [77]:
xr.DataArray(data=demand, 
             dims='draw')

### Coordinates

A `coordinate` is used to index an array. Can specify the `coordinate` with the `coords` argument and a dictionary:

In [78]:
demandDA = xr.DataArray(data=demand,
                        coords={'draw':np.arange(100)+1},
                        name='demand')
demandDA

Now we have an array of demand values $d_i$ where $i$ is a specific coordinate.

In [79]:
## creating a DataArray of order quantities - must use name now to create dataset later
orderDA = xr.DataArray(data = np.arange(25,51), 
                       coords = {"orderQtyIndex": np.arange(25,51)},
                       name = "orderQty")
orderDA

## Creating a `Dataset` from multiple `DataArray` objects

Here's the magic

In [80]:
# create dataset by combining data arrays
ds = xr.merge([demandDA,orderDA])
ds

Now we can create new data variables using the `.assign` method which you can use to calculate across data variables.

In the example below, we'll create a new data variable `soldNewsPapers` which is the minimum of `orderQty` and `demand` DataArrays:

In [81]:
ds = (ds
          .assign(soldNewspapers = np.minimum(ds.demand,ds.orderQty))
)
ds

But didn't `demand` have 100 draws and `orderQty` have 26 elements? Let's take a closer inspection of the result:

In [82]:
ds.soldNewspapers

We now have a shape (100, 26) data array, what's in each element?

In [83]:
ds.soldNewspapers[0]

What's neat here is that we created a new 100 x 26 DataArray, where each element in the dataarray is a single draw of demand (from a binomial distribution). It's like a mega cross-join!

It might be useful to see this as a dataframe:

In [84]:
ds.to_dataframe()

Unnamed: 0_level_0,Unnamed: 1_level_0,demand,orderQty,soldNewspapers
draw,orderQtyIndex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,25,42,25,25
1,26,42,26,26
1,27,42,27,27
1,28,42,28,28
1,29,42,29,29
...,...,...,...,...
100,46,37,46,37
100,47,37,47,37
100,48,37,48,37
100,49,37,49,37


Adding a more complex function using a `lambda` function:

In [85]:
(  
    ds
    .assign(soldNewspapers = np.minimum(ds.demand,ds.orderQty))
    .assign(revenue = lambda DS: 3 * DS.soldNewspapers)
)

Extra practice, instead of having `price=3`, we can test different prices. We must first add in a new `DataArray` 

In [86]:
## creating a DataArray of Prices - must use name now to create dataset later
priceDA = xr.DataArray(data = np.arange(2,6), 
                       coords = {"priceIndex": np.arange(2,6)},
                       name = "price")
priceDA

In [88]:
ds2 = (ds
       .merge(priceDA)
       .assign(soldNewspapers = np.minimum(ds.demand,ds.orderQty))
       .assign(revenue=lambda x: x.price * x.soldNewspapers)
       )
ds2

## Subsetting

Let's look at the situation where we ordered 36 newspapers:

In [89]:
ds2.sel(orderQtyIndex=36)

We have every draw of demand here, and we can also see the `soldNewspapers` in this context:

In [90]:
ds2.sel(orderQtyIndex=36).soldNewspapers

Or look at the results when `price=3` - how about revenue for each value of order quantity in a dataframe:

In [98]:
(ds2.sel(priceIndex=3)
    .revenue
    .sum(axis=0)
    .to_dataframe())

Unnamed: 0_level_0,priceIndex,revenue
orderQtyIndex,Unnamed: 1_level_1,Unnamed: 2_level_1
25,3,7500
26,3,7800
27,3,8100
28,3,8397
29,3,8688
30,3,8970
31,3,9249
32,3,9522
33,3,9792
34,3,10056
