<a href="https://colab.research.google.com/github/flyaflya/persuasive/blob/main/demoNotebooks/xarrayIntro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install daft matplotlib --upgrade

# Multi-Dimensional Arrays for Decision Analysis


In [None]:
#@title A Full Decision Model
#| echo: false
#| include: false

import matplotlib.pyplot as plt
import pandas as pd
from functools import partial, partialmethod
import daft   ### %pip install -U git+https://github.com/daft-dev/daft.git
from numpy.random import default_rng
import numpy as np

class dag(daft.PGM):
    def __init__(self, *args, **kwargs):
        daft.PGM.__init__(self, *args, **kwargs)
    
    obsNode = partialmethod(daft.PGM.add_node, aspect = 2.2, fontsize = 10, plot_params = {'facecolor': 'cadetblue'})
    decNode = partialmethod(daft.PGM.add_node, aspect = 2.2, fontsize = 10, shape = "rectangle", plot_params = {'facecolor': 'thistle'})
    detNode = partialmethod(daft.PGM.add_node, aspect = 5.4, fontsize = 9.25, alternate = True, plot_params = {'facecolor': 'aliceblue'})
    latNode = partialmethod(daft.PGM.add_node, scale = 1.2, aspect = 2.2, fontsize = 10, plot_params = {'facecolor': 'aliceblue'})
    detNodeBig = partialmethod(daft.PGM.add_node, scale = 1.6, aspect = 2.25, fontsize = 10, alternate = True, plot_params = {'facecolor': 'aliceblue'})
    latNodeBig = partialmethod(daft.PGM.add_node, scale = 1.6, aspect = 2.2, fontsize = 10, plot_params = {'facecolor': 'aliceblue'})
    
pgm = dag(dpi = 300, alternate_style="outer")
pgm.latNode("d","Demand\n"+r"$(d_i)$",1,2)
pgm.decNode("q","Order\n" + r"Qty$(q)$",6,2)
pgm.detNode("pi","Profit: " + r"$\pi(d_i,q) =$" + "\n" + r"$3 \times \min(d_i,q) - 1 \times q$", 3.5,2)
pgm.detNode("l","Lost Sales\n" + r"$\ell(d_i,q) = \max(0,d_i-q)$", 3.5,1)
pgm.add_edge("d","pi")
pgm.add_edge("q","pi")
pgm.add_edge("d","l")
pgm.add_edge("q","l")
pgm.add_plate([0.25, 0.25, 5, 2.25], label = "Sim Num:\n" + r"$i = 1, 2, 3$", 
              label_offset = (2,2), rect_params = dict({"fill": False, "linestyle": "dashed", "edgecolor": "cadetblue"}))
pgm.add_plate([2, -0.1, 4.7, 2.75], label = "Order Quantity:\n" + r"$q = 30, 40, 50$", 
              label_offset = (2,2), position = "bottom right", rect_params = dict({"fill": False, "linestyle": "dotted", "edgecolor": "darkorchid"}))
pgm.show(dpi = 140)

Around the nodes, notice the new addition of rectangles,technically called _plates_.  

$$
\begin{aligned}
i &\equiv \textrm{Index for simulation draws. } i \in \{1, 2, 3\}\\
d_i &\equiv \textrm{Daily demand for newspapers, and}\\
d_i &\sim \textrm{Binomial}(n=200,p=0.2).\\
q &\equiv \textrm{Order quantity chosen by decision-maker, where}\\
q &\in \{30, 40, 50\} \qquad \textit{   (potentially good order qtys.)}.\\
\pi &\equiv \textrm{Daily profit is revenue minus expenses.}\\
\pi(d_i,q) &= 3 \times \min(d_i,q) - 1 \times q \qquad \textit{   (cannot sell more than ordered)}.\\
\ell &\equiv \textrm{Lost sales.  Unmet demand due to being out of stock.}\\
\ell(d_i,q) &= \max(0, d-q) \qquad \textit{   (lost sales cannot be negative)}.
\end{aligned}
$$

In [None]:
potentialCusts = 200
purchaseProb = 0.2

rng = default_rng(seed = 111) 
numSims = 3

# create data frame to store simulated demand
newsDF = pd.DataFrame({"simNum": range(1, numSims+1),  # sequence of 1 to 100
                   "demand": rng.binomial(n = potentialCusts, 
                                          p = purchaseProb,
                                          size = numSims)})

## google SEARCH PHRASE: get element-wise minimum of two columns in pandas dataframe
newsDF["profit_q40"] = 3 * np.minimum(newsDF.demand,40) - 1 * 40
newsDF["lostSales_q40"] = np.maximum(0,newsDF.demand - 40)

# view first few 5 rows of newsDF
newsDF.iloc[:5,:]

It will feel cumbersome to add more columns for each order quantity's profit and each quantity's lost sales into a dataframe.  There must be a better way to structure how we store this data.  

## The `xarray` Package

### `DataArray`:  In its simplest 1-dimensional form, a `DataArray` is just a collection of values, like the column of dataframe (`pandas.Series`) or a one-dimensional array of values (`numpy.ndarray`).  We can create a simple `DataArray` using its constructor function. 

In [None]:
from numpy.random import default_rng
import numpy as np
import xarray as xr

rng = default_rng(seed = 111)  ## set random seed 
demand = rng.binomial(n=200,p=0.2,size=3)   ## get demand values

## make data array
xr.DataArray(data = demand)


*   dimension key-value pair `dim_0: 3` tells us that the cardinality of our 
demand array is 3


In [None]:
## make data array with labelled dimension name
xr.DataArray(data = demand, dims = "draw")

In [None]:
## explicit labeling of coordinates - must use name now to create dataset later
demandDA = xr.DataArray(data = demand, coords = {"draw": np.arange(3)+1}, name = "demand")
demandDA

Notice, we can drop the `dims` arguments as the dimension name is supplied in the dictionary object passed to the `coords` argument.

In [None]:
#@title Repeating the graphical model
pgm.show(dpi=120)

In [None]:
## creating a DataArray of order quantities - must use name now to create dataset later
orderDA = xr.DataArray(data = [30, 40, 50], 
                       coords = {"orderQtyIndex": [30,40,50]},
                       name = "orderQty")
orderDA

## Merge the two data arrays into a dataset.

In [None]:
# create dataset by combining data arrays
newsvDS = xr.merge([demandDA,orderDA])
newsvDS

Our `Dataset` container now has the dimensions implied by the plate indices in @fig-newsvGM2, namely `draw: 3  orderQty: 3`.  In math terms we have two sets, one is a set of demand draws where cardinality $|D|=3$; the other a set of order quantities with cardinality $|Q|=3$.  Thus, the set of all ordered pairs $(d_i,q)$ to be used for calculation of profit and lost sales will have cardinality $|D| \times |Q| = 9$.  Thus, our simulation of potential  $\pi(d_i,q)$ and $\ell(d_i,q)$ values will each have 9 elements.

## Seven mental models of dataset manipulation

The seven most important ways we might want to manipulate a data array or dataset is to:

1. Assign: `.assign()` or `assign_coords()`: Add data variables with broadcasting and array math.  (Can also use dict-like methods)
2. Subset: `.sel()` or `.where()` subset a data array or dataset based on coordinates or data values, respectively.
3. Drop: `.drop_vars()` or `.drop_dims()`: Remove an explicit list of data variables or remove all data variables indexed by a particular dimension. 
4. Sort: `.sortby()` sorts or arranges a data array or dataset based on data values or coordinate values.
5. Aggregate: See the [xarray documentation](https://docs.xarray.dev/en/stable/api.html?highlight=aggregation#) for a list of aggregation functions.  These functions will collapse all the data of a given dimension; for example one can collapse a time dimension using the `mean()` aggregation method to get the average value for all of time.
6. Split-Apply-Combine: `.groupby()` and `DatasetGroupBy.foo()` are usually used in combination to 1) _split_ the dataset into groups based on levels of a variable, 2) _apply_ a function (e.g. `foo()`) to each group's dataset individually, and then 3) _combine_ the modified datasets. See the [xarray documentation](https://docs.xarray.dev/en/stable/api.html?highlight=Groupby#groupby-objects) for more details. 
7. Merge(join): Getting information from two datasets to intelligently combine.

### 1 - Assign: Adding Data Arrays



In [None]:
(  ## open parenthesis to start readable code
    newsvDS
    .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
) ## close parenthesis finishes the "method chaining"

In [None]:
#| eval: false
(  ## open parenthesis to start readable code
    newsvDS
    .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
    .assign(revenue = 3 * newsvDS.soldNewspapers)
) ## close parenthesis finishes the "method chaining"

The above code will yield an error:

```
AttributeError: 'Dataset' object has no attribute 'soldNewspapers'
```

The last assignment apparently does not have visibility into the newly created data for `soldNewspapers`.  To pass the _current state_ of the dataset to the `.assign()` method, we use a `lambda` function.  The `lambda` function has syntax `lambda arguments : expression` where `lambda` is a keyword telling python to expect an argument (or arguments), followed by a colon (`:`), and then an expression for what will be returned by the function..  Here is updated code that works:

In [None]:
(  ## open parenthesis to start readable code
    newsvDS
    .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
    .assign(revenue = lambda DS: 3 * DS.soldNewspapers)
) ## use lambda function to get current state of dataset in chain

## YOUR TURN:  Use `assign` to add a lost sales data variable to the dataset.  Modify the below.

In [None]:
newsvDS = (newsvDS
            .assign(soldNewspapers = np.minimum(newsvDS.demand,newsvDS.orderQty))
            .assign(revenue = lambda DS: 3 * DS.soldNewspapers)
            .assign(expense = 1 * newsvDS.orderQty)
            .assign(profit = lambda DS: DS.revenue - DS.expense)
)

(newsvDS
 .to_dataframe())  #dataframe for printing

Note, one can also add columns directly using dict-like indexing when chains of operations are not required.  The following code would work similarly to what we did earlier:

In [None]:
newsvDS["lostSales"] = np.maximum(0, newsvDS.demand - newsvDS.orderQty)

### 2 - Select a subset of the data array or dataset

Syntax of help documentation is often `packagename.Class.method`.  Let's find the `sel` method in the xarray documentation. (https://docs.xarray.dev/en/stable/user-guide/indexing.html)

In [None]:
# select a particular value for a dimension
newsvDS.sel(orderQtyIndex = 30) # returns 1-d dataset

`xarray` follows the pandas convention for selecting a range of coordinate values to keep using the `slice` function.  

In [None]:
# slicing returns all values inside the range (inclusive) 
# as long as the index labels are monotonic increasing
newsvDS.sel(orderQtyIndex = slice(36,58))

Slicing returns a smaller dataset or data array based on coordinates, but often we want a smaller dataset based on data values.  In these cases, we apply the `.where()` method where the argument is some logical condition for which data to keep:

In [None]:
# need to explicitly use DataSet.DataArray syntax for 
# filtering out rows that do not meet condition
newsvDS.assign(lostSales = lambda DS: DS.revenue - DS.expense).where(newsvDS.lostSales > 0)

Often times, the `lambda` syntax for anonymous functions gets used to pass in the dataset name:

In [None]:
(
    newsvDS.where(lambda x: x.lostSales > 0, drop = True)
    .to_dataframe()  #convert to pandas dataframe for printing
    .dropna() # pandas method to remove NaN rows
 ) 

## YOUR TURN:  
Experiment with omitting the `pandas.DataFrame.dropna` method from the above.  What's different.   



In [None]:
## experiment here:



### 3 - Drop Dimensions

The `drop_dims()` method returns a new object by dropping a full dimension from a dataset along with any variables whose coordinates rely on that dimension.

In [None]:
newsvDS.drop_dims("orderQtyIndex")

Above, the order quantity dimension is dropped along with all the data variables whose value depended on order quantity: `orderQty`, `soldNewspapers`, `revenue`, `expense`, `profit`, and `lostSales`.

If you want to just drop some of the data variables, you use `drop_vars()`:

In [None]:
newsvDS.drop_vars(["revenue","expense"])

### 4 - Sort a data array or dataset based on data values or data values.

We will typically want dataframe-like reports generated out of `xarray` as a last step in data manipulation.  We will rely on `pandas.DataFrame.sort_values()` to help us for this mental model.

In [None]:
(newsvDS
 .to_dataframe()
 .sort_values("profit"))

## YOUR TURN:
Using `.sort_values("profit", ascending = True)` reverse the sort order so maximum profit is first.

In [None]:
# Experiment Here:





### 5 - Aggregation 

See the `xarray` documentation at https://docs.xarray.dev/en/stable/api.html#id6 for a complete list of aggregation functions.

1) Aggregate the information in a data array. 
2) Assign the output of the aggregation to a new data array in a pre-existing dataset.

In [None]:
## collapse the 100 draws into 1 summary statistic
(
    newsvDS
    .profit
    .mean(dim = "draw")
)

Notice, this returns a `DataArray` object.  We will then keep our data and summary statistics together in one dataset by adding the array back to the original dataset using `assign()`.  Here the two-step workflow is demonstrated to return expected profit and expected lost sales for each order quantity:

In [None]:
## create mean summary stats
(
    newsvDS
    .assign(expProfit = newsvDS.profit.mean(dim="draw"))
    .assign(expLossSales = newsvDS.lostSales.mean(dim="draw"))
)

Feel free to play around with these other frequently-used aggregation functions include `count`, `first`, `last`, `max`, `mean`, `median`, `min`, `quantile`, and `sum.`

### 6 - Split-Apply-Combine

See the [xarray documentation](https://docs.xarray.dev/en/stable/api.html?highlight=Groupby#groupby-objects) for more details using split-apply-combine:



In [None]:
## find average profit by orderQty
## see docuemntation here: https://docs.xarray.dev/en/stable/generated/xarray.core.groupby.DatasetGroupBy.mean.html
(
    newsvDS
    .get("profit")
    .groupby("orderQtyIndex")
    .mean(...)
).to_dataframe()

## YOUR TURN
Find the maximum lost sales by order quantity.

In [None]:
# Code Here:

### 7 - Merge(join):

To be shown after the dinner break.