<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Multi-way Modeling

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

In [None]:
def retrieve_groupbys(df, by, column="X"):
    _, gbs =  zip(*list(df.groupby(by=by)[column]))
    return gbs

def make_plot(mus, ax=None, **plot_kwargs):
    if ax is None:
        f, ax = plt.subplots(figsize=(12, 6))
    xs = np.arange(mus.shape[0])
    ax.plot(xs, mus[:, 0], lw=4, color="C0", **plot_kwargs)
    ax.plot(xs, mus[:, 1], lw=4, color="C1", **plot_kwargs)

from matplotlib.lines import Line2D

def make_line(color, linewidth=4):
    return Line2D([0], [0], linewidth=linewidth, color=color)

# Previously, we worked with data that varied only along one category, or factor.

For example:
- how does a participant's performance on a task vary depending on whether they are given caffeine or not?
- how do a participant's brain activity patterns vary with the type of music they are listening to?

## In the next two lectures, we will learn how to work with data that varies along two or more categories, which might _interact_.

For example:
- how do whether a participant drinks caffeine and whether they're over 40 interact to determine their performance on a task?
- if we show a person movies inside an fMRI machine, how do both the sound and the image being presented together determine the pattern of activity in their brain?

# Let's start by considering how we might make a detailed mechanistic model of an experiment.

That is, we want to describe _exactly_ the process by which each data point is generated.

When we run an experiment,
we typically only record and control only a very small subset
of all the variables we _could_ record and control,
and which _in principle_ have an effect on each other and on the variable we are interested in measuring.

## Physics Example: Measuring weight of an object

If we measure the weight of an object more than once,
we find that what we observe varies from measurement to measurement.

That's in part because we usually ignore all of the following:
- Air pressure
- Temperature
- Gravitational pull of the moon
- Gravitational pull of the sun and other heavenly bodies
- Changes in the stiffness of the springs inside our scale

## Psychology Example: Measuring a brain signal in response to hearing a word

If we perform an experiment where we repeatedly measure a brain signal,
e.g. the electrical signal of an EEG (aka brain wave)
or the magnetic signal of an fMRI,
in response to the same stimulus,
e.g. a spoken word,
we'll see that the signals vary,
both from person to person and within the same person.

That's in part because we are ignoring all of the following factors:

- How closely they are paying attention
- Their bodily state - hungry, sleepy, overheated
- The different indviduals' histories with that word
- The precise orientation of their skull relative to our probes
- The intonation of the word
- The behavior of individual brain cells
- Variability in our equipment

## In one view, it is our ignorance of these factors that leads to what we call randomness.

The laws of physics are deterministic,
excepting certain interpretations of quantum mechanics,
meaning that, in principle,
once certain values are known
(position, velocity, mass, charge, etc.)
for all of the pieces of the system,
then there is nothing left to chance.

And at the scale that most phenomena of interest to humans happen,
quantum effects are negligible.

Therefore when we say our data is random, we _must_ be cheating a little bit.

## That is, _randomness_ is just code for _things I don't know_.

# Imagine an experiment where there are exactly 12 categorical factors that influence the measured value.

That is,
once one knows the values of each of the 12 factors,
the final value of the measurement is _fully determined_.

That is, it is deterministic, rather than random.

## Assume further that each factor is binary: it is present or not.

## Lastly, let's say every factor, when present, adds a certain amount to the measurement, which we call the _size_ of the _effect_ of the factor.

In [None]:
factor_effect_sizes = list(sorted(pm.Uniform.dist(lower=-1, upper=1).random(size=12)))

factor_effect_sizes

In [None]:
# factor_effect_sizes = [-3] + factor_effect_sizes + [6]

## To make this work with pyMC, let's say that on any given trial, which effects are present is _random_.

This is an example of using `pyMC` to simulate a system,
as opposed to using `pyMC` to compute posteriors.

In [None]:
with pm.Model() as many_effects_model:
    effects_present = pm.Bernoulli("effect_present", p=0.5, shape=len(factor_effect_sizes))
    
    sum_of_effects = 0
    for ii in range(len(factor_effect_sizes)):
        sum_of_effects += factor_effect_sizes[ii] * effects_present[ii]
        
    observed_data = pm.Deterministic("X", sum_of_effects)

How to make this more realistic:
- Not all factors are binary and have equal chance - could switch to `Categorical`
- Many factors are related to each other (we'll see more on that today)
- Not all effects are discrete! (we'll talk about that next week)

In [None]:
with many_effects_model:
    many_effects_trace = pm.sample(draws=500, chains=10)

    many_effects_df = pm.trace_to_dataframe(many_effects_trace)

In [None]:
many_effects_df.head()

## If we know which factors are present and absent, the data looks deterministic.

The next block of code finds all the rows that are equal to a given row
using `apply` on the `row_equal_to` function defined below.

Not all rows will be duplicated,
so if the result printed by this cell has only one row,
try changing the `row_index`.

If you inspect the `effect_present` columns in the output of the cell,
you'll see that the values are the same in all the printed rows.
Furthermore, the value of the `X` column is also equal.

In [None]:
def row_equal_to(row, other_row):
    return all(row[:-1] == other_row[:-1])

row_index = 0
equal_to_row = many_effects_df.apply(row_equal_to, axis=1,
                                     other_row=many_effects_df.iloc[row_index])

many_effects_df[equal_to_row]

And so if we were to group our data by all of these columns simultaneously
and then look at the histogram of `X`, the result would just be a single point:
there would be no "distribution" of `X`.

## Despite the deterministic nature of our data, if we ignore which factors are present, we obtain a familiar-looking distribution for the `X` values.

That is,
we pretend that we didn't measure the values of the `effect_present` variables
and then look at the distribution.

This simulates the realistic setting where we measure the outcome variable
but not all of the factors that determine it.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
xs = np.linspace(min(many_effects_df["X"]), max(many_effects_df["X"]))
ps = scipy.stats.norm.pdf(
    xs, loc=many_effects_df["X"].mean(), scale=many_effects_df["X"].std())
sns.distplot(many_effects_df["X"], label="Observed");
ax.plot(xs, ps, lw=4, label="Closest Gaussian"); ax.legend();

## The fact that this distribution is bell-shaped is a case of the _Central Limit Theorem_.

Whenever our measurement is subject to a large number of non-interacting effects
of about the same size, the distribution we observe of measurement values
if we allow those effects to vary is a normal distribution.

If some of the effects are much larger than the others,
then the Central Limit Theorem holds much more loosely:
we need more and more interfering effects to end up with a bell curve.

It is often the case that some effects are much larger and more important than others:
in science, we rely on this to make simple models.

To see what this kind of looks like, execute the commented out cell skipped above
that adds two new factors to the data, each larger in magnitude than the others,
then re-run the cells above.
You'll see that the data is no longer distributed normally.
If you also run the cells below

## We can _estimate_ the effect sizes by grouping and taking averages.

In [None]:
effect_columns = many_effects_df.columns[:-1]; factor_index = -1
group_means = many_effects_df.groupby(effect_columns[factor_index])["X"].mean()
group_means

In [None]:
group_means[1] - group_means[0]

In [None]:
factor_effect_sizes[factor_index]

## If we group on one of effects and then plot, we still see data that looks random.

In [None]:
column = effect_columns[-1]
gbs = retrieve_groupbys(many_effects_df, by=column)

f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(x=column, y="X", data=many_effects_df, ax=ax, width=0.5, linewidth=4);

scipy.stats.f_oneway(*gbs)

This type of plot is known as a [Violin Plot](https://seaborn.pydata.org/generated/seaborn.violinplot.html),
for the resemblance to the musical instrument.

The "lumpy" portion of the plot is a kernel density estimate, as in `distplot`.
The only difference is that the density is mirrored,
on the left and right of the box in the center.
The box in the center is a boxplot:
the median and the 25th and 75th percentile are shown.

## Our modeling tools are designed to try and manage the uncertainty that our ignorance of the unmeasured factors introduces.

## This remains true if we group on a small number of effects, relative to the total.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(x=effect_columns[-1], y="X", hue=effect_columns[-2], data=many_effects_df, ax=ax, width=0.5,
               linewidth=4);

# In a real experiment, we would typically only measure a small handful of the influencing factors at most: say, 2.

That is, treat the model above as the _true model_,
one that accurately describes our data-generating process.

In real life,
we usually don't know this model:
we don't know the effect sizes,
we don't even know the identities of the factors!
(and there are actually infinite, or at least extremely large number of, possibilities
for factors).

And furthermore, we don't typically measure everything of relevance.
We identify, based on our intuition or on previous results,
factors that we think are important.

So the data we actually observe, in a real experiment, looks more like:

In [None]:
observed_data_df = pd.DataFrame()

factor1_idx = 0
factor2_idx = -1

observed_data_df["factor1"] = many_effects_df[effect_columns[factor1_idx]]
observed_data_df["factor2"] = many_effects_df[effect_columns[factor2_idx]]

observed_data_df["measurement"] = many_effects_df["X"]

Because the factors are ordered by their effect size, from most negative to most positive,
setting the indices to 0 and -1 presumes we identifed the factors with the largest effects.

Try setting them to different values to see what happens when we try to build models of data
where some of the most important factors are left out.

In [None]:
print(observed_data_df.head())

Always think of your data this way:
the tip of the iceberg,
or as a "slice" through what you should be observing.

From this dataframe,
which represents what we might actually observe in an experiment,
we can produce the kinds of plots we've made for real data.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(x="factor1", y="measurement", data=observed_data_df, linewidth=4);

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(x="factor2", y="measurement", data=observed_data_df, linewidth=4);

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(x="factor1", y="measurement", hue="factor2", data=observed_data_df, linewidth=4);

But, just as we cannot use our data to
we cannot see the deterministic nature of the true model in `observed_data_df`,
nor can we recover the exact values of the effect sizes.

## Our modeling tools are designed to try and manage the uncertainty that our ignorance of the unmeasured factors introduces.

In [None]:
# first part of prior: describing uncertainty in the group means

with pm.Model() as synthetic_data_model:
    mus = pm.Normal("mus", mu=0, sd=1e2, shape=(2, 2))

The new piece here is in the `shape` argument:
now, the shape argument is getting a tuple of values, `shape=(2, 2)`,
rather than a single value, e.g. `shape=3`.

Previously, when shape had only one value,
we though of `mus` as a list.

Technically, in that case and in this one,
`mus` is something called a `Tensor`,
from the `theano` library.

In this case, you can think of it like a list of lists:

In [None]:
list_of_lists = [[0, 1], [2, 3]]
list_of_lists

The interpretation of `mus` is still the same:
it's just a random variable that holds values for
each of the groups means.

To determine which group we're in, we need to know our position in both the "outer" and the "inner" list.

In [None]:
# second part of prior: describing uncertainty in the standard deviation

with synthetic_data_model:
    sd = pm.Exponential("sigma", lam=0.1)

In [None]:
# likelihood: if we knew the means and sds, what would be our remaining uncertainty?

with synthetic_data_model:
    # where does the uncertainty represented by this Normal come from?
    #  from things we're not measuring and modeling
    observations = pm.Normal("observations",
                             mu=mus[observed_data_df["factor1"], observed_data_df["factor2"]],
                             # implementation detail: we use _two_ Series to index mus now
                             sd=sd, observed=observed_data_df["measurement"])

Notice: this isn't a _mechanistic_ model of our data,
or at least not a complete one.

We know that the real mechanistic model of our data
is the `many_effects_model` above.

Instead, we say that some of the mechanisms in our system
we are going to approximate with a Normal likelihood.

$$
\mu[i, j] \sim \text{Normal}(0, 1\mathrm{e}2)\\
\sigma \sim \text{Exponential}(0.1)\\
d \sim \text{Normal}(\mu[i, j], \sigma)
$$

The notation here is meant to evoke the syntax we use to index into `DataFrame`s with `iloc`.
More on that below.

In [None]:
with synthetic_data_model:
    synthetic_trace = pm.sample()
    synthetic_posterior_samples = shared_util.samples_to_dataframe(synthetic_trace)

In [None]:
print(synthetic_posterior_samples.head())

Notice that the sampled values of `mus` also look like lists-of-lists:

In [None]:
synthetic_posterior_samples["mus"].iloc[0]

But they are _not_ actually lists:
they are `arrays`,
provided by the `numpy` library, alias `np`.

In [None]:
type(synthetic_posterior_samples["mus"].iloc[0])

They are also like `DataFrames`, in that we can use
the indexing syntax, `[...]`, to access their contents.

However unlike `DataFrames`,
`arrays` only have one style of indexing,
which is equivalent to `iloc`.

In [None]:
print(synthetic_posterior_samples.head())

In [None]:
synthetic_posterior_samples.iloc[1, 0]   # entry in second row of first column of DataFrame

In [None]:
example_array = synthetic_posterior_samples.iloc[1, 0] 

example_array, example_array[1, 0]  # entry in second row of first column of array

In [None]:
example_array, example_array[:, 0]  # entries in all rows of first column of array (result is 1-D array)

Also unlike `DataFrames`, arrays can have more (or less) than two dimensions.

See [this tutorial for more on numpy and arrays](https://hackernoon.com/introduction-to-numpy-1-an-absolute-beginners-guide-to-machine-learning-and-data-science-5d87f13f0d51).

## We then estimate the effects of factors from the entries of the `mu` array on each sample.

First, let's look at the mean when both factors are present and when they are absent.

In [None]:
def get_mean_both_factors_absent(row):
    mus = row["mus"]
    return mus[0, 0]

def get_mean_both_factors_present(row):
    mus = row["mus"]
    return mus[1, 1]

In [None]:
mean_both_present = synthetic_posterior_samples.apply(get_mean_both_factors_absent, axis=1)
mean_both_absent = synthetic_posterior_samples.apply(get_mean_both_factors_present, axis=1)

f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(mean_both_present - mean_both_absent);

But this doesn't tell us what either factor does separately.

To do that, we need a bit more work:

In [None]:
def compute_delta_factor1(row):
    mus = row["mus"]
    
    # what is the mean when factor1 is present, averaging across factor2?
    mean_factor1_present = (mus[1, 0] + mus[1, 1]) / 2
    
    # same as above, but for factor1 absent
    mean_factor1_absent = np.mean(mus[0, :]) 
    return mean_factor1_present - mean_factor1_absent

In [None]:
delta_factor1_posterior = synthetic_posterior_samples.apply(compute_delta_factor1, axis=1)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(delta_factor1_posterior, label="Posterior", axlabel="Factor 1 Effect");
ax.vlines(factor_effect_sizes[factor1_idx], 0, 4, lw=4, label="True Value");
ax.legend(); ax.set_xlim(-1.5, 1.5);

In [None]:
def compute_delta_factor2(row):
    mus = row["mus"]
    
    # what are the means for the factor2 groups, averaging across factor1?
    factor2_group_means = np.mean(mus[:, 0]), np.mean(mus[:, 1])
    
    return factor2_group_means[1] - factor2_group_means[0]

In [None]:
delta_factor2_posterior = synthetic_posterior_samples.apply(compute_delta_factor2, axis=1)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(delta_factor2_posterior, label="Posterior", axlabel="Factor 2 Effect");
ax.vlines(factor_effect_sizes[factor2_idx], 0, 4, lw=4, label="True Value");
ax.legend(); ax.set_xlim(-1.5, 1.5);

# What do we mean  by _non-interacting_ effects?

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(x="factor1", y="measurement", hue="factor2", data=observed_data_df, linewidth=4);

Draw lines between the means: they'll be parallel.
Or, alternatively, the two pairs of "violins",
one pair for `factor1 = 0` and one pair for `factor1 = 1`,
are just shifted relative to one another.

We can check for interactions directly from our posterior values for the means:

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
[make_plot(synthetic_posterior_samples.iloc[ii]["mus"], ax=ax, alpha=0.1)
 for ii in random.sample(range(len(synthetic_posterior_samples)), 200)];
ax.set_xticks([0, 1]); ax.set_xlabel("factor1");
ax.set_ylabel("Group Average")
ax.legend([make_line("C0"), make_line("C1")], ["factor2 = 0", "factor2 = 1"]);

## When two factors _interact_, they are more than the sum of their parts.

Literally:
if there is _no interaction_,
we can guess the effect of both factors together
by estimating the effect of the two factors separately.

In [None]:
def compute_interaction_effect(row):
    mus = row["mus"]
    # compute the "mean of means" with np.mean,
    #  which by default averages _across rows and columns_
    grand_mean = np.mean(mus)
    
    prediction_from_separate = grand_mean + \
        compute_delta_factor1(row) + \
        compute_delta_factor2(row)
    
    actually_observed_effect = mus[1, 1]
    
    return actually_observed_effect - prediction_from_separate

In [None]:
interaction_effects = synthetic_posterior_samples.apply(
    compute_interaction_effect, axis=1)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(interaction_effects, label="Posterior", axlabel="Interaction Effect");
ax.vlines(0, 0, 4, lw=4, label="True Value");
ax.legend(); ax.set_xlim(-1.5, 1.5);

In [None]:
(interaction_effects > 0).mean()

Note:
sometimes more than 95% of the posterior is on one side or the other
of `0`, indicating that a 

But notice how small the values are,
relative to the effects for the main factors
and relative to the variability in the data.

This underscores the importance of thinking about whether an effect in the data is _meaningful_
not just whether it is non-zero.

In [None]:
# what is the posterior chance that the interaction effect is larger than 5% of the variability in the data?
(np.abs(interaction_effects) > (5e-2 / observed_data_df["measurement"].std())).mean()

In [None]:
# what is the posterior chance that the effect of factor1 is larger than 5% of the variability in the data?
(np.abs(delta_factor1_posterior) > (5e-2 / observed_data_df["measurement"].std())).mean()

# But many real-life factors do interact.

## For example: closing each eye while firing a bow and arrow at a target.

Trying to predict what happens when you close _both_ your eyes
by just adding together what happens when you close _either_ eye doesn't work.

Let's quickly connect this back to our mechanistic model:
what are some factors that determine accuracy that we aren't considering?

In [None]:
with pm.Model() as accuracy_model:
    left_eye_closed = pm.Bernoulli("left_eye_closed", p=0.5)
    right_eye_closed = pm.Bernoulli("right_eye_closed", p=0.5)
    
    accuracies = shared_util.to_pymc([[0.8, 0.73],  # notice: a list of lists
                                      [0.73, 0.1]])
    
    target_hit = pm.Bernoulli("target_hit", p=accuracies[left_eye_closed, right_eye_closed])

`shared_util.to_pymc` converts the argument to a type of `theano.Tensor` so that it can be used in a pyMC model.

In [None]:
with accuracy_model:
    accuracy_trace = pm.sample()
    accuracy_df = pm.trace_to_dataframe(accuracy_trace)

In [None]:
accuracy_df.groupby(["left_eye_closed", "right_eye_closed"]).mean()

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.barplot(x="left_eye_closed", y="target_hit", hue="right_eye_closed",
            data=accuracy_df, linewidth=4);

Draw lines between the means:
they are very much _not_ parallel.

# Let's look for interactions in some real data.

We'll be using some EEG experiment data graciously provided by the [Voytek lab](http://voyteklab.com/about-us/) of UCSD. Participants of varying ages were asked to perform a working memory task with varying levels of difficulty. The raw EEG signal has been summarized into the following two measures:

* [Contralateral Delay Activity](https://www.ncbi.nlm.nih.gov/pubmed/26802451), or CDA, is used to measure the engagement of visual working memory.

* [Frontal Midline Theta](https://www.ncbi.nlm.nih.gov/pubmed/9895201) oscillation amplitude has been correlated with sustained, internally-directed cognitive activity.

The performance of the subjects has also been summarized using the measure
[d'](https://en.wikipedia.org/wiki/Sensitivity_index) (pronounced "d-prime"), also known as the *sensitivity index*. D' is a measure of the subject's performance in  a task. It's based on comparing the true positive rate and false positive rate.

In this lecture, we will look at `d`, the subject performance metric.

In [None]:
shared_data_path = Path("..") / ".." / "shared" / 'data'

df = pd.read_csv(shared_data_path / 'voytek_working_memory_aging_split.csv', index_col=None)

df.sample(5)

In [None]:
print(df.groupby("group")["age"].describe())

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(df["d"], ax=ax);

# If we split this data up by one variable at a time, we know what to do.

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.violinplot(y="d", x="group", data=df, ax=ax, linewidth=4, width=0.3);

In [None]:
scipy.stats.f_oneway(df["d"][df["group"] == 1],
                     df["d"][df["group"] == 2])

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
sns.violinplot(y="d", x="difficulty", data=df, ax=ax, linewidth=4, width=0.3);

In [None]:
scipy.stats.f_oneway(df["d"][df["difficulty"] == 1],
                     df["d"][df["difficulty"] == 2],
                     df["d"][df["difficulty"] == 3])

# We've already worked with one-way models in pyMC.

First, let's simplify and format our data.

In [None]:
data = pd.DataFrame()

data["age_group"] = df["group"] - 1  # subtract 1 so that it starts from 0, like Python indexing
data["difficulty"] = df["difficulty"] - 1
data["d"] = df["d"]

## For a one-way model, we define something like a list of parameters, then index into that list.

$$
\mu[i] \sim \text{Normal}(0, 1\mathrm{e}6, \text{shape}=3)\\
\sigma \sim \text{Exponential}(0.1)\\
d \sim \text{Normal}(\mu[i], \sigma)
$$

In [None]:
difficulty_indexer = data["difficulty"]

In [None]:
with pm.Model() as eeg_difficulty_model:
    means = pm.Normal("mus", mu=0, sd=1e6, shape=3)
    sd = pm.Exponential("sigma", lam=0.1)
    
    observations = pm.Normal("d", mu=means[difficulty_indexer], sd=sd, observed=data["d"])

In class, we won't sample from and work with these models,
since we've already seen them,
but feel frree to add cells and look at the posteriors in your copy of the slides.

### For a different one-way model, we define a different indexer and change the shapes.

In [None]:
age_indexer = data["age_group"]

$$
\mu[j] \sim \text{Normal}(0, 1\mathrm{e}6, \text{shape}=2)\\
\sigma \sim \text{Exponential}(0.1)\\
d \sim \text{Normal}(\mu[j], \sigma)
$$

In [None]:
with pm.Model() as eeg_age_model:
    means = pm.Normal("mus", mu=0, sd=1e6, shape=2)
    sd = pm.Exponential("sigma", lam=1)
    
    observations = pm.Normal("d", mu=means[age_indexer], sd=sd, observed=data["d"])

# But if we group the data by two categories at once, it's unclear what to do.

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
sns.violinplot(y="d", x="difficulty", hue="group", data=df, ax=ax, linewidth=4);

`scipy` does not provide functions for performing statistical tests
about multiple factors at once:
hence the `one` in `f_oneway`.

Next time,
we'll see how this is done using a different Python library,
`statsmodels`.

The fact that the analytical statistical testing approach requires a new library
is another sign of its inflexibility.

We've already been working around this problem.
The `attention` dataset also has multiple possible grouping factors:
the `attention` column and the `solutions` column.
Previously, we either ignored one column
or looked at only rows where one of the two was fixed.

But we'll get a more complete understanding of our data
if we include all of the factors we measure.

# In pyMC, multi-way models are only a small adjustment: we define something like a list-of-lists for the parameters.

That is, we have one parameter for each combination of each factor.

$$
d \sim \text{Normal}(\mu[i, j], \sigma)
$$

In [None]:
with pm.Model() as eeg_combined_model:
    means = pm.Normal("mus", mu=0, sd=1e6, shape=(3, 2))
    sigma = pm.Exponential("sigma", lam=0.1)
    
    observations = pm.Normal("d", mu=means[difficulty_indexer, age_indexer], sd=sigma, observed=data["d"])

In [None]:
with eeg_combined_model:
    eeg_combined_trace = pm.sample(draws=1000)
    eeg_combined_df = shared_util.samples_to_dataframe(eeg_combined_trace)

In [None]:
print(eeg_combined_df.head())

In [None]:
eeg_combined_df["mus"].iloc[0]

Each sample contains a mean for each combination of age group (column)
and task difficulty (row).

## As always, the first move is to visualize our posterior,

ideally in a manner similar to how we visualized our data.

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True, sharey=True); ax=axs[0]
[make_plot(eeg_combined_df.iloc[ii]["mus"], ax=ax, alpha=0.1)
 for ii in random.sample(range(len(eeg_combined_df)), 200)];
ax.set_xticks([0, 1, 2]); ax.set_xlabel("Difficulty Index");
ax.set_ylabel("Group Average d")
ax.legend([make_line("C0"), make_line("C1")], ["Young", "Old"]);

sns.violinplot(x="difficulty", y="d", hue="age_group",
               data=data, ax=axs[1], axlabel=False); axs[1].get_legend().remove();

Notice how the slope of the line from difficulty 0 to difficulty 2 looks slightly steeper
for the yellow lines (the old age group)
than for the blue lines (the young age group)?

That suggests there is an interaction:
one way to phrase it is that the harder tasks are even harder for the older age group
than for the younger age group.

One thing that makes multi-way models harder is that the claims we are interested in
are not directly present in the group means.

That is, to get at the things we find interesting,
we typically need to `apply` some Python functions to the entries.

## We then estimate the effects of factors from the entries of the `mu` array on each sample.

In [None]:
def compute_grand_mean(mus):
    # compute the "mean of means" with np.mean,
    #  which by default averages _across rows and columns_
    return np.mean(mus)

In [None]:
grand_means = eeg_combined_df["mus"].apply(compute_grand_mean)

In [None]:
sns.distplot(grand_means);  # for d, a value of 1 means chance performance

In [None]:
def compute_age_means(mus):
    # use np.mean function, but only average across _rows_
    age_means = np.mean(mus, axis=0)
    return age_means

In [None]:
age_group_means = eeg_combined_df["mus"].apply(compute_age_means)

age_group_means.iloc[0]

We calculate a group mean for each age group on each sample.

Again, to calculate the "effect" of the age variable,
aka the average difference of the two group levels,
we need to subtract one from the other.

In [None]:
def age_factor_effect(age_group_means):
    return age_group_means[1] - age_group_means[0]

delta_ages = age_group_means.apply(age_factor_effect)

f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(delta_ages, axlabel="Old vs Young Factor Effect");
ax.set_xlim([-1.7, 1.7]);

In [None]:
def compute_difficulty_group_means(mus):
    return np.mean(mus, axis=1)

In [None]:
difficulty_group_means = eeg_combined_df["mus"].apply(compute_difficulty_group_means)

difficulty_group_means.iloc[0]

We calculate a group mean for each difficulty group on each sample.

Because there are more than two difficulty groups,
if we want to think about an "effect" we need to specify a difference between two groups.

For example, the highest difficulty and the lowest:

In [None]:
hard_vs_easy = difficulty_group_means.apply(
    lambda difficulty_group_means: difficulty_group_means[2] - difficulty_group_means[0])

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(hard_vs_easy, axlabel="Hard vs Easy Factor Effect");
ax.set_xlim([-1.7, 1.7]);

In [None]:
def compute_interaction_effect(mus, diff_index, age_index):
    prediction_from_separate = compute_difficulty_group_means(mus)[diff_index] \
        + compute_age_means(mus)[age_index]\
        - compute_grand_mean(mus)
    
    actually_observed_effect = mus[diff_index, age_index]
    
    return actually_observed_effect - prediction_from_separate

In [None]:
interaction_effects_old_hard = eeg_combined_df["mus"].apply(
    compute_interaction_effect, diff_index=2, age_index=1)

In [None]:
f, ax = plt.subplots(figsize=(12, 6))
sns.distplot(interaction_effects_old_hard,
             axlabel="Interaction Effect Between\nOld Age Group and Hard Task Difficulty");
ax.set_xlim([-1.7, 1.7]);

In [None]:
(interaction_effects_old_hard > 0).mean()