<img src="../../shared/img/slides_banner.svg" width=2560></img>

# Categorical Effects 01 - Bayesian Inference for Categorical Effects Models

In [None]:
%matplotlib inline

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

import daft
from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
import seaborn as sns
import scipy.stats

import utils.daft
import utils.plot
import utils.util

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

In [None]:
normal = lambda x, mu, sd: scipy.stats.norm(loc=mu, scale=sd).pdf(x)
exponential = lambda x, mu, lam: scipy.stats.expon(loc=mu, scale=1/lam).pdf(x)
cauchy = lambda x, loc, scale: scipy.stats.cauchy(loc=loc, scale=scale).pdf(x)
uniform = lambda x, lower, width: scipy.stats.uniform(loc=lower, scale=width).pdf(x)
f_dist = lambda x, loc, scale: scipy.stats.f(dfn=10, dfd=2, loc=loc, scale=scale).pdf(x)
binom = lambda xs, n, p: scipy.stats.binom(n, p).pmf(xs)

In the last two weeks, we've primarily focused on differences in means between two groups.

This is the only thing a $t$-test can do (and then only if the samples are large or the population distribution of each group is normal.)

The Bayesian approach could do more (differences in $\sigma$, compute other inferences, etc.)
but our focus was on differences in means.

## This week, we generalize to the case where there are more than two groups.

We'll do this by first "turning the problem on its head"
and thinking about how to _generate_ data from multiple groups,
using mixture distributions,
rather than how to _analyze_ data from multiple groups,
then finding posteriors of those generative models given data.

# A mixture distribution is a weighted combination of distributions.

A mixture, just like baking mix:

- 2 cups sugar
- 1 3/4 cups all-purpose flour
- 3/4 cup unsweetened cocoa powder
- 1 1/2 teaspoons baking powder
- 1 1/2 teaspoons baking soda
- 1 teaspoon salt

From [FiveHeartHome.com](https://www.fivehearthome.com/homemade-chocolate-cake-mix-recipe/).

A cake mix recipe has _ingredients_ that we combine in different _amounts_ to get a _mix_.

Here's a "distributional recipe":

- 5% `Normal(-6, 1)`
- 10% `Exponential(lam=1) + 2`
- 50% `Normal(0, 1)`
- 20% `Uniform(-3, 0)`
- 15% `F(10, 2) + 5`

The "ingredients" are known as mixture _components_.
We bake with distributions instead of flour and sugar.

The "amounts" are known as the mixture _weights_:
we measure in fractions instead of in cups and teaspons.

This "weighted combination" is the _mixture distribution_.

And here's what it looks like:

In [None]:
weights = [0.05, 0.1, 0.4, 0.3, 0.15]
params = [(-6, 1), (2, 1), (0, 1), (-3, 3), (5, 1)]
funks = [normal, exponential, normal, uniform, f_dist]  # defined above

utils.plot.plot_mixture_marginal(weights, params, funks)

If a variable is sampled according to this mixture distribution,
the plot above shows what the histogram of samples would converge to.

We can compare that to all of the "ingredients":

In [None]:
utils.plot.comparative_mixture_plot(weights, params, funks);

In the top plot, we see the mixture distribution again.

In the bottom plot, we see each of the "ingredients" in the mixture.

If we think of the top plot as the _marginal_ distribution of the variable we observe,
the bottom plot is the collection of _conditional distributions_:
the distribution of the variable we observe _given_ which mixture component it came from.

## We can define a mixture distribution in pyMC by indexing.

In [None]:
with pm.Model() as mixture_model:
    # First, define each component of the mixture
    rv1 = pm.Normal("norm1",        # FIRST COMPONENT: Normal
                    mu=-6, sd=1)
    rv2 = pm.Deterministic("expo",  # SECOND COMPONENT: Exponential
                           pm.Exponential("_expo", lam=1) + 2)
    rv3 = pm.Normal("norm2",        # THIRD COMPONENT: Normal
                    mu=0, sd=1)
    rv4 = pm.Uniform("unif",        # FOURTH COMPONENT: Uniform
                     lower=-3, upper=0)
    rv5 = pm.Deterministic(
        "F",                        # FIFTH COMPONENT: F
         ((pm.ChiSquared("_u1", 2) / 2) / (pm.ChiSquared("_u2", 10) / 10)) + 5)
    
    # then, put all components in a "list"
    list_of_components = shared_util.to_pymc([rv1, rv2, rv3, rv4, rv5])
    
    # then, determine which component is "active"
    #  on this observation by chance, with p given by weights
    component = pm.Categorical("component", p=weights)
    # and select it from that list
    observed_value = pm.Deterministic("observed_value",
                             list_of_components[component])

This is just like what we've been doing for models with two groups in them,
but now there can be _any number of groups_:
just define a variable for each group and add it to the "list".

#### Programming Aside

Feel free to skip this, especially on a first read,
because it covers the internal details of pyMC.

Internally to pyMC, `list_of_components` is something called a `Tensor`.
It's very similar to an `array` in numpy.
`shared_util.to_pymc` converts normal Python objects,
like lists, into these special `Tensor` objects.

The biggest difference is that the `Tensor` is like a placeholder:
as you work with your pyMC model, e.g. by sampling or by doing MAP estimation,
the values inside it will change.

If you want to see what the `Tensor` looks like,
for a given setting of all the variables,
you need to define all of the input values to that `Tensor`
and then call a method called `eval`,
as in `eval`uate.

In this example, the `Tensor` contains our five group random variables,
and so if we want to know its contents,
we need to define a value for each.

This is done by providing `eval` a dictionary
where the keys are the `RandomVariable`s
and the values are the concrete values we want the variables to take on.

The cells below show an example of this process.

They also demonstrate what
`list_of_components` and `list_of_components[component]`
might look like for one sample.

In [None]:
rvs_and_values = {rv1: 1., rv2: 2., rv3: 3., rv4: 4., rv5:10.}

list_of_components.eval(inputs_to_values=rvs_and_values)

In [None]:
list_of_components[4].eval(inputs_to_values=rvs_and_values)

## With a pyMC model, we can sample from a mixture distribution.

In [None]:
with mixture_model:
    samples = pm.sample(draws=1250, chains=4, target_accept=0.95)

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True)
for var in ["norm1", "expo", "norm2", "unif", "F"]:
    sns.distplot(samples[var], ax=axs[1]);
sns.distplot(samples["observed_value"], kde=False, bins=100, color="k", norm_hist=True, ax=axs[0])
utils.plot.plot_mixture_marginal(weights, params, funks, ax=axs[0])
axs[1].set_xlim(-10, 10);
axs[0].set_title("Mixture Distribution, pyMC Estimate and Actual")
axs[1].set_title("Mixture Components, pyMC Estimates");

## As graphs, all mixture models look the same

In [None]:
utils.daft.make_mixture_plate_graph()

$$
g \sim \text{Categorical}(p=\text{weights})\\
x \sim \text{ComponentDistributions}_g(\lambda[g], \beta)
$$

This is the graph for a mixture distribution over the variable $x$.

Here $g$ is a variable that identifies which group, or component,
a value $x$ comes from.
Each group has $N$ observations in it.

$\text{ComponentDistributions}_i$ is the
distribution of component number $i$.
There are $K$ components, where `K = len(weights)`.
Each is the "population distribution" of a group.

These distributions typically differ in some of their parameters,
$\lambda$, and might share others, $\beta$.

To clarify,
here's what that model would look like for three total observations from two groups
if we were to "unravel" the plates.

In [None]:
utils.daft.make_mixture_plateless_graph()

The plate notation is much cleaner!

## The most important special case: mixtures within the same family

But we typically focus on the case where each component comes from the same family of distributions:

- 5% `Normal(-6, 1)`
- 10% `Normal(2, 1)`
- 50% `Normal(0, 1)`
- 20% `Normal(-3, 2)`
- 15% `Normal(5, 1)`

This is a _mixture-of-Gaussians_ or _mixture-of-Normals_ model.

### Given a definition, we can plot the distribution,

In [None]:
utils.plot.comparative_mixture_plot(weights, params, normal);

### draw a graph and specify its pieces,

In [None]:
utils.daft.make_mixture_plate_graph()

$$
g \sim \text{Categorical}(p=[5, 10, 50, 20, 15])\\
x \sim \text{Normal}(\mu=\lambda[g], \sigma=\beta)
$$

### and write a pyMC model.

In [None]:
with pm.Model() as normal_mixture_model:
    # First, define each component of the mixture
    rv1 = pm.Normal("norm1",
                    mu=-6, sd=1)
    rv2 = pm.Normal("norm2",
                    mu=2, sd=1)
    rv3 = pm.Normal("norm3",
                    mu=0, sd=1)
    rv4 = pm.Normal("norm4",
                    mu=-3, sd=2)
    rv5 = pm.Normal("norm5",
                    mu=5, sd=1)
    
    # then, put all components in a "list"
    list_of_components = shared_util.to_pymc([rv1, rv2, rv3, rv4, rv5])
    
    # then, determine which component is "active"
    #  on this observation by chance, with p given by weights
    component = pm.Categorical("component", p=weights)
    # and select it from that list
    observed_value = pm.Deterministic("observed_value",
                             list_of_components[component])

In [None]:
with normal_mixture_model:
    normal_samples = pm.sample(draws=1250, chains=4, target_accept=0.95)

In [None]:
f, axs = plt.subplots(figsize=(12, 12), nrows=2, sharex=True)
for var in ["norm1", "norm2", "norm3", "norm4", "norm5"]:
    sns.distplot(normal_samples[var], ax=axs[1]);
sns.distplot(normal_samples["observed_value"], kde=False, bins=100, color="k", norm_hist=True, ax=axs[0])
utils.plot.plot_mixture_marginal(weights, params, normal, ax=axs[0])
axs[1].set_xlim(-10, 10);
axs[0].set_title("Mixture Distribution, pyMC Estimate and Actual")
axs[1].set_title("Mixture Components, pyMC Estimates");

## We can do the same with any family of distributions!

In [None]:
utils.plot.comparative_mixture_plot(weights, params, uniform);

Note: a histogram is a "mixture-of-uniforms" of sorts!

In [None]:
utils.plot.comparative_mixture_plot(weights, params, exponential);

In [None]:
utils.plot.comparative_mixture_plot(weights, params, f_dist);

## Mixture models are _very general_, and you've already seen them!

Whenever the likelihood component of the model has a parameter that varies
discretely based on the value of a grouping variable,
we can think of the observed variable as having a _mixture distribution_
where each component of the mixture is a different group.

We can basically _always do this_:
notice the wide variety of shapes we achieved above.

The `kde` plotted by seaborn as part of `distplot`
is a mixture-of-Gaussians.

A histogram can be thought of as a mixture-of-Uniforms.

Mixture distributions are most useful when
the component labels are easy to measure or identify
and the pieces are relatively simple.

## Concrete Example: Dog Weight Modeling with Mixture-of-Gaussians

Let's say that dogs come in two breeds: larger ones, called hound dogs,
and smaller ones, or lap dogs.

In [None]:
Image(url="http://cdn.akc.org/content/hero/508386698_bigsmall.jpg")

Pictured: elements from tail of the distribution.
From the [American Kennel Club website](https://www.akc.org/expert-advice/lifestyle/why-small-dogs-behave-differently-than-large-dogs/).

We choose to model the dependence of weight on breed with a `Normal` distribution, aka a Gaussian.

That is, if we know the breed, then the likelihood of the weight is `Normal`.

In [None]:
labels = ["Hound", "Lap"]
weights = [0.5, 0.5]  # assume there are an equal number of hounds and lap dogs
params = [(20, 3), (10, 2)]  # hounds are heavier and lap dogs are lighter

In [None]:
utils.plot.comparative_mixture_plot(weights, params, normal, xs=np.linspace(5, 30), labels=labels)
ax = plt.gca(); ax.legend()
ax.set_xlabel("Weight of Dog");

In [None]:
with pm.Model() as normal_dog_weight_model:
    # First, define each component of the mixture
    rvs = [pm.Normal("hound", mu=20, sd=3), pm.Normal("toy", mu=10, sd=2)]
    
    # then, put all components in a "list"
    list_of_components = shared_util.to_pymc(rvs)
    
    # then, determine which component is "active"
    #  on this observation by chance, with p given by weights
    component = pm.Categorical("component", p=weights)

    # and select it from that list
    observed_value = pm.Deterministic("observed_value",
                             list_of_components[component])

Note the similarity of this model to the models for differences in means
that we've been working through.

The big difference is that instead of having the measurable values be observed
and the parameters unknown,
here, because we're working theoretically instead of with data,
we know the parameters and the data is unknown.

Otherwise, the models are structurally the same!

## Concrete Example: Piazza Post Modeling with Mixture-of-Poissons

Imagine, for the sake of argument,
that there exists a class that uses the
[Piazza](https://piazza.com) web service to host a forum for discussion about course material.

Imagine further that this class has assignments due two days a week,
say Monday and Friday.

The teacher of the class might observe that the number of posts on a given day
seems to vary according to whether an assignment is due.

They might choose to model it with a _mixture-of-Poissons_:
one `Poisson` with a larger mean for days when assignments are due,
and one `Poisson` with a smaller mean for days when assignments are not due.

In [None]:
labels= ["Assignment Due", "No Assignment Due"]
weights = [2 / 7, 5 / 7]  # 2 days a week assignments are due, 5 days they are not
params = [(8,), (1,)]  # when an assignment is due, average number of posts is higher
poiss = lambda xs, mu: scipy.stats.poisson(mu).pmf(xs)

The distribution for such a model would look like:

In [None]:
utils.plot.comparative_mixture_plot(weights, params, poiss, xs=np.arange(0, 15), labels=labels)
ax = plt.gca(); ax.legend()
ax.set_xlabel("Number of Piazza Posts");

## Concrete Example: Effort Modeling with Mixture-of-Binomials

It is not necessary for the group label to be observable,
though it does make things easier.

If not, then the group label becomes an unobserved variable as well:
a _hidden mixture model_ or _latent mixture model_,
with the graph below:

In [None]:
utils.daft.make_mixture_plate_graph(group_observed=False)

For example, we might suspect that there are two kinds of participants in our study:
ones who are engaged in the task and will put out a large amount of effort
and have a high success rate
and ones who are not engaged in the task and will only put out a small amount of effort
and have a lower success rate.

Building off the model from a few weeks back,
we might measure the number of successes on multiple trials of a task,
modeled as a binomial,
and hypothesize that our participants have differing values of the parameters $N$ and $p$.

In [None]:
labels = ["Low Effort", "High Effort"]
weights = [0.5, 0.5]  # assume there are an equal number of the two kinds of participants
params = [(5, 0.1), (20, 0.6)]  # low effort, low success; high effort, high success

In [None]:
utils.plot.comparative_mixture_plot(weights, params, binom, xs=np.arange(0, 20), labels=labels)
ax = plt.gca(); ax.legend();
ax.set_xlabel("Number of Successes");

We don't know whether an individual was a "low effort" or a "high effort" participant directly
-- there is no objective way to measure that, unlike dog breed or day of the week.

It is something we might be able to infer, by building a model and computing a posterior.

## Concrete Example: Attention Experiment with a Mixture-of-Gaussians

Let's return to an old favorite: the `attention` dataset.

In [None]:
atten_df = sns.load_dataset("attention", index_col=0, data_home=Path("..") / ".." / "shared" / "data")
print(atten_df.head())

In this data, subjects completed a task with varying numbers of `solutions` and had their performance `score`d.

We've already used this data previously,
but always by only looking at a pair of groups at a time.
We're now ready to look at the `solutions` variable across all three values at once.

When we performed a $t$-test on this dataset, we were implicitly making a mixture-of-Gaussians model
where both of the mixture components had the same parameters.

When we did Bayesian inference, we allowed the components to have different, unknown parameters.

Now, let's see what the model of the data looks like if
we use the means and standard deviations observed for each group as our parameters
(rather than treating them as unknown, random variables).

In [None]:
weights = [1 / 3, 1 / 3, 1 / 3]

means = atten_df.groupby("solutions")["score"].mean()
sds = atten_df.groupby("solutions")["score"].std()
params = list(zip(means, sds))

In [None]:
labels = [1, 2, 3]
f, axs = utils.plot.comparative_mixture_plot(weights, params, normal, xs=np.linspace(0, 10), labels=labels)
axs[1].legend(); axs[1].set_xlabel("Score");

Because we have data in this case,
we can compare what we observed to what the model predicts.

In [None]:
labels = [1, 2, 3]
f, axs = utils.plot.comparative_mixture_plot(weights, params, normal, xs=np.linspace(0, 10), labels=labels)
sns.distplot(atten_df["score"], ax=axs[0], kde=False, norm_hist=True, label="Observed", axlabel=False);
axs[0].legend();
[sns.distplot(atten_df["score"][atten_df["solutions"] == ii], color="C" + str(ii - 1),
              kde=False, norm_hist=True, label="Observed " + str(ii)) for ii in labels];
axs[1].legend(); axs[1].set_xlabel("Score");

This suggests that our mixture model is "not that bad"
-- the components seem to line up with the observations from each group
and the overall, marginal distribution looks about right.

But what if, rather than applying a single, fixed mixture model to the data,
we want to determine which mixture models (which settings of the parameters in a mixture model)
are likely or plausible given the data?

# Bayesian inference with a mixture distribution

When we peform Bayesian inference on data from a mixture distribution,
instead of taking the parameters to be known,
we make the parameters random variables
(including, now, priors!).

In [None]:
utils.daft.make_mixture_plate_graph(params_known=False, value_observed=True)

We call a model that has a mixture distribution a _mixture model_.
Though it's not always explicitly stated,
most models with discrete components are mixture models!

We then give the values of the observed variable to the model
and compute the posterior over the parameter variables.

In [None]:
atten_model = pm.Model()
N_groups = len(atten_df["solutions"].unique())

atten_grand_mean = atten_df["score"].mean()
pooled_sd = atten_df.groupby("solutions")["score"].std().mean()

with atten_model:
    mus = pm.Normal("mus", mu=atten_grand_mean, sd=10, shape=N_groups)
    sigma = pm.Exponential("sigma", lam=1 / pooled_sd)
    
    score = pm.Normal("score", mu=mus[atten_df["solutions"] - 1], sd=sigma,
                          observed=atten_df["score"])

Notice how close this is to the models for differences in means!
The only real change is that now we have three (or in general, $K$) groups,
instead of two.

Plus, we have a new way of thinking about it: as a mixture!

The cell below contains an equivalent model to the above,
but written more in the style of the mixture distributions in this lecture
and less in the style of the models from previous assignments.

You should compare the two blocks of code to each other
and recognize their equivalence,
then compare both to the mixture distributions above.

If you want to play with the model or check that it runs,
change the cell below from a Markdown cell to a Code cell using the Cell Menu
and comment out the first and last lines (<code>\`\`\`python</code> and <code>\`\`\`</code>).
This might break some of the downstream analysis code,
so make sure to redefine the `atten_model` using the first cell
before continuing further.

```python
atten_model = pm.Model()
N_groups = len(atten_df["solutions"].unique())

atten_grand_mean = atten_df["score"].mean()
pooled_sd = atten_df.groupby("solutions")["score"].std().mean()

with atten_model:
    # First, define each component of the mixture
    mu_0 = pm.Normal("mu_0", mu=atten_grand_mean, sd=10)
    mu_1 = pm.Normal("mu_1", mu=atten_grand_mean, sd=10)
    mu_2 = pm.Normal("mu_2", mu=atten_grand_mean, sd=10)
    
    sigma = pm.Exponential("sigma", lam=1 / pooled_sd)
    
    # then, put all components in a "list"
    mus = shared_util.to_pymc([mu_0, mu_1, mu_2])
    
    # then, determine which component is "active"
    # and select it from that list
    
    mus_for_each_observation = mus[atten_df["solutions"] - 1]
    
    score = pm.Normal("score", mu=mus_for_each_observation, sd=sigma,
                      observed=atten_df["score"])
```

In [None]:
with atten_model:
    atten_trace = pm.sample(draws=1000)

`plot_posterior` is our first choice for taking a look at the posterior approximated by `pm.sample`.

In [None]:
pm.plot_posterior(atten_trace,
                  figsize=(12, 12));

It's somewhat hard to directly compare these values to one another.

Let's start by using a new feature of `plot_posterior`:
the `ref_val` keyword argument lets us plot a vertical line
as a `ref`erence `val`ue to compare each posterior to.

In this case,
the value we want to compare to is the _overall mean_,
or _grand mean_,
which is the mean of our observed variable, the `score`,
without taking into account which group an observation came from.

In [None]:
atten_grand_mean = atten_df["score"].mean()

In [None]:
pm.plot_posterior(atten_trace, varnames=["mus"],
                  figsize=(12, 12), ref_val=atten_grand_mean);

With this reference value in place,
it's easier to see the information in our posterior:
there's weak evidence that groups 0 and 2 have different means from the grand mean,
but fairly strong evidence that the two means are different.

Rather than compare to the grand mean,
it's convenient to subtract the grand mean out
before we specify our model,
so that our parameters are in terms of
_differences of the means from the overall mean_,
rather than in their original units.

In [None]:
centered_atten_df = atten_df.copy()
centered_atten_df["score"] = centered_atten_df["score"] - atten_grand_mean

centered_atten_model = pm.Model()
N_groups = len(atten_df["solutions"].unique())

with centered_atten_model:
    mus = pm.Normal("mus", mu=0, sd=10, shape=N_groups)
    sigma = pm.Exponential("sigma", lam= 1 / pooled_sd)
    
    score = pm.Normal("score",
                      mu=mus[centered_atten_df["solutions"] - 1], sd=sigma,
                      observed=centered_atten_df["score"])

In [None]:
with centered_atten_model:
    centered_atten_trace = pm.sample(draws=1000)

In [None]:
pm.plot_posterior(centered_atten_trace, figsize=(12, 12),
                  ref_val=[0, 0, 0, pooled_sd]);

In this posterior, I've included `sigma` and used the `pooled_sd` as the reference value.

## Now, we make judgements based on posterior samples, as before.

In [None]:
samples = shared_util.samples_to_dataframe(centered_atten_trace)

In [None]:
sns.distplot(samples["mus"].apply(lambda mus: int(mus[2] > 0)),
             kde=False, norm_hist=True, bins=[0, 1, 2],
             axlabel="mu_2 larger than grand mean");

In [None]:
sns.distplot(samples["mus"].apply(lambda mus: int(mus[0] > 0)),
             kde=False, norm_hist=True, bins=[0, 1, 2],
            axlabel="mu_0 larger than grand mean");

In [None]:
sns.distplot(samples["mus"].apply(lambda mus: int(mus[2] > mus[0])),
             kde=False, norm_hist=True, bins=[0, 1, 2],
             axlabel="mu_2 larger than mu_0");

But using binary thresholds like this is somewhat silly:
sometimes, the difference between the means can be extremely small,
so small that it would be practically meaningless,
and yet the claim we are checking would still technically be true.

In [None]:
utils.util.get_smallest_difference(samples["mus"], 0, 2)

This function finds the smallest difference between the mean of groups 2 and 0 that counted
for the inference above.
I typically find a smallest value around `0.01`.
This is quite small,
given that the differences in the group means we observed were around 1.

And furthermore, these comparisons are still only pairwise:
we're comparing $\mu_2$ to $\mu_0$, or either mean to the grand mean.

Better:
if we have some notion of a "big enough difference to matter",
then we can check whether enough of our posterior
samples have a difference that big in _any_ of the means.

In [None]:
def max_absolute_difference(mus):
    return np.max(np.abs(mus - np.mean(mus)))

In [None]:
sns.distplot(samples["mus"].apply(max_absolute_difference), axlabel="Biggest Absolute Difference");

That is, we check whether the value of this absolute difference is at least `0.1`, or some other threshold.

Importantly: this is the _difference in the means_, not the differences in the data!

It's more common to use the squared difference,
because it's connected to the variance and standard deviation:

In [None]:
def max_squared_difference(mus):
    return np.max(np.square(mus - np.mean(mus)))

In [None]:
sns.distplot(samples["mus"].apply(max_squared_difference),
             axlabel="Max Squared Difference of Means");

But if we take the maximum of the squared differences,
then the number we measure would only get bigger as we looked at more groups,
even if there were no effect.

So instead, we can look at the _average squared difference_,
which is the same as the _variance_ of the means.

In [None]:
def mean_squared_difference(mus):
    return np.var(mus)

In [None]:
sns.distplot(samples["mus"].apply(mean_squared_difference),
             axlabel="Mean Squared Difference of Means");

But this still doesn't give us a sense of scale:
is `1.0` large? Is `0.01` small?
We might know this from our application,
but it is nice to be able to get a normalized value.

If we divide by the variance of the means by the variance of the data,
we get a value that is between 0 and 1.

In [None]:
def normalized_mean_squared_difference(mus):
    return np.var(mus) / np.var(atten_df["score"])

Quick explanation why: the two numbers we are dividing are positive, so the value can't be negative.
The number on the bottom, the variance of the overall distribution, aka the variance of the mixture distribution,
can't be larger than the variance of the means.

Consider the most extreme cases:
- If the top number is 0, because all the means are the same,
then the value is 0. Therefore this ratio is at least 0.
- We can make the variance of the data smaller by making the variance of each component smaller,
but even if we make the width of each component infinitely thin,
the overall variance of the data won't be 0,
since there is still variability due to the group being different for different observations.
Therefore the bottom number is at least as big as the top number,
so the ratio is no more than 1.

In [None]:
sns.distplot(samples["mus"].apply(normalized_mean_squared_difference),
             axlabel="Squared Difference in Means as Fraction of Variance");

This is connected to a statistic called the _variance explained_.

## Mixture distribution-based models are also called categorical effects models

Whenever the parameters of one variable depend on another grouping variable, like `"solutions"` above,
we say that there is a _categorical effect_:

the _category_ or _group_ that an observation comes from has an _effect_ on the distribution we choose to model the observation.

The terminology implies, falsely, that finding a categorical effect establishes a "cause-and-effect" relationship.

### Counter-Example: Ice Cream

Say we measured, on each day, the temperature and whether we sold out of ice cream.

It is reasonable to expect that we will sell out of ice cream more when the temperature is higher,
and so if we split up days by whether we sold out of ice cream and then compare temperatures,
we'll see a "categorical effect", even though the direction of causation is the other way around!

Fundamentally, the problem of determining causation is as much philosophical as mathematical,
and philosophically, even Bayesian statistics can't really accomodate causal claims.

If you're interested in learning more about how to think about causal claims
mathematically and rigorously, instead of the somewhat mushy way it's done in
the statistics we're learning here, check out
[_The Book of Why_](https://www.basicbooks.com/titles/judea-pearl/the-book-of-why/9780465097609/)
by Judea Pearl,
who was one of the pioneers of the graphical modeling style
we've been using, among other things.
It's an accessible introduction to this exciting area of statistics research!

## The `Normal` case is called an ANOVA model, and is especially important for frequentist approaches

The particular mixture model we worked with above,
with `Normal` components that all have the same variance,
is known as an **ANOVA model**
because of its association with the **AN**alysis <b>O</b>f **VA**riance method,
which combines an ANOVA model with a collection of statistical tests.

On Wednesday, we will go over the ANOVA model and tests from a frequentist perspective.

It is perhaps _the most common_ statistical test in research psychology,
and it includes the $t$-test as a special case.

## As usual, the Bayesian approach is more flexible

We can justify our choice to use a `Normal` likelihood
if we think that many small, independent, unknown factors cause
the participants' `score`s to vary relative to the group mean.

How recently they had coffee, whether they've participated in a similar study before,
how much sleep they've gotten recently,
whether and how much they're trying to impress the experimenter, etc. etc.

But imagine that sometimes, in our attention model,
the `score` of a participant is wildly different from what it typically is
due to some large, rare effect.

For instance, a participant might fall asleep on one trial,
and so score very low,
or they might catch a glimpse of the solution when the experimenter,
while doing their make-up, holds up a small hand mirror,
and so score very high.

The resulting observation is called an _outlier_:
it "lies outside" the distribution of the other variables.

## We can model outlier-prone data with the `StudentT` distribution

The go-to choices for likelihoods of data where outliers are a potential problem
are the `StudentT` and the `Cauchy`.
For the `StudentT`,
you'll want to set the value of the `nu` parameter to be `1` or `2`;
as that value increases, the shape of the distribution becomes more `Normal`.

In [None]:
robust_atten_model = pm.Model()
N_groups = len(atten_df["solutions"].unique())

with robust_atten_model:
    mus = pm.Normal("mus", mu=0, sd=10, shape=N_groups)
    sigma = pm.Exponential("sigma", lam= 1 / pooled_sd)
    
    score = pm.StudentT("score", nu=1,
                      mu=mus[centered_atten_df["solutions"] - 1], sd=sigma,
                      observed=centered_atten_df["score"])

In [None]:
with robust_atten_model:
    robust_atten_trace = pm.sample(draws=1000)

In [None]:
pm.plot_posterior(robust_atten_trace, figsize=(12, 12),
                  ref_val=[0, 0, 0, pooled_sd]);

## Use MAP estimation to get a single best guess

When we want a single "best guess" out of a Bayesian model,
we use the method of _maximum a posteriori_ inference, or MAP inference.

This method seeks to select the values of the parameters, $\theta$,
that make the posterior (log-)probability as large as possible:

$$
p(\lambda\vert\text{data}) \propto p(\text{data}\vert\lambda) p(\lambda)
$$

In [None]:
with centered_atten_model:
    MAP_estimates = pm.find_MAP(start=centered_atten_trace[-1])

In [None]:
MAP_estimates

## With a best guess for the parameters, we can make predictions.

The best test of a scientific model
is whether it can predict new observations.

The second best is to check how well it does predicting existing observations.

We can ask our model to predict the `score`s of new participants
based only on how many `solutions` the problem they are attempting has.

For a `Normal` mixture model, the natural prediction to make is the value of `mu`.

In [None]:
predictions = MAP_estimates["mus"][centered_atten_df["solutions"] - 1]

In [None]:
np.array(centered_atten_df["solutions"]), predictions

And we can quantify the quality of our predictions by checking the averaged squared error,
or squared difference between our prediction and what we observed.

In [None]:
average_prediction_error = np.mean(np.square(centered_atten_df["score"] - predictions))
average_prediction_error

Is this good performance? The answer is unclear.

Let's compare the prediction performance of a very silly model:
one that always predicts that the participant will score the average,
regardless of group.

This is our "baseline" model:
it is a point of comparison for all other models.
It's what we would have to do if we didn't know what the value of `solutions` was.

In [None]:
baseline_predictions = [0] * len(predictions)

In [None]:
average_baseline_prediction_error = np.mean(np.square(centered_atten_df["score"] - baseline_predictions))
average_baseline_prediction_error

The error of the baseline model is larger, which is a good sign.

It is typical to compare the prediction error of the baseline model
and the actual model by a ratio:

In [None]:
1 - average_prediction_error / average_baseline_prediction_error

This is the same quantity computed by the `normalized_mean_squared_difference` function above:

In [None]:
normalized_mean_squared_difference(MAP_estimates["mus"])