<img src="../../shared/img/banner.svg" width=2560></img>

# Categorical Effects 02 - ANOVA by Hand

In [None]:
import sys

sys.path.append("../../")

from shared.src import quiet
from shared.src import seed
from shared.src import style

In [None]:
from pathlib import Path
import random

from IPython.display import HTML, Image
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
import scipy.stats

import utils.daft

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

import utils.anova as anova

In [None]:
sns.set_context("notebook", font_scale=1.7)

In [None]:
import shared.src.utils.util as shared_util

In this tutorial, we will learn about the
[ANalysis Of VAriance](https://en.wikipedia.org/wiki/Analysis_of_variance)
model by walking through a by-hand implementation of it
in its simplest form:
the one-way ANOVA.

In all of its forms,
ANOVA works by "partitioning" the variance in an observation
into different components.
The null hypothesis is that the factors we are interested in are random with respect to the observation.
A randomly chosen partition of the data will, with high probability, not reduce the variance very much.
Therefore if,
once we've partitioned data according to the factors we're interested in,
the variance is greatly decreased,
then something unlikely has occurred under our null hypothesis
and we may have found an effect.

Below, we make these statements rigorous on a toy dataset
and then apply a one-way ANOVA to a real neuroscience dataset.

# Introducing Analysis of Variance

Following Wikipedia's
[motivating example](https://en.wikipedia.org/wiki/Analysis_of_variance#Motivating_example),
we consider what might happen if we were to gather some data
about a collection of dogs.

This collection of dogs has two breeds,
each of which comes in a long-haired and a short-haired variety.
We measure the weight of each dog,
taking care to note its breed and hair length,
and organize that information into a dataframe.

In [None]:
N = 75

dogs = anova.produce_dog_dataframe(N, weight_effect = 6)
dogs.sample(5)

As always,
we begin by visualizing our data.
Here, we choose a histogram.
We also calculate a summary statistic --
since we'll need it later,
we look at the variance,
or mean squared difference from the mean.

In [None]:
anova.plot_data(dogs, "weight")

It appears that,
as far as weight is concerned,
there are only two kinds of dogs,
not four:
the histogram has two "clusters"
and the kernel density estimate has two peaks.

Intuitively,
we might guess that breed is more likely
to have a large effect on weight
than is hair length.

To determine whether this is the case,
we split our data up,
first by breed and then by hair length,
and make the same plots as before.
In addition to calculating the overall mean squared difference,
we calculate the mean squared difference within each group
and the mean squared difference of the group means from the overall mean.

In [None]:
anova.plot_partition(dogs, "breed", "weight")

In [None]:
anova.plot_partition(dogs, "hair_length", "weight" )

Color indicates from which group
the data in the plot was drawn.

Notice how much less
"spread out"
the data in the groups based on breed is
than the data in groups based on hair length is?
Notice also how much further apart the means
of the two breeds are,
and how much further they are from the
mean of the weights of all of the dogs?

We can quantify this "spread",
as usual, using the variance,
or the mean squared difference
from the mean.

Though the exact values are random if we generate
the data over and over again,
we in general see that the variance in the groups based on breed
is substantially smaller than the original variance,
while the variance in the groups based on hair length
is around the same size.
Additionally,
we see that the mean squared difference of the group means
from *their* mean, which is the mean weight of all dogs,
is much larger when we group by breed than when we group by hair length.

You can also vary how much of an effect the breed has on weight
by changing the value of the `weight_effect`
argument to `produce_dog_dataframe`.
Try some different effect sizes
and see if you can predict the changes in spread.

This is the idea at the core of ANOVA:
if group membership has a strong effect on an observed variable,
then splitting the data up by group should
reduce the variance substantially,
while the mean squared difference of the groups should go up.
This isn't true of every possible way that an observed variable might
depend on group membership, but it is true in many cases,
as described below in "When to Use ANOVA".

What does it mean to reduce the variance "substantially"?
We define "substantially" as
"more than when splitting the data up by a labeling unrelated to the observation".
For the data above, our unrelated labeling was the hair length.
We can, however, also create random labels and use those.

We firm up this intuition with the section below on the
data model underlying ANOVA.

## The Implicit Model in ANOVA

Unlike other traditional statistical tests,
the ANOVA is thought of as a _model_ of the data:
it is used to predict unobserved values, for example.
Most other tests also have an implicit model,
but that fact is typically ignored when they are used.

When we use a one-way ANOVA to model our data,
we are implicitly choosing to model the dependent variable
as the sum of three things:
1. the **grand mean**,
or the average value of all observations,
1. the **group effect**, or the change in the average value due to being in that group,
which is the difference between that grand mean and the average of observations in a particular group,
1. and the sum of any **unknown effects**,
or changes to the observed variable due to things we didn't measure,
which is just the difference between the sum of the first two terms and the value that was observed.

Below, we take this English-language description and convert it into math.

$$\begin{align}
    &\text{Observation}\ j\ \text{in Group}\ i\ &= \ &\text{Grand Mean}\ &+\ &\text{Group Effect}_i\ &+ \ &\text{Unknown Effects}_{ij} \\
    &Y_{ij} &= \ &\mu_\text{grand} \ &+\ &A_i\ &+ \ &\epsilon_{ij}
\end{align}$$

where the variable $A$ is the independent variable, the index $i$ indicates which group or level of this variable the individual was in and the index $j$ tells us which individual within the group they were.

In our dog example, $A$ would be either "hair length" or "breed" and $i$ would be the index, $0$ or $1$, for the groups within those variables. $j$ would index dogs within a group. To pick out a particular dog, we need both which group it's in, $i$, and which dog in that group it is, $j$.

This notation may seem like over-kill, but it will be necessary for more complicated ANOVAs.

To make this model tractable,
we have to assume something about these "unknown effects".
Into this category we lump everything that we did not measure,
everything that does not correspond to a group in our study,
but which can conceivably impact the observed value.
Appealing to the
[Central Limit Theorem](https://www.khanacademy.org/math/statistics-probability/sampling-distributions-library/sample-means/v/central-limit-theorem),
we say that the things we aren't measuring are
independent from each other and our groups,
large in number,
and individually have small effects on our observed variable,
and therefore the distribution of their impacts on the observed variable
is a Gaussian.

If we know all of the parameters of this model,
as when we're determining the distribution under the null,
then the graph looks like this:

In [None]:
utils.daft.make_anova_graph()

$$
\mu_G \sim \text{Flat}() \\
\sigma \sim \text{HalfFlat}() \\
A \sim \text{Flat}\left(\text{shape}=K\right) \\
g \sim \text{DiscreteUniform}\left(0, K-1\right) \\
y \sim \text{Normal}(\mu= \mu_G + A_g, \text{sd}=\sigma)
$$

In the case of the dog breed and weight example,
the values for the parameters would be

$\mu_G$:

In [None]:
mu_G = dogs.groupby("breed")["weight"].mean().mean()
mu_G

$\sigma$:

In [None]:
sigma = np.sqrt(dogs.groupby("breed")["weight"].var(ddof=1).mean())
sigma

$A$

In [None]:
A = dogs.groupby("breed")["weight"].mean() - mu_G
A

In [None]:
anova.plot_partition(dogs, "breed", "weight")

The gray line represents $\mu_G$.

The two values of $A$ are given by the gaps between
the colored lines and the gray line.

$\sigma$ is the average of the widths of the two distributions.

Note that there's no fundamental need to separate out the grand mean $\mu_G$
from the values in $A$.

We could instead have

$$
\sigma \sim \text{HalfFlat}() \\
A^\prime \sim \text{Flat}\left(\text{shape}=K\right) \\
g \sim \text{DiscreteUniform}\left(0, K-1\right) \\
y \sim \text{Normal}(\mu= A^\prime_g, \text{sd}=\sigma)
$$

Then, the values of $A^\prime$ would be represented in the plot above by
the colored dashed lines.

If you're writing your own "Bayesian ANOVA model" for data,
you should parameterize it like this.
The former way of writing it is for convenience when
working with the statistical tests associated with ANOVA,
to which we now turn,

## THE $F$ Test for ANOVA Models

As established above,
we' break our data into groups and then calculate
the mean and variance _in each group_.
If the group labels have a big effect on the observed values,
the the differences between the group means will be large
relative to the variability of the data.

The statistic used to test this model is the ratio of
the _variance captured by the group effects_
to the _variance of the unknown effects_.
This statistic is called the
$F$-statistic,
named in honor of its inventor,
[Sir Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher):

So for this case with two groups ($K=2$), we have

$$
F = \frac{N \cdot \left(A_0^2 + A_1 ^2\right)}{\sigma^2}
$$

Notice that $\sigma$ is not the same as the standard deviation of the data, ignoring the groups:

In [None]:
sigma, np.sqrt(dogs["weight"].var())

For the dog breeds example, the values are

In [None]:
N = sum(dogs["breed"] == 0)

(N * A.var()) / (sigma ** 2)

Large values of $F$ are unlikely under the null hypothesis
that the true values of $A_0$ and $A_1$ are $0$.

In [None]:
scipy.stats.f_oneway(dogs[dogs["breed"] == 0]["weight"],
                     dogs[dogs["breed"] == 1]["weight"])

Whereas small values of the statistic are likely when the group label has little or no relation to the measurement.

In [None]:
F_hair_length, p_hair_length = scipy.stats.f_oneway(
    dogs[dogs["hair_length"] == 0]["weight"],
    dogs[dogs["hair_length"] == 1]["weight"])
F_hair_length, p_hair_length

In general,

$$
F = \frac{\frac{N}{K-1} \cdot \sum_i A_i^2}{\sigma^2}
$$

Where the $\sum$ symbol means "add up all of these things".

That is, if $S$ is a list or a Series, $\sum_i S_i$ means the same as `sum(S)` or `S.sum()` or, most directly

`sum([S[i] for i in range(len(S))])`.

Remember that $p$-values are calculated by using
the sampling distribution of the statistic under the null hypothesis,
aka the null distribution.

The sampling distribution of $F$ under the null hypothesis
for this data is below.
It is the same for both the `hair_length` and the `breed` models.

In [None]:
K = 2
null_distribution = scipy.stats.f(K-1, N-K)

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
fs = pd.Series(np.linspace(0, 10, num=5000))
ax.plot(fs, null_distribution.pdf(fs), lw=4, label="Null Distribution");
ax.vlines(F_hair_length, 0, null_distribution.pdf(F_hair_length),
          lw=4, zorder=3, label="Observed $F$");
ax.fill_between(color="C1", label="Contributes to $p$",
    x=fs[fs>=F_hair_length],
    y1=null_distribution.pdf(fs[fs>=F_hair_length]));
ax.set_xlabel("$F$"); ax.legend();

Here, the "extreme" values of the test statistc, $F$,
are all values as large or larger than the one we observed,
unlike in a $t$-test, where the extreme values
were those with _magnitude_ as large or larger.

In [None]:
1 - null_distribution.cdf(F_hair_length)

## When to Use ANOVA

The discussion above,
full of comparisons of means to baselines
and calculations of variances as yardsticks
reminds us of the $t$-test.

And indeed, the $t$-test is a particularly simple one-way ANOVA,
with only two groups.
$t^2$ is then equal to $F$.
Running the $t$-test on our dog-breeds data
would give the same result as running an ANOVA.

One might then imagine that we don't need an ANOVA:
we simply use the $t$-test for data from each group
versus the rest of the data.

However, in the case that we have more than two groups
(more than two dog breeds)
this requires us to perform multiple hypothesis tests.
Each time we perform a hypothesis test,
there is a chance of a false positive.
As we perform more and more hypothesis tests,
the chance that at least one of them fails goes up very quickly.
This chance is called the *familywise error rate*,
since its the rate at which an entire family of tests has an error.
Issues with rising familywise error rates are called issues of *multiple comparisons*,
since they arise from comparing multiple test statistics to their critical values.

The utility of ANOVA comes from the fact that it lets us test
the hypothesis that the mean of at least one
of the levels of the variable we are using
to group our observations
is different from the overall mean
without specifying which level it is.

Once we've verified that this over-all null hypothesis is false,
we can more confidently perform additional $t$-tests
to narrow down just which level is different.

This approach works well to control familywise error rates
for the simple experiments we are analyzing in this section.
For more complex experiments,
which require more complex ANOVAs,
multiple comparison issues can still arise.

## When Not to Use ANOVA

Our assumption about the distribution of the unobserved effects is frequently incorrect:
for example,
whether the participant is hungover or not
can have a very strong impact on their performance in a task and
it is not independent of the age variable.
When we have rare effects that cause big changes in the observed value,
the result is *outliers*.
ANOVA is not robust to violations of the assumption
that there are no outliers,
but we won't be discussing alternatives in this course.

We also assume that the expected magnitude of the unknown effects
is the same for each group --
that the variance of the unknown effects
doesn't depend on the group.
The Greek term for "spread" is
*skedasis*, so if we assume that the spread,
or variance,
is the same across groups, we are assuming
*homoscedasticity*,
as opposed to
*heteroscedasticity*.
If your groups are all the same size,
then ANOVA is actually quite robust
to violations of this assumption.
See
[this discussion on CrossValidated](https://stats.stackexchange.com/questions/97098/practically-speaking-how-do-people-handle-anova-when-the-data-doesnt-quite-mee)
for more.

# Implementing ANOVA

## Example Dataset

For the rest of this section's tutorial and lab, we'll be using some EEG data graciously provided by the [Voytek lab](http://voyteklab.com/about-us/) of UCSD. Participants of varying ages were asked to perform a working memory task with varying levels of difficulty. The raw EEG signal has been summarized into the following two measures:

* [Contralateral Delay Activity](https://www.ncbi.nlm.nih.gov/pubmed/26802451), or CDA, is used to measure the engagement of visual working memory.

* [Frontal Midline Theta](https://www.ncbi.nlm.nih.gov/pubmed/9895201) oscillation amplitude has been correlated with sustained, internally-directed cognitive activity.

The performance of the subjects has also been summarized using the measure
[d'](https://en.wikipedia.org/wiki/Sensitivity_index) (pronounced "d-prime"), also known as the *sensitivity index*. D' is a measure of the subject's performance in  a task. It's based on comparing the true positive rate and false positive rate.

## Loading the Data

In [None]:
shared_data_path = Path("..") / ".." / "shared" / 'data'

In [None]:
df = pd.read_csv(shared_data_path / 'voytek_working_memory_aging_split.csv', index_col=None)

df.sample(5)

For the purposes of this tutorial, we're interested only in how task difficulty affects our three measures. We're uninterested in the subject's metadata -- `age_split`, `group`, `age`, and `idx`. Let's begin by dropping those columns from our dataframe using the DataFrame method `drop`.

In [None]:
data = df.drop(['age_split','group','age','idx'], axis=1)
data.sample(5)

It's good practice to keep an original copy of your dataframe around (here, named `df`) so you can undo irreversible changes, like dropping columns.

In [None]:
f, ax, = plt.subplots(figsize=(12, 6))
sns.boxplot(y="cda", x="difficulty", hue="difficulty", data=data, ax=ax);

## ANOVA by Hand

To get a better understanding of ANOVA,
we'll now implement it from scratch.

To get started,
you'll need the total number of observations $N$,
the group sizes
(here, each group is the same size),
and the keys for each group
(here, 1, 2, and 3, and they're the `unique` values of the `difficulty` variable).

The first cell picks a measure to run ANOVA on.
We'll want to write all of our code that follows in such a way
that we can run ANOVA on the other measures
just by changing this one cell.

In [None]:
measure = "cda"

In [None]:
N = len(data[measure])

groups = data["difficulty"].unique()

In [None]:
groups

We'll proceed by generating a new data frame
that contains all the information we need to perform an ANOVA
-- each row will contain the grand mean and the group mean,
the group effect,
and the residual for that observation.
We will call this our `anova_frame`.

We also drop the other measures from this frame, to keep down on clutter.

In [None]:
anova_frame = data.copy().drop([unused_measure for unused_measure in ["cda", "d", "fmt"]
                                if measure != unused_measure], axis=1)

In [None]:
anova_frame.head()

### Computing the Grand and Group Means

The cell below computes the grand mean
and the group mean for each difficulty level
and stores them in the `anova_frame`.

The precise defintions of the grand and group means appear below.

$$\begin{align}
\mu_\text{grand} &= \frac{1}{N}\sum_\text{dataset} Y_{ij} \\
\mu_{\text{group}_i} &= \frac{1}{N_{\text{group}_i}}\sum_{\text{group}_i} Y_{ij}
\end{align}$$

Note that $\mu_{\text{group}_i}$, the group *mean*, is not the same as $A_i$, the group *effect*!

The group effect is the difference between the group mean and the grand mean. We must compute the group and grand mean first, before we can compute the group effect.

In [None]:
anova_frame["grand_mean"] = anova_frame[measure].mean()

group_means = anova_frame.groupby("difficulty")[measure].mean()

for group in groups:
    anova_frame.loc[anova_frame.difficulty==group,"group_mean"] = group_means[group]

### Aside on Degrees of Freedom

Let's take a look at the resulting data frame.

In [None]:
anova_frame.sample(10)

There are only three unique values in the `group_mean` column, corresponding to the three group means. If we calculate their average value, we'll find that it is equal to the grand mean.

In [None]:
group_means = anova_frame["group_mean"].unique()

print(group_means)

np.mean(group_means) - anova_frame[measure].mean() < 1e-4

And so if we know the grand mean,
we only need two of the group means to know the other.

Put another way, though we have three numbers here in the form of our three group means,
the value of the third is constrained by the values of the other two (and the grand mean).
There are only two "free parameters" here,
rather than three.
The number of free parameters is also known as the number of
*degrees of freedom*,
terminology that is borrowed from physics.

When we are computing inferential statistics,
we need to use the number of degrees of freedom,
rather than the total number of observations.

For the mean, this doesn't cause an issue,
because the mean is computed independently
from all of the datapoints.

However, the variance is computed
using the value of the mean,
which is calculated from the same data.
Therefore if we know the mean, the variance,
and all but one of the data values,
then we can calculate the missing value.
Therefore the variance has one less degree of freedom,
so the proper formula for the variance
as an inferential statistic of the data is

$$ \frac{1}{N-1} \sum_\text{dataset} \left(\text{observation}\ -\ \text{average of observations} \right)^2 $$

This will become important later when we calculate the variances.

### Computing the Group and Unknown Effects

Now, we compute the explained and unexplained, or residual, components for each observation.
The explained differences are the differences between the group average and the overall average.
The unexplained difference is the difference between the individual score and the group average.

Above, these were called the "group effect" and the "unknown effect", respectively.
Terminology differs between authors.
Different terms correspond to different particular intuitions about ANOVA.

$$\begin{align}
\text{Group Effect}_i &= \text{Explained Component}_i \\
            &= \mu_{\text{group}_i} - \mu_\text{grand}  \\
            \\
\text{Unknown Effects}_{ij} &= \text{Unexplained Component}_{ij} = \text{Residual}_{ij} \\
            &= Y_{ij} - \left(\text{Grand Mean}\ +\ \text{Group Effect}_i \right) \\
            &= Y_{ij} - \mu_{\text{group}_i}
\end{align}$$

In [None]:
anova_frame["explained"] = anova_frame["group_mean"]-anova_frame["grand_mean"]

anova_frame["residual"] = anova_frame[measure]-anova_frame["group_mean"]

In [None]:
anova_frame.sample(10)

To check our work, we confirm that the total value for each observation is equal to the sum of the grand mean, the explained component, and the residual.

In [None]:
np.isclose(anova_frame[measure], anova_frame["grand_mean"] 
                                + anova_frame["explained"]
                                  + anova_frame["residual"]).all()

This is a condition that is guaranteed by our model,
which states that any component of our observation that isn't explainable
as a deviation in group mean from the grand mean
is due to an unknown effect and is
to be left unexplained, or residual.

Review the implicit ANOVA model below and compare it to the `anova_frame` above.
Where is each component in this dataframe? How are they related to each other?

$$\begin{align}
    &\text{Observation}\ j\ \text{in Group}\ i\ &= \ &\text{Grand Mean}\ &+\ &\text{Group Effect}_i\ &+ \ &\text{Unknown Effects}_{ij} \\
    &Y_{ij} &= \ &\mu_\text{grand} \ &+\ &A_i\ &+ \ &\epsilon_{ij}
\end{align}$$

### Computing Sums of Squares

Now that we have all of our components,
we need to compute their mean squares.
The general formula for computing the mean squared difference from the mean is

$$
\text{Variance of Observed Variable} = \frac{1}{\text{degrees of freedom}} \sum_\text{observations} \left(\text{observation}\ -\ \text{average of observations}\right)^2
$$

where everything except the inverse degrees of freedom
is called a
*sum of squares*,
for the obvious reason
that it is a _sum_ of things being _squared_.

As a first step to calculating each mean square,
we calculate several sums of squares.

Using the symbol $SS$ to stand for "sum of squares",
we define the following:

$$\begin{align}
SS_\text{Total} &= \sum_\text{dataset} Y_{ij}^2 \\
SS_\text{Grand Mean} &= \sum_\text{dataset} \mu_\text{grand}^2 \\
SS_\text{Explainable} = SS_\text{Total} - SS_\text{Grand Mean} 
                     &= \sum_\text{dataset} \left(Y_{ij} - \mu_\text{grand}\right)^2 \\
SS_{\text{Explained}} = \sum_\text{groups} SS_{\text{Group Effect}_i}
                     &= \sum_\text{groups} \sum_{\text{group}_i} A_i^2 \\
SS_{\text{Residual}} = SS_\text{Explainable} - SS_\text{Explained}
                     &= \sum_\text{dataset} \left(Y_{ij} - A_i - \mu_{grand}\right)^2
\end{align}$$

also known as the total sum of squares, sum of the grand mean squared,
the sum of squares that can be explained by a model,
the sum of squares that were explained by the model,
and the residual sum of squares.

Several of the equations above, like the one relating
the explainable sum of squares to the total and grand mean sums of squares,
can be used to determine whether we've done our calculations correctly.
The assertion statements in the code cell under the next check this.

We'll store the sums of squares in a dictionary, `sum_of_squares`, using the column name as the key.

In [None]:
sum_of_squares = {}

keys = [measure, "grand_mean", "explained", "residual"]

for key in keys:
    sum_of_squares[key] = np.sum(np.square((anova_frame[key])))
    
sum_of_squares["explainable"] = sum_of_squares[measure] - sum_of_squares["grand_mean"]

In [None]:
# these should be the same, except for computer rounding error

assert( sum_of_squares[measure] - (sum_of_squares["grand_mean"] + 
                                 sum_of_squares["explainable"]) <= 1e-4 )

assert( sum_of_squares["explainable"] - (sum_of_squares["explained"] +
                                       sum_of_squares["residual"]) <= 1e-4 )

In [None]:
sum_of_squares

### Computing Mean Squares

If we use an alternative formula for variance:

$$\begin{align}
\text{Variance}(X) = \frac{1}{\text{degrees of freedom}} \left( SS(X) - SS(\mu_X) \right)
\end{align}$$

where $\mu_X$ is the mean of $X$,
we see that our explained sum of squares, 
$$
\ \\
SS_\text{Explained} = SS_\text{Group Means} - SS_\text{Grand Mean}\\
$$
is ready to be turned into a variance:
the average of the group means is the grand mean,
so dividing the explained sum of squares
by its degrees of freedom tells us
the variance of our predictions,
aka the spread of the group means.

In addition, the mean of the residuals is by definition $0$,
so the mean squared difference of the residuals from their mean is just
$$
\ \\ 
\frac{1}{\text{degrees of freedom}}SS_{\text{Residual}} \\
$$
and so we also only need to divide by degrees of freedom to get the variance.

Because we are simply dividing a sum of squares,
these quantities are often "mean squares"
rather than "mean squared differences from the mean"
or "variances".

The total degrees of freedom is the number of observations, $N$.

Each time we use a degree of freedom, we subtract from our total available.
When we are done, we have no degrees of freedom left.

Calculating the grand mean takes away one degree of freedom:
if we know the grand mean AND $N-1$ of the data values,
then we know the missing value.

Calculating the $k$ different group means,
which gave us our explained components,
takes away $k-1$ degrees of freedom.
If we know the grand mean AND $k-1$ of the group means,
then we know the missing group mean.

The remaining degrees of freedom are used by the residuals.
If we write the process of subtraction out,
we get

$$
\text{residual degrees of freedom} = N - 1 - (k-1) = N - 1 - k + 1 = N - k
$$

The cell below calculates these degrees of freedom
and places them in a dictionary
called `dof`.

As a sanity check, the assert statement below
checks that the sum of the other degrees of freedom
is equal to the total degrees of freedom.

In [None]:
# k is the number of groups
k = len(groups)

dof = {}
vals = [N, 1, k-1, N-k]

for key,val in zip(keys,vals):
    dof[key] = val

In [None]:
dof

In [None]:
assert(sum([dof[key] for key in dof.keys()
                       if key is not measure]) == dof[measure])

Now, we calculate our estimate for the mean square of the explained and unexplained components.

In [None]:
mean_square = {}

for key in ["explained","residual"]:
    mean_square[key] = sum_of_squares[key] / dof[key]

In [None]:
mean_square

The variance of the explained component tells us how much
the groups differ from one another:
the larger the spread in group means,
the large the variance of the explained component.
This is also sometimes referred to as the
*variance in our predictions*.

We'd like our predictions to,
in addition to having small squared error,
aka small variance *within* groups,
have large variance *between* groups,
since that means that the prediction
is actually different for different groups.
For this reason,
these mean squares are sometimes called
the within group mean squares
and the between group mean squares,
rather than residual and explained,
respecitvely.

Note that the bigger the explained mean square is,
the more supported the hypothesis is,
because it is less likely we would have observed such a result if the null hypothesis were true.
(Why?)

### Computing the $F$-Statistic

However,
the explained mean square by itself isn't sufficient to make decisions
about the validity of hypotheses
-- is a variance of `3.23` "big enough" to not be due to chance?
For our data, it seems like it is,
but for data with units in the billions and spread in the millions,
it would not be.
Therefore,
if we want a statistic that tells us how good our hypothesis is,
we need to somehow take into account the total variance.

To do so,
we divide the explained variance
not by the total variance,
but by the unexplained variance.
The ratio of the explained variance
to the unexplained variance
is called the $F$-statistic.

Why do we divide by the unexplained,
residual variance
instead of by the total variance?

The reasons for choosing $F$ are technical,
relating to which ratio is easier
to get a handle on mathematically.
The upshot is that the $F$ statistic
has a nice distribution that can be written down mathematically,
without the aid of computers,
as [Sir Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher)
did in the early 20th century.
Ratios of other variances are not as amenable
to mathematical analysis,
and so the only way to test hypotheses
using those ratios as the test statistic is by using
randomization tests,
which require substantial computing power
that was not available until recently.

We compute the value of $F$ for this data below.

In [None]:
F = mean_square["explained"] / mean_square["residual"]

F

And then use the null distribution of $F$ given by `scipy`
to compute a $p$ value.

In [None]:
1 - scipy.stats.f(dof["explained"], dof["residual"]).cdf(F)

In [None]:
difficulty_groupby = anova_frame.groupby("difficulty")[measure]

difficulty_groups = [difficulty_groupby.get_group(difficulty) for difficulty in [1, 2, 3]]

In [None]:
scipy.stats.f_oneway(difficulty_groups[0], difficulty_groups[1], difficulty_groups[2])