<img src="./img/HWNI_logo.svg"/>

# Tutorial - ANOVA by Hand

In [1]:
# makes our plots show up inside Jupyter
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

import util.utils as utils
import util.shared as shared

shared.format_plots()
shared.format_dataframes()

## Analysis of Variance

## The Implicit Model in ANOVA

## When Should You Use ANOVA?

Define familywise error rate here?

## Example Dataset

For this week's lab, we'll be using some EEG data graciously provided by the [Voytek lab](http://voyteklab.com/about-us/) of UCSD. Participants of varying ages were asked to perform a working memory task with varying levels of difficulty. The raw EEG signal has been summarized into the following two measures:

* [Contralateral Delay Activity](https://www.ncbi.nlm.nih.gov/pubmed/26802451), or CDA, is used to measure the engagement of visual working memory.

* [Frontal Midline Theta](https://www.ncbi.nlm.nih.gov/pubmed/9895201) oscillation amplitude has been correlated with sustained, internally-directed cognitive activity.

The performance of the subjects has also been summarized using the measure
[d'](https://en.wikipedia.org/wiki/Sensitivity_index) (pronounced "d-prime"), also known as the *sensitivity index*. D' is a measure of the subject's performance in  a task. It's based on comparing the true positive rate and false positive rate.

## Loading the Data

In [2]:
df = pd.read_csv('./data/voytek_working_memory_aging_split.csv',index_col=None)

df.sample(5)

Unnamed: 0,idx,id,age_split,group,age,difficulty,d,cda,fmt
47,23,24,4,2,68,2,2.87,1.0,-0.78
30,6,7,1,1,21,2,3.61,2.26,1.06
19,19,20,3,2,58,1,4.13,0.65,-1.12
20,20,21,3,2,55,1,4.61,2.03,-0.01
62,14,15,3,2,51,3,4.05,1.02,0.62


For the purposes of this lab, we're interested only in how task difficulty affects our three measures. We're uninterested in the subject's metadata -- `age_split`, `group`, `age`, and `idx`. Let's begin by dropping those columns from our dataframe using the DataFrame method `drop`.

In [3]:
data = df.drop(['age_split','group','age','idx'], axis=1)
data[data.id == 1]

Unnamed: 0,id,difficulty,d,cda,fmt
0,1,1,4.86,1.0,0.8
24,1,2,4.89,2.04,0.49
48,1,3,4.55,1.81,0.29


It's good practice to keep an original copy of your dataframe around (here, named `df`) so you can undo irreversible changes, like dropping columns.

## ANOVA the Hard Way

To get a better understanding of ANOVA, we'll now implement it from scratch.

To get started, you'll need the total number of observations $N$, the group size (here, each group is the same size), and the keys for each group (here, 1, 2, and 3, and they're stored in the second level of the column multi-index).

The first cell picks a measure to run ANOVA on. We'll want to write all of our code that follows in such a way that we can run ANOVA on the other measures just by changing this one cell.

In [4]:
measure = "cda"

In [5]:
N = len(data[measure])

groups = data["difficulty"].unique()

In [6]:
groups

array([1, 2, 3])

We'll proceed by generating a new data frame that contains all the information we need to perform an ANOVA -- each row will contain the grand mean and the group mean, the explained component, and the residual for that observation. We will call this our `anova_frame`.

In [7]:
anova_frame = data.copy()

The cell below computes the grand mean and the group mean for each difficulty level and stores them in the `anova_frame`.

In [8]:
anova_frame["grand_mean"] = anova_frame[measure].mean()

group_means = anova_frame.groupby("difficulty")[measure].mean()

for group in groups:
    anova_frame.loc[anova_frame.difficulty==group,"group_mean"] = group_means[group]

Let's take a look at the resulting data frame.

In [9]:
anova_frame.sample(10)

Unnamed: 0,id,difficulty,d,cda,fmt,grand_mean,group_mean
43,20,2,2.88,1.87,-1.79,1.474444,1.484167
54,7,3,3.05,3.4,0.76,1.474444,1.83625
34,11,2,4.26,2.09,0.71,1.474444,1.484167
23,24,1,3.32,0.14,-0.47,1.474444,1.102917
45,22,2,4.34,0.93,0.25,1.474444,1.484167
9,10,1,4.26,2.15,1.33,1.474444,1.102917
58,11,3,4.82,1.45,1.02,1.474444,1.83625
62,15,3,4.05,1.02,0.62,1.474444,1.83625
59,12,3,3.15,1.01,-0.55,1.474444,1.83625
69,22,3,4.56,1.2,0.61,1.474444,1.83625


There are only three unique values in the `group_mean` column, corresponding to the three group means. If we calculate their average value, we'll find that it is equal to the grand mean.

In [10]:
group_means = anova_frame["group_mean"].unique()

print(group_means)

np.mean(group_means) - anova_frame[measure].mean() < 1e-4

[ 1.10291667  1.48416667  1.83625   ]


True

This value is equal to the grand mean. If we know the grand mean, we only need two of the group means to know the other.

**degrees of freedom**


Now, we compute the explained and unexplained components for each observation. The explained differences are the differences between the group average and the overall average. The unexplained difference is the difference between the individual score and the group average.

In [11]:
anova_frame["explained"] = anova_frame["group_mean"]-anova_frame["grand_mean"]

anova_frame["residual"] = anova_frame[measure]-anova_frame["group_mean"]

In [12]:
anova_frame.sample(10)

Unnamed: 0,id,difficulty,d,cda,fmt,grand_mean,group_mean,explained,residual
44,21,2,4.25,1.73,-0.34,1.474444,1.484167,0.009722,0.245833
48,1,3,4.55,1.81,0.29,1.474444,1.83625,0.361806,-0.02625
54,7,3,3.05,3.4,0.76,1.474444,1.83625,0.361806,1.56375
64,17,3,3.8,1.46,0.28,1.474444,1.83625,0.361806,-0.37625
46,23,2,3.27,1.87,0.89,1.474444,1.484167,0.009722,0.385833
33,10,2,4.72,1.36,1.37,1.474444,1.484167,0.009722,-0.124167
27,4,2,4.89,1.79,0.53,1.474444,1.484167,0.009722,0.305833
26,3,2,4.29,1.36,0.22,1.474444,1.484167,0.009722,-0.124167
67,20,3,2.32,2.01,-1.47,1.474444,1.83625,0.361806,0.17375
37,14,2,3.43,1.52,0.24,1.474444,1.484167,0.009722,0.035833


To check our work, we confirm that the total value for each observation is equal to the sum of the grand mean, the explained component, and the residual.

In [13]:
np.isclose(anova_frame[measure],anova_frame["grand_mean"] 
                                + anova_frame["explained"]
                                  + anova_frame["residual"]).all()

True

Now, write a sum-of-squares function using `np.sum` and `np.square` and then use it to compute the following sum of squares values:

- total sum of squares
- sum of the grand mean squared
- sum of squares explained by the model
- residual sum of squares (component not explained by the model)

Also, calculate the explainable sum of squares from the difference of two of the above quantities.

The assertion statements in the final code block can be used to check your work.

We'll store the sums of squares in a dictionary, `sum_of_squares`, using the column name as the key.

** these are called different things by different folks. write formulas. **

In [14]:
sum_of_squares = {}

keys = [measure,"grand_mean","explained","residual"]

for key in keys:
    sum_of_squares[key] = np.sum(np.square((anova_frame[key])))
    
sum_of_squares["explainable"] = sum_of_squares[measure] - sum_of_squares["grand_mean"]

In [15]:
#these should be the same, except for computer rounding error

assert( sum_of_squares[measure] - (sum_of_squares["grand_mean"] + 
                                 sum_of_squares["explainable"]) <= 1e-4 )

assert( sum_of_squares["explainable"] - (sum_of_squares["explained"] +
                                       sum_of_squares["residual"]) <= 1e-4 )

In [16]:
sum_of_squares

{'cda': 204.53820000000002,
 'explainable': 48.011177777777817,
 'explained': 6.4567361111111206,
 'grand_mean': 156.5270222222222,
 'residual': 41.554441666666669}

Now, we need to calculate the following degrees of freedom in this model:

- total degrees of freedom
- the degrees of freedom of the model (or explained degrees of freedom)
- the "leftover" degrees of freedom (or the unexplained degrees of freedom)

In [17]:
# k is the number of groups
k = len(groups)

dof = {}
vals = [N,1,k-1,N-k]

for key,val in zip(keys,vals):
    dof[key] = val

In [18]:
dof

{'cda': 72, 'explained': 2, 'grand_mean': 1, 'residual': 69}

In [19]:
assert(sum([dof[key] for key in dof.keys()]) == 2*N)

Now, we calculate our estimate for the mean square of the explained and unexplained components. Note that, because we are estimating a parameter of the population, we want to use the appropriate degree of freedom instead of the raw $N$ for each average.

In [20]:
mean_square = {}

for key in ["explained","residual"]:
    mean_square[key] = sum_of_squares[key]/dof[key]

In [21]:
mean_square

{'explained': 3.2283680555555603, 'residual': 0.60223828502415466}

The mean square of the explained component tells us how much, on average, our hypothesis is able to improve, in terms of squared error, our guess of the value of our outcome variable over the "null" hypothesis. The bigger this is, the more supported our hypothesis is, and the less likely we are to have observed such a result if the null hypothesis were true.

However, a mean square value by itself doesn't tell you much  -- is reduction of 2 in mean squared error a "big" improvement? For our data, it would be, but for data with units in the billions and spread in the millions, it would not be. Therefore, if we want a statistic that tells us how good our hypothesis is, we need to somehow take into account the amount of unexplained variance.

The statistic used for this purpose in ANOVA is the *$F$-statistic*, named in honor of its inventor, [Sir Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher). Compute the value of $F$ for this data below.

In [22]:
F = mean_square["explained"]/mean_square["residual"]

F

5.3606157825487237

In
[the lab for this section](./Lab - One-Way ANOVA.ipynb),
we will first walk through the versions of ANOVA
provided by the `scipy` and `statsmodels`
packages,
then extend the "homemade" approach above
to calculating $p$ values and effect sizes.