<img src="./img/HWNI_logo.svg"/>

# Lab 04 - One-Way ANOVA

In [None]:
# makes our plots show up inside Jupyter
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

# choose colors that work for most color-blind folks
sns.set_palette("colorblind")
sns.set(color_codes=True)

import util.lab04utils as utils 

# this makes our tables easier to read
utils.formatDataframes()

## About the Dataset

For this week's lab, we'll be using some EEG data graciously provided by the [Voytek lab](http://voyteklab.com/about-us/) of UCSD. Participants of varying ages were asked to perform a working memory task with varying levels of difficulty. The raw EEG signal has been summarized into the following two measures:

* [Contralateral Delay Activity](https://www.ncbi.nlm.nih.gov/pubmed/26802451), or CDA, is used to measure the engagement of visual working memory.

* [Frontal Midline Theta](https://www.ncbi.nlm.nih.gov/pubmed/9895201) oscillation amplitude has been correlated with sustained, internally-directed cognitive activity.

The performance of the subjects has also been summarized using the measure
[d'](https://en.wikipedia.org/wiki/Sensitivity_index) (pronounced "d-prime"), also known as the *sensitivity index*. D' is a measure of the subject's performance in  a task. It's based on comparing the true positive rate and false positive rate.

## Loading the Data

First, we load the data and take a look at a sample.

In [None]:
df = pd.read_csv('./data/voytek_working_memory_aging_split.csv',index_col=None)

df.sample(5)

For the purposes of this lab, we're interested only in how task difficulty affects our three measures. We're uninterested in the subject's metadata -- `age_split`, `group`, `age`, and `idx`. Let's begin by dropping those columns from our dataframe using the DataFrame method `drop`.

In [None]:
data = df.drop(['age_split','group','age','idx'], axis=1)
data[data.id == 1]

It's good practice to keep an original copy of your dataframe around (here, named `df`) so you can undo irreversible changes, like dropping columns.

If we're interested in looking at subject-by-subject information, we're not quite done with formatting our data. For subject-level analysis, our data is not yet tidy, since a single subject's observations are scattered over multiple rows. Use the `pivot` method to tidy our data. Hint: we want to get our row `index`es from the subject's `id` and to make new `columns` for our measures using the level of `difficulty`. Look back at the last lab for an example of using `pivot`.

In [None]:
pivotData = data.pivot(index=,columns=) # your code here

In [None]:
pivotData.sample(5)

It now takes two pieces of information to pick out a particular column: the measure we're interested in (one of `d`, `cda`, or `fmt`) and the `difficulty` level: `1`, `2`, or `3`. These need to be provided in order.

In [None]:
measure = 'cda'
difficulty = 2

pivotData[measure,difficulty].head()

If we instead only index into the first level, we get a single-level dataframe back:

In [None]:
pivotData[measure].head()

## Visualizing the Data

Below, we'll be running ANOVA using difficulty as an independent factor and the three measures as dependent factors. Choose a visualization or collection of visualizations that you think would be appropriate for this purpose and plot the data below. Be ready to explain your choice in class.

You can use either the `pivot`ed data frame or the un`pivot`ed data frame.

Hint: the easiest way to apply most of our visualizations to this dataset is to write a loop that runs over our measures and produces a separate plot for each one.

In [None]:
# your code here

Recall the assumptions behind ANOVA.

#### Q1 Based off of your visualization, do you think any of the assumptions of ANOVA are being violated for the case of fmt? What about d'?

## ANOVA the Easy Way

First, we'll use the built-in `scipy.stats` function `f_oneway` to perform ANOVA. This will be useful for checking your work in the next problem where you write your own "homebrew" ANOVA.

In [None]:
scipy.stats.f_oneway?

Try to organize the code you write in the cell below so that you minimize the amount you repeat yourself.

In [None]:
# your code here

## ANOVA the Hard Way

To get a better understanding of ANOVA, we'll now implement it from scratch.

To get started, you'll need the total number of observations $N$, the group size (here, each group is the same size), and the keys for each group (here, 1, 2, and 3, and they're stored in the second level of the column multi-index).

The first cell picks a measure to run ANOVA on. We'll want to write all of our code that follows in such a way that we can run ANOVA on the other measures just by changing this one cell.

In [None]:
measure = '' # you choose: fmt, cda, or d

In [None]:
N = len(df[measure])
groupSize = len(pivotData[measure])
groups = pivotData.columns.levels[1]

We'll proceed by generating a new data frame that contains all the information we need to perform an ANOVA -- each row will contain the grand mean and the group mean, the explained component, and the residual for that observation.

Because we'll be doing a between-subjects (aka unblocked) ANOVA, we use the un`pivot`ed data.

#### Q2 Why is this a sub-optimal design for our statistical analysis?

In [None]:
anovaFrame = data.copy()

In the cell below, compute the grand mean and group means and store them in the ANOVA dataframe. Hint: the `groupyby` method is useful here.

In [None]:
anovaFrame["grandMean"] = [] # your code here

groupMeans = [] # your code here

for group in groups:
    anovaFrame.loc[anovaFrame.difficulty==group,"groupMean"] = [] # your code here

In [None]:
anovaFrame.sample(10)

There are only three unique values in the `groupMean` column, corresponding to the three group means. Calculate their average value.

In [None]:
groupMeans = anovaFrame["groupMean"].unique()

print(groupMeans)

# your code here

#### Q3 What other quantity is this equal to? What does this have to do with degrees of freedom?

Now, compute the explained and unexplained components for each observation. The explained difference is the differences between the group average and the overall average. The unexplained difference is the difference between the individual score and the group average. 

In [None]:
anovaFrame["explained"] = [] # your code here

anovaFrame["residual"] = [] # your code here

To check your work, confirm that the total value for each observation is equal to the sum of the grand mean, the explained component, and the residual.

In [None]:
assert np.isclose(anovaFrame[measure],anovaFrame["grandMean"] 
                                        + anovaFrame["explained"]
                                          + anovaFrame["residual"]).all()

Now, write a sum-of-squares function using `np.sum` and `np.square` and then use it to compute the following sum of squares values:

- total sum of squares
- sum of the grand mean squared
- sum of squares explained by the model
- residual sum of squares (component not explained by the model)

Also, calculate the explainable sum of squares from the difference of two of the above quantities.

The assertion statements in the final code block can be used to check your work.

We'll store the sums of squares in a dictionary, `sumofsquares`, using the column name as the key.

In [None]:
def SS(x):
    return # your code here

In [None]:
sumofsquares = {}

keys = [measure,"grandMean","explained","residual"]

for key in keys:
    sumofsquares[key] = [] #your code here
    
sumofsquares["explainable"] = [] # your code here

In [None]:
#these should be the same, except for computer rounding error

assert( sumofsquares[measure] - (sumofsquares["grandMean"] + 
                                 sumofsquares["explainable"]) <= 1e-4 )

assert( sumofsquares["explainable"] - (sumofsquares["explained"] +
                                       sumofsquares["residual"]) <= 1e-4 )

Now, calculate the following degrees of freedom in this model:

- total degrees of freedom
- the degrees of freedom of the model (or explained degrees of freedom)
- the "leftover" degrees of freedom (or the unexplained degrees of freedom)

#### Q4 The latter two should add up to $N-1$. Where does the other degree of freedom "go"?

In [None]:
# k is the number of groups
k = len(groups)

dof = {}
vals = [] # your code here

for key,val in zip(keys,vals):
    dof[key] = val

In [None]:
assert(sum([dof[key] for key in dof.keys()]) == 2*N)

Now, we calculate our estimate for the mean square of the explained and unexplained components. Note that, because we are estimating a parameter of the population, we want to use the appropriate degree of freedom instead of the raw $N$ for each average.

In [None]:
meansquare = {}

for key in ["explained","residual"]:
    meansquare[key] = [] # your code here

The mean square of the explained component tells us how much, on average, our hypothesis is able to improve, in terms of squared error, our guess of the value of our outcome variable over the "null" hypothesis. The bigger this is, the more supported our hypothesis is, and the less likely we are to have observed such a result if the null hypothesis were true.

However, a mean square value by itself doesn't tell you much  -- is reduction of 2 in mean squared error a "big" improvement? For our data, it would be, but for data with units in the billions and spread in the millions, it would not be. Therefore, if we want a statistic that tells us how good our hypothesis is, we need to somehow take into account the amount of unexplained variance.

The statistic used for this purpose in ANOVA is the *F-statistic*, named in honor of its inventor, [Sir Ronald Fisher](https://en.wikipedia.org/wiki/Ronald_Fisher). Compute the value of F for this data below.

In [None]:
F = [] # your code here

F

### Calculating p

To interpret this F value and so determine if the amount of variance we were able to explain is close to what one would expect by chance, we need to get a hold of the sampling distribution of the F-statistic. Once we have that, we can calculate the area under the curve from the observed value on up and use that to determine our p-value.

#### Q5 Why do we only calculate a "one-tailed" area, above our value, rather than a "two-tailed" area? 

We have two options for acquiring the sampling distribution of F:

1. Use a pre-calculated distribution. In the old days, that'd mean looking up values in a table in a reference text. These days, statistical computing libraries like `scipy` provide this service via functions.
1. Simulate what our data would look like under the null hypothesis and use the distribution of the F-statistic from our simulations to approximate the true distribution of F

We'll take the second road.
This will involve a [*resampling*](https://en.wikipedia.org/wiki/Resampling_%28statistics%29) technique called an
[*approximate permutation test*](https://en.wikipedia.org/wiki/Resampling_%28statistics%29#Monte_Carlo_testing).
It is closely related to [exact tests](https://en.wikipedia.org/wiki/Exact_test),
which are covered in Chapter 7 of *Intuitive Biostatistics*.

#### Q6 Explain, in your own words, what the null hypothesis of the ANOVA test is.

#### Q7 Under the null hypothesis of ANOVA, what can we say about the relationship between the group label and the measurement? 

#### Q8 Based on this relationship, how could we simulate what our data would look like under the null hypothesis? That is, how could we generate data that is distributed according to the null hypothesis?

The function `utils.estimate_f_distribution` will implement this simulation.

In [None]:
measure = "cda"

groupedData = utils.groupData(pivotData,measure,groups)

fs = utils.estimate_f_distribution(groupedData)

sortedFs = sorted(fs)

In [None]:
utils.plotApproximatedF(fs)

Using the `sortedFs` array, calculate an approximate p-value for the value of F you calculated above. Check your work by comparing it to the value provided by `scipy.stats.f_oneway`

In [None]:
# your code here

As part of the process of generating our estimated F distribution, we need to generate a bunch of data sets that are distributed according to the null hypothesis. We can use these, plus any method that calculates p-values, to simulate the distribution of p-values under the null hypothesis. The function `utils.simulateNull` will do this for us.

In [None]:
ps = utils.simulateNull(groupedData,N=10000)

#### Q9 What sampling distribution do we expect the p-value to have under the null hypothesis? How might this distribution look different under the alternative hypothesis?

In the cell below, plot the simulated sampling distribution of the p-value.

In [None]:
# your code here

## Beyond p - $\eta^2$ and $\omega^2$

The F-statistic is used to determine the statistical significance of an ANOVA result.

#### Q10 How is this different from the practical or scientific significance of an ANOVA result?

The APA recommends that, in addition to reporting F and p, scientists report the value $\eta^2$, which is equal to the ratio of the explained sum-of-squares to the explainable sum-of-squares. It's also known as the *variance explained*.

#### Q11 Why is this number closer to a notion of practical significance than F is?

Calculate $\eta$ for a statistically significant test you ran above.

In [None]:
eta = [] # your code here
eta

#### Q12 Does this seem like a practically significant fraction to you? Look back at the visualizations of the data you produced at the beginning of the lab. Are there any visual hints that would lead you to expect a value of $\eta^2$ close to what you calculated?

Notice that $\eta$ uses the sums-of-squares, rather than mean squares. This makes it a biased estimator of the quantity it is trying to capture: it overestimates how much variance has been explained. Intuitively, it only captures how well you explained your dataset, not how well you might explain additional data from the same population.

An unbiased estimator for explained variance exists, known as $\omega^2$ (pronounced "omega-squared"). You can [read more about it here](http://daniellakens.blogspot.com/2015/06/why-you-should-use-omega-squared.html).

Use the following formula to compute $\omega^2$ for your test:

$$
    \omega^2 = \frac{F-1}{\frac{F+1+\text{df}_{residual}}{\text{df}_{explained}}}
$$

In [None]:
om_sqrd = [] # your code here
om_sqrd

#### Q13 Does this more accurate estimate change your opinion of the the practical significance of the results of your test?