<img src="./img/HWNI_logo.svg"/>

# Lab 05a - Two-Way ANOVA

In [None]:
# makes our plots show up inside Jupyter
%matplotlib inline

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats

# choose colors that work for most color-blind folks
sns.set_palette("colorblind")
sns.set(color_codes=True)

import util.lab05utils as utils 

# this makes our tables easier to read
utils.formatDataframes()

## About the Lab

The first half of the lab is similar to Lab 04 -- you will code an ANOVA by hand as a computational exercise. The second half, on multiple comparisons, is set up as an experiment. 

So, even if you get stuck on writing the code for this half of the lab, you can still do the second half of the lab. Also, don't hesitate to post questions to bCourses!

## About the Dataset

The data for this lab is simulated.

After learning of [Nemeroff, Heim, et al.](http://www.pnas.org/content/100/24/14293.full)'s finding that psychotherapy is more effective at treating depression in individuals with childhood trauma, you become interested in whether the same might be true for treating anxiety.

You run a clinical experiment in which individuals with and without childhood trauma are treated with a placebo, the standard of care for psychotherapy (CBT, or 
["Cognitive Behavioral Therapy"](https://en.wikipedia.org/wiki/Cognitive_behavioral_therapy)),
or the standard of care for pharmacotherapy
[(an anxiolytic GABA agonist)](https://en.wikipedia.org/wiki/Anxiolytic).
As an aside: in actual clinical studies, treatment with a placebo is considered unethical if there's already a standard of care, and a proper clinical trial compares alternative treatments to standard treatments.

The results of your experiment are summarized in the table `anxiety_dataset.csv`.

## Loading the Data

In [None]:
df = pd.read_csv('./data/anxiety_dataset.csv',index_col=0)

data = df.copy()

In [None]:
data.sample(10)

For the independent variables `treatment` and `trauma`, use the pandas `Series` method `unique` to determine the levels and store these in variables called `treatments` and `traumas`.

In [None]:
treatments = [] # your code here
traumas = [] # your code here

## Visualizing the Data

Visualize the data. Aim for a visualization that makes it possible to see any main effects along with the interaction effect you're interested in. Examples include: histograms, factorial plots (make sure to include error bars), and strip/swarm/violin plots.

#### Q1 Discuss the connections between visual aspects of your graph and 1) the assumptions of ANOVA and 2) the outcomes you expect.

### Easy Way

Because the ANOVA test is, from one perspective, a test of the goodness-of-fit of a linear model, modeling packages provide the tools in Python for performing ANVOAs more complicated than one-way between-subjects.

In this course, we'll be using the `statsmodels` package.

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

We specify models by describing them with strings that look like:

```
    "outcome ~ factor"
```

for one-way ANOVAs,

```
    "outcome ~ factor1*factor2*...factorN"
```

for N-way ANOVAs where we want to compute all main effects and interactions,
and

```
    "outcome ~ factor1:randomFactor1:randomFactor2:...randomFactorN"
```

for ANOVAs with N factors where we're only interested in interactions (e.g., a subject factor).

In [None]:
ols_lm = smf.ols('anxietyReduction ~ treatment*trauma',data=df)

fit = ols_lm.fit()

table = sm.stats.anova_lm(fit,typ=2)

table

Hang on to this table of results for comparison with your ANOVA by hand below.

#### Q2 Interpret the pattern of significant and non-significant results that you see. First, phrase your answer as you would write it in the results section of a paper (e.g. using the [APA guidelines](https://depts.washington.edu/psych/files/writing_center/stats.pdf)), then describe the results less formally, as in a research talk.

Just as we performed an overall F-test before doing t-tests in a one-way ANOVA, one can also perform an "omnibus" F-test before performing the individual F-tests of a multi-way ANOVA. In this test, we check to see whether the model as a whole has a significant between-groups mean-square.

We can calculate the model's overall between-groups mean-square by adding up the sums of squares for each component of the model and dividing by the sum of the degrees of freedom of each component of the model. Comparing this to the residual mean square gives us an F for which we can compute a p-value.

The issue of multiple-comparisons in ANOVA and the role of the omnibus test is explored in the second half of the lab.

Implement an omnibus test and run it on the results table above. You'll need the `cdf` method of `scipy.stats.f`. Note that the results table contains all the information you need to run an omnibus test.

In [None]:
scipy.stats.f?

In [None]:
def computeP(fValue,dofB,dofW):
    fDistribution = scipy.stats.f
    p = 1 - fDistribution.cdf(fValue,dofB,dofW)
    return p

def omnibusTest(result):
    #your code here
    
    F = []
    p = []
    
    return (F,p)

In [None]:
F,p = omnibusTest(table)
print((F,p))

### Hard Way

As previously, we'll build a pandas data frame to hold our decomposition matrix.

In [None]:
N = len(df)

anovaFrame = data.copy()

anovaFrame.sample(10)

We start by calculating the relevant means, using the `groupby` method when necessary. For the interaction term, note that you can do a "double" `groupby` by passing a list of columns to group on to the `groupby` method, instead of a single column.

In [None]:
dataMean = [] # your code here
treatmentMeans = [] # your code here
traumaMeans = [] # your code here

treatmentXtraumaMeans = [] # your code here

In [None]:
treatmentXtraumaMeans

Now, using the combination of a selector and the `.loc` method, add columns for the main effect of treatment and the interaction effect of trauma and treatment. We use `.loc` to find things using keys and we use selectors to find things using logical statements. `.loc` is useful for indexing into the `groupby` objects, while selectors are useful for indexing into the `anovaFrame`.

The column for main effect of trauma has been worked as an example, and code has been provided for the selection component of computing the interaction effect.

We covered selectors in the pandas tutorial, but look back if you need a refresher. If you're confused or curious about how `.loc` works, check out [this tutorial](http://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/). 

In [None]:
anovaFrame["meanEffect"] = dataMean

for trauma in traumas:
    selector = (anovaFrame.trauma == trauma)
    anovaFrame.loc[selector,"traumaMainEffect"] = traumaMeans.loc[trauma]-dataMean
    
for treatment in treatments:
    selector = (anovaFrame.treatment == treatment)
    anovaFrame.loc[selector,"treatmentMainEffect"] = [] #your code here
    
for trauma in traumas:
    for treatment in treatments:
        selector = (anovaFrame.treatment == treatment) & (anovaFrame.trauma == trauma)
        anovaFrame.loc[selector,"interactionEffect"]  = treatmentXtraumaMeans.loc[treatment,trauma] - \
            #your code here
            # the \ symbol lets you "wrap" a line of code like you would a line of text!

Lastly, calculate the residual error. This can be done by simply subtracting columns of `anovaFrame`.

In [None]:
anovaFrame["residual"] = [] #your code here

anovaFrame.sample(10)

With our decomposition matrix in hand, we can calculate the relevant sums of squares. Using a square-and-sum function, calculate the total, mean, main effects, interaction, and residual sums of squares.

As before, we'll store them in a dictionary, keyed by their name.

In [None]:
def SS(x):
    return # your code here

In [None]:
sumofsquares = {}

keys = ["total","mean","trauma","treatment","interaction","residual"]

columns = [] # your code here

for key,column in zip(keys,columns):
    sumofsquares[key] = SS(column)
    
sumofsquares

Calculate the degrees of freedom for each of the sums of squares above and store it into a dictionary.

In [None]:
dof = {}

keys = ["total","mean","trauma","treatment","interaction","residual"]
vals = [] #your code here

for key,val in zip(keys,vals):
    dof[key] = val
    
dof

Using the dictionaries `sumofsquares` and `dof`, compute the mean square values for all of the keyed quantities.

In [None]:
meansquare = {}

for key in keys:
    meansquare[key] = [] #your code here
    
meansquare

Compute the F-statistic for each main effect and the interaction.

In [None]:
F = {}

for key in ["treatment","trauma","interaction"]:
    F[key] = [] #your code here

F

Lastly, compute the p value using `scipy.stats.f.cdf`.

#### Q3 What's a CDF again? Why do we use the CDF of the sampling distribution of the statistic to compute p-values?

In [None]:
for effect in F.keys():
    print(effect)
    print() #your code here