# Lecture 24: One-Way Analysis of Variance (ANOVA) 
***

We'll need Numpy, Matplotlib, Pandas, and scipy.stats for this notebook, so let's load them. 

In [None]:
import numpy as np 
from scipy import stats
import statsmodels.api as sm 
import pandas as pd
import matplotlib.pylab as plt 
%matplotlib inline

### Exercise 1 - Diet, Exercise, and Weight Loss 
*** 

A randomized control study was performed with $9$ subjects to investigate the effect of exercise and diet on weight loss.  All $9$ subjects of the study exercised on a daily basis, one third of the subjects ate their regular diet, one third of subjects ate based on Diet $A$, and one third of subjects ate based on Diet $B$.  The observed weight loss after one week is summarized in the following data. 

In [None]:
dfD = pd.DataFrame({"Control": [3, 2, 1], "Diet A": [5, 3, 4], "Diet B": [5, 6, 7] })
dfD.head()

**Part A**: We're interested in determining whether the mean weight-loss of all three groups are the same, or if some groups have better results.  We've done this example by hand in class.  In this exercise you'll use [scipy.stats.f_oneway](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) function to verify our results. Check out the docs, then use the function to find the appropriate $F$-statistic and corresponding p-value for the ANOVA test. Does the test indicate, at the $\alpha = 0.05$ significance level, that the mean weight-loss across all groups is the same?  

**Part B**: In class, we claimed that an ANOVA $F$-test is equivalent to linear regression where the features are binary categorical variables associated with group membership. In this exercise we'll verify this.  The following code re-factors the DataFrame to create binary categorical variables corresponding to Diet $A$ and Diet $B$.  Look at the resulting DataFrame, and explain how the response and features are encoded. 

In [None]:
y = dfD.values.T.flatten()
dct = {"loss": y}
counts = [dfD[col].count() for col in dfD.columns]
ccf = [int(np.sum(counts[:ii])) for ii in range(1,len(counts))] + [len(y)]
for ii in range(1,dfD.shape[1]):
    x = np.zeros(len(y))
    x[ccf[ii-1]:ccf[ii]] = 1  
    dct[dfD.columns[ii]] = x 

dfR = pd.DataFrame(dct)
dfR = dfR.loc[:,["loss"] + list(dfD.columns[1:].values)]
dfR.head(100)

**Part C**: Use statsmodels to perform a multiple linear regression on the data created in **Part B**.  Look at the model summary and compare the computed $F$-test and model coefficients to the results above.  

### Exercise 2 - Who's the Better Archer? 
*** 

Three friendly archery enthusiasts are arguing over which one is the superior archer.  Having taken Intro to Data Science, they decide to settle the bet by having a short competition and then performing a statistical analysis on the results.  In the competition, each archer takes 6 shots and records their score based on distance from the bullseye (higher is better).  The results are as follows: 

In [None]:
dfA = pd.DataFrame({"Suzie": np.array([5,4,4,3,9,4]),"Jack": np.array([4,8,7,5,1,5]),"Ruth": np.array([9,8,8,10,5,10])})
dfA.head(10)

**Part A**: Use stats.f_oneway to perform an $F$-test to determine if the mean scores of the three archers are different at the $\alpha = 0.05$ significance level.  

**Part B**: Use numpy to compute the $F$-statistic and associated p-value directly.  Verify that you get the same results as produced by stats.f_oneway. 

In [None]:
# Compute the grand_mean
grand_mean = 0 # TODO 

# Compute the between-group sum of squares 
SSB = 0 # TODO 

# Compute the between_group degrees of freedom 
SSB_df = 0 # TODO 

# Compute the within-group sum of squares 
SSW = 0 # TODO 

# Compute the within_group degrees of freedom 
SSW_df = 0 # TODO 

# Compute the test statistic 
F = 0 # TODO 

# Compute the associated p-value 
pval = 0 # TODO 


**Part C**: Use Tukey's HSD to determine which archers are statistically different using the [MultiComparison](http://www.statsmodels.org/dev/generated/statsmodels.sandbox.stats.multicomp.MultiComparison.html) module. Interpret the results. 

In [None]:
from statsmodels.stats.multicomp import MultiComparison

In [None]:
# Format the data 
data = dfA.values.T.flatten()
labels = []
for col in dfA.columns:
    labels += [col]*dfA[col].count()
    
# Do Tukeys HSD 
mc = MultiComparison(data, labels)
result = mc.tukeyhsd()
print(result)