# Lecture 24: One-Way Analysis of Variance (ANOVA) Solutions
***

We'll need Numpy, Matplotlib, Pandas, and scipy.stats for this notebook, so let's load them. 

In [1]:
import numpy as np 
from scipy import stats
import statsmodels.api as sm 
import pandas as pd
import matplotlib.pylab as plt 
%matplotlib inline

  from pandas.core import datetools


### Exercise 1 - Diet, Exercise, and Weight Loss 
*** 

A randomized control study was performed with $9$ subjects to investigate the effect of exercise and diet on weight loss.  All $9$ subjects of the study exercised on a daily basis, one third of the subjects ate their regular diet, one third of subjects ate based on Diet $A$, and one third of subjects ate based on Diet $B$.  The observed weight loss after one week is summarized in the following data. 

In [6]:
dfD = pd.DataFrame({"Control": [3, 2, 1], "Diet A": [5, 3, 4], "Diet B": [5, 6, 7] })
dfD.head()

Unnamed: 0,Control,Diet A,Diet B
0,3,5,5
1,2,3,6
2,1,4,7


**Part A**: We're interested in determining whether the mean weight-loss of all three groups are the same, or if some groups have better results.  We've done this example by hand in class.  In this exercise you'll use [scipy.stats.f_oneway](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html) function to verify our results. Check out the docs, then use the function to find the appropriate $F$-statistic and corresponding p-value for the ANOVA test. Does the test indicate, at the $\alpha = 0.05$ significance level, that the mean weight-loss across all groups is the same?  

In [7]:
F, pval = stats.f_oneway(dfD["Control"], dfD["Diet A"], dfD["Diet B"])
print("F = {:.5f}".format(F))
print("pval = {:.5f}".format(pval))

F = 12.00000
pval = 0.00800


Since $0.008 < \alpha = 0.05$ we conclude that at least one of the group means is different from the other. 

**Part B**: In class, we claimed that an ANOVA $F$-test is equivalent to linear regression where the features are binary categorical variables associated with group membership. In this exercise we'll verify this.  The following code re-factors the DataFrame to create binary categorical variables corresponding to Diet $A$ and Diet $B$.  Look at the resulting DataFrame, and explain how the response and features are encoded. 

In [8]:
y = dfD.values.T.flatten()
dct = {"loss": y}
counts = [dfD[col].count() for col in dfD.columns]
ccf = [int(np.sum(counts[:ii])) for ii in range(1,len(counts))] + [len(y)]
for ii in range(1,dfD.shape[1]):
    x = np.zeros(len(y))
    x[ccf[ii-1]:ccf[ii]] = 1  
    dct[dfD.columns[ii]] = x 

dfR = pd.DataFrame(dct)
dfR = dfR.loc[:,["loss"] + list(dfD.columns[1:].values)]
dfR.head(100)

Unnamed: 0,loss,Diet A,Diet B
0,3,0.0,0.0
1,2,0.0,0.0
2,1,0.0,0.0
3,5,1.0,0.0
4,3,1.0,0.0
5,4,1.0,0.0
6,5,0.0,1.0
7,6,0.0,1.0
8,7,0.0,1.0


We choose the Control column as the control group (duh).  The binary features $X_1$ and $X_2$ correspond to the Diet $A$ and Diet $B$ groups, respectively.  The first three responses correspond to the control group, so both $X_1$ and $X_2$ are zero.  The next three responses correspond to Diet $A$ so $X_1 = 1$ and $X_2 = 0$.  Finally, the last three responses correspond to Diet $B$ so $X_1 = 0$ and $X_2 = 1$. 

**Part C**: Use statsmodels to perform a multiple linear regression on the data created in **Part B**.  Look at the model summary and compare the computed $F$-test and model coefficients to the results above.  

In [12]:
y = dfR.loc[:,"loss"]
X = dfR.loc[:,["Diet A", "Diet B"]]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
model.summary()

  "anyway, n=%i" % int(n))


0,1,2,3
Dep. Variable:,loss,R-squared:,0.8
Model:,OLS,Adj. R-squared:,0.733
Method:,Least Squares,F-statistic:,12.0
Date:,"Mon, 04 Dec 2017",Prob (F-statistic):,0.008
Time:,22:54:58,Log-Likelihood:,-10.946
No. Observations:,9,AIC:,27.89
Df Residuals:,6,BIC:,28.48
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,2.0000,0.577,3.464,0.013,0.587,3.413
Diet A,2.0000,0.816,2.449,0.050,0.002,3.998
Diet B,4.0000,0.816,4.899,0.003,2.002,5.998

0,1,2,3
Omnibus:,2.38,Durbin-Watson:,2.333
Prob(Omnibus):,0.304,Jarque-Bera (JB):,0.844
Skew:,-0.0,Prob(JB):,0.656
Kurtosis:,1.5,Cond. No.,3.73


The computed $F$-statistic and associated p-value for the MLR model are identical to those from the ANOVA model. 

### Exercise 2 - Who's the Better Archer? 
*** 

Three friendly archery enthusiasts are arguing over which one is the superior archer.  Having taken Intro to Data Science, they decide to settle the bet by having a short competition and then performing a statistical analysis on the results.  In the competition, each archer takes 6 shots and records their score based on distance from the bullseye (higher is better).  The results are as follows: 

In [2]:
dfA = pd.DataFrame({"Suzie": np.array([5,4,4,3,9,4]),"Jack": np.array([4,8,7,5,1,5]),"Ruth": np.array([9,8,8,10,5,10])})
dfA.head(10)

Unnamed: 0,Jack,Ruth,Suzie
0,4,9,5
1,8,8,4
2,7,8,4
3,5,10,3
4,1,5,9
5,5,10,4


**Part A**: Use stats.f_oneway to perform an $F$-test to determine if the mean scores of the three archers are different at the $\alpha = 0.05$ significance level.  

In [10]:
F, pval = stats.f_oneway(dfA["Jack"], dfA["Ruth"], dfA["Suzie"])
print("F, pval = {:.5f}, {:.5f}".format(F, pval))

F, pval = 5.00000, 0.02168


Since the p-value of the ANOVA test is $0.02168 < \alpha = 0.05$ we reject the null hypothesis and conclude that there is at least one mean in the three groups that is different from the others. 

**Part B**: Use numpy to compute the $F$-statistic and associated p-value directly.  Verify that you get the same results as produced by stats.f_oneway. 

In [13]:
# Get total number of points
N = len(dfA.values.flatten())

# Get total number of groups
I = len(dfA.columns)

# Compute the grand_mean
grand_mean = np.mean(dfA.values) 

# Compute the between-group sum of squares 
SSB = np.sum([dfA[col].count()*(dfA[col].mean()-grand_mean)**2 for col in dfA.columns])

# Compute the between_group degrees of freedom 
SSB_df = I-1 

# Compute the within-group sum of squares 
SSW = np.sum([np.sum((dfA[col] - dfA[col].mean())**2) for col in dfA.columns])

# Compute the within_group degrees of freedom 
SSW_df = N-I 

# Compute the test statistic 
F = (SSB/SSB_df)/(SSW/SSW_df) 

# Compute the associated p-value 
pval = 1 - stats.f.cdf(F, SSB_df, SSW_df) 

print("SSB, SSB_df = {:.3f}, {}".format(SSB, SSB_df))
print("SSW, SSW_df = {:.3f}, {}".format(SSW, SSW_df))
print("F, pval = {:.5f}, {:.5f}".format(F, pval))

SSB, SSB_df = 46.778, 2
SSW, SSW_df = 70.167, 15
F, pval = 5.00000, 0.02168


The computed $F$-statistic and associated $p$-value are identical to those computed by stats.f_oneway. 

**Part C**: Use Tukey's HSD to determine which archers are statistically different. Interpret the results. 

In [4]:
from statsmodels.stats.multicomp import MultiComparison

In [5]:
data = dfA.values.T.flatten()
labels = []
for col in dfA.columns:
    labels += [col]*dfA[col].count()
    
mc = MultiComparison(data, labels)
result = mc.tukeyhsd()
print(result)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
group1 group2 meandiff  lower   upper  reject
---------------------------------------------
 Jack   Ruth   3.3333   0.0911  6.5755  True 
 Jack  Suzie  -0.1667  -3.4089  3.0755 False 
 Ruth  Suzie    -3.5   -6.7422 -0.2578  True 
---------------------------------------------


Since there are three groups (corresponding to the three archers) we make $3$ pairwise comparisons.  The reject column tells us whether or not the null hypothesis that the two means are equal is rejected or not.  We see that there is sufficient statistical evidence to believe that Jack's and Ruth's means are different, Jack's and Suzie's means are **not** different, and Ruth's and Suzie's means are different. 

From this we can conclude that Jack's and Suzie's means are the same while Ruth's is different from the others.  Inspecting the sample means we observe that 

In [15]:
print("Suzie mean: {:.3f}".format(dfA["Suzie"].mean()))
print("Jack mean: {:.3f}".format(dfA["Jack"].mean()))
print("Ruth mean: {:.3f}".format(dfA["Ruth"].mean()))

Suzie mean: 4.833
Jack mean: 5.000
Ruth mean: 8.333


Since Ruth's sample mean is higher than the others we conclude that $\mu_{Ruth} > \mu_{Suzie} = \mu_{Jack}$.