# ANOVA  - Lab

## Introduction

In this lab, you'll get some brief practice generating an ANOVA table (AOV) and interpreting its output. You'll then also perform some investigations to compare the method to the t-tests you previously employed to conduct hypothesis testing.

## Objectives

You will be able to:
* Use ANOVA for testing multiple pairwise comparisons
* Understand and explain the methodology behind ANOVA tests

## Loading the Data

Start by loading in the data stored in the file **ToothGrowth.csv**.

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

df = pd.read_csv('ToothGrowth.csv')
df.head()# Your code here

Unnamed: 0,len,supp,dose
0,4.2,VC,0.5
1,11.5,VC,0.5
2,7.3,VC,0.5
3,5.8,VC,0.5
4,6.4,VC,0.5


## Generating the ANOVA Table

Now generate an ANOVA table in order to analyze the influence of the medication and dosage 

In [4]:
formula = 'df.len ~ df.supp + df.dose'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, type=2)
print(table)#Your code here

            df       sum_sq      mean_sq           F        PR(>F)
df.supp    1.0   205.350000   205.350000   11.446768  1.300662e-03
df.dose    1.0  2224.304298  2224.304298  123.988774  6.313519e-16
Residual  57.0  1022.555036    17.939562         NaN           NaN


## Reading the Table

Make a brief comment regarding the statistics regarding the effect of supplement and dosage on tooth length.

In [None]:
Dosage is more impactful than supplement# Your comment here

## Comparing to T-Tests

Now that you've gotten a brief chance to interact with ANOVA, its interesting to compare the results to those from the t-tests you were just working with. With that, start by breaking the data into two samples: those given the OJ supplement, and those given the VC supplement. Afterwards, you'll conduct a t-test to compare the tooth length of these two different samples.

In [10]:
oj = df.loc[df['supp'] == 'OJ']['len']
vc = df.loc[df['supp'] == 'VC']['len']#Your code here

In [11]:
oj.head()

30    15.2
31    21.5
32    17.6
33     9.7
34    14.5
Name: len, dtype: float64

Now compare a t-test between these two groups and print the associated two-sided p-value.

In [13]:
import scipy.stats#Your code here; calculate the 2-sided p-value for a t-test comparing the two supplement groups.

In [15]:
scipy.stats.ttest_ind(oj, vc)[1]

0.06039337122412849

## A 2-Category ANOVA F-Test is Equivalent to a 2-Tailed t-Test!

Now, recalculate an ANOVA F-test with only the supplement variable. An ANOVA F-test between two categories is the same as performing a 2-tailed t-Test! So, the p-value in the table should be identical to your calculation above.

> Note: there may be a small fractional difference (>0.001) between the two values due to a rounding error between implementations. 

In [16]:
formula = 'df.len ~ df.supp'
lm = ols(formula, df).fit()
table = sm.stats.anova_lm(lm, type=2)
print(table)#Your code here; conduct an ANOVA F-test of the oj and vc supplement groups.
#Compare the p-value to that of the t-test above. 
#They should match (there may be a tiny fractional difference due to rounding errors in varying implementations)

            df       sum_sq     mean_sq         F    PR(>F)
df.supp    1.0   205.350000  205.350000  3.668253  0.060393
Residual  58.0  3246.859333   55.980333       NaN       NaN


## Generating Multiple T-Tests

While the 2-category ANOVA test is identical to a 2-tailed t-Test, performing multiple t-tests leads to the multiple comparisons problem. To investigate this, look at the various sample groups you could create from the 2 features: 

In [17]:
for group in df.groupby(['supp', 'dose'])['len']:
    group_name = group[0]
    data = group[1]
    print(group_name)

('OJ', 0.5)
('OJ', 1.0)
('OJ', 2.0)
('VC', 0.5)
('VC', 1.0)
('VC', 2.0)


While bad practice, examine the effects of calculating multiple t-tests with the various combinations of these. To do this, generate all combinations of the above groups. For each pairwise combination, calculate the p-value of a 2 sided t-test. Print the group combinations and their associated p-value for the two-sided t-test.

In [34]:
from itertools import combinations 

groups = [group[0] for group in df.groupby(['supp', 'dose'])['len']]
combos = combinations(groups, 2)
for combo in combos:
    supp1 = combo[0][0]
    dose1 = combo[0][1]
    supp2 = combo[1][0]
    dose2 = combo[1][1]
    sample1 = df[(df.supp == supp1) & (df.dose == dose1)]
    sample2 = df[(df.supp == supp2) & (df.dose == dose2)]
    p = scipy.stats.ttest_ind(sample1.len, sample2.len)
    print(combo, p[1])#Your code here; reuse your $t$-test code above to calculate the p-value for a 2-sided $t$-test
#for all combinations of the supplement-dose groups listed above. 
#(Since there isn't a control group, compare each group to every other group.)

(('OJ', 0.5), ('OJ', 1.0)) 8.357559281443774e-05
(('OJ', 0.5), ('OJ', 2.0)) 3.4018585295016214e-07
(('OJ', 0.5), ('VC', 0.5)) 0.005303661339923052
(('OJ', 0.5), ('VC', 1.0)) 0.04223992429368205
(('OJ', 0.5), ('VC', 2.0)) 7.025409196997986e-06
(('OJ', 1.0), ('OJ', 2.0)) 0.03736279585664383
(('OJ', 1.0), ('VC', 0.5)) 1.3372624230559434e-08
(('OJ', 1.0), ('VC', 1.0)) 0.0007807261651774468
(('OJ', 1.0), ('VC', 2.0)) 0.09583711277517494
(('OJ', 2.0), ('VC', 0.5)) 1.3381068810881244e-11
(('OJ', 2.0), ('VC', 1.0)) 2.3131084633597503e-07
(('OJ', 2.0), ('VC', 2.0)) 0.9637097790041267
(('VC', 0.5), ('VC', 1.0)) 6.492264598157612e-07
(('VC', 0.5), ('VC', 2.0)) 4.957285658438862e-09
(('VC', 1.0), ('VC', 2.0)) 3.397577925539582e-05


## Summary

In this lesson, you examined the ANOVA technique to generalize A/B testing methods to multiple groups and factors.