# ANOVA (Analysis of Varience)

### One way ANOVA

A one-way ANOVA uses the following null and alternative hypotheses:

- H0 (null hypothesis): μ1 = μ2 = μ3 = … = μk (all the population means are equal)
- H1 (null hypothesis): at least one population mean is different from the rest

>Case Statemnet
>
>A researcher recruits 30 students to participate in a study. The students are randomly assigned to use one of three studying techniques for the next three weeks to >prepare for an exam. At the end of the three weeks, all of the students take the same test. 

In [6]:
from scipy.stats import f_oneway

#enter exam scores for each group
group1 = [85, 86, 88, 75, 78, 94, 98, 79, 71, 80]
group2 = [91, 92, 93, 85, 87, 84, 82, 88, 95, 96]
group3 = [79, 78, 88, 94, 92, 85, 83, 85, 82, 81]



#perform one-way ANOVA
stats, p=f_oneway(group1, group2, group3)
stats=round(stats, 3)
p=round(p, 3)
print('stats '+str(stats))
print('p '+str(p))
if p > 0.05:
    print("As the p_value is more than 0.05, we accept the null hypothesis")
else:
     print("As the p_value is less than 0.05 , we reject the null hypothesis")


stats 2.358
p 0.114
As the p_value is more than 0.05, we accept the null hypothesis


NOTE
>This means we do not have sufficient evidence to say that there is a difference in exam scores among the three studying techniques.

### Two way ANOVA

#### A two-way ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups that have been split on two factors.

Case Statement
>
>A botanist wants to know whether or not plant growth is influenced by sunlight exposure and watering frequency. She plants 30 seeds and lets them grow for two months under different conditions for sunlight exposure and watering frequency. After two months, she records the height of each plant, in inches.
>two-way ANOVA to determine if watering frequency and sunlight exposure have a significant effect on plant growth, and to determine if there is any interaction effect between watering frequency and sunlight exposure.
>
> - water: how frequently each plant was watered: daily or weekly
> - sun: how much sunlight exposure each plant received: low, medium, or high
> - height: the height of each plant (in inches) after two months

In [7]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

#view first ten rows of data 
df.head(10)


#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


Results  interpretation
>
>Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.
>
>And since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight >exposure and watering frequency.
- we would need to perform post-hoc tests to determine exactly how different levels of water and sunlight affect plant height.

## Post Hoc Tests (Two way ANOVA )

>**Technical Note:** It’s important to note that we only need to conduct a post hoc test when the p-value for the ANOVA is statistically significant. If the p-value is not statistically significant, this indicates that the means for all of the groups are not different from each other, so there is no need to conduct a post hoc test to find out which groups are different from each other.

### Repeated Measures ANOVA

A repeated measures ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more groups in which the same subjects show up in each group.

A repeated measures ANOVA uses the following null and alternative hypotheses:

- The null hypothesis (H0): µ1 = µ2 = µ3 (the population means are all equal)

- The alternative hypothesis: (Ha): at least one population mean is different from the rest

#### Assumption for Repeated measures of ANOVA:

1. Independence: Each of the observations should be independent.

2. Normality: The distribution of the response variable is normally distributed.

3. Sphericity: The variances of the differences between all combinations of related groups must be equal.

### One way Repeated Measures ANOVA

Case Statement
>
>Researchers want to know if four different drugs lead to different reaction times. To test this, they measure the reaction time of five patients on the four different drugs.
>
>Since each patient is measured on each of the four drugs, we will use a repeated measures ANOVA to determine if the mean reaction time differs between drugs.

In [9]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM

#create data
df = pd.DataFrame({'patient': np.repeat([1, 2, 3, 4, 5], 4),
                   'drug': np.tile([1, 2, 3, 4], 5),
                   'response': [30, 28, 16, 34,
                                14, 18, 10, 22,
                                24, 20, 18, 30,
                                38, 34, 20, 44, 
                                26, 28, 14, 30]})

#view first ten rows of data 
df.head(10)

#perform the repeated measures ANOVA
print(AnovaRM(data=df, depvar='response', subject='patient', within=['drug']).fit())


              Anova
     F Value Num DF  Den DF Pr > F
----------------------------------
drug 24.7589 3.0000 12.0000 0.0000



Results Interpretation
>
>F test-statistic is 24.7589 and the corresponding p-value is 0.0000.
>
>Since this p-value is less than 0.05, we reject the null hypothesis and conclude that there is a statistically significant difference in mean response times between the four drugs.

#### What to Do if this Assumption is Violated

If we reject the null hypothesis of Mauchly’s test of sphericity, then we typically apply a correction to the degrees of freedom used to calculate the F-value in the repeated measures ANOVA table.

There are three corrections we can apply:

- Huynh-Feldt (least conservative)
- Greenhouse–Geisser
- Lower-bound (most conservative)

Each of these corrections tend to increase the p-values in the output table of the repeated measures ANOVA to account for the fact that the assumption of sphericity is violated.

We can then use these p-values to determine if we should reject or fail to reject the null hypothesis of the repeated measures ANOVA.