# Analysis of Variance(ANOVA) 

## ANOVA - Analysis of Variance
- Compares the means of 3(+) groups of data.
- Used to study if there is **statistical difference** between 3(+) group of data. 
- Assumes the data are **normally distributed** and have **equal variances**

### One-way ANOVA
- Compares the mean of 3(+) groups of data considering **one independent** variable or factor.
- Within each group there should be at least three observations.

## Two-way ANOVA
- Compares the means of 3(+) groups of data considering two independent variables or factors. 

### Assumptions

- Observations in each sample are independent and identically distributed (iid).
- Observations in each sample are normally distributed.
- Observations in each sample have the same variance.

### Interpretation

- H0: the means of the samples are equal.
- Ha: one or more of the means of the samples are unequal.


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy import stats
sns.set(font_scale=2, palette= "viridis")
import researchpy as rp 

In [None]:
data = pd.read_csv('../data/pulse_data.csv')
data.head() 

In [None]:
data.shape

In [None]:
data['Exercise'].unique()

In [None]:
data.groupby('Exercise')['Pulse1'].describe().T

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(data=data, x='Exercise', y='Pulse1')
plt.show() 

### One-way  ANOVA with `scipy.stats`

In [None]:
stats.f_oneway(data['Height'][data['Exercise'] == 'Low'],
               data['Height'][data['Exercise'] == 'Moderate'],
               data['Height'][data['Exercise'] == 'High'])

In [None]:
stat, p_value = stats.f_oneway(
              data['Height'][data['Exercise'] == 'Low'],
               data['Height'][data['Exercise'] == 'Moderate'],
               data['Height'][data['Exercise'] == 'High'])

print(f'statistic = {stat}, p-value = {p_value}')
# interpret
alpha = 0.05
if p > alpha:
    print('The means of the samples are equal.(fail to reject H0, not significant)')
else:
    print('The means of the samples are not equal(reject H0, significant)')

In [None]:
import statsmodels.api as sm 
from statsmodels.formula.api import ols

In [None]:
model = ols('Height ~ Exercise', data=data).fit()
anova_result = sm.stats.anova_lm(model, typ=2)
print(anova_result)

## Tukey's Honest Significance Difference
The test finds out which specific group's means compared with each other different

In [None]:
from statsmodels.stats.multicomp import MultiComparison 
mul_com = MultiComparison(data['Height'], data['Exercise'])
mul_result = mul_com.tukeyhsd()
print(mul_result)

In [None]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd 
tukey = pairwise_tukeyhsd(data['Height'], data['Exercise'], alpha=0.05)
# print summary 
tukey.summary()

In [None]:
# plot 
tukey.plot_simultaneous()
plt.vlines(x = 20, ymin=0.5, ymax=4.5)
plt.show() 

### Two-way ANOVA with `statsmodels`
https://www.statsmodels.org/stable/examples/notebooks/generated/interactions_anova.html

In [None]:
rp.summary_cont(data.groupby('Exercise'))['Pulse1']

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(data=data, x='Exercise', y = 'Pulse1')
plt.show() 

In [None]:
rp.summary_cont(data.groupby('Exercise'))['Pulse2']

In [None]:
plt.figure(figsize=(12,8))
sns.boxplot(data=data, x='Exercise', y = 'Pulse2')
plt.show() 

In [None]:
model = ols('Pulse1 ~C(Exercise) + C(BMICat)', data=data).fit()
anova_result = sm.stats.anova_lm(model)
print(anova_result)