# ANOVA

In [1]:
import sweepystats as sw
import numpy as np
import pandas as pd

## 1-way ANOVA

Suppose we are given an example data set, and we want to know:

> Do samples in different `Group` have different `Outcome`s?

In [2]:
df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]), 
})
df

Unnamed: 0,Outcome,Group
0,3.6,A
1,3.5,A
2,4.2,B
3,2.7,B
4,4.1,A
5,5.2,C
6,3.0,B
7,4.8,C
8,4.0,C


Statistically, we want to test whether the mean of each group (i.e. categories A vs B vs C) is different. The null hypothesis is $\mu_A = \mu_B = \mu_C$ . For this, we can conduct a 1-way ANOVA. 

`Sweepystats` accepts patsy's [formula](https://patsy.readthedocs.io/en/latest/formulas.html) to specify which variable is being considered.

In [3]:
formula = "Outcome ~ Group"
one_way = sw.ANOVA(df, formula)
one_way.fit()

100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 7341.26it/s]


The F-statistic and p-value can be extracted as:

In [5]:
f_stat, pval = one_way.f_test("Group")
f_stat, pval

(np.float64(3.96686746987947), np.float64(0.07984562357182826))

If we reject the null at $\alpha = 0.05$ level, then no, there is no statistically significant difference between at least one pair of group means.

### Check answer is correct

We can compare the answer via sweep operator is correct using `statsmodels` package:

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3)  # Type I ANOVA
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
Intercept,41.813333,1.0,113.349398,4e-05
Group,2.926667,2.0,3.966867,0.079846
Residual,2.213333,6.0,,


## $k$-way ANOVA

Now suppose we have another covariate `Factor` that was measured, and we want to know:

> Do samples in different `Group` and `Factor` have different `Outcome`s?

In [7]:
df = pd.DataFrame({
    'Outcome': [3.6, 3.5, 4.2, 2.7, 4.1, 5.2, 3.0, 4.8, 4.0],
    'Group': pd.Categorical(["A", "A", "B", "B", "A", "C", "B", "C", "C"]), 
    'Factor': pd.Categorical(["X", "X", "Y", "X", "Y", "Y", "X", "Y", "X"])
})
df

Unnamed: 0,Outcome,Group,Factor
0,3.6,A,X
1,3.5,A,X
2,4.2,B,Y
3,2.7,B,X
4,4.1,A,Y
5,5.2,C,Y
6,3.0,B,X
7,4.8,C,Y
8,4.0,C,X


We previously saw that `Group` alone is not significant, using 1-way ANOVA. Lets additionally adjust for `Factor` and the interaction effect between `Group` and `Factor`.

In [8]:
formula = "Outcome ~ Group + Factor + Group:Factor"
two_way = sw.ANOVA(df, formula)
two_way.fit()

100%|████████████████████████████████████████████| 6/6 [00:00<00:00, 6572.43it/s]


Now, we can test for significance of `Group`, `Factor`, and their interaction using an F-test. For example,

In [9]:
# test for Group variable
f_stat, pval = two_way.f_test("Group")
f_stat, pval

(np.float64(11.561538461537321), np.float64(0.03891754069189004))

In [10]:
# test for interaction 
f_stat, pval = two_way.f_test("Group:Factor")
f_stat, pval

(np.float64(2.474358974358741), np.float64(0.2318655632501541))

Note that in each of these tests, internally we are **NOT** refitting the reduced model - we simply *swept* out the (one-hot encoded) variable from the full model!

### Check answer is correct

Again we can compare the answer is correct using `statsmodels` package:

In [11]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Fit the model
model = ols('Outcome ~ Group + Factor + Group:Factor', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=3)  # Type III ANOVA (note: use type 2 if no interaction term)
print(anova_table)

                 sum_sq   df           F    PR(>F)
Intercept     25.205000  1.0  581.653846  0.000156
Group          1.002000  2.0   11.561538  0.038918
Factor         0.201667  1.0    4.653846  0.119883
Group:Factor   0.214444  2.0    2.474359  0.231866
Residual       0.130000  3.0         NaN       NaN
