# Two-Way ANOVA
<a id = "top"></a>
A two-way ANOVA allows us to compare the mean of a numerical variable between groups of TWO categorical variables AND assess if there's an interaction effect (combination effect from the two predictors).  This is the first time in the course that we will be able to have more than one predictor that influences our outcome.

- [Two-way ANOVA basics](#basics)
    - [Graphically looking at differences in group means](#graph)
    - [Decomposing Variance](#decomp)
    - [Conducting the F-test (ANOVA)](#ftest)
    - [Interpreting the Result](#interp)
- [Assumptions](#assump)
    - [Basic Assumptions](#baseassump)
    - [Homogeneity of Variances (Levene's Test)](#homovar)
    - [Normality of Residuals (QQ Plots)](#resid)
- [Post-hoc Pairwise Comparisons](#posthoc)
    - [Tukey's Honest Significant Difference (HSD)](#tukey)
- [Effect Size](#effect)
    - [R-squared](#rsq)
    - [Cohen's $f$](#cohen)

In [None]:
library(tidyverse)
library(magrittr)
library(ggpubr) # containes line/dot plot for visualizing means
library(DescTools) # contains levene's test function
library(pwr) # for power analysis
library(scales) ## for scaling functions for ggplot2
library(gridExtra) # for side-by-side plots

options(repr.plot.width=14, repr.plot.height=5) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

<a id = "basics"></a>
## Two-way ANOVA Basics
For this in-class example we're going to use a small example dataset built in to R - "warpbreaks."  
https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/warpbreaks.html

We have three variables that describe weaving on 54 looms - 9 looms in each combo of wool and tension:
- breaks	numeric	The number of breaks
- wool	    factor	The type of wool (A or B)
- tension	factor	The level of tension (L, M, H)

We will now be able to see if the type of tension (low, medium, or high) and/or the type of wool (A vs. B) have an effect on the number of breaks.  So we're looking to see if the mean number of breaks differs significantly by loom tension and type of wool used.

In [None]:
# basic summary statistics
data(warpbreaks)
summary(warpbreaks)
warpbreaks %$% table(wool, tension) ## new pipe operator - http://www.deeplytrivial.com/2020/04/e-is-for-exposition-pipe.html

[Return to Top](#top)
<a id = "graph"></a>
### Graphically looking at differences in group means.
We should always begin by visualizing the distribution of warp breaks within the three groups (low, medium, and high tension).  There are a number of ways that we can look at the difference in distributions.  Here I'm going to look at two - density curve and boxplot.

In [None]:
## density plot
d1 <- warpbreaks %>%
  ggplot( aes(x=breaks, fill=tension)) +
    geom_density(alpha=0.6) +
    scale_fill_manual(values=c("#26d5b8", "#ff5733", "magenta")) +
    labs(fill= "Tension",
         y = "Density",
         x = "Warp Breaks",
         title = "Distribution of warp breaks by loom tension")

d2 <- warpbreaks %>%
  ggplot( aes(x=breaks, fill=wool)) +
    geom_density(alpha=0.6) +
    scale_fill_manual(values=c("#26d5b8", "#ff5733")) +
    labs(fill= "Wool",
         y = "Density",
         x = "Warp Breaks",
         title = "Distribution of warp breaks by type of wool")

grid.arrange(d1, d2, ncol = 2)

In [None]:
#boxplot
boxp <- warpbreaks %>%
  ggplot( aes(y=breaks, x=tension, fill=tension)) +
    geom_boxplot() +
    stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color = "mean"),
                 width = 0.75, linetype = "solid", size = 2) +
    stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), 
                geom="errorbar", color="#39ff14", width=0.2, size = 1) +
    scale_fill_manual(values=c("#26d5b8", "#ff5733", "magenta")) +
    scale_color_manual(values = "#39ff14")+
    labs(fill= "Tension",
         y = "Warp Breaks",
         x = "Loom Tension",
         color = "Group Mean",
         title = "Distribution of warp breaks by loom tension")

boxp2 <- warpbreaks %>%
  ggplot( aes(y=breaks, x=wool, fill=wool)) +
    geom_boxplot() +
    stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color = "mean"),
                 width = 0.75, linetype = "solid", size = 2) +
    stat_summary(fun.data=mean_sdl, fun.args = list(mult=1), 
                geom="errorbar", color="#39ff14", width=0.2, size = 1) +
    scale_fill_manual(values=c("#26d5b8", "#ff5733")) +
    scale_color_manual(values = "#39ff14")+
    labs(fill= "Wool",
         y = "Warp Breaks",
         x = "Type of Wool",
         color = "Group Mean",
         title = "Distribution of warp breaks by type of wool")

grid.arrange(boxp, boxp2, ncol = 2)

We're already familiar with the distribution of warp breaks by tension, but now we are adding wool into the mix.  Wool A has a more normal distribution of breaks than Wool B, although both have positive skew.  The boxplot showing breaks by the type of wool shows that the medians are very similar, however the means are not.  And the spread (IQR, variance) of the distribution within Wool A appears to be larger than within Wool B.

For our Two-Way ANOVA we'll want to look at one more graph - A line plot of the means that will allow us to visualize any possible interactions:

In [None]:
bold.14.text <- element_text(face = "bold", size = 14)

int1 <- warpbreaks %>% 
  ggplot(aes(x = wool, color = tension, group = tension, y = breaks)) +
  stat_summary(fun.y = mean, geom = "point", size = 5) +
  stat_summary(fun.y = mean, geom = "line", size = 2) +
  labs(title = "Average Breaks by Wool Type and Tension") +
  theme(text = bold.14.text) 
int2 <- warpbreaks %>% 
  ggplot(aes(x = tension, color = wool, group = wool, y = breaks)) +
  stat_summary(fun.y = mean, geom = "point", size = 5) +
  stat_summary(fun.y = mean, geom = "line", size = 2) +
  labs(title = "Average Breaks by Tension and Wool Type") +
  theme(text = bold.14.text)
grid.arrange(int1, int2, ncol = 2)

What interactions show us is any potential combined influence of the two predictors.  For example, in this graph, we can see that there appears to be a significant interaction between wool and tension and their influence on number of breaks. 

What is the visual cue of an interaction effect?   If there is no interaction, the lines between the means within each group would be parallel.  If they are not parallel that suggests there might be an interaction.  If they cross, that's strong suggestion that an interaction exists.

We can interpret it from two directions:

- How well the wool performs depends on the tension of the loom.
- The effect of loom tension depends on the type of wool.

Comparing Wool A to Wool B 
- In Wool A - medium and high tension perform similarly and low tension leads to an extreme amount of breaks
- In Wool B - high tension yields an extremely low number of breaks, low and medium tension perform similarly.

Within Tension Groups
- Low Tension performs better with Wool B
- Medium Tension performs better with Wool A
- High Tensions performs better with Wool B

[Return to Top](#top)
<a id = "decomp"></a>
### Decomposing Variance
The first step in conducting an ANOVA analysis is calculating the sum of squares.  For a two-way ANOVA with an interaction term we decompose the total sum of squares (SST) into 4 pieces:
- The group SS for the first categorical variable
- The group SS for the second categorical variable
- The SS for the interaction of the two categorical predictors
- The Residual SS ("whatever's left over")

![](slide_42.jpg)

Where these are the definitions for the terms:
![](defs.PNG)

First, let's calculate the total sum of squares (SST):
![](SST.PNG)

In [None]:
overallmean <- mean(warpbreaks$breaks)
warpbreaks %<>% mutate(SS = (breaks - overallmean)^2)
SST <- sum(warpbreaks$SS)
SST

Now that we know SST, once we calculate some other sum of squares, we can use subtraction to find the last one without manual calculation.

Let's move next to the SS_wool (Factor A in the above table):
![](SSA.PNG)

Here c is the number of levels of the OTHER categorical variable (tension, 3 levels) and n' is the sample size within each cell of the two-way table of wool x tension (9).

In [None]:
fact_a <- warpbreaks %>% group_by(wool) %>% 
                         summarize(group_mean = mean(breaks)) %>% 
                         mutate(SS = (group_mean - overallmean)^2)

fact_a
SSA <- 3*9*(sum(fact_a$SS))
SSA

We calculate the SS for the second predictor (tension) in the same manner.  The n' is still 9 and we multiply by the number of groups in the OTHER variable (wool has 2 groups).

![](SSB.PNG)

In [None]:
fact_b <- warpbreaks %>% group_by(tension) %>% 
                         summarize(group_mean = mean(breaks)) %>% 
                         mutate(SS = (group_mean - overallmean)^2)

fact_b
SSB <- 2*9*(sum(fact_b$SS))
SSB

We have two more sums of squares to calculate, one for the interaction and one for the residuals.

![](SSABE.PNG)

Calculating the sum of squares for the interaction term is a bit harder.  So let's calculate SSE and use all of our calculated sums of squares to get SSAB from subtraction from the total (SST).

In [None]:
resid <- warpbreaks %>% group_by(wool, tension) %>% 
                        mutate(cellmean = mean(breaks)) %>% 
                        ungroup() %>% 
                        mutate(SS = (breaks - cellmean)^2)
head(resid)
SSE <- sum(resid$SS)
SSE


So, we know SST, SSA, SSB, and SSE.  We still need SSAB, but since 

#### SST = SSA + SSB + SSAB + SSE

we can calculate SSAB:

In [None]:
SSAB <- SST - SSA - SSB - SSE
SSAB

So we were able to calculate all of the sum of squares.  I'm not going to build the entire ANOVA table, but let's visit degrees of freedom really quick:

![](dof.PNG)

[Return to Top](#top)
<a id = "ftest"></a>

### Conducting the F-test (ANOVA)
Recall that an ANOVA table looks something like this: 
![](slide_42.jpg)

Instead of calculating all the pieces, we're just going to run `aov()`

In [None]:
summary(aov(breaks ~ wool*tension, data = warpbreaks))

The mean sum of squares for each predictor has been calculated by dividing SS for that predictor by the degrees of freedom for that predictor.  

Then those MS values are used to calculate an F-ratio for EACH predictor.  With two categorical variables we get two "main effects" and F-ratios and one "interaction effect" and a third F-ratio.


[Return to Top](#top)
<a id = "interp"></a>

### Interpreting the Result

We have three F-tests and three p-values to interpret.

1. The main effect of wool.  Setting everything else aside, the p-value for the effect of wool on breaks is 0.058, which is slightly above an alpha of 0.05, indicating that wool is not a significant predictor of breaks (_on its own..._)

2. The main effect of tension.  Setting wool and the interaction aside, the p-value for the effect of tension on breaks is 0.0007, which is below an alpha of 0.05.  Tension is a significant predictor of breaks, on its own.

3. The interaction effect.  Setting aside the main effects, the interaction of wool and breaks is significant with a p-value of 0.02.  This means that while wool may not be significant on its own, the way that wool type interacts with tension significantly influences breaks.

Again - we just know that at least one mean is significantly different from the others - and not which specific pairs are different.

 [Return to Top](#top)
<a id = "assump"></a>

## Assumptions
As with any statistical test, we have a number of assumptions that we have to fulfill in order for the test to be fully valid.  Let's review the ANOVA assumptions.

<a id = "baseassump"></a>
The basic ones we don't need to test:
1. Dependent variable is numeric - **breaks is numeric, so we're good here!**
2. Group sample sizes are approximately equal - **in this case they're all exactly the same size**
3. Independence of observations - **each loom is randomly assigned to wool/tension combos**
4. No extreme outliers - **we can review the boxplot - nothing seems too extreme**

<a id = "homovar"></a>
Pre-check item:
5. Homogeneity of variance - the within group variance for each of the groups should be equal.  This is the same assumption we had for our t-test, but in that case we only had two groups, now we have more.  **We need to test this using Levene's Test.  This is typically reviewed PRIOR to conducting your ANOVA analysis.**

Because we have two categorical predictors, we have to run Levene's Test on both.

In [None]:
#LeveneTest(DV ~ IV, data = your data frame)

LeveneTest(breaks ~ tension, data = warpbreaks)
LeveneTest(breaks ~ wool, data = warpbreaks)

For both variables our p-value is greater than alpha, so we can conclude that there is no significant difference in the within group variances in both wool and tension.

**IMPORTANT**: This is a test of variance, but this is **NOT** the ANOVA test.  This just tests the assumption of homogeneity of variances.  We cannot use these results to make inference about our means.

<a id = "resid"></a>

So, we have one final assumption:

6. Normality of _**residuals**_

Here we don't care about the normality of our observations (breaks) we care about the normality of the residuals (the difference between each observation of breaks and the group mean for that observation).  These residuals should be normally distributed.  Because the residuals are calculated as a result of conducting the analysis, we cannot check this assumption until after we conduct our ANOVA test.  So this is considered a "post-hoc" check of assumptions.

**We check this via a QQ plot of the _residuals_ from our data.**

Now are residuals are whatever's left over after we remove the variance due to each main effect and the interaction effect.

In [None]:
## we can get the residuals from an aov object (the results of running aov)
twoway_aov <- aov(breaks ~ tension*wool, data = warpbreaks)
resid_df <- data.frame(resid = twoway_aov$residuals) ## the residuals part of the aov results using $residuals

resid_df %>% ggplot(aes(sample = resid)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Residuals")

Our residuals seem to be almost perfectly normal!

[Return to Top](#top)
<a id = "posthoc"></a>

## Post-hoc Pairwise Comparisons
We can still do pairwise comparisons.  We'll test pairwise differences for each grouping in our model - one set for the main effect of wool, one set for the main effect of tension, and one final set for the pairs of each interaction (it gets a bit confusing).

<a id = "tukey"></a>
### Tukey's Honest Significant Difference (HSD)
Like before, we just need to run TukeyHSD on our aov object.

In [None]:
#use the TukeyHSD function and pass it your saved aov object.

TukeyHSD(twoway_aov) #aov object created in section QQ plot section

Lot's of stuff going on here!!!

TENSION: The pairwise differences for tension are what we saw before in the one way ANOVA.  Low tension is significantly different from both medium and large, but medium and large are not significantly different from each other.

WOOL: For wool, there are only two levels, so only 1 pairwise comparison.  We still need to use the adjustment for multiple comparisons because overall we are still doing many t-tests on the same data.  We get the same p-value we got for the F-test in the ANOVA table.

INTERACTION: Each tension:wool pair is compared to each other tension:wool pair - which gives us 15 pairwise comparisons!  Remember we have 6 tension:wool pairs (L:A, L:B, M:A, M:B, H:A, H:B).  Most are not significant, 5 are significant:

- M vs. L tension inside Wool A
- H vs. L tension inside Wool A
- Wool A vs. Wool B inside L tension
- Wool A vs. Wool B inside M tension
- Wool B at H tension vs. Wool A at L tension

Let's look at our plot of interactions again:

In [None]:
grid.arrange(int1, int2, ncol = 2)

[Return to Top](#top)
<a id = "effect"></a>

## Effect Size
A statistical analysis is not complete without addressing the substantive significance of a result.  Just because we find that our analysis is statistical significant doesn't mean that it is large enough to matter. For ANOVA and throughout the end of the semester we will be focusing on measures of effect size that tell us the strength of our IV in predicting our DV.  In other words, how much of the variance in our outcome is explained by our predictor variable(s).

In the case of ANOVA - how much of the variance in the numerical variable is explained by group/level of the categorical variable(s) and interactions OVERALL.  Is the main source of variation in the scores due to the different group means, or due to individual variations in scores (residuals)?

<a id = "rsq"></a>
### R-squared
Our effect size for ANOVA is the same r-squared statistic we looked at in two-sample t-tests.  The interpretation is the same, the formula for calculating it is different.  In the ANOVA situation r-squared is sometimes referred to as eta-squared, but it is the same thing.

We have two different main effects, and an interaction effect, so I'm going to use a shortcut to get the r-squared of the overall ANOVA analysis by running it as a linear model instead.

In [None]:
# calculate r-squared
wb_lm <- lm(breaks ~ wool * tension, data = warpbreaks)
rsq <- summary(wb_lm)$r.squared 
rsq # proportion
percent(rsq, accuracy = .01) # percentage

So the r-squared for the warpbreaks data shows that 38% of the variance in breaks can be explained by our model (the combination of the main effects of wool and tension, plus the interaction between wool and tension). 

This is a sizeable amount of variance explained (medium-ish, IMHO), and larger than the variance explained by just tension in the one-way model (22%)

However, if we want to know the effect of each of our predictors separately we can look at something called partial eta squared.

In [None]:
eta2 <- EtaSq(twoway_aov) ## give eta-squared the saved anova output
eta2 # print the entire eta-squared output

Here we see the component parts of the variance explained by each IV.  The first column, labeled eta.sq, portions out the portion of the IV over the total variance.  The partial eta-squared (eta.sq.part) is the SS of that IV over the SS of that IV plus just the SS of the residuals.  

# $eta.sq_{wool} = \frac{SS_{wool}}{SS_{total}}$   &nbsp;&nbsp;&nbsp;   but   &nbsp;&nbsp;  &nbsp; $eta.sq.part_{wool} = \frac{SS_{wool}}{SS_{wool} + SS_{resid}}$

This is the effect of the variable in isolation of the other predictors.  This value is not necessarily useful in a situation with a single analysis, but is useful in comparing the effect of a certain predictor across experiments where there are different IVs included in each.

[Return to Top](#top)
<a id = "cohen"></a>

### Cohen's $f$
The other effect size statistic we will use is Cohen's $f$.  Cohen's $f$ is primarily needed because it is the effect size used in power calculations.  R-squared is preferred for "interpretation" purposes.  Cohen's $f$ is calculated using the r-squared value.

## $f = \sqrt{\frac{r^2}{1 - r^2}}$

In [None]:
## calculate cohen's f using saved value of rsq
cohenf <- sqrt(rsq / (1-rsq))
cohenf

Cohen's f can be interpreted similarly to Cohen's d, but as a standardized average effect size across all the levels of the categorical variable.  But we will use r-squared moving forward in the class, so become familiar with it now.  It also has a more straightforward interpretation once you can wrap your head around the understanding of "proportion of variance explained."

But, Cohen's f will be needed for the power calculation.  You cannot use r-squared in the power function (which you will see in lab).