In [None]:
## Ignore this code, this is just generating some data
library(infer)
library(tidyverse)
library(nycflights13)
library(palmerpenguins)

## 1. **AN**alysis **O**f **VA**riance (ANOVA)

- We learned how to compare the mean of two groups: `t.test` or permutation test.

- But how does it work when we have more than two groups?

- For example, imagine we have $k$ groups. We set the hypotheses to be:
$$H_0: \mu_1=\mu_2=...=\mu_k\quad\quad vs \quad\quad H_A: \mu_i\neq \mu_j, \text{ for at least one } i\neq j$$
    
- In other words, the alternative hypothesis of ANOVA suggests that at least one group has a different mean
    - We don't need all the groups to have a different mean. It is enough that one group has a different mean for $H_0$ to be false.      
    
- Let's explore ANOVA with an example.

### 1.1 cars dataset

In [None]:
# cell-01
# Let's take a look on the dataset


In [None]:
# cell-02
# Let's make cyl a factor



In [1]:
# cell-03
# Horsepower per cylinder


- The idea of ANOVA is to compare the variation within each group/category against the variation between the groups/category.

- If the within-group spread is small compared to the between-group spread, then we have evidence of difference. 

### 1.2 Variability within the groups (SSE)

- We want to see how spread the points of a group are around its mean.

1. We take the difference of each point to the mean of its group.
2. Take the square of these differences;
3. Sum the square difference for all points

In [None]:
# cell-04 
# Let's plot the data
cars %>%
    mutate(car_number = 1:nrow(cars)) %>%
    ggplot(aes(car_number, hp, color = cyl)) + 
    geom_point(size=3) +
    theme(text = element_text(size = 20)) #+ 
    #geom_hline(aes(yintercept = cars %>% filter(cyl == 4) %>% pull(hp) %>% mean()), color = 'red') +
    #geom_hline(aes(yintercept = cars %>% filter(cyl == 6) %>% pull(hp) %>% mean()), color = 'darkgreen') +
    #geom_hline(aes(yintercept = cars %>% filter(cyl == 8) %>% pull(hp) %>% mean()), color = 'blue')

In [None]:
# cell-05
# Calculate the SSE


### 1.3 Variability between groups (SST)



- We want to see how the **mean** of each group varies around the **overall mean** (the mean considering all the points of all the groups).
  
1. Take the difference between the mean of each group and the overall mean;
2. Take the square of the difference
3. Multiply the square difference by the number of points in the group.

In [None]:
# cell-06
# Plot the means for the variability between groups
cars %>%
    mutate(car_number = 1:nrow(cars)) %>%
    ggplot(aes(car_number, hp, color = cyl)) + 
    #geom_point(size=3) +
    theme(text = element_text(size = 20)) + 
    geom_hline(aes(yintercept = cars %>% filter(cyl == 4) %>% pull(hp) %>% mean()), color = 'red') +
    geom_hline(aes(yintercept = cars %>% filter(cyl == 6) %>% pull(hp) %>% mean()), color = 'darkgreen') +
    geom_hline(aes(yintercept = cars %>% filter(cyl == 8) %>% pull(hp) %>% mean()), color = 'blue') + 
    geom_hline(aes(yintercept = mean(cars$hp)s), color = 'black', lwd = 2) 

In [None]:
# cell-07 
# Calculate SST


### 1.4 Degrees of freedom 

We want to compare the SSE and SST, but they are dependent on the number of groups and points in each group we have. To account for that, we will compute some sort of average. But instead of dividing by the number of points we will use the so-called "Degrees of Freedom". 

- The degrees of freedom for SST: is the number of groups - 1.
- The degrees of freedom for SSE: is the number of points minus the number of groups.

In [None]:
# cell-08
# Get MSE and MST


### 1.5 The F-statistic

- Now that we have accounted for the number of points and number of groups, we can compare the MSE and MST.

- The test statistic for ANOVA is then given by

$$
F = MST/MSE
$$

In [None]:
# cell-09
# Calculate F-statistic


**<font color= "red">It is time for CLICKER QUESTION!!</font>**

### 1.6 ANOVA in R

To do ANOVA in R, we use the `aov` function: 

```
aov(formula = response ~ grouping_variable,
    data = dataframe)
```

In [None]:
# cell-10
# Solve the problem above using aov


The `broom::tidy` extracts all the info in a dataframe for you:

In [None]:
# cell-11
# Call broom::tidy on the aov object


Let's see if it matches what we did. 

- `sumsq` is our SS terms
    -  `sumsq` of `cyl` is SST
    -  `sumsq` of residuals is SSE

In [None]:
# cell-12
SST
SSE

- `meansq` is our MS terms
    -  `meansq` of `cyl` is MST
    -  `meansq` of residuals is MSE

In [None]:
# cell-13
MST
MSE

### 1.7 Assumptions for ANOVA

- The population of all groups follow a Normal distribution;
- All the population have the same variance.
    - In practice, we are fine as long as the largest variance is not multiple times larger than the smallest variance. 
<br>
- The samples are independent across and within each group. 

### 1.8 Tuckey Honest Significant Difference 

The issue with ANOVA is that it only tells us that at least one group has a different mean. But it doesn't tell us which groups are different. 

Once we detect that there is a difference with ANOVA, we can study pairwise difference of means by using Tukey's HSD. Tukey's HSD will basically make pairwise tests, but it will control the probability of Type I Error. 

In [None]:
# cell-14
# TukeyHSD
...(aov(hp ~ cyl, data = cars))