## $\color{red}{\text{Analysis of variance (ANOVA)}}$
---
- **Purpose**: To determine whether the means of three or more groups are equal. ANOVA requires one numeric and categorical variable
- **Structure**: 

$$
\begin{equation}
  \begin{array}{l}
    H_0: \mu_{g1} = \mu_{g2} = \mu_{g3} = ⋯ \mu_{g_k}\\
    H_1: At \:least \:one \:group \:has \:a \:different \:mean
  \end{array}
\end{equation}
$$
- **The alternative hypothesis does not say that ALL group means are different, only that AT LEAST ONE IS DIFFERENT**
  - Example: it could be that there's no difference between $g1$ and $g2$ but there's a statistical difference between $g1$ and $g3$
- **Assumptions**:
  - The numeric variable is normally distributed in each group
    - We can test this with a QQ-plot or a hypothesis tests
  - The samples are independent
  - The variance is equal across groups

## $\color{red}{\text{Import Required Packages}}$

In [4]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

## $\color{red}{\text{Import Data}}$

In [3]:
df = pd.read_excel('hr_data', sheet_name='origData')

NameError: name 'pd' is not defined

#### $\color{green}{\text{Example 1}}$
- Perform ANOVA on `MaritalStatus` and `HourlyRate`.
  - Provide a grouped box-plot
  - Check the ANOVA assumptions
  - Interpret the ANOVA results

In [None]:
import seaborn as sns

sns.boxplot(x=df['Gender'], y=df['HourlyRate'], hue=df['Gender'])
plt.show()

### $\color{blue}{\text{QQ-Plot for normality assumptions}}$
---

In [None]:
import statsmodels.api as sm

male_df = df[df.Gender == 'Male']['HourlyRate']
female_df = df[df.Gender == 'Female']['HourlyRate']
male_df.head()

# Check ANOVA Assumptions
sm.qqplot(male_df, line='45')
sm.qqplot(female_df, line='45')
plt.show()

### $\color{blue}{\text{Hypothesis tests for normality assumptions}}$
---
- Shapiro-Wilkes test
  \begin{equation}
    \begin{array}{l}
      H_0: The \:data \:follows \:a \:normal \:distribution \\
      H_1: The \:data \:does \:NOT \:follows \:a \:normal \:distribution
    \end{array}
  \end{equation}

In [5]:
from scipy.stats import shapiro

_, male_pv = shapiro(male_df) # _, pv only assigns the P-value to male_pv, _ denotes to not assign the statistic to anything

NameError: name 'male_df' is not defined

### $\color{blue}{\text{ANOVA}}$
---
- **Structure**: 

$$\begin{equation}
  \begin{array}{l}
    H_0: \mu_{g1} = \mu_{g2} = \mu_{g3} = ⋯ \mu_{g_k}\\
    H_1: At \:least \:one \:group \:has \:a \:different \:mean
  \end{array}
\end{equation}
$$

In [6]:
from scipy.stats import f_oneway

# Conducting ANOVA

_, anova_pv = f_oneway(male_hrate, female_hrate)
anova_pv

NameError: name 'male_hrate' is not defined

## $\color{red}{\text{ANOVA post hoc tests}}$
---
- Post hoc tests are used to decide *which* group is signifgicant
- It is used after a statistical test involving three or more groups


### $\color{blue}{\text{Tukey's HSD test}}$
---
- Compares all groups to each other to decide which group is significant
- It is done often as a follow-up to ANOVA
- **Structure**: 

$$\begin{equation}
  \begin{array}{l}
    H_0: \mu_{g1} = \mu_{g2} \\
    H_1: \mu_{g1} \neq \mu_{g2}
  \end{array}
\end{equation}
$$

## $\color{red}{\text{Some common non-parametric tests}}$

### $\color{blue}{\text{Kruskal-Wallis test}}$
---
- A non-parametric version of ANOVA
  - We do not assume normality
- **Structure**: \begin{equation}
  \begin{array}{l}
    H_0: \mu_{g1} = \mu_{g2} = \mu_{g3} = ⋯ \mu_{g_k}\\
    H_1: At \:least \:one \:group \:has \:a \:different \:mean
  \end{array}
\end{equation}

### $\color{blue}{\text{Dunn's test}}$
---
- **Structure**: \begin{equation}
  \begin{array}{l}
    H_0: \mu_{g1} = \mu_{g2} \\
    H_1: \mu_{g1} \neq \mu_{g2}
  \end{array}
\end{equation}

In [None]:
from scipy import stats

# KW Test

_, pv = stats.kruskal(male_hrate, female_hrate)

In [7]:
# Dunns test is Posthoc to KW Test
import scikit_posthocs as sp

x = male_hrate.tolist()
y = female_hrate.tolist()

l = [x,y] # combine the groups

sp.posthoc_dunn(l, p_adjust='holm')

NameError: name 'male_hrate' is not defined

This returns a crosstab of p-values, use these for your post-hoc tests to see which group means differ.