# One-Way ANOVA
In this notebook we will analyze batting data from Major League Baseball using one-way ANOVA.  We will look at:

- variance testing to determine equal variance across groups
- one-way ANOVA with both equal and unequal variances
- one-way ANOVA with repeated measures
- post hoc testing of assumptions (normality of residuals)
- post hoc pairwise testing of means
- non-parametric tests (testing of mean rank order)
- effect size
- power

...and we'll look at some visualizations along the way.

# Table of Contents
- [One-Way ANOVA - unequal variances](#owunequal)
- [One-Way ANOVA - equal variances - post season](#oweqpost)
- [One-Way ANOVA - equal variances - regular season](#oweqpre)
- [One-Way ANOVA - repeated measures](#repeated)
- [One-way Non-parametric test (for non-normal or ordinal data)](#kruskal)
- [Power Analysis](#power)

### Topic Subsections (within the larger analyses)
- [Testing Assumptions - Homogeneity of Variances](#levene)
- [Testing Assumptions - Post Hoc examination of normality of residuals](#resid)
- [Post Hoc Pairwise Comparisons - Tukey HSD](#tukey)
- [Post Hoc Pairwise Comparisons - Bonferroni](#bonf)
- [Post Hoc Pairwise Comparisons - Dunn Test (non-parametric analysis)](#dunn)
- [Effect Size - R-squared](#rsq)
- [Effect Size - non-parametric tests](#epsilon)

In [None]:
## load the libraries we'll need in the notebook

library(tidyverse)
library(magrittr)
library(ggpubr) # containes line/dot plot for visualizing means
library(DescTools) # contains levene's test function
library(broom) # to tidy model output
library(rcompanion) # for EpsilonSquared function
library(pwr) # for power analysis
library(tidyr) # for pivot_longer

options(repr.plot.width=5, repr.plot.height=4) ## set options for plot size within the notebook -
# this is only for jupyter notebooks, you can disregard this.

## Loading and Cleaning Data
Before I get started with the ANOVA examples I am going to load and clean data.  In this example we're going to load the (rather large) baseball batting stats and player tables from data.world directly from their site via the internet.  We will then subset our data to 1920 and later (the so-called "live ball era") and use joins to add the player level stats (height, weight, handedness) to our batting stats.  Our batting stats data has one row for each player for each year.

In [None]:
## Read Sabremetrics Baseball data from data.world
## NOTE: these are large datasets and may take 10+ minutes to load

df_players <- read.csv("https://query.data.world/s/2lodzrpv2eyrgdvsih4udahoxpg3rx", header=TRUE, stringsAsFactors=FALSE)

df_batting <- read.csv("https://query.data.world/s/qbrtixcbxxuyxuqq5oq6tfrkxoaycs", header=TRUE, stringsAsFactors=FALSE)

df_batt_post <- read.csv("https://query.data.world/s/wrxv3xo54tvesrlfhf72jlfn2lxsm6", header=TRUE, stringsAsFactors=FALSE)


In [None]:
## look at first rows of each df
head(df_players)
head(df_batt_post)
head(df_batting)

In [None]:
## look at the "structure of our dfs - variable names and types"

str(df_players)
str(df_batt_post)
str(df_batting)

In [None]:
#subset batting dfs to "live ball era" - 1920 and later
df_batting <- df_batting  %>% filter(yearid >= 1920)
df_batt_post <- df_batt_post  %>% filter(yearid >= 1920)

In [None]:
# join player data (weight, height, handedness (bats / throws)) with 
# batting data (both general and post season) using playerid
plyr_cols <- df_players  %>% select(playerid, weight, height, bats, throws) ## select to only cols from player tbl I want
dfreg <- inner_join(df_batting, plyr_cols, by = "playerid") # join batting to plyr_cols for all rows that match
dfpost <- inner_join(df_batt_post, plyr_cols, by = "playerid") # join post season batting stats to plyr_cols
head(dfreg)
head(dfpost)

Now we have the data at a point where we can start the analysis.  We will use our player data as predictors for our batting statistics.

## Beginning the analysis
We will begin the analysis by looking at some summary statistics and visualizations of our variables of interest.  The first ANOVA test we will conduct is to look at regular season hits (h) by handedness during batting (bats).  Let's look at the distributions of these variables.

In [None]:
summary(dfreg$h)
table(dfreg$bats)

Oh, no! We have NAs on our batting variable.  Let's remove those - instead of removing the rows that have NAs on *ANY* variable we'll only remove those rows that have NA on the hitting variable (h).

In [None]:
## remove cases with NA value on dfreg$h
dfreg %<>% drop_na(h)  ## note the  %<>% pipe operator - this pipes forward AND does assignment back 
                       ## will overwrite the dataframe you're manipulating
## check our summary again
summary(dfreg$h)

We also have some missing data on our "bats" variable - notice the 2 in a column before B with no label?  Those are NA that are not coded as NA, they are blank character values - "".  Let's remove those too!

In [None]:
## remove cases with empty character value on dfreg$bats
dfreg %<>% filter(bats != "")  ## note the  %<>% pipe operator - this pipes forward AND does assignment back 
                       ## will overwrite the dataframe you're manipulating
## check our summary again
table(dfreg$bats)

I think we might finally be done with our data cleanup and ready to roll.  Lets start with looking at some graphical representations of our distribution.

In [None]:
#frequency histogram
dfreg  %>% ggplot(aes(h)) + 
  geom_histogram(binwidth=10)

Well, our distribution is a bit skewed, but let's move foward.

In [None]:
#box plot
dfreg  %>% ggplot(aes(y = h)) +
        geom_boxplot() +
        xlab("all players") +
          theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank())

In [None]:
#grouped box plot
dfreg %>% ggplot(aes(x = bats, y = h, fill = bats)) +
        geom_boxplot()

In [None]:
#distribution of means by groups
dfreg %>% ggline(x = "bats", y = "h", 
       add = c("mean_se", "jitter"),  add.params = list(color="bats"),
       ylab = "Hits", xlab = "Player Handedness (batting)") 

Our data is definitely skewed, but it looks like the means may be different.  To see if we can remove some of the skew, and make our analysis more appopriate for the data/meaning of the data, lets remove any observations where the player had 0 at bats (ab) because those players would have had no chance to have made a hit.

In [None]:
# subset to observations with at least one at bat (ab)

dfreg2 <- dfreg %>% filter(ab > 0) 

In [None]:
# look at a couple of our graphs again

#frequency histogram
dfreg2  %>% ggplot(aes(h)) + 
  geom_histogram(binwidth=10)

#distribution of means by groups
dfreg2 %>% ggline(x = "bats", y = "h", 
       add = c("mean_se", "jitter"),  add.params = list(color="bats"),
       ylab = "Hits", xlab = "Player Handedness (batting)") 

It may be a bit better, but still skewed.  Let's move on and do our pre-checks of our assumptions:

## Pre-check - Assumptions
- DV is numeric (interval or ratio) - yes, number of hits by player by year
- No extreme outliers - nothing looks too bad
- Normality __*of residuals*__ - can't check this until after
- Independence of Observations (random selection, different samples) - not entirely since there are multiple observations from players for different years, but the observations are not paired in this format.  We'll agree to violate this assumption.
- Group sample sizes are approximately equal - equal enough, and large enough to not matter too much.

AND....
<a id="levene"></a>
- Homogeneity of Variance - Let's check this right now with Levene's test.

Levene's Test is the same as var.test() we used with t-tests, however it tests the variances of multiple groups.

Recall:

$H_0:$ The variances in the groups are equal. <BR>
$H_A:$ The variances in the groups are not equal.

In this test, we sort of want to fail to reject null, because it's easier if our variances are equal and we don't need to make the adjustment.

In [None]:
#LeveneTest(DV ~ IV, data = your data frame)

LeveneTest(h ~ bats, data = dfreg2)

Our p-value is very, very low - so we reject the null hypothesis.  The null hypothesis here is that the variances of each group are equal - so our variances are unequal.  This means we need to use the Welch's ANOVA for unequal variances.
<a id="owunequal"></a>
## One-way ANOVA time! (unequal variances)
We are finally ready to run our first ANOVA analysis.  We'll set our alpha at 0.05.  Based on the results of Levene's Test we have unequal variances.  To use the Welch's ANOVA we use the function oneway.test().

In [None]:
# oneway.test(DV ~ IV, data = your data frame)
oneway.test(h ~ bats, data = dfreg2)

This output doesn't give us all of the sum of squares and the mean sum of squares.  We get the F-value, the numerator and denominator degrees of freedom, and the p-value.  Our p-value is much lower than alpha and therefore we reject the null hypothesis.  This means that the mean number of hits between batter handedness (left-handed, right-handed, or both) is statistically different.  We'll later look at if it's actually substantively significant in a standardized way, but lets take a peak at the unstandardized difference.

In [None]:
dfreg2 %>% group_by(bats) %>% summarize(mean_hits = mean(h))

Right-handed batters hit on average 39 hits per season, while left-handed batters hit an average of 50 hits, and switch hitters hit an average of 55 hits per season.  This means that the difference between R and L are 11 hits, and between L and B is 5 hits.  Those seem somewhat significant to me.  We would look at the R-squared to see the standardized effect size (the proportion of variance explained by the IV) but we can't do that with the output of oneway.test() so I'm going to use the lm() (linear model) function.

### R-squared (effect size)

In [None]:
summary(lm(h ~ bats, data = dfreg2)) ## use summary to get the full output that includes r-squared.

Our R-squared shows that handedness of the batter only accounts for 1% of the variance in number of hits, even though the result is statistically significant, and the unstandardized difference looked substantively signficant, the variance explained is very very low (almost 0) which means that handedness doesn't influence number of hits.  We'll return to this example when we look at non-parametric tests, as our data was extremely skewed.

<a id = "oweqpost"></a>

## One-way ANOVA - equal variances - post season
This time we're going to run an ANOVA analysis using the aov() function, which assumes equal variances.  Given the current World Series matchup - the Houston Astros vs. the Washington Nationals - we'll comparing hitting stats between those two teams during the offseason.  First we'll need to subset our data to just include those teams.  We'll include the Montreal Expos stats with the Washington Nationals group - so we'll need a bit of data cleaning.  We'll look at the teamid variable - we want "HOU" for the Astros and "WAS" and "MON" for the Nationals, but "MON" will be a third group in our analysis.

In [None]:
dfpost_sub <- dfpost %>% 
                    filter(teamid %in% c("HOU", "WAS", "MON")) 
table(dfpost_sub$teamid) ## check to make sure the distribution team variable looks the way we want it to after

We'll run a Levene's test to check for homogeneity of variances, but we'll assume equal variances in our ANOVA analysis either way, for the purposes of the example.

In [None]:
LeveneTest(h ~ as.factor(teamid), data = dfpost_sub)

We fail to reject the null hypothesis, therefore we can assume our variances are equal. Let's take a look at our distribution of hits.

In [None]:
## check means by team
dfpost_sub %>% group_by(teamid) %>% summarize(mean_hits = mean(h))

In [None]:
#grouped box plot
dfpost_sub %>% ggplot(aes(x = teamid, y = h, fill = teamid)) +
        geom_boxplot()


In [None]:
## line plot
#distribution of means by groups
dfpost_sub %>% ggline(x = "teamid", y = "h", 
       add = c("mean_se", "jitter"),  add.params = list(color="teamid"),
       ylab = "Hits", xlab = "Team") 

Doesn't look like there's a lot going on here, let's see what our results say:

In [None]:
# use aov(DV ~ IV, data = dataset) to run ANOVA.  Then look at summary() for full results. 

WSaov = aov(h ~ teamid, data=dfpost_sub)
summary(WSaov)

There is no statistically significant difference in hitting between HOU, WAS, and the old Montreal Expos during the post-season.  
<a id = "oweqpre"></a>
## One-way ANOVA - equal variances - regular season
Let's look at a different variable, home runs (HR) during the regular season by weight.  Since weight is a numerical variable we will need to use cut() to make it into a factor with levels that are weight ranges.  We'll start by looking at the distribution of weight to determine our cuts.  

In [None]:
summary(dfreg$weight)

We see two concerning things here - 1) it seems unlikely that a baseball player weighed 65 pounds - possible outlier? - and 2) we have 7 NAs.  Let's look at a box plot to see if we do have outliers.

In [None]:
dfreg %>% ggplot(aes(y = weight)) +
        geom_boxplot()

That observation with a weight of 65 is a clear outlier, so let's drop him and the NAs before we move forward.

In [None]:
# drop NAs and observations with weights lower than 100
dfreg3 <- dfreg %>% drop_na(weight) 
summary(dfreg3$weight)

Now we'll make our cuts.  I'm going to do my groups as (120, 175] (175, 200] (200, 225] (225, 250] (250, Inf]

In [None]:
wtcut <- c(-Inf,175,200,225,250,Inf) # define cut points
wtlbls <- c("less than 175lbs", "175 - 200lbs", "200 - 225lbs", "225 - 250lbs", "more than 250lbs" ) # create descriptive labels
dfreg3 %<>% mutate(wt_cut = cut(weight, br= wtcut, label = wtlbls)) # use mutate to create new variable with the weight categories
table(dfreg3$wt_cut) #inspect the results

## I'm also going to filter out the players with 0 at bats
dfreg3 %<>% filter(ab>0) 

Now we can look at the distribution of HR (home runs) by weight categories.

In [None]:
# means by group

dfreg3 %>% group_by(wt_cut) %>% summarize(mean_hr = mean(hr)) 

In [None]:
## line plot
#distribution of means by groups
dfreg3 %>% ggline(x = "wt_cut", y = "hr", 
       add = c("mean_se", "jitter"),  add.params = list(color="wt_cut"),
       ylab = "Home Runs", xlab = "Weight") 

It looks like there may be an association here, lets run our ANOVA.

In [None]:
# use aov(DV ~ IV, data = dataset) to run ANOVA.  Then look at summary() for full results. 

wt_hr_aov = aov(hr ~ wt_cut, data=dfreg3)
summary(wt_hr_aov)

There is a statistically significant difference in mean home runs by weight.  But is the difference substantively significant?  Let's look at r-squared.  

<a id = "rsq"></a>
### R-squared
We know that $r^2 = \frac{SS_{between}}{SS_{total}}$, so we can calculate it from the saved output of the ANOVA.

In [None]:
# look at structure of aov result to locate the pieces of data we need
# first use the tidy function from broom to "tidy" the aov output
tidyout <- tidy(wt_hr_aov)
str(tidyout)

In [None]:
## calculate with aov output
ss_group <- tidyout$sumsq[1]
ss_total <- sum(tidyout$sumsq)


R_squared <- ss_group / ss_total
            
R_squared

In [None]:
# or use lm() function
wt_ht <- lm(hr ~ wt_cut, data=dfreg3)

anova(wt_ht)
summary(wt_ht)

Either way we calculate it, our R-squared reflects no substantive significance - 1.8% of the variance in number of HRs is explained by player weight.

Now we're wondering which groups are better at HRs / which groups have statitistically signficant paired differences.  Let's run our post-hoc pairwise comparisons.
<a id = "tukey"></a>
### Post-Hoc Pairwise Comparisons - Tukey HSD


In [None]:
#use the TukeyHSD function and pass it your saved ANOVA output.

TukeyHSD(wt_hr_aov)

It looks like all of our pairwise comparisons are significant except the one between people more than 250 with the groups of 200-225 and 225-250.  We'll try Bonferroni too.
<a id = "bonf"></a>
### Post-hoc Pairwise Comparisons - Bonferroni Adjustment.


In [None]:
# for bonferroni we use the function pairwise.t.test with the p.adj argument set to "bonf"
pairwise.t.test(dfreg3$hr, dfreg3$wt_cut, p.adj = "bonf")

These results are the same in terms of what's significant, but the p-values are slightly higher.
<a id = "resid"></a>
### Post Hoc Assumptions Test - Normality of Residuals

You'll recall that one of our assumptions with ANOVA is the normality of the residuals.  We can't test that prior to conducting the analysis, so we have to look at it afterwards.  We'll use the plot() function with our aov output to graphically check this assumption. It is the Second plot (the QQ plot) that we check for normality of residuals and compare our line of residual points to the diagonal reference line.

In [None]:
plot(wt_hr_aov)

As could be expected with our heavy amount of 0 values, and our skewed distribution of hr, our residuals deviate significantly from normality.  This means that our ANOVA results may not be valid, due to the violation of this assumption.
<a id = "repeated"></a>
## One-way Repeated Measures ANOVA

Let's look at some paired data.  We'll look at HRs for players who played in the off-season compared to their regular season stats.  Specifically let's look at HRs among people who have at least one 1 HR in the regular season (to hopefully deal with some of the skewness).  In order to deal with the unequal number of games, we'll do HR/at bats to standardize the DV.

In [None]:
# join regular season stats to everyone who played in the offseason, using both playerid and year.
dfjoin <- left_join(dfpost, dfreg, by = c("playerid","yearid"))
head(dfjoin)
# note our post seasons stats are now .x and our regular season stats are .y


In [None]:
# subset to only those who had at least one HR in regular season and make standardized hr/ab variable
dfjoin %<>% filter(hr.y > 0) %>%  # at least one hr in regular season
                mutate(stdhr_reg = hr.y/ab.y) %>%  # create std. hr variable for regular season
                mutate(stdhr_post = hr.x/ab.x) # create std. hr variable for post season
#check summary stats
summary(dfjoin[c("stdhr_reg", "stdhr_post")])

In [None]:
# remove NAs
dfjoin %<>% drop_na(stdhr_reg)

We're doing a repeated measures ANOVA, so our "grouping" variable is time.  To do this we need to convert our data from "wide" to "long" format.

In [None]:
#wide format
dfjoin_wide <- dfjoin %>% select(playerid, yearid, stdhr_reg, stdhr_post)
head(dfjoin_wide)

In [None]:
#long format
dfjoin_long <- dfjoin_wide %>% pivot_longer(
                                    cols = starts_with("stdhr"),
                                    names_to = "season",
                                    names_prefix = "stdhr_",
                                    values_to = "HR",
                                    values_drop_na = TRUE)
head(dfjoin_long, 12)

Now we can run our ANOVA function with HR as the DV and season as our IV (time variable).

In [None]:
seas_hr_aov = aov(HR ~ season, data=dfjoin_long)
summary(seas_hr_aov)

Players who played in the playoffs and had at least one home run in the regular season had a significantly different average number of homeruns per at bat between the regular season and the post season.  Let's look at the actual means

In [None]:
dfjoin_long %>% group_by(season) %>% summarize(mean_hr = mean(HR))

So fewer average HR per at bats in the post season.  Let's check our effect size:

In [None]:
# get r-squared from lm()
summary(lm(HR ~ season, data=dfjoin_long))

Again, this result is not substantively significant.

<a id = "kruskal"></a>
## Non-parametric Tests (Kruskal-Wallis Test)

Remember our non-parametric tests are not bound by the same assumptions as our parametric tests (most importantly normality) but they are less powerful, less detailed, and less specific.  

Because of our normality violation in our first test, we'll retry the analysis using the Kruskal-Wallis Test (1/2).

The test uses rank order vs. the actual values and compares means rank order by group.

This test generates an H score (not F) - in R it says Chi Square, but it is an H statistic that uses the Chi Square method to determine statistical significance.

We're returning back to our regular season hitting data, comparing by handedness.  Our DV is hits (h) and our IV is handedness for batting (bats).

In [None]:
# Running Kruskal-Wallis 
kruskal.test(h ~ bats, data = dfreg2)

We have a statistically significant difference - which means again that left-handed, right-handed, and switch hitters on average have significantly different numbers of hits.

<a id = "dunn"></a>

We can also look at the pairwise comparisons in the non-parametric test using the Dunn Test with the Bonferroni Adjustment.

In [None]:
## DunnTest (From DescTools package) DunnTest(DV, IV, method = "bonferroni")
DunnTest(dfreg2$h, as.factor(dfreg2$bats), method="bonferroni")

All of the pairwise comparisons are significant - so all of the groups are significantly different from each other.

<a id = "epsilon"></a>

The final thing we can calculate for the non-parametric test is the effect size.  This version is epsilon-squared (instead of r-squared) but the interpretation is the same.

In [None]:
#calculate epsilonSquared(DV,IV)
epsilonSquared(dfreg2$h, dfreg2$bats)

Similar to our results above, there is no substantive significance - the handedness of the batter only explains 1.5% of the variance in number of hits.  

<a id = "power"></a>

## Power Analysis

The final thing we will look at for one-way ANOVA is the power analysis.  

The pwr.anova.test() function is very similar to the power functions we've used before, except the arguments are:

- k = # of groups (categories in IV)
- f = effect size (r-squared or epsilon-squared)
- sig.level = alpha
- power = power 
- n = sample size __PER GROUP__ (with the assumption that the groups will be equal sizes)

As always, we supply 4 of the 5 things and R will calculate the fifth (set to NULL in the function call).  Typically we either calculate power of our analysis post hoc, or calculate the sample size we would need to achieve a certain power level for a particular effect size before conducting an experiment.

In [None]:
# first, we need to know sample size, because they are unequal, we should use the n of the smallest group:
table(dfreg2$bats)

In [None]:
# calculate the power of our last, non-parametric test

n = 5773

pwr.anova.test(k=3, f=0.015, sig.level = 0.05, power = NULL, n = n)


Our analysis had a power of 0.41.  

Let's see what sample size we would need to achieve a power of 0.8 with an very high effect size (r-squared) of 0.75 - 75% of the variance in the DV explained by the IV.  Our study will have 5 groups/levels of our IV.

In [None]:
pwr.anova.test(k=5, f=0.75, sig.level = 0.05, power = 0.8, n = NULL)

We would need 6 observations __PER GROUP__ for a total of 6 * 5 groups = 30 observations total.