# One-Way ANOVA
<a id = "top"></a>
This lab will take you through two complete examples of conducting a one-way ANOVA analysis using the Cards Against Humanity poll data we've used in previous labs.  Each example will go through the process of preliminary data inspection/visualization, checking assumptions, conducting the ANOVA analysis, interpretation, post-hoc assumption checking, post-hoc pairwise comparisons, and effect size calculation and interpretation.  

Examples of power analysis will also be included at the end of the notebook.

### Table of Contents
- [Example 1: Income by Education Level](#ex1)
    - [Preliminary Inspection](#prelim1)
    - [Checking Assumptions](#assump1)
    - [ANOVA Analysis](#anova1)
    - [Post-hoc Assumption Check](#postchk1)
    - [Pairwise Comparisons (Tukey HSD)](#pair1)
    - [Effect Size](#eff1)
- [Example 2: Attractiveness by Age Range](#ex2)
    - [Preliminary Inspection](#prelim2)
    - [Checking Assumptions](#assump2)
    - [ANOVA Analysis](#anova2)
    - [Post-hoc Assumption Check](#postchk2)
    - [Pairwise Comparisons (Bonferroni)](#pair2)
    - [Effect Size](#eff2)
- [Special Cases](#special)
- [Power Analysis](#power)

In [None]:
# LIBRARIES
library(tidyverse)
library(magrittr) ## for pipe operators
library(pwr) ## for power function and ES.h (Cohen's h)
library(scales) ## for scaling functions for ggplot2
library(effsize) ## for Cohen's D
library(DescTools) ## for non-parametric tests
library(rcompanion) # for EpsilonSquared function

## bold text specification for ggplot
bold.14.text <- element_text(face = "bold", size = 14)

### these plot size options are for jupyter notebooks ONLY
options(repr.plot.width  = 9,
        repr.plot.height = 7)

In [None]:
## LOAD the DATA
cah <- read_csv("201806-CAH_PulseOfTheNation_Raw.csv")
## variable names currently full questions - need to rename
new_names <- c("gender", "age", "agerange", "race", "income", "educ", "partyid", "polaffil", 
               "trump", "hollymoney", "fed_min_is", "fed_min_should", "fed_tax_is", "fed_tax_should", 
               "redist", "redist_you", "redist_people", "baseincome", "faircomp", "ceofair", "attractive")
colnames(cah) <- new_names
glimpse(cah)

In [None]:
#question text
spec(cah)

<a id = "ex1"></a>
## Example 1: Income by Education Level

Again we're going to use data from a Cards Against Humanity poll, this time from June 2018.  For the first example we're going to see if the mean of income differs by education level.  We have strong reason to believe this might be true - higher education is generally related to obtaining a higher paying job.  

Before we start we need to do a bit of data cleaning on our two variables.  Note - I'm not removing NA for variables I'm not using - this is important to maintain the power of your analysis.

In [None]:
## data cleaning for income and education level
cah1 <- cah %>% drop_na(income) %>% 
                filter(!educ %in% c("DK/REF", "Other")) %>% 
                mutate(educ = fct_relevel(educ, "High school or less", "Some college", "College degree", "Graduate degree"))
summary(cah1$income)
table(cah1$educ)

<a id = "prelim1"></a>
### Preliminary Inspection
The first part of any analysis is to inspect and visualize your data so that you know what you're working with.  We've looked at basic summary data in the code above when we did the data cleaning, but we should look at a couple of graphs and get a feel for what the distribution of income looks like within and between education levels.  

First, I'm going to look at a density plot to see what the distributions look like - looking for things like normality and skewness as well as seeing how much the distributions overlap.

In [None]:
## density plot
cah1 %>%
  ggplot( aes(x=income/1000, fill=educ)) +  ## divide income by 1000 to make the axes tick marks more readable.
    geom_density(alpha=0.6) +
    scale_fill_manual(values=c("#26d5b8", "#ff5733", "magenta", "blue")) +
    labs(fill= "Education Level",
         y = "Density",
         x = "Income in $1000s",
         title = "Distribution of Income by Education Level")

It's clear that there is some difference between the groups, but it's hard to see because of the extreme outliers.  Let's drop those out of the graph for now so we can better visualize what's going on in the bulk of the distributions.

In [None]:
## density plot
cah1 %>% filter(income < 200000) %>% mutate(educ = fct_relevel(educ, rev)) %>% 
  ggplot( aes(x=income/1000, fill=educ)) +  ## divide income by 1000 to make the axes tick marks more readable.
    geom_density(alpha=0.6) +
    labs(fill= "Education Level",
         y = "Density",
         x = "Income in $1000s",
         title = "Distribution of Income by Education Level")

Now we can more clearly see what's going on with the main part of the data.  I have (momentarily) reversed the order of the factor labels on education so that the large blob that represents individuals with Graduate degrees doesn't lay over all of the activitiy happening with the other groups - especially between 50k and 125k.

All of the distributions seem to be approximately normal with some amount of right/positive skew. 

HS and less and Some college have higher peaks - this indicates that there is less variability in these groups - more responses collected around the mean.  The shorter and fatter distributions seen in college grads and graduate degree groups mean that there are more variability within these groups - there is more spread of values around the mean.

It's hard to determine the location of the mean/median in this view, so lets also make a boxplot to look at that information.

In [None]:
#boxplot
cah1 %>% filter(income < 200000) %>% 
  ggplot(aes(y=income/1000, x=educ, fill=educ)) +  ## divide income by 1000 to make the axes tick marks more readable.
    geom_boxplot() +
    stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color = "mean"),
                 width = 0.75, linetype = "solid", size = 2) +  ## this one adds the line that indicates the group mean.
    scale_fill_manual(values=c("#26d5b8", "#ff5733", "magenta", "blue")) +
    scale_color_manual(values = "#39ff14")+ ## makes the group mean line green - doing it this way forces it to show in the legend
    labs(fill= "Education Level",
         y = "Income in $1000s",
         x = "Education Level",
         color = "Group Mean",
         title = "Distribution of Income by Education Level")

Again, I limited this to not include any income outliers - any responses greater than 200k.  

What we see here:
1. With the exception of the Graduate degree group, the group means/medians are different from each other - reflecting the skewness we saw in the density plot.
2. Like we saw in the density plot, the range/variation of the observations in the College/Grad degree group is larger than the than the variation in the other two groups.
3. The main point of looking at this box plot is that it allows us to more easily see where the group means fall in the distribution and compare them to each other.  It appears at first glance that the mean of income in both the HS and some college groups are very similar/close to each other.  The College degree and Graduate degree group means are both larger than those first two groups, and different from each other.

This is consistent with our measureable hypothesis - Income depends on one's level of education.

[Return to Top](#top)
<a id = "assump1"></a>
### Checking Assumptions
ANOVA analysis has 6 assumptions, 5 of which we can review/check before our analysis.

The basic ones we don't need to test but just need to confirm (possibly through looking at survey/data documentation):
1. Dependent variable is numeric - **income is numeric**
2. Group sample sizes are approximately equal - **if you look back at when we did our data cleaning, each group had roughly 100 observations**
3. Independence of observations - **the observations come from independent respondents who were randomly selected**
4. No extreme outliers - **We do have extreme outliers!  We will remove them from the data moving forward so that we do not violate this assumption (and so they don't unduly bias our results)**

In [None]:
# remove outliers
cah1 %<>% filter(income < 200000)
summary(cah1$income)

Our last pre-check item:
5. Homogeneity of variance - the within group variance for each of the groups should be equal. **We will check this with Levene's Test**

The statistical hypotheses for Levene's Test are:

$H_0:$ The variances in the groups are equal. <BR>
$H_A:$ The variances in the groups are not equal.

In this test, we sort of want to fail to reject null, because it's easier if our variances are equal and we don't need to make the adjustment.

REMEMBER: This is a pre-test to check this assumption.  This is NOT your ANOVA analysis!!!!

In [None]:
#LeveneTest(DV ~ IV, data = your data frame)

LeveneTest(income ~ educ, data = cah1)

As we may have expected from reviewing our density graph and boxplots, the within group variances are not equal.   Because the p-value of Levene's Test is less than alpha = 0.05, that means that **we are violating the assumption of homogeneity of variance.** 

**IMPORTANT**: This is a test of variance, but this is **NOT** the ANOVA test.  This just tests the assumption of homogeneity of variances.  We cannot use these results to make inference about our means.

We cannot test our last assumption until after our analysis, so let's move forward!

[Return to Top](#top)
<a id = "anova1"></a>
### ANOVA Analysis
We've checked off our pre-flight to do list, now we're ready for lift off!  And R makes it easy to conduct an ANOVA analysis - we just need one line of code!

Up front, let's review our statistical hypotheses

$H_0: \mu_{hs} = \mu_{sc} = \mu_{cd} = \mu_{gd}$  <BR>
$H_A: \mu_{hs} \neq \mu_{sc} \neq \mu_{cd} \neq \mu_{gd}$ 

Or, in words:

$H_0:$ There is no difference in average income between education levels. <BR>
$H_A:$ There is at least one significant difference between average income by education level.

In [None]:
# aov(DV ~ IV, data = yourdf)
aov(income ~ educ, data = cah1)

**Wait! Where are our results?**

By default, this is the output of `aov()` which is not very informative.  To get the "good stuff" we need to call the `summary()` function on the aov object.  We can do this by wrapping the entire call in summary, or by saving the result of calling aov to an object that you then pass to `summary()`

In [None]:
# get full results of aov
inc_educ_aov <- aov(income ~ educ, data = cah1)
summary(inc_educ_aov)

Now we can figure out our results! 

Sum Sq and Mean Sq are so large in this case because our income values (for example: $180,000) are large numbers.  The magnitude of these values depends on the units of your observations, nothing else, and cannot be compared between different analyses.  Taking the ratio with the F-value removes the units (in the same way that converting an x value to a z-score removes the units).

Our p-value is very, very small (0.0000000000255), and below an alpha of 0.05, so we reject the null hypothesis.  This means that our result is statistically significant.

In non-statistical terms, this means that the average income of at least one of the educational levels significantly differs.  We don't know which one(s) significantly differ from this output, however.  We'll address that in a minute with the pairwise comparisons.  In the meantime we need to check our one post-hoc assumption.

[Return to Top](#top)
<a id = "#postchk1"></a>

### Post-hoc Assumption Check
We need to check one final assumption - the normality of the residuals.  Because the residuals are calculated during the ANOVA analysis (they're what make up SSW - or the residual sum of squares), we cannot check this assumption prior to the analysis.

6. Normality of Residuals - **We check this via a QQ plot of the _residuals_ from our data.**

You can run `str()` on your `aov()` result and see that inside that object are saved a number of different pieces that you can extract using $ indexing.  One of these pieces is the vector of residuals.  We can use these to create a QQ plot, in the same way we have previously created QQ plots, except this time it's of the residuals, not the observations.

In [None]:
## need to save residuals as a df so that we can use ggplot.
## we're getting the residuals from our previously saved aov object (See previous code block)
resid_df <- data.frame(resid = inc_educ_aov$residuals) ## the residuals part of the aov results using $residuals

resid_df %>% ggplot(aes(sample = resid)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Residuals")

Here it appears that our residuals are somewhat normally distributed.  There is a small amount of deviation from normality in the lower tail, and a more extreme about of deviation from normality in the upper tail.

Because the deviations are both on the top side of the reference line it's indicative of the right skew we can see in our density plot of the observations.  That right skew carries over into our residuals.  We can plot a density plot of the residuals if we're curious. (not necessary)

Remember, the residuals are the individual observation deviations from the GROUP mean.  They are in dollars (the same unit as the observations).

In [None]:
# not necessary density plot of residuals.
# quick and "dirty" - not PQ

resid_df %>% ggplot( aes(x=resid)) +  
    geom_density(fill = "blue") 

This confirms the right skew - notice where 0 (the mean of the residuals) sits on the x axis.

[Return to Top](#top)
<a id = "pair1"></a>
### Pairwise Comparions (Tukey HSD)
Now we get to the fun part - it's time to figure out which group means (education levels) actually significantly differ from the others, through pairwise t-tests.  Recall that we need to use special procedures for these pairwise t-tests to adjust for the multiple comparisons problem - the inflation of Type I error that comes from repeatedly conducting statistical tests on the same data.  In this first example I will show Tukey HSD.  We'll look at Bonferroni in the second example.

Since we have 4 groups, we'll have 

## ${4 \choose 2} = \frac{4!}{2!(4-2)!} = \frac{4\times3\times2\times1}{(2\times1)(2\times1)} = \frac{24}{4} = 6$

This is a "combination" problem under the category of math/probability called combinatorics.  You may have seen this problem/formula in STAT 100 or other courses that over probabilites.  How this is read is:

4 choose 2 equals 4 factorial over 2 factorial multiplied by (4-2)factorial.

4 choose 2 means that from for objects we choose 2.  This is a combination problem because we don't care about order, SC - HS is the same thing as HS - SC

Note: you don't have to do the combination problem, you can just wait and see how many pairwise comparisons you get when you run the code.

In [None]:
# TukeyHSD() with saved aov() object - we saved this in the ANOVA analysis section above.
TukeyHSD(inc_educ_aov)

As we may have expected from visual examination of the boxplot - all of the pairwise comparisons are significantly different EXCEPT for the difference between HS and Some college.  It looks like, in terms of income, Some College probably is very similar to HS diploma, and both of those groups significantly differ from the average income when one has a college or graduate degree.

The output shows the difference between the group means, the lower and upper bound of the 95% CI for that _difference_ and the p-value (adjusted for the multiple comparisons).  The first column shows you the order of the subtraction done in the numerator in the t-test, so the sign of diff can show you the direction of the difference.  The groups with degrees all have significantly greater income, on average, than the groups with HS diplomas or some college.

Which leads us to effect size...

[Return to Top](#top)
<a id = "eff1"></a>
### Effect Size
We now need to look at the magnitude of these differences to see if they're substantively significant on top of being statistically significant.

#### Unstandardized Effect Size
Unstandardized Effect Size is always the difference between the means in the units of the observations.  Because it's in the units of the observations it's unstandardized - which means we can't compare between different analyses - we can't compare $10,000 to a difference of 2 in attractiveness on a range from 1-10.  Just because 2 is much smaller than 10,000 doesn't mean that the magnitude of the difference is any less.

Here we can look at the difference in means from our Tukey HSD output.

The difference between SC and HS is not statistically significant and also not substantively significant - the difference in mean income is about $2k, which I would not consider large enough to "matter" in the real world.

The largest differences are about 30k, which is a sizeable difference in means.  This is in the comparisons between the highest educated group (Graduate Degree earners) and the lowest educated groups (HS and SC).  

The other differences are in the 14k - 18k range, still substantive, but not as large as the 30k differences (about half as large).  These are seen between CD and HS or SC, and between CD and GD.

Overall, I would conclude that the unstandardized difference is substantively significant.

#### Standardized Effect Size - R-squared.
Unlike in previous statistical tests (where we used Cramer's V and Cohen's d), r-squared doesn't tell us about the magnitude of the differences.  It tells us how much variance income can be explained by education level.  Or in other words - is educational level a substantive predictor of income?  A IV can be a significant predictor of the outcome while also not accounting for much variance in the outcome.

The formula for r-squared is:

## $r^2 = \frac{SS_{between}}{SS_{total}}$

It calculates the proportion of the total variation that is explained by the groups.

Because we didn't manually calculate SSB and SSW, we can extract these numbers from our saved `aov()` object.

In [None]:
# calculate r-squared
##first, obtain SSB and SSW from the aov output - aov() output saved in ANOVA analysis section
## we'll use tidy() from the broom package to convert the aov() summary to a df.
tidyaov <- broom::tidy(inc_educ_aov)
SSB <- tidyaov$sumsq[1] ## sumsq between is in the first row of the sumsq column
SSW <- tidyaov$sumsq[2] ## residual (within) sumsq is in the second row

rsq = SSB / (SSB + SSW) ## add SSB and SSW in the denominator to get SST (total sum of squares)
rsq # proportion
percent(rsq, accuracy = .01) # percentage

Education level explains 13% of the variance in income.  This may seem like a small amount, but in real world data analysis, this is a small but decent amount of variance explained by a single predictor.

#### Cohen's $f$
The other effect size statistic we will use is Cohen's $f$.  Cohen's $f$ is primarily needed because it is the effect size used in power calculations. Cohen's $f$ can be interpreted similarly to Cohen's d (with the rule of thumb cutoffs), but is now the averaged magnitude of the differences in means (because now we have many pairwise differences). 

R-squared is preferred for "interpretation" purposes.  Cohen's $f$ is calculated using the r-squared value.  And Cohen's $f$ is **required** for the power analysis.  You cannot use r-squared as the effect size in the `pwr()` function.

## $f = \sqrt{\frac{r^2}{1 - r^2}}$

In [None]:
## calculate cohen's f using saved value of rsq
cohenf <- sqrt(rsq / (1-rsq))
cohenf

The Cohen's $f$ is approximately 0.4.  This is between the small value (0.2) and a medium value (0.5), so I could consider it small/medium (or smedium in Adrianne-speak). 

Overall, I'd conclude that the difference in average income by education level is both significant and substantive.

Let's move on to our second example

[Return to Top](#top)
<a id = "ex2"></a>

## Example 2: Attractiveness by Age Range
This example will not have as much discussion of the concepts and will be more focused on doing the analysis and the interpretation of the analysis.  If you need more information on the concepts definitely look at Example 1.  

In this example we'll also use the same CAH poll data, but we'll look at two different variables.  Attractiveness, which is a 1-10 self-rating of one's attractiveness (_This next question is about your physical appearance, and you may choose not to respond if it makes you uncomfortable. On a scale of 1-10, how physically attractive are you?_) for the outcome/numerical variable, and Age range (categorical age) for the predictor.  We seek to see if age influences how attractive people think they are.

Attractiveness is a rating on 1-10, so may be considered more ordinal than numerical, but it's got a big enough range to do numerical analysis using it.

Throughout this example I'll use some alternative ways of doing things (different types of graphs, the other type of pairwise tests), so make sure you review both examples.

In [None]:
## data cleaning
cah2 <- cah %>% filter(attractive != "DK/REF") %>% 
                mutate(attractive = replace(attractive, attractive == "Not attractive at all", "1")) %>% 
                mutate(attractive = replace(attractive, attractive == "Very attractive", "10")) %>% 
                mutate(attractive = as.numeric(attractive), agerange = factor(agerange))
table(cah2$agerange)
summary(cah2$attractive)

[Return to Top](#top)
<a id = "prelim2"></a>
### Preliminary Inspection
Our first part is always preliminary inspection.  This time I'm going to use a violin plot that shows both the density and the boxplot on one graph.

In [None]:
cah2 %>% ggplot(aes(x = agerange, y = attractive, fill = agerange)) + 
            geom_violin() +
            geom_boxplot(width=0.1, fill = "white", color = "black", size = 1)+
            stat_summary(fun.y = mean, geom = "errorbar", aes(ymax = ..y.., ymin = ..y.., color = "mean"),
                 width = 0.75, linetype = "solid", size = 2) +
            scale_color_manual(values = "#39ff14")+
            labs(fill="Age Range",
                 y = "self-rated attractiveness",
                 x = "",
                 title = "Distribution of self-rated attractiveness by Age Range",
                 color = "Group Mean") +
            theme(legend.position = "bottom", text = bold.14.text) +
            ylim(0,10)

First, we'll look at the density part of the graph - the colored "violin shape" - which is the same thing as our density graph, but mirrored on either side of the box plot (which is why it's thought to look like a violin, I'll leave to your own imagination what it may actually look like...).  It appears that none of the group distributions are particularly normal, except for maybe in the age range of 25-34.  Because this variable is more ordinal (it's a score on a discrete range), this is to be expected.  The distribution of self-rated attractiveness is much smaller in the first two age groups (less variation) and much wider in the four other age groups (more variation).  

Moving to the boxplot portion of the graph, I've also added the neon green line here to indicate the group mean.  In most of the age groups it appears that there is not much difference between the group mean and median, indicating that there is probably not a lot of skewness in those groups.  However, amongst the two oldest age groups there is a difference between mean and median, indicating some potential skewness.  We can see this in the density part of the graph, with the long tails going down to about a score of 1.  

Moving to comparing the group means, there does seem to be a general downward trend in mean self-rated attrativeness as individuals get older.  There does not appear to be a discernable difference between the oldest two groups.  But, on a range of 0-10, a 0.5 point difference in means can be substantial.  We'll have to run the ANOVA to see if any of this is statistically significant.

[Return to Top](#top)
<a id = "assump2"></a>
### Checking Assumptions
ANOVA analysis has 6 assumptions, 5 of which we can review/check before our analysis.

The basic ones we don't need to test but just need to confirm (possibly through looking at survey/data documentation):
1. Dependent variable is numeric - **it would best be considered ordinal, but it has a large enough range (1-10) to treat it as numerical in analysis**
2. Group sample sizes are approximately equal - **the group sample sizes range from ~40 to ~300, so not necessarily close to equal, but not far enough apart to be too concerned.  If we had one group that was 38 and one group that was 1038, that would be a problem.**
3. Independence of observations - **the observations come from independent respondents who were randomly selected**
4. No extreme outliers - **We have some skewness, but no observations that would be considered extreme**

And our pre-check assumption we'll run in R:
5. Homogeneity of variance - the within group variance for each of the groups should be equal. **We will check this with Levene's Test**

In [None]:
#LeveneTest(DV ~ IV, data = your data frame)
LeveneTest(attractive ~ agerange, data = cah2)

With a p-value less than alpha = 0.05, we reject the null hypothesis.  The alternative hypothesis states our within group variances are not equal, which means we are violating the assumption of homogeneity of variance.  There's a way we can deal with this (which I will show in the special cases section below), however, it makes it more difficult to do the pairwise tests afterwards.  By running `aov()` with this data we are not adjusting for this violation of the assumption of homogeneity of variance.

Our last assumption check (normality of residuals) will wait until after we run our ANOVA model.

[Return to Top](#top)
<a id = "anova2"></a>
### ANOVA Analysis
What are our statistical hypotheses this time?

$H_0:$ There is no difference in average self-rated attractiveness between age groups. <BR>
$H_A:$ There is at least one significant difference between average self-rated attractiveness by age groups.

In [None]:
## remember we want to save our result object (aov object) because we'll need it later.
attract_aov <- aov(attractive ~ agerange, data = cah2)
summary(attract_aov)

Our p-value (0.014) is below an alpha of 0.05, so we reject the null hypothesis.  This means that our result is statistically significant.

In non-statistical terms, this means that the average self-rated attractiveness of at least one of the age groups significantly differs from the others.  We don't know which one(s) significantly differ from this output, however.  We'll address that in a minute with the pairwise comparisons.  In the meantime we need to check our one post-hoc assumption.

[Return to Top](#top)
<a id = "#postchk2"></a>

### Post-hoc Assumption Check
We need to check one final assumption - the normality of the residuals.  Because the residuals are calculated during the ANOVA analysis (they're what make up SSW - or the residual sum of squares), we cannot check this assumption prior to the analysis.

6. Normality of Residuals - **We check this via a QQ plot of the _residuals_ from our data.**

In [None]:
## need to save residuals as a df so that we can use ggplot.
## we're getting the residuals from our previously saved aov object (See previous code block)
resid_df2 <- data.frame(resid = attract_aov$residuals) ## the residuals part of the aov results using $residuals

resid_df2 %>% ggplot(aes(sample = resid)) +
  geom_qq_line(color = "red", size = 1) +
  geom_qq(color = "black") +
  labs(title = "QQ Plot of Residuals")

Here the distribution of the residuals appears to be approximately normal, with some deviation in the upper tail.  This could be due to under-dispersion of data - see this for explanation of interpretation of QQ plots http://www.ucd.ie/ecomodel/Resources/QQplots_WebVersion.html

Overall, the residuals for this result is fairly normal, despite that bit of deviation.

[Return to Top](#top)
<a id = "pair2"></a>
### Pairwise Comparions (Bonferroni Adjustment)
Now we'll look to see which age groups are significantly different than each other on self-rated attractiveness.  Since I showed an example of Tukey HSD last time, I'll show the Bonferroni Adjustment this time.

Recall the the Bonferroni Adjustment conducts the t-test as normal, but adjusts alpha by dividing the overall alpha by dividing by the number of comparisons.  In the case of our code output, the adjustment will be made to the p-value.

In [None]:
# for bonferroni we use the function pairwise.t.test with the p.adj argument set to "bonf"
# pairwise.t.test(outcome, predictor, p.adj = "bonf")
pairwise.t.test(cah2$attractive, cah2$agerange, p.adj = "bonf")

The Bonferroni output only gives us the p-values and not the differences in means.  They're displayed in a format similar to a correlation matrix.  Looking at the p-values, none are below alpha = 0.05.  So while the F-test was significant, none of the pairwise comparisons are significant, using the Bonferroni adjustment.  Let's see if the results are different with Tukey HSD.  Remember that the Bonferroni adjustment is conservative, and yields higher p-values than with Tukey HSD (when looking at all possible pairwise comparisons).

In [None]:
# TukeyHSD() with saved aov() object - we saved this in the ANOVA analysis section above.
TukeyHSD(attract_aov)

The only difference that approaches significance at alpha = 0.05 is the difference between the highest age group (65+) and the lowest age group (18-24).

You may wonder why we don't see at least one significant pairwise difference when we said the result of the F-test was significant, and that meant that at least one mean was significantly different from the others.  

Well, what the F-test actually tells us is the ratio of the between group variance to the residual variance is large enough to pass the critical F-value.  This means that the between group variance is substantially large in comparison to the residual variance.  This indicates that the age groups are explaining some of the variance in attractiveness, enough to be significant.

From https://help.xlstat.com/s/article/how-to-interpret-contradictory-results-between-anova-and-multiple-pairwise-comparisons?language=en_US:
_Here are some suggestions why post-hoc tests may appear non-significant while the global effect is significant. The list below is not exhaustive. Other situations exist. _

_- A lack of statistical power. For example, when groups have small sizes. When pairwise comparison tests are not statistically powerful, it is less likely to detect significant differences._ **we had one small group in comparison to the other groups.  Lack of statistical power means that we fail to see a significant effect where one might actually exist (Type II error)**
- _A high number of factor levels can also be an explanation. The more pairwise comparisons you have, the more your p-values will get penalized in order to decrease the risk of rejecting null hypotheses while they are true._ **we had 5 age groups, had we collapsed some (which would help with the small group size issue), we would have had a smaller number of groups, and therefore a smaller penalty for multiple comparisons** 
-_A weakly significant global effect (p-value of the ANOVA table is very close to the significance level)._ **ours was 0.0143, so not too close to 0.05, but also a lot larger than the extremely small p-value we saw in the first example**
-_A conservative multiple comparisons test. The more conservative the test, the more likely you will reject significant differences between means that in reality are meaningful._ **compare the p-values in the Bonferroni adjustment (the more conservative adjustment) to the Tukey HSD results**

Looking at our effect size, especially our r-squared value, will tell us if age group has any predictive power for self-rated attractiveness.

[Return to Top](#top)
<a id = "eff2"></a>
### Effect Size
We now need to look at the magnitude of these differences to see if they're substantively significant.

#### Unstandardized Effect Size
Looking at the Tukey output, we see that the range of differences in mean self-rated attractiveness between groups ranges from practically 0 to about 0.8pt (on our 1-10 scale).  I would consider 1pt on a 1-10 scale to be a substantial difference, it amounts to 1/10th of the overall scale.  

#### R-squared
We can calculate r-squared using the SSB and SSW that we can obtain from our saved `aov()` output.

In [None]:
# calculate r-squared
##first, obtain SSB and SSW from the aov output - aov() output saved in ANOVA analysis section
## we'll use tidy() from the broom package to convert the aov() summary to a df.
tidyaov2 <- broom::tidy(attract_aov)  ## this is how we call a function from a package we have installed without loading the entire library
SSB2 <- tidyaov2$sumsq[1] ## sumsq between is in the first row of the sumsq column
SSW2 <- tidyaov2$sumsq[2] ## residual (within) sumsq is in the second row

rsq2 = SSB2 / (SSB2 + SSW2) ## add SSB and SSW in the denominator to get SST (total sum of squares)
rsq2 # proportion
percent(rsq2, accuracy = .01) # percentage

Age group explains about 2% of the variance in self-rated attractiveness.  So like we concluded from our results above, age is likely not a good predictor of self-rated attractiveness.

#### Cohen's $f$
Finally, we can look at Cohen's $f$ just to get the full picture of the magnitude of the effect.

In [None]:
## calculate cohen's f using saved value of rsq
cohenf2 <- sqrt(rsq2 / (1-rsq2))
cohenf2

Our Cohen's $f$ is approximately 0.15, which corresponds to between negligable and small on the rule of thumb scale (same as the one we used with Cohen's d.  So again supporting the conclusion that the analysis is not substantively significant.  And, probably only marginally statistically signficant, although a case could be made for lack of power - which we can investigate below.

Let's move to some special cases first.

[Return to Top](#top)
<a id = "special"></a>

## Special Cases
Sometimes when our data doesn't adhere to our assumptions, there are alternative types of ANOVA analyses we can use.  

First, we'll look at what we can do when we violate the assumption of homogeneity of variances (equal within group variances).

### ANOVA analysis with adjustment for unequal variances.
In both of the examples above our Levene's Test indicated that we had significantly unequal variances, however we did not address that in our code.  Unfortunately, the code that performs the analysis adjusted for unequal variances doesn't leave us with an aov() object, which makes it harder to do some of our post-hoc analyses, especially pairwise comparisons using Tukey's HSD.  

To show you how this would work, however, I will revisit the analysis from the first example.

We were looking at if level of education influences income.  Let's pick up at the ANOVA analysis step, now addressing the unequal variance (heteroscedasticity).

In [None]:
#oneway.test(formula, data, var.equal = FALSE) var.equal = FALSE is default, so can be left out of your function call

oneway.test(income ~ educ, data = cah1)

We get a similar result, but note that the denominator degrees of freedom is adjusted as part of the adjustment for unequal variances, the F-ratio is lower, and the p-value is higher.  By initially running `aov()` we obtained a result that was "more significant" that we get when we don't ignore the unequal variances.  In this case it doesn't the result and interpretation (statistical significance), but if our result was closer to the threshold of statistical significance, the adjustment could make the difference between a p-value that is below 0.05 and one that is above 0.05.

#### Pairwise Tests - Unequal Variances

Unfortunately, this test outputs an object of type `htest` which is the same type of object that we obtain from `t.test()`, while `aov()` returns an object of type `c("aov", "lm")`.  This means that we can't run TukeyHSD() on the object obtained from `oneway.test()`.  However, there is a method we can use with the same function we used for the pairwise tests with Bonferroni adjustment, with slightly different arguments.

In [None]:
pairwise.t.test(cah1$income, cah1$educ, p.adjust.method = "BH", pool.sd = FALSE)

Again, this gives us a very similar result - all pairs significantly different except for the SC vs. HS pair.  But we can't always assume that we don't need to worry about violating this assumption.

## Non-parametric Analysis
Similar to the non-parametric t-test (where we compared medians instead of means) there is also a non-parametric version of the ANOVA analysis.  The benefit of a non-parametric test is that it doesn't have as many assumptions as a parametric test (like normality and numerical data), which makes it appropriate for cases when our data is not normal or is ordinal (vs. numerical). But, they are less powerful, less detailed, and less specific. 

Because the self-rated attractiveness score is ordinal, let's revisit that analysis in the non-parametric format: the Kruskal-Wallis Test.

The test uses rank order vs. the actual values and compares means rank order by group.

This test generates an H score (not F) - in R it says Chi Square, but it is an H statistic that uses the Chi Square method to determine statistical significance.

In [None]:
# Running Kruskal-Wallis
# kruskal.test(formula, data)
kruskal.test(attractive ~ agerange, data = cah2)

Our result is still significant, however again we see the p-value is slightly higher, showing that this test is more conservative, and ignoring the ordinal nature of data in some cases might lead us to make the wrong conclusion.

We can still do post-hoc pairwise comparisons in the non-parametric format.  This is accomplished through a Dunn Test.

In [None]:
## DunnTest (From DescTools package) DunnTest(DV, IV, method = "bonferroni")
DunnTest(cah2$attractive, cah2$agerange, method="bonferroni")

Even when looking at the mean rank order vs. the mean of the actual values we get similar p-values with the Bonferroni adjustment.  Again, we cannot always rely on `aov()` providing similar results to the more appropriate tests for the data we're using.

And we can even look at the effect size for this non-parametric test, using Epsilon-squared, which has the same interpretation as R-squared.

In [None]:
#calculate epsilonSquared(DV,IV)
epsilonSquared(cah2$attractive, cah2$agerange)

This is consistent with the r-squared result we got above, 2% of the variance in self-rated attractiveness can be explained by age range.

[Return to Top](#top)
<a id = "power"></a>

## Power Analysis
Finally, let's look at conducting a power analysis for an ANOVA model.  

The pwr.anova.test() function is very similar to the power functions we've used before, except the arguments are:

- k = # of groups (categories in IV)
- f = effect size (Cohen's f)
- sig.level = alpha
- power = power 
- n = sample size __PER GROUP__ (with the assumption that the groups will be equal sizes)  If the groups are unequal sizes you should use the n of the **SMALLEST** group.

As always, we supply 4 of the 5 things and R will calculate the fifth (set to NULL in the function call).  

Since we were concerned that we might have had low power in our second example, let's return to that example.

In [None]:
# there are 6 ageranges
cah_k = 6
# we already saved Cohen's f as cohenf2
# we'll use alpha of 0.05
# power will be set to NULL - it's what we want to calculate
# n - the size of the SMALLEST group
cah_n = min(table(cah2$agerange)) # the min value from the frequency table of agerange

# now we have all the pieces - conduct power analysis
pwr.anova.test(k = cah_k, f = cohenf2, sig.level = 0.05, power = NULL, n = cah_n)


As suspected, this analysis had low power.  Our power was 0.35, when we typically want power to be 0.8 (or higher).  Given a power of 0.35, we have a 64% probability of Type II error.

What sample size would we need (per group) to yield a power of 0.8?

In [None]:
# now set n to NULL and power to 0.8
pwr.anova.test(k = cah_k, f = cohenf2, sig.level = 0.05, power = 0.8, n = NULL)

We would need a minimum of 98 observations PER GROUP to have a power of 0.8 for the analysis of self-rated attractiveness by age group (given the effect size we observed in our sample).