# Chi-Square Analysis
This lab will cover both types of Chi-Square Analysis - Goodness of Fit (one variable) and Test of Independence (two variables).  Within each section I will cover one "conceptual" example that shows how to manually calculate the values using the formulas outlined in the slides.  This allows you to conceputally see how it works "under the hood."  Then I will show additional "practical" or applied examples of how you would conduct the analysis in "the real world." (or for your project)

I will then also show you how to calculate the effect size (Phi or Cramer's V) and use the power function.

Finally I will cover making the type of PQ formatted table you will need for Project 2.

### Table of Contents
<a id = "top"></a>
- [Goodness of Fit](#gof)
    - [Conceptually](#gofc)
    - [Practically](#gofp)
    - [Special Case: Yates' Continuity Correction](#yates)
- [Test of Independence](#toi)
    - [Conceptually](#toic)
    - [Practically](#toip)
    - [Special Case: Fisher's Exact Test](#fisher)
- [Effect Size](#effsize)
    - [Phi](#phi)
    - [Cramer's V](#cramv)
- [Power](#power)
- [PQ output](#pqchi)


In [None]:
# LIBRARIES
library(tidyverse)
library(magrittr)
library(DescTools) ## for phi and cramer's v functions
library(pwr) ## for power function
library(sjPlot) ## for tab_xtab
library(webshot) ## to convert html objects to images

For this lab we'll use the October 2017 Cards Against Humanity Poll. (we used the Sept 2017 file in Lab 2).

In [None]:
# DATA
cah_oct <- read_csv("201710-CAH_PulseOfTheNation_Raw.csv")
glimpse(cah_oct)

In [None]:
## variable names currently full questions - need to rename
new_names <- c("income", "gender", "age", "age_cat", "polaffil", "trump", "educ", "race", "whtnat", "whtnat_rep",
              "love_us", "love_us_dem", "helppoor", "helppoor_rep", "racist", "racist_dem", "friendtrump", "civilwar",
              "hunting", "kale", "therock", "trumpvader")
colnames(cah_oct) <- new_names

# keep columns that are not numerical - chisquare uses categorical variables
cah_oct %<>% select_if(is.character)

# convert all vars to factor
cah_oct %<>% mutate_if(is.character, as.factor)
summary(cah_oct)

*[Back to Top](#top)*
<a id = "gof"></a>
## Goodness of Fit
Chi-square Goodness of Fit test allows us to determine if there is a statistically significant difference between the observed frequencies/proportions of one categorical variable and the expected proportions/frequencies determined from an external population expectation.

<a id = "gofc"   ></a>
### GoF Conceptually
Before I show you how to conduct chi-square tests in the way you will in the "real world" (aka for your homework and projects), I will first run through a step by step manual calculation of a GOF test based on the functions outlined in the slides.  This is aimed at illustrating what goes on "under the hood" when conducting this test in R.

For this first example I will use the variable `friendtrump` which contains the answers to this question - "Have you lost any friendships or other relationships as a result of the 2016 presidential election?"

When I googled for previous polling to compare our sample data to, I found <a href="https://www.politico.com/story/2016/09/poll-2016-election-broken-friendships-228846" target="_blank">this article</a> that indicates that in a different survey 7% of respondents reported losing friends due to Trump.  I'm going to treat this as my "population" expectation.

The first thing we'll need is a frequency table:

In [None]:
# frequency table

table(cah_oct$friendtrump)

# remove 8 DK/REF

friendtab <- cah_oct %>% filter(friendtrump != "DK/REF") %>% 
                            group_by(friendtrump) %>% 
                            summarize(observed = n())
friendtab

Now I need to calculate the expected values.  Our sample has 992 observations, we need to calculate 7% of 992, which will be our expected value for "Yes" and then the remaining respondents will be expected to be "No."

In [None]:
# add expected values column
friendtab$expected <- c(sum(friendtab$observed)*.93, sum(friendtab$observed)*.07)
friendtab

The formula for Chi-Square is:
## $\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}$

$k$ is the number of categories, in this case 2.  $O_i$ is the observed frequency in cell $i$; $E_i$ is the expected frequency in cell $i$.

I'll calculate each cell chi-square, the sum them to get the overall chi-square for this table.

In [None]:
friendtab %<>% mutate(O_min_E = observed - expected, ## new var -> O - E 
                     O_min_Esq = O_min_E^2, ## new var -> (O - E)^2
                     cell_chi = O_min_Esq / expected) ## new var -> cell chisq (full formula)
friendtab

Now that I have the cell Chi-Square values I can sum them into one overall observed Chi-Square value for the variable.

In [None]:
obs_chisq <- sum(friendtab$cell_chi)
obs_chisq

So now we know our observed Chi-Square is 74.92.  To figure out if this is significant we need a critical value.  To get a critical value we need to know our degrees of freedom and set our alpha level.

Degrees of freedom is k - 1, where k is the number of categories/levels of our variable.  We have 2, yes and no, so our degrees of freedom is 2 - 1 = 1.

We decide to set alpha to the conventional level, 0.05.

Let's get the critical value:

In [None]:
#access chi-square table and retrieve a critical value
# qchisq(1-alpha, df = #)  1-alpha because we want the top tail.
qchisq(0.95, df = 1)

The critical chi-square for alpha = 0.05 with 1 degree of freedom is 3.84.  Note that obtaining this value has nothing to do with the values of our data, other than specifying the right number of degrees of freedom.  This is the critical value for any chi-square test at alpha = 0.05 with one degree of freedom.

Comparing our observed chi-square value - 74.92 - with our critical value - 3.84 - we see that our observed chi-square is larger than the critical value therefore we reject null.  We would report this result by saying something like:

**In the CAH poll the proportion of respondents reporting having lost friendships due to the election of Donald Trump is significantly different from the reference proportion of 7% (or 0.07), $\chi^2$(1, n = 992) = 74.92.**

The final piece we can add for our readers is the p-value.  Without the critical value a reader may not readily be able to confirm that your observed chi-square exceeds the critical value (without pulling out a chi-square table), but the p-value we can compare to alpha to determine if a result is significant without needing the critical value.

In [None]:
# obtain p-value
# pchisq(yourchisq, df=#, lower.tail=FALSE)  lower.tail = FALSE because we want the probability from the upper end of the dist. 
pchisq(obs_chisq, df=1, lower.tail=FALSE) 

The p-value is $4.89 \times 10^{-18}$, which is a very, very small number - much smaller than our alpha, 0.05.  When our p-value is less than alpha we can reject null.  This is another way to check for significance.

So when do we reject null?  We can check one of two ways:

1. Our observed chi-square value is **_LARGER_** than the critical chi-square value.
2. Our p-value is **_SMALLER_** than our predetermined alpha.

Because the value is so small (scientific notation) we can simply round up and report it as $p < 0.001$  (see slide 25 for guidelines on reporting p-value)

So our conclusion is:

**In the CAH poll the proportion of respondents reporting having lost friendships due to the election of Donald Trump is significantly different from the reference proportion of 7% (or 0.07), $\chi^2$(1, n = 992) = 74.92, $p < 0.001$.**


*[Back to Top](#top)*
<a id = "gofp"   ></a>
### Goodness of Fit: Practical Application
I'm going to show you a couple of examples of conducting a GoF using `chisq.test()`.

I'm first going to start with `race`.  Often survey methodologists want to know if there is a significant difference in the makeup of their sample vs. the overall US population.  We can see if our sample is reflective of the US population by comparing the distribution of our sample to a reference value obtained from the <a href="https://www.census.gov/quickfacts/fact/table/US/LFE046218" target = "_blank">US Census Bureau.</a>

Percentages from Census Bureau
- Asian = 5.9%
- Black = 13.4%
- Latino = 18.3%
- Other = 1.9%
- White = 60.4%

First, we'll need a table of our observed frequencies, after removing DK/REF.

In [None]:
# remove observations that have DK/REF for race
cah2 <- cah_oct %>% 
            filter(race != "DK/REF")  %>% 
            droplevels() ## drop the empty DK/REF factor level

race_obs <- table(cah2$race)
race_obs

Next, I need to build a vector of my expected proportions.

In [None]:
# create vector of expected proportions
# make sure you list your proportions in the same order as in your observed table
race_exp <- c(.059, .134, .183, .02, .604)
race_exp

Now I have everything I need to run `chisq.test()` to test Goodness of Fit.

Documentation for `chisq.test()` ===> https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/chisq.test

In [None]:
# chisq.test(x = observed, p = expected)

chisq.test(x = race_obs, p = race_exp)

In some cases your proportions might not sum exactly to 1 (probably due to rounding), you need to add the argument rescale.p = TRUE.  You will get an error if p does not sum to 1 and you do not use this argument

In [None]:
# if p doesn't sum to 1 exactly

chisq.test(x = race_obs, p = race_exp, rescale.p = TRUE)

sum(race_obs)

#### Interpretation:

This p-value - $ 3.73 \times 10^{-7} $ - is smaller than an alpha of 0.05, therefore we can reject null.  Our conclusion would be written up like this:

The distribution of race in the CAH poll sample significantly deviates from US population proportions, $\chi^2= 35.46, p < 0.001$.  This means that our sample is not exactly representative of the population in regards to race.

*[Back to Top](#top)*

### GoF: Second Example
Let's try another variable - this time `polaffil` which indicates the respondents political affiliation.  I will use proportions I've derived from responses to a larger survey, the GSS.  I set my expected (population) proportions as follows:

- Strong Republican = 10.5%
- Not Strong Republican = 12.7% 
- Independent = 43%
- Not Strong Democrat = 17%
- Strong Democrat = 17%

But first, I'm going to do a bit of data cleaning - removing DK/REF and reordering the factor levels as above.

In [None]:
## data cleaning - yours may differ
cah3 <- cah_oct %>% 
            filter(polaffil != "DK/REF") %>% 
            mutate(polaffil = fct_relevel(polaffil, "Strong Republican", "Not Strong Republican", "Independent",
                                         "Not Strong Democrat", "Strong Democrat")) %>% 
            droplevels()

In [None]:
# obtain table of observed frequencies from cah_oct data
pid_obs <- table(cah3$polaffil)

# set my vector of expected proportions from the outline in the text above
## note these are proportions, which are percentages divided by 100
pid_exp <- c(.105, .127, .43, .17, .17)

# use these values to run chisq.test()
#we can save the result as a chisq object
myresult <- chisq.test(x = pid_obs, p = pid_exp, rescale.p = TRUE)
myresult
print("-------------------------")
str(myresult) # structure of the chisq object (we can see the "pieces" of the output)

#### Interpretation:

Because the p-value is lower than alpha of 0.05 we reject null.  Our conclusion is:

**The distribution of political affiliations in our sample is significantly different than the distribution of political affilations among the GSS sample, $\chi^2 = 72.07, p < 0.001$.**

For fun, we can graph a Chi-square distribution with 4 degrees of freedom - highlight the critical value and indicate where our oberved Chi-square is.  The shaded area under the curve is the "region of rejection" or the area exceeding the critical value where we would reject null.

In [None]:
crit_chisq <- qchisq(0.95, df = myresult$parameter) #parameter of the chisq object is the degrees of freedom
ggplot(data.frame(x = c(0, 20)), aes(x = x)) +  ## make up a dataframe because the main thing is showing the chi-square density
    stat_function(fun = dchisq, args = list(df = myresult$parameter)) +  ## dchisq is the density function 
                                                        ## that generates the curve based on our degrees of freedom
    stat_function(fun = dchisq, 
                  args = list(df = myresult$parameter),
                  xlim = c(crit_chisq, 20), 
                  geom = "area",
                  alpha=0.2) +
    labs(x = "Chi-Squared", title = "Chi-square Distribution with 4 Degrees of Freedom", y = "") +
    annotate(geom="text", x=10, y=0.025, label=paste("critical X^2 = ", round(crit_chisq, digits = 2)),
             color="blue", fontface = 2, size = 4) +  ## x and y tell ggplot the coords to place your words
    annotate(geom="text", x=17, y=0.0055, label=paste("obs X^2 =", round(myresult$statistic, digits = 2), ">>>>"),
             color="red", fontface = 2, size = 4)

<a id = "yates"   ></a>

*[Back to Top](#top)*
### Special Case - Yates' Continuity Correction for GoF
Yates' Continuity Correction must be used if degrees of freedom is 1.  Degrees of freedom would be 1 if we're working with a categorical variable with only two levels, such as a version of gender which only has male or female as options.  Let's look at the gender variable here and compare with a population proportion expecting 50.8% to be female.

In [None]:
# data cleaning
cah4 <- cah_oct %>% filter(gender %in% c("Male", "Female")) %>% droplevels()

#table of observed values
gend_obs <- table(cah4$gender)

#table of expected values
gend_exp <- c(.508, 1 - 0.508) ## Setting male proportion to 1 - proportion of females

# run chisq.test with correct = TRUE to get the continuity correction
chisq.test(x = gend_obs, p = gend_exp, correct = TRUE)

#### Interpretation

Because the p-value is greater than our set alpha (0.05) we fail to reject the null hypothesis.  This means we conclude that "null is true" or that there is no effect.  We would write this as:

**We tested the distribution of gender against expected US population proportions, there was no deviation from the expected distribution of males and females in our sample, $\chi^2 = 3.53, p = 0.06$**

_If we had set our alpha to 0.1 (which means we would have a greater chance of Type I error - 10% vs. 5%) we would have rejected null, since p = 0.06 is less than 0.1._

<a id = "toi"   ></a>

*[Back to Top](#top)*
## Chi-Square Test of Independence
Now we'll turn to the Chi-Square Test of Independence (ToI) that allows us to determine if the distribution of one categorical variable is associated with the distribution of another categorical variable.  To first graphically inspect the potential association between two categorical variables we can use a grouped bar chart, similar to the FOURTH item on the second Homework assignment.

For this first example I'll use the variables `racist` which corresponds to the question - "Do you think that most white people in America are racist?" and the variable `race` which indicates the respondent's race.  We are asking the question - do people of difference races vary in their belief that most white people are racist?

In [None]:
## first, create a df with the percent of racist among each race
cah_oct %<>%
    # remove observations where the response is DK/REF for either variable
    filter(race != "DK/REF" & racist != "DK/REF")  %>%
    # reorder factor levels of race
    mutate(race = fct_relevel(race, sort)) %>% # sort by alphabet - factor label
    mutate(race = fct_relevel(race, "Other", after = Inf))  %>% # make "other" last
    droplevels()


    # group by both categorical variables
cah_race_pct <- cah_oct %>% mutate(race = fct_relevel(race, rev)) %>% # reverse so they are in correct order after coord_flip()
                            group_by(race, racist) %>%
                            # calculate proportions inside nested groups
                            summarize(inside = n())  %>% 
                            ungroup() %>% group_by(race) %>% 
                            mutate(outside = sum(inside), percent = inside/outside * 100)  %>% 
                            ungroup() %>% select(-inside, - outside)

cah_race_pct

In [None]:
# use summary pcts df to create plot
cah_race_pct %>% 
    # make the plot 
    ggplot(aes(fill=racist, y=percent, x=race)) + 
      geom_bar(position="dodge", stat="identity") +
      labs(x = "Race",
           y = "Percent",
           fill = "Are white people racist?",
           title = "Are white people racist? by Race of Respondent") +
      coord_flip() + # rotate so the axis tick labels don't overlap
      ## optional customizations
      theme(legend.position="bottom") +
      scale_fill_manual(values=c("#FF33CC", "#33FF99"))

By just inspecting the graph it looks like White and Asian people may be more likely to say that white people are not racist.  This could indicate that the variables are associated - what a person answers about if they think white people are racist may vary by their race.  Let's test this!  

*[Back to Top](#top)*
<a id = "toic"   ></a>
### ToI Conceptually

Again, this is the first example, where I will walk through calculating everything by hand.  The <a id = "toip"   >practical examples</a> will then demonstrate how you will do this analysis in homework or in your project.

First, we need to get our observed values.  We will use table() with addmargins() to get the row and column totals.

In [None]:
addmargins(table(cah_oct$race, cah_oct$racist))

Because I'm calculating everything "by hand" I'm going to make a "flat" aka long version of this two way table.

In [None]:
race_tab <- cah_oct %>% 
                group_by(race, racist) %>% 
                summarize(observed = n())
race_tab

Now, I'll add columns for the row totals, the column totals, the overal total - then calculate the expected value - row total x column total over "total total."

## $E_{ij} = \frac{rowtotal_i \times coltotal_j}{overalltotal}$

In [None]:
## add rowtot, coltot, tottot

race_tab %<>% # start with "flat" frequency table
    group_by(race) %>% 
    mutate(rowtot = sum(observed))  %>%  ## sum of observed within just race group (the row variable)
    ungroup()  %>%  # ungroups the df so that we can group by racist now
    group_by(racist) %>% 
    mutate(coltot = sum(observed))  %>%  ## sum of observed within racist group is the column total
    ungroup() %>% 
    mutate(tottot = sum(observed))
race_tab

In [None]:
# now calculate expected
race_tab %<>% 
    mutate(expected = (rowtot*coltot) / tottot) %>%  # expected is row total times column total over total total
    select(-rowtot, -coltot, -tottot)
race_tab

We can now visually inspect the observed values versus the expected values for each cell.  Do the differences look large to you?  Do you think they may be statistically significant? 

Let's check!  Now we can use our calculated observed and expected values to calculate our cell chi-square values.

## $\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$

where $r$ is the number of rows and $c$ is the number of columns. $O_{ij}$ is the observed value in the ith row and jth cell; $E_{ij}$ is the expected value.  

In [None]:
race_tab %<>% 
    mutate(cell_chisq = ((observed - expected)^2) / expected)
race_tab

Before we finish, we can take a moment to look at the cell chi-square value.  The higher the cell chi-square value, the more that cell deviates from the expected value.  So we can say that a lot of the association between thinking white people are racist and the race of the respondent comes from Black respondents saying Yes, white people are racist more than we expect if it was in concordance with the expected distribution if the two variables were independent.

Finally, to determine if our variables are significantly associated we sum the cell chi-square values and compare them to a critical chi-square value.  To get the critical chi-square value we need to know our degrees of freedom.  For ToI, the degrees of freedom is:

$df = (r - 1)(c - 1)$

So in our case the degrees of freedom equals (5-1)(2-1) = 4.

Also we can calculate the relevant p-value to compare to our alpha of 0.05.

In [None]:
# sum cell chi sq
chi_sq_race <- sum(race_tab$cell_chisq)
chi_sq_race

# get critical chi-square value with 4 degrees of freedom
qchisq(0.95, df = 4)

# get p-value
pchisq(chi_sq_race, df=4, lower.tail=FALSE)

So our overall observed chi-square value is 27.59.  This is larger than the critical value of 9.49.  The p-value is less than alpha of 0.05.  So we reject the null hypothesis.  These two variables are associated.  Let's confirm our results using `chisq.test()`

In [None]:
# first create table object using table()
rtab <- table(cah_oct$race, cah_oct$racist)
# pass table object to chisq.test()
chisq.test(rtab)

We get a warning that our chi-squared approximation may be incorrect.  This is because some of our expected values in some cells were small, which makes the test more imprecise.

#### Our Conclusion?
There is a statistically significant association between whether or not a person believes that white people (in general) are racist and their own race, $\chi^2 = 27.59, p < 0.001$.  This means that the distribution of answers for the yes/no question about whether or not white people are racist differs between each category of respondent's race.

*[Back to Top](#top)*
<a id = "toip"   ></a>

### ToI: Practical Application
Now we can look at a few examples of conducting the Test of Independence the same way you would in your assignments.  All we need are:
 1. cleaned variables
 2. a 2-way table of our two categorical variables of interest using `table()`
 3. run `chisq.test()` on the saved table object from 2.
 
#### First Example - Approval of Trump and Support of Dwayne "The Rock" Johnson
We have two variables.  The first is `trump` which represents each persons approval of Donald Trump as president.  This has two response options - approve or disapprove.  The second is `therock` which is the answer to the question - If Dwayne "The Rock" Johnson ran for president as a candidate for your political party, would you vote for him? - yes or no.  Because in each variable DK/REF (Don't know or refused) is sizable, we're going to leave those as their own category, reflecting those respondents who do not have strong opinions.

In [None]:
# create table object
pres_tab <- table(cah_oct$trump, cah_oct$therock)
pres_tab

In [None]:
# run chisq.test on table object
chisq.test(pres_tab)

The p-value is lower than 0.05 (remember that is scientific notation, so the number is actually 0.000009425 - 9.425 times ten to the negative six). Therefore we reject null and conclude that these two variables are associated.  

#### Our Conclusion:
Whether or not a person would support The Rock for president depends on whether or not they approve of Donald Trump, $\chi^2 = 28.6, p < 0.001$

#### Second example - Hunting vs. Kale

Now I'm going to see if there is a significant association with `hunting` (Have you ever gone hunting?) and eating `kale` (Have you ever eaten a kale salad?)

In [None]:
# drop obs with DK/REF (don't know or refused) since they are not substantial numbers of observations for these
cah_oct %<>% filter(hunting != "DK/REF" & kale != "DK/REF") %>% droplevels()

# two-way table
kale_tab <- table(cah_oct$hunting, cah_oct$kale)
kale_tab
#chisq.test
chisq.test(kale_tab)

Our degrees of freedom is 1, so R automatically applied Yates' continuity correction.  We find a statistically significant association between going hunting and eating kale.

#### Our conclusion:
There is a significant assocation between if a person has ever gone hunting and if they've ever eaten kale salad, $\chi^2 = 8.89, p = 0.002$. The distribution of whether a person went hunting differs among the "yes kale" group and the "no kale" group.  Or, we can also say that whether or not a person has hunted depends on whether or not they've eaten kale.  **This DOES NOT mean that people who eat kale also go hunting,** it just means that there is an association between these variables - it could be that among the "yes kale" group there are few hunters and among the "no kale" group there are many hunters.  Let's look at labeled frequencies to get an idea of the distribution of hunting among kale eaters vs. not kale eaters.

In [None]:
kale_v_hunting <- cah_oct %>% group_by(kale, hunting) %>% summarize(freq = n()) %>% ungroup() %>% group_by(kale) %>% 
            mutate(pct_within_kale = freq / sum(freq) * 100)

kale_v_hunting

So among people who have not eaten kale salad 64% have not gone hunting and 36% have.  Among poeple who have eaten kale salad 53% are not hunters and 46% are.  Because the distribution of hunters inside "no kale" is 60/40 vs. the distribution inside "yes kale" is more 50/50, that is why the variables are associated.  _**If they were independent, we would expect the proportion of hunters inside each group of kale eating to be the same.**_

*[Back to Top](#top)*
<a id = "fisher"   ></a>

### Special Case: Fisher's Exact Test.
When we have a 2x2 table (each variable has only two levels/groups/categories) AND the cell counts are small (less than 10), we can use Fisher's Exact Test to determine if there is a significant association between the variables.  Because the (computationally intensive) process exactly calculates the probabilty of seeing the observed differences or _more extreme_ differences our p-value is exact and not an approximation.  That's why it's called Fisher's Exact Test.

In [None]:
# I'm going to create a little table for this example - you would get a table from your observations using table()
## DO NOT BUILD YOUR TABLE THIS WAY - THIS IS FOR EXAMPLE PURPOSES
example_table <- matrix(c(45,19,55,89), nrow = 2)
rownames(example_table) <- c("drug x", "drug y")
colnames(example_table) <- c("cured", "not cured")
example_table

In [None]:
# run fisher's exact test
fisher.test(example_table)

The output we get is different from what we get from `chisq.test()`, but the important thing we want to look at here is the p-value.  This p-value (0.00002217) is less than an alpha of 0.05, therefore we reject null and conclude that these variables are significantly associated - that the distribution of cured/not cured varies by type of drug.  This means the number/proportion of individuals cured is different based on which drug they received.  Let's compare this to if we ran an uncorrected chi-square test on this same example data.

In [None]:
chisq.test(example_table, correct = FALSE)

We make the same conclusion (reject null) but the p-values are different - in fact this uncorrected p-value is lower than the one from Fisher's Exact test.  In this data it's the difference between two very small p-values, but if you were working with data where your unadjusted p-value from chi-square was just under 0.5, using the Fisher's Test would be vital to ensure you're not incorrectly rejecting null (Type I error!)

*[Back to Top](#top)*
<a id = "effsize"   ></a>
## Effect Size
Now that we've determined statistical significance, we need to calculate that "so what?" factor.  Just because an association between variables is statistically significant doesn't mean that it's a substantively important assocation/finding. 

#### Unstandardized Effect Size
Our unstandardized effect size is comparing the proportions within the groups in an un-statistical way.  Basically we're looking to see what is the difference in proportions and do we think it's important.  For this let's return to a grouped bar chart. And again look at kale vs. hunting.

In [None]:
## review the table of percent of hunting within kale eating
kale_v_hunting

In [None]:
# make grouped bar chart

kale_v_hunting %>% 
    # make the plot 
    ggplot(aes(fill=hunting, y=pct_within_kale, x=kale)) + 
      geom_bar(position="dodge", stat="identity") +
      labs(x = "Eats Kale",
           y = "Percent",
           fill = "Goes Hunting",
           title = "Percent of hunters by kale eating") +
      ## optional customizations
      theme(legend.position="bottom") +
      scale_fill_manual(values=c("#FF33CC", "#33FF99"))+
      ## add percentage labels at top of bars
      geom_text(aes(label=paste0(round(pct_within_kale, 0),"%")), 
                vjust=-.3, color="black", position = position_dodge(0.9), size=5)

We would likely conclude that the 10% point difference in percentages is probably substantive.  But lets quantify that effect size by standardizing it, which allows us to compare effect sizes across questions and datasets.

For chi-square we use two measures of effect size - phi, which is used on 2x2 tables, and Cramer's V, which is used for tables larger than 2x2.  

<a id = "phi"   ></a>
### Phi ($\phi$) - Effect Size for 2x2 tables
Kale vs. Hunting is an example of a 2x2 table - Kale has two levels - yes or no - and hunting has two levels - yes or no.  We can use phi to get a standardized measure of the effect size.

## $ \phi = \sqrt{\frac{\chi^2}{n}}$

$n$ is the sample size.

In [None]:
# use Phi() to calculate phi for kale v. hunting
kale_tab <- table(cah_oct$hunting, cah_oct$kale)
Phi(kale_tab)

So the phi for the association between kale and hunting is 0.1. 

#### How do we interpret phi?
We use a "rule of thumb":
- 0.1 = small eﬀect
- 0.3 = medium eﬀect
- 0.5 = large eﬀect

Here, our value of 0.1 means that the assocation between kale and hunting is substantively small (but not nonexistant).

*[Back to Top](#top)*
<a id = "cramv"   ></a>

### Cramer's V
For any other dimension of two-way tables we will use Cramer's V to calculate our standardized effect size.  For this example I'm going to go back to the 3x3 table with support for Trump vs. support for The Rock as president.

In [None]:
pres_tab <- table(cah_oct$trump, cah_oct$therock)
CramerV(pres_tab)

#### How do we interpret Cramer's V?
We use the same "rule of thumb" as with phi:
- 0.1 = small eﬀect
- 0.3 = medium eﬀect
- 0.5 = large eﬀect

The association between approval of Trump and support for The Rock as president is substantively small (Cramer's V = 0.11).

We can also return to our chi-square test of belief that white people are racist by respondent's race:

In [None]:
rtab <- table(cah_oct$race, cah_oct$racist)
CramerV(rtab)

The association between belief that white people are racist and a person's race is small to medium - Cramer's V = 0.17.

*[Back to Top](#top)*
<a id = "power"   ></a>
## Power

The power of a statistical test is our ability to be able to exceed the threshold and correctly reject null when null is false.

Power is composed of effect size (in this case Phi or Cramer's V), total sample size (N), degrees of freedom, and our chosen significance level (alpha = 0.05).

The power function in R has 5 arguments:
- effect size
- sample size
- degrees of freedom
- alpha
- power

We supply 4 of the 5 and set the 5th to NULL.  This will prompt R to calculate the value for that 5th argument.  

One thing we use a power function for is to calculate power when we get a non-significant result.  This allows us to see if we had enough power to detect the effect *IF* it exists.  If our power is low, our Type II error is high - we don't have enough power to get over that threshold to be able to detect significant effects.

For this example I'm going to use the race-of-interviewer data from the lecture.

In [None]:
### THIS IS AN EXAMPLE - DONT USE THIS TYPE OF CODE TO ENTER YOUR DATA 
## IGNORE THIS DATA SETUP
roi <- matrix(c(96,164,91,186,35,89), nc = 3)
rownames(roi) <- c("Black Int", "White Int")
colnames(roi) <- c("Wilder(D)", "Coleman(R)", "Undecided")
roi
chisq.test(roi)

#### Calculate the Power after an experiment

As we saw in the lecture slides, these variables were not significantly associated (p = 0.23, which is above alpha = 0.05).  To determine power we need the effect size (Cramer's V because we have a 2x3 table), sample size, degrees of freedom, and alpha.

In [None]:
## calculate all of the pieces of information I need 

eff_size <- CramerV(roi) ## calculate Cramer's V
samp_size <- sum(roi) ## add up all of the frequencies in my two-way table
dof <- 2 ## df from the Chi-square output (also we know df = (2-1)(3-1) = 2)

In [None]:
## calculate the power for the ROI analysis
pwr.chisq.test(w = eff_size, N = samp_size, df = dof, sig.level = 0.05, power = NULL) # power = NULL because that's what we want to get

The power function prints all 5 of the parameters.  w, N, df, and sig.level we calculated and provided.  Based on those values the power of the ROI analysis was only 32%, which means we had a 68% chance of Type II error (which is not good).  

#### Calculate the sample size needed to get a power of 80% (before an experiment)

Given the same effect size, degrees of freedom, and significance level, we can instead calculate the N (sample size) we would need to collect data from in order to obtain a power of 0.8.  This time we will specify power of 0.8 and leave N as NULL so it will be calculated

In [None]:
## determine sample size needed if we want 80% power
pwr.chisq.test(w = eff_size, N = NULL, df = dof, sig.level = 0.05, power = 0.8)

In order to get 80% power for this analysis with all else being equal (effect size, degrees of freedom, alpha) we would need 2,150 people in our sample.  Note - we always round up when calculating sample size.

*[Back to Top](#top)*

<a id = "pqchi"   ></a>
## Reporting Chi-Square Results in PQ format
Finally, in order to write your report, you will need to know how to generate a two-way table in PQ format with chi-square values included.  For this we can use tab_xtab() from the package `sjPlot` we saw for making PQ two-way tables in Lab 2.5.

See the documentation for tab_xtab() <a href="https://www.rdocumentation.org/packages/sjPlot/versions/2.8.2/topics/tab_xtab" target = '_blank'> here,</a> for all of the optional arguments and adjustments to the output.  You can even add css to manually fine tune the style of the table.

In [None]:
## create two way table WITH chi-square
tab_xtab(var.row = cah_oct$kale, ## variable that makes up the rows
         var.col = cah_oct$hunting,  ### variable that makes up the columns
         ### specify descriptive overall table title
         title = "Is there an association between hunting and eating kale?",
         ## specify variable labels in order of row then column (as a vector of strings)
         var.labels = c("Kale Salad", "Been Hunting"),
         show.cell.prc = TRUE, ## show percentages in the cells
         show.row.prc = TRUE,
         show.summary = TRUE, ## to get chi-square
         statistics = "phi",
         file = "kale_hunt.html"
         )

In [None]:
webshot("kale_hunt.html", "kale_hunt.png")

I've inserted the image kale_hunt.png here in this markdown block (image displayed below).  

This table both shows us our two-way table with frequencies and proportions, AND now we have `show.summary = TRUE`to show our chi-square statistics at the bottom of the table.  We get chi-square, degrees of freedom, phi or cramerv, and the p-value for our chi-square test.

We have the cell percent in red - the percentage of the sample that is in that cell.  I've also included row percentages - the percentage of hunter-yes vs. hunter-no within each category of the kale question.  This allows us to more easily review the difference in proportions of hunters by kale eating, which can inform us about the unstandardized substantive significance of our analysis.

![](kale_hunt.png)

## Your Turn!
Conduct a chi-square analysis between `trump` - 'Do you approve or disapprove of how Donald Trump is handling his job as president?' and `trumpvader` - 'Who would you prefer as president of the United States, Darth Vader or Donald Trump?

Leave DK/REF on both variables as there are a sizeable number of observations in these categories.

1. Conduct the chi-square test.
2. Interpret the output - do we reject null?  What does that mean about the association between `trump` and `trumpvader`?
3. Calculate the appropriate measure of effect size - phi or Cramer's V - and interpret the value.
4. Construct a PQ chart of the results.  Make sure to include appropriate descriptive names and title.