# Hypothesis Testing

In this lab, we will learn how to how to conduct some basic hypothesis tests in R.  

First, let's load our packages. We'll be using the 2020 ANES again (with the `anesr` package), along with the script I created to recode some variables. Through out the lab, we'll be revisiting some of the associations we examined earlier in the semester to see if we have evidence of ***statistically significant*** relationships.

In [9]:
# Uncomment the following code to run the notebook as an interactive nb in the cloud

#install.packages(c("tidyverse","haven", "knitr", "tidymodels", "modelsummary", "devtools"))

library(tidyverse)
library(devtools)
library(haven)
library(knitr)
library(tidymodels)
library(modelsummary)

#install_github("jamesmartherus/anesr") 
library(anesr)

Now we're ready to load the data and run the R script we used in lab3.

In [10]:
# load the ANES 2020 dataset
data("timeseries_2020")

# store the url of the R script
myurl <- paste0("https://raw.githubusercontent.com/bowendc/510_labs/main/", "lab3_recodes.R")

# download the R script from my GitHub
download.file(url = myurl, "lab3recodes.R")

# Run the R script
source("lab3recodes.R")

# Create ordinal and nominal/factor variables 
anes20 <- anes20 |> mutate(welfare_ord = ordered(welfare, labels = welfare_lbl),
                           sex_fct = factor(sex, labels = sex_lbl))

## $\chi ^2$ test

As you know, the $\chi ^2$ test evaluates whether the cells in a crosstab are significantly different from what we would expect to see in the crosstab if the values of the outcome were unrelated to the values of the predictor. In R, you can conduct a $\chi ^2$ test on a table using the `chisq.test()` function that is part of the core `stats` package bundled with base R.

Here, let's see if welfare attitudes are significantly related to respondent gender. 


In [None]:
ct1 <- table(anes20$welfare_ord, anes20$sex_fct) 

# notice I'm not using proportions from prop.table();
# just using the raw frequencies. That's what the test
# requires

chisq.test(ct1)

The test statistic is 50.311. You would need a Chi-squared table to evaluate the number with 4 degrees of freedom yourself. Fortunately, R provides the p-value. Is the relationship between gender and welfare attitudes statistically significant using conventional significance thresholds?

## Difference of means 

To compare whether one mean is signifantly different from another, we can use the `t.test()` function. We need to take a slightly different approach than how we created a mean comparison test. For reference, here's how we did that in lab 3:

In [None]:
mc <- anes20 |> filter(!is.na(sex_fct)) |>
                group_by(sex_fct) |>
                summarize(Democratic = mean(dem_therm, na.rm=TRUE),
                          Republican = mean(rep_therm, na.rm=TRUE))
mc

But what `t.test()` wants is vectors of data, not pre-calculated means. So, let's specify the various vectors (columns of data) that we can then compare.

In [13]:
wom_dem <- anes20 |> 
            filter(sex_fct=="Women") |> 
            select(dem_therm)
men_dem <- anes20 |> 
            filter(sex_fct=="Men") |> 
            select(dem_therm)
wom_rep <- anes20 |> 
            filter(sex_fct=="Women") |> 
            select(rep_therm)
men_rep <- anes20 |> 
            filter(sex_fct=="Men") |> 
            select(rep_therm)

Now that we've stored those columns of data, we feed them into `t.test()`.

In [None]:
# the mu argument here specifies the null 
# hypothesis that the difference is 0 
# the default alternative hypothesis is a 
# two-sided test

t.test(wom_dem, men_dem, mu = 0)
t.test(wom_rep, men_rep, mu = 0)

## Difference in sample proportions 

We can conduct difference in sample proportions tests using the `prop.test()` function. The example below wraps `prop.test` around an existing crosstab. 

In [None]:

# select just non-missing observations on the predictor
# and outcome variables (and identify just Republicans
# and Democratic respondents). In other words, the 
# outcome needs to be 0,1 and the predictor needs 
# to have only two categories.
vbm <- anes20 |> select(favor.vote.mail, pid3) |>
              filter(!is.na(pid3) & 
                      !is.na(favor.vote.mail) &
                     pid3 != "Independent")

# wrap  prop.test around the crosstab
prop.test(table(vbm$pid3, vbm$favor.vote.mail))

## Hypothesis testing with regression

This is the easiest one! You already know all the functions you need to evaluate significance; just use the p-values and t-tests included in the regression output. In the regression output below, you can see the `statistic` column ($t$ score), as well as the `p.value` column.

`modelsummary()` is a terrific package for displaying regression output. We will use the package in greater detail later, but it allows more flexibility in how parameter estimates are presented and the format in which they are presented. What I have presented below is probably too much information: it includes coefficient estimates, stars to note whether a coefficient is significantly different from 0 at various thresholds of $\alpha$, the estimated standard error of each coefficient, the $t$ score, and the confidence interval around the coefficient. Most of the time, you would probably only display coefficients, standard errors, stars or $t$ scores or p-values.

In [None]:
m1 <- lm(welfare ~ age + income + pid7, data = anes20)
tidy(m1)
modelsummary(m1, estimate = "{estimate}{stars}",
                 statistic = "se = {std.error} t = {statistic} [{conf.low}, {conf.high}]")