## Bivariate Comparisons

This lab will describe how to make appropriate bivariate comparisons between variables measured at different levels of measurement. Remember, the main idea is to compare what happens to the outcome variable (either the distribution of its units across its values or some other relevant statistic) as the values of the predictor variable vary. 

For this lab, we will use the 2020 American National Election Study. The unit of analysis is the survey respondent, and the level of analysis is the individual. Let's install our necessary packages, along with the `devtools` package that will let us access the `anesr` set of functions to grab the data.

In [None]:
# Install the required packages if not already installed 

 install.packages(c('tidyverse', 'devtools', 'haven', 'knitr'))

# let's load your packages in the R session

library(tidyverse)
library(devtools)
library(haven)
library(knitr)

# install a user-generated package from github (need to load devtools())

# user-generated package to load American National Election Study survey data
 install_github("jamesmartherus/anesr") 

library(anesr)

Now we can load our data using the `data()` function. Because we loaded the `anesr` package, the ANES datasets can be read just like they're sitting in our working directory. 

In [None]:
# This is the 2020 election survey of the standard ANES time series survey. 
data(timeseries_2020)

Sometimes it is helpful to download a file, save it to your computer, and use it in your R script. You can download a file using `download.file` (clever naming, huh?). The code below saves a url and then uses `download.file` to get the .R file stored at that url and saves it with a different name into your working directory. Then, the code runs the R script.

In [None]:
# save the url for easy access. The paste0() function lets you combine
# strings together. paste with a 0 at the end will combine
# without any separator character between the strings of text

myurl = paste0("https://raw.githubusercontent.com/bowendc/510_labs/main/","lab4_recodes.R")

download.file(url = myurl, "lab4recodes.R" ) # lab4recodes.R is the name we're giving to the file 

source("lab4recodes.R") # runs the .R script recoding ANES variables

# one more recode: 

anes20 <- anes20 |> mutate(welfare_ord = ordered(welfare, labels = welfare_lbl), #
                           sex_fct = factor(sex, labels = sex_lbl))

### Crosstabs

In my opinion, crosstabs in R are more difficult than they need to be. Perhaps because of the difficulty of creating crosstabs in base R, there many packages offering additional functionality. I haven't found any that I really love. Because of this, we will create crosstabs several different ways here. 

First, let's see what is available in base R. 

In [None]:
# table will display a one-, two-, or even three-way frequency distribution (just the frequencies, not percentages).
# outcome goes first, predictor second
table(anes20$welfare_ord, anes20$sex_fct)

# wrap in prop.table( , 2) to get column proportions
prop.table(table(anes20$welfare_ord, anes20$sex_fct), 2)

# store for later use:
ct1 <- table(anes20$welfare_ord, anes20$sex_fct)

# call back up to create rounded percentages using round() function
round(prop.table(ct1, 2), digits = 3) * 100


This isn't bad, but we could also combine the table with the `kable()` function to add a caption and automate the rounding. If we were writing an interactive document using `knitr` or creating a table for a website with html, then the `kable()` function has some nice additional features as well. 

In [None]:
# uncomment to try for yourself on your computer

 kable(prop.table(ct1, 2)*100, align = "lccc", 
      format = "simple", 
      digits = 2, 
      caption = "Opinions about welfare spending by sex of respondent, 2020 ANES")

Now let's use `tidyverse` to accomplish something similar.

In [None]:
ct2 <- anes20 |> 
        filter(!is.na(welfare_ord) & !is.na(sex_fct)) |>
                group_by(welfare_ord, sex_fct) |> 
                summarize(n = n()) |>
                pivot_wider(names_from = sex_fct,
                              values_from = n) |>
                ungroup() |>
                mutate(`Men` = prop.table(`Men`),
                       `Women`= prop.table(`Women`))
ct2

# ct2 is stored as a tibble data frame. 
# we can export to a csv file using write.csv()

write.csv(ct2, file = "lab4.ct2.csv")

### Mean comparisons 

Mean comparisons (or other some other statistic), can be created using `tidyverse` functions. Let's examine the thermometer ratings of the major political parties in the U.S. by gender.  

In [None]:
mc <- anes20 |> filter(!is.na(sex_fct)) |> 
                group_by(sex_fct) |> 
                summarize(Democratic = mean(dem_therm, na.rm = TRUE),
                          Republican = mean(rep_therm, na.rm = TRUE))
mc

mc |> 
  kable(align="lcc", 
        col.names = c("", "Democratic", "Republican"), 
        digits = 2, 
        format = "simple", 
        caption = "Mean thermometer ratings of political parties by sex of respondent, 2020 ANES") 

## Scatterplots and trend lines

**ggplot2** can be used to create scatterplots. By adding additional layers to the plot, we can overlay other graph types (like loess plots, lines of best fit, and more).

In [None]:
hs <- nj.grad.wide |> filter(School.Name != "District" & County.Name!="State") # != means "does not equal". & means "and"

### Histograms and density plots

Histograms are graphs that present the distribution of a variable measured at the continuous or interval level by grouping the values into "bins". 

In [None]:
# assign graph to `plot` for easy access later 
plot <- ggplot(data = anes20, mapping = aes(x = age, y = trans_therm)) + 
            geom_jitter(size = 2, 
                        width= 2, 
                        height = 2, 
                        fill = "black", 
                        alpha = .03, 
                        stroke = 0 ) +
            labs(y = "Thermometer Rating: Transgender People",
                x = "Age") +
            theme_minimal()

plot

Now, let's add a loess line using `geom_smooth()`

In [None]:
plot + geom_smooth(method = "loess", se = FALSE)

And a line of best fit!

In [None]:
plot2 <- plot + geom_smooth(method = "loess", se = FALSE) +
                geom_smooth(method = "lm", se = FALSE,color = "grey15", linetype="longdash", linewidth = 1) 
plot2

Good! Now let's try recreating this graph accounting for party identification of the respondent using `facet_grid`.

In [None]:
plot2 + facet_grid(~pid3)