# Basic comparison tests 

In this lab, we will cover how to compare your independent and dependent variables to evaluate whether your variables move together as expected in your hypothesis. For each of the tests, we will present the distribution of cases across values of your dependent variable as we move across values of the independent variable. We will utilize three main tests: cross-tabulations, mean comparison tests, and scatterplots. Of course, many more ways to compare two variables together exist, but these three are core starting points. We will also learn how to conduct these comparisons while holding constant a third variable. 

In [None]:
# load necessary packages - install if needed

install.packages('pacman')

pacman::p_load(tidyverse, modelsummary)

For this lab, I've created a simple data frame, stored as a csv file on Github, with some survey data on demographic characteristics of the survey respondents and whether (and degree) of interaction they have had with their state legislators.

In [None]:
# ingest the data from github and store in R session as "survey.df"
survey.df <- read.csv("https://raw.githubusercontent.com/bowendc/pol200_labs/refs/heads/main/simple_survey.csv")

head(survey.df, n = 15)

## Cross-tabulations

Cross-tabs are joint frequency tables. They present the frequency distribution of your dependent variable across columns defined by values of your independent variable. That way you can see if the distribution of the dependent variable changes as you move across the columns. 

Creating crosstabs in base R is a multi-step process. We need to use two functions: the `table()` function and the `prop.table()` function:

In [None]:

# create the table with joint frequencies (counts)
# syntax: table(df$y, df$x)
# here we store the crosstab in the object matrix ct for future use
ct <- table(survey.df$cntrep_help, survey.df$income)
ct

# now we have the counts; we use prop.table() to calculate proportions
# the second argument will calculate 1 for row %s and 2 for column %s.
prop.table(ct, 2) 

# we can clean the code by rounding. 
round(prop.table(ct, 2), 2)


# we could do all these in one line of code by wrapping
# the functions inside one another:

# this time, let's use the another reason for contact: expressing opinions
round(prop.table(table(survey.df$cntrep_opinions, survey.df$income), 2), 2)


This isn't bad, but it sure would be nice to have value labels instead of the value numbers for easier interpretation. There are many ways of doing this, none of which makes me particularly satisfied. But one thing we could do is store these variables as `factor` data, which allows for the inclusion of both levels (order of values) and labels. Here, I write them to new variables and use the new variables to create the crosstab. 

In [None]:
survey.df$cntrep_opinions.lab <- factor(survey.df$cntrep_opinions,
                                        levels = c(0, 1),           #list order of values
                                        labels= c("No", "Yes"))     # text labels should match order in levels

survey.df$income.lab <- factor(survey.df$income,
                                levels = c(1, 2, 3, 4),
                                labels= c( "<50k", "50k - 100k", "100k - 150k", ">150k"))


table(survey.df$cntrep_opinions.lab, survey.df$income.lab)

Here is another way of creating the crosstab by using the `datasummary_crosstab` function from the **modelsummary** package.

In [None]:
# using datasummary_crosstabs
# the statistic argument below allows us to use column percentages

datasummary_crosstab(cntrep_opinions.lab ~ income.lab,
                         statistic = 1 ~ 1 + N + Percent("col"), 
                         data = survey.df |> filter(!is.na(cntrep_opinions.lab) & 
                                                    !is.na(income.lab))) 

Pretty good! But we can do even better by using some of the `tidy` functions we already know about. This time, let's use the `output` argument to format the table as a dataframe, and then we can our other functions for editing.

In [None]:
# the rename function will provide a descriptive name for the first column
# the select() function below will drop the column with the empty column header
ct3 <- datasummary_crosstab(cntrep_opinions.lab ~ income.lab,
                         statistic = 1 ~ 1 + N + Percent("col"),
                         data = survey.df |> filter(!is.na(cntrep_opinions.lab) & 
                                                    !is.na(income.lab)),
                         output = 'dataframe') |>
                         rename(`Contacted Representative` = cntrep_opinions.lab) |>
                         select(!` `)

ct3
    

### Controlling for a third factor variable

We can easily replicate this table while holding constant a third variable. To make the table easier to read, let's again use the `factor` function to code a new variable with text labels. Then we can specify our control variable inside the `datasummary_crosstab()` call.


In [None]:
survey.df$women.lab <- factor(survey.df$women,
                                        levels = c(0, 1),
                                        labels= c("Men", "Women"))


# Note the end of the first line. The * will break the columns of the table 
#   into combinations of our two variables.
#   The variable you list first will be the first division of the data, 
#   followed by the second variable. 
ct4 <- datasummary_crosstab(cntrep_opinions.lab ~ women.lab * income.lab,
                         statistic = 1 ~ 1 + N + Percent("col"),
                         data = survey.df,
                         output = 'default') 

ct4

## Mean Comparisons

In a mean comparison, calculate the mean of the an interval or dichotomous dependent variable by categories of the independent variable. The values of the IV make up the *rows*, while the means of the DV go in the *column*. While we could use `datasummary_crosstab` to do this too, I think it is easier to use `summarize` for this.

In [None]:
# group your data by the independent variable

survey.df |>
    group_by(income.lab) |>
    summarize(`Mean of Contacted Rep` = mean(cntrep_opinions, na.rm = TRUE))


### Controlling for a third variable

We can hold constant a third variable by incorporating that variable into our grouping strategy, and then pivoting with `pivot_wider`:

In [None]:
survey.df |>
    group_by(women.lab, income.lab) |>
    summarize(`Mean of Contacted Rep` = mean(cntrep_opinions, na.rm = TRUE)) |>
    pivot_wider(names_from = women.lab,
                values_from = `Mean of Contacted Rep`)

That's pretty good, but we can clean it up! Let's filter out missing observations and let's round the means.

In [None]:
survey.df |>
    filter(!is.na(income.lab)) |>          # is.na() will identify observations coded NA
    group_by(women.lab, income.lab) |>
    summarize(`Mean of Contacted Rep` = round(mean(cntrep_opinions, na.rm = TRUE), 2)) |>
    pivot_wider(names_from = women.lab,
                values_from = `Mean of Contacted Rep`) |>
    rename(Income = income.lab)

## Scatterplots

Scatterplots are graphs showing the exact value of each observation on both the independent and dependent variable. Scatterplots are useful for interval data that would otherwise be unsuitable for presenting in a table. We're going to spend more time later learning about graphing in R, so in the example below, we'll just cover the very basics. We'll be using the package `ggplot2`, part of the **tidyverse**. Here we will use a count of all types of interactions the respondents had with legislators (0 to 10 scale) as the dependent variable, and the respondent's age as the dependent variable. 

First, create the plot space with `ggplot()`.

In [None]:
plot <- ggplot(data = survey.df,                # define the data used
               mapping = aes(x = age,           # set variables to use
                             y = interactions)) 
plot

In [None]:
plot + geom_point()  # create scatterplot representation of data

In [None]:
plot + geom_jitter() # create scatterplot with a little random noise
                     # useful when we have lots of observations at 
                     # same place on graph

In [None]:
plot + geom_jitter(alpha = .2) # make points semi-transparent
                               # 0 is invisible, 1 is fully opaque

### Controlling for a third variable on a scatterplot

When controlling for a third variable, you need to remake the scatterplot by categories of the confounding variable. We can do that several ways, but here, let's use the `facet_grid()` layer in `ggplot()`. 

In [None]:
plot + geom_jitter(alpha = .2) + 
       facet_grid(~income.lab)    # control for income