In this lab, we are going to cover some additional basic data management functions as well as how to make simple comparisons using crosstabs, mean comparisons, and difference of means tests. 

Let's begin by loading up our packages, including installing a new package, **vtable**.

In [None]:
# If needed, install the packages.

install.packages(c("tidyverse","haven","dataverse","vtable"))

In [None]:
library(tidyverse)
library(haven)
library(dataverse)
library(vtable)

## Downloading the dataset

Like in Lab 1, we'll use the **dataverse** package functions to download a replication dataset. for this lab, we will be using a dataset created by Meier, Johnson, and An for their paper "Perceptual Bias and Public Programs: The Case of the United States and Hospital Care." The dataset is part of the replication file their [PAR paper](https://onlinelibrary.wiley.com/doi/full/10.1111/puar.13067). In the paper, the authors are retesting a previous finding that citizens had negative opinions about public services, compared to private services. Respondents were randomly assigned to receiving a cue about a public or private hospital, and then asked to evaluate the organization on several items. We'll use one: *q9*: "The hospital acts in the interest of patients", scored on a scale from 1 (doesn't fit) to 7 (fits very well).

In [72]:
public.hosp <- get_dataframe_by_doi(              
          filedoi = "doi:10.7910/DVN/NRQPZX/KOEWTQ",
          original = TRUE, 
          .f = haven::read_dta, 
          server = "dataverse.harvard.edu")          

## Creating a table of descriptive statistics 

To create a table of descriptive statistics, we will use the *sumtable()* function of the **vtable** package. This is a clean, well-formatted table output that will give us a nice set of options for exporting, viewing, or saving the table. The *select()* function is part of **tidyverse** and will allow us to filter out some columns of data if we want. You could do this directly in the *sumtable()* function using the *vars* argument. Check out the *out* argument from the help file to see some other options. I'm using *out = "return"* so that the table will show up as output in the console, but the default settings will open up an html version in your web browser. You can also export to a csv file. 

In [None]:
sumtable(public.hosp, out = "return") 

public.hosp |> 
    select(!(c("q1", "q2"))) |>    # ! means "does not", so we are selecting
    sumtable(out = "return")       # just those columns that don't equal q1 or q2

## Recoding and generating new variables

Changing variables and creating new variables can be done in both base R and using the tidyverse. I typically use the **tidyverse** function *mutate()* for recoding and generating new variables. To illustrate how *mutate* works, let's recode the age variable in the dataset. Notice a couple of items about the following code. First, if we want the changes to remain, we need to assign the output of the function to a new object, or overwrite the existing object by assigning it its same. I would strongly recommend assigning to a new object when recoding, like in the code below. Second, you can string together multiple calls to recode or create new variables inside the same *mutate* call, just string them together with a comma. Finally, to recode from interval data into categories, we'll use the *case_when* function. This function will build, so it matters which order you list the conditions. If you list something last that contradicts what you said earlier, then the earlier recode will be overwritten. The final line of the *case_when* statement takes all of the remaining values and codes them to missing data.

If we want to label the values of our new variable, we can use R's *ordered()* function, which tells R the variable is ordinal and allows us to add labels. 

In [None]:
public.hosp.recode <- public.hosp |> 
                            mutate(agesq = age^2,
                                   age4cat = case_when(
                                        age < 35 ~ 0,
                                        age >= 35 & age < 50 ~ 1,
                                        age >=50 & age < 65 ~ 2, 
                                        age >=65 ~ 3,
                                        TRUE ~ NA_real_ # anything not already assigned gets NA
                                   ))

public.hosp.recode$age4cat <- ordered(public.hosp.recode$age4cat,
                                   labels = c("0-34", "34-49", "50-64", "65+"))

table(public.hosp.recode$age4cat)

## Crosstabs 

I use cross-tabulations, or contingency tables, quite a bit in my own work. They are simple ways of seeing the joint frequency distribution of two (or more) variables together. There are several packages in R that can create crosstabs, although I haven't found one that I am perfectly happy with. Instead of using a specially-design package, let's rely on the **dplyr** functionality (part of **tidyverse**) to create our own crosstabs. 

First, we can group our data by our independent (*public*) and dependent (*q9*) variables. Then, we can tell R to generate a count of the number of cases by the categories of *q9* and *public*. 

In [None]:
CT.summary <- public.hosp |> 
            group_by(q9, public) |>
            count()

CT.summary

We're getting closer, but what we really want is *public* to be in two separate columns, one for each of its values. *pivot_wider* reshapes the data, and *rename* obviously renames the columns, which would have simply been labeled as "0" and "1".

In [None]:
CT.tab <- CT.summary |>
            pivot_wider(names_from = public,    # switches the table from long to wide
                        values_from = n) |>
            rename(private = "0",              # gives intelligible names
                    public = "1")

CT.tab              

We're almost there! Now to calculate the percentages or proportions by column of data, we need to ungroup our data (*using group_by(NULL)*) and then call up *mutate()* one more time.

To make a nicer looking table, let's only select the columns we want to present. 

In [None]:
CT.tab.prop <- CT.tab |> group_by(NULL) |> # we need to ungroup our data or the summaries will
                                           # continue to be by values of q9
                    mutate(pct.private = 100*(private/sum(private)),
                           pct.public = 100*(public/sum(public)))
                    
CT.tab.prop |> select(q9, pct.private, pct.public)

## Mean comparisons and difference of means tests

Mean comparisons are easier, and we can calculate the mean comparison test simply by grouping our data and then running *summarise* with the *mean* function. Grouping is done by the independent variable in a mean comparison test, we are calculating the average of the dependent variable.

In [None]:
MC.tab <- public.hosp.recode |>
                    group_by(public) |>
                    summarise("Mean of q9" = mean(q9))

MC.tab

Hmmm. These look pretty similar. I'm not sure there is much of an effect of the public hospital prompt in the survey experiment. Let's test it more formally by calculating a t-test, which evaluates whether $\bar{y_{private}} - \bar{y_{public}}$ is significantly different from the null hypothesis of 0. 

In [None]:
t.test(q9 ~ public, data = public.hosp.recode, var.equal = TRUE)

What do you think? Is the experimental treatment significantly related to evaluations of whether the hospital acts in the interest of patients?
