In [1]:
# tidyverse and presentation
library('dplyr')
library('forcats')
library('reshape2')
library('stringr')
library('readr')
library('ggplot2')
library('stargazer')
library('arsenal')
library('sjPlot')
library('tidyr')
library('naniar') # replacing -99 w/NA

# display
options(repr.matrix.max.cols=50, repr.matrix.max.rows=100)
options(dplyr.width = Inf)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



Please cite as: 


 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 


Learn more about sjPlot with 'browseVignettes("sjPlot")'.


Attaching package: ‘tidyr’


The following object is masked from ‘package:reshape2’:

    smiths




# Overview

This notebook preps data from Qualtrics for analysis. The survey ran in October 2021 using a Qualtrics panel and in November 2021 using MTurkers. We estimated N = 426 for statistical power to detect differences among our users, data, and use case variables. We received 586 responses from Qualtrics panelists and 432 from MTurkers. 

Where does the data live that we want to use?

In [2]:
qualtrics_file <- '../data/qualtrics_data.csv'

mturk_file <- '../data/mturk_data.csv'

out_file <- '../data/personal_data_survey_responses.Rdata'

In [3]:
qdf <- read_csv(qualtrics_file)
qdf$source <- "qualtrics"
mdf <- read_csv(mturk_file)
mdf$source <- "mturk"
df <- rbind(qdf, mdf)

[1mRows: [22m[34m586[39m [1mColumns: [22m[34m141[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (140): responseid, distributionchannel, q43, social_media_use, how_often...
[32mdbl[39m   (1): gender_identity_6

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1mRows: [22m[34m432[39m [1mColumns: [22m[34m141[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (137): responseid, distributionchannel, q43, social_media_use, how_often...
[32mdbl[39m   (4): gender_identity_3, gender_identity_4, sexual_orientation_5, race_...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set

Let's make sure I didn't introduce any duplicates when manually prepping spreadsheets.

In [4]:
df <- df %>% distinct()
nrow(df)

In [5]:
glimpse(df)

Rows: 1,018
Columns: 142
$ responseid           [3m[90m<chr>[39m[23m "R_3FXrXtg2n2GkeT4", "R_usoOyr9obtGrVYt", "R_12aM…
$ distributionchannel  [3m[90m<chr>[39m[23m "anonymous", "anonymous", "anonymous", "anonymous…
$ q43                  [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
$ social_media_use     [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
$ how_often_1          [3m[90m<chr>[39m[23m "Several Times a Day", "Several Times a Day", "Se…
$ how_often_2          [3m[90m<chr>[39m[23m "A Few Times a Week", "Several Times a Day", "Sev…
$ how_often_3          [3m[90m<chr>[39m[23m "A Few Times a Week", "Don't use at all", "Every …
$ how_often_4          [3m[90m<chr>[39m[23m "About Once a Day", "Don't use at all", "Several …
$ how_often_5          [3m[90m<chr>[39m[23m "A Few Times a Week", "Don't use at all", "Every …
$ how_often_6          [3m[90m<chr>[39m[23m "About Once a Day", "Don't use at al

# Handling Missing

In [6]:
df <- df %>%
    replace_with_na_all(condition = ~.x == -99)

# Recode and Relevel

We need to recode many `chr` variables to `factor` and to relevel/reorder the levels to go from low to high. We start with all the Likert scales and yes/no questions.

In [7]:
agreelevel <- c("Disagree", "Neither Agree Nor Disagree", "Agree")
agreelevel_case <- c("Disagree", "Neither agree nor disagree", "Agree")
importlevel <- c("Not important", "Neither important nor unimportant", "Important")
ynlevel <- c("No", "Yes", "I'm not sure.")
freqlevel <- c("Never", "Sometimes", "Always")
idealevel <-c ("Bad idea", "I'm not sure", "Good idea")
oftenlevel <- c("Don't use at all", "Less Often", "Every Few Weeks", "A Few Times a Week", "About Once a Day", "Several Times a Day")
senslevel <- c("-99","1","2","3","4","5","6","7","8","9","10")
trustlevel <- c("-99","1","2","3","4","5","6","7")
concernlevel <- c("Extremely Unconcerned", "Moderately Unconcerned", "Slightly Unconcerned", "Neither Concerned nor Unconcerned",
                 "Slightly Concerned", "Moderately Concerned", "Extremely Concerned")



In [8]:
df <- df %>%
    mutate_at(vars(starts_with("trust_")),
              list(~factor(., levels = trustlevel, ordered = TRUE))) %>%
    mutate_at(vars(starts_with("trust_")),
              list(~recode(., "Complete Trust - 7" = "7", "No trust at all - 1" = "1"))) %>%
    mutate_at(vars(starts_with("sensitivity")),
             list(~recode(., "Very Sensitive 10" = "10", "Not Sensitive 1" = "1"))) %>%
    mutate_at(vars(starts_with("sensitivity")),
             list(~factor(., levels = senslevel, ordered = TRUE))) %>%
    mutate_at(vars(starts_with("digital_privacy")),
             list(~factor(., levels = agreelevel, ordered = TRUE))) %>%
    mutate_at(vars(starts_with("data_archive")),
             list(~factor(., levels = importlevel, ordered = TRUE ))) %>%
    mutate_at(vars(starts_with("researchers_")),
             list(~factor(., levels = ynlevel, ordered = FALSE))) %>%
    mutate_at(vars(starts_with("sm_companies_")),
             list(~factor(., levels = ynlevel, ordered = FALSE))) %>%
    mutate_at(vars(starts_with("journalists_")),
             list(~factor(., levels = ynlevel, ordered = FALSE))) %>%
    mutate_at(vars(starts_with("privacy_behaviors")),
             list(~factor(., levels = freqlevel, ordered = TRUE))) %>%
    mutate_at(vars(starts_with("how_often")),
             list(~factor(., levels = oftenlevel, ordered = TRUE))) %>%
    mutate(secure_archive = factor(secure_archive, levels = idealevel, ordered = FALSE)) %>%
    mutate(anonymous_archive = factor(anonymous_archive, levels = idealevel, ordered = TRUE)) %>%
    mutate(social_science = factor(social_science, levels = agreelevel_case, ordered = TRUE)) %>%
    mutate(social_media_use = factor(social_media_use, levels = ynlevel, ordered = FALSE)) %>%
    mutate(sometimes = factor(sometimes, levels = ynlevel, ordered = FALSE)) %>%
    mutate(social_media_use = factor(social_media_use, levels = ynlevel, ordered = FALSE)) %>%
    mutate(tweets_public = factor(tweets_public, levels = ynlevel, ordered = FALSE)) %>%
    mutate(insta_public = factor(insta_public, levels = ynlevel, ordered = FALSE)) %>%
    rename(concern_misuse = `concern-misuse`) %>%
    rename(concern_harm = `concern-harm`) %>%
    mutate_at(vars(starts_with("concern_")),
             list(~factor(., levels = concernlevel, ordered = TRUE))) %>%
    glimpse()

Rows: 1,018
Columns: 142
$ responseid           [3m[90m<chr>[39m[23m "R_3FXrXtg2n2GkeT4", "R_usoOyr9obtGrVYt", "R_12aM…
$ distributionchannel  [3m[90m<chr>[39m[23m "anonymous", "anonymous", "anonymous", "anonymous…
$ q43                  [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
$ social_media_use     [3m[90m<fct>[39m[23m Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ how_often_1          [3m[90m<ord>[39m[23m Several Times a Day, Several Times a Day, Several…
$ how_often_2          [3m[90m<ord>[39m[23m A Few Times a Week, Several Times a Day, Several …
$ how_often_3          [3m[90m<ord>[39m[23m A Few Times a Week, Don't use at all, Every Few W…
$ how_often_4          [3m[90m<ord>[39m[23m About Once a Day, Don't use at all, Several Times…
$ how_often_5          [3m[90m<ord>[39m[23m A Few Times a Week, Don't use at all, Every Few W…
$ how_often_6          [3m[90m<ord>[39m[23m About Once a Day, Don't use at all, 

## Create scales of trust, digital privacy, and privacy behaviors

In [9]:
df <- df %>%
    mutate_at(vars(starts_with("trust_")),
              list(as.numeric)) %>%
    mutate_at(vars(starts_with("digital_privacy_")),
              list(as.numeric)) %>%
    mutate_at(vars(starts_with("privacy_behaviors_")),
              list(as.numeric)) %>%
    rowwise() %>% 
    mutate(trust_scale = sum(across(starts_with("trust_")), na.rm = T)) %>% 
    mutate(dp_scale = sum(across(starts_with("digital_privacy_")), na.rm = T)) %>%
    mutate(pb_scale = sum(across(starts_with("privacy_behaviors_")), na.rm = T)) %>%
    ungroup() %>%
    glimpse()

Rows: 1,018
Columns: 145
$ responseid           [3m[90m<chr>[39m[23m "R_3FXrXtg2n2GkeT4", "R_usoOyr9obtGrVYt", "R_12aM…
$ distributionchannel  [3m[90m<chr>[39m[23m "anonymous", "anonymous", "anonymous", "anonymous…
$ q43                  [3m[90m<chr>[39m[23m "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", …
$ social_media_use     [3m[90m<fct>[39m[23m Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Yes,…
$ how_often_1          [3m[90m<ord>[39m[23m Several Times a Day, Several Times a Day, Several…
$ how_often_2          [3m[90m<ord>[39m[23m A Few Times a Week, Several Times a Day, Several …
$ how_often_3          [3m[90m<ord>[39m[23m A Few Times a Week, Don't use at all, Every Few W…
$ how_often_4          [3m[90m<ord>[39m[23m About Once a Day, Don't use at all, Several Times…
$ how_often_5          [3m[90m<ord>[39m[23m A Few Times a Week, Don't use at all, Every Few W…
$ how_often_6          [3m[90m<ord>[39m[23m About Once a Day, Don't use at all, 

## Recode Demographics

Then we handle the demographic questions. Many of these were asked with "check all that apply", so we have to generate variables. We have very few folks who indicated minority SOGI so are collapsing those categories.

In [10]:
df <- df %>%
    mutate(man_alone = case_when(gender_identity_2 == 'Man' ~ '1', 
                                 gender_identity_2 == '-99' ~ as.character(NA),
                                 gender_identity_2 == '0' ~ '0',
                                 gender_identity_1 == 'Woman' | gender_identity_3 == 'Transgender' | gender_identity_4 == 'Nonbinary/genderqueer' | gender_identity_5 == 'Something else' ~ '0',
                                 gender_identity_5 == 'Prefer not to answer' ~ as.character(NA))) %>%
    mutate(straight_alone = case_when(sexual_orientation_2 == 'Heterosexual (straight)' ~ '1', 
                                      sexual_orientation_2 == '-99' ~ as.character(NA),
                                      sexual_orientation_2 == '0' ~ '0',
                                      sexual_orientation_1 == 'Gay or lesbian' | sexual_orientation_3 == 'Bisexual' | sexual_orientation_5 == 'Something else' ~ '0',
                                      sexual_orientation_4 == 'Prefer not to answer' ~ as.character(NA))) %>%
    mutate(white_alone = case_when(race_ethnicity_7 == 'White' ~ '1', 
                                   race_ethnicity_7 == '-99' ~ as.character(NA),
                                   race_ethnicity_7 == '0' ~ '0',
                                   race_ethnicity_1 == 'American Indian or Alaskan Native' | race_ethnicity_2 == 'Asian' | race_ethnicity_3 == 'Black or African American' | race_ethnicity_4 == 'Hispanic or Latino' | race_ethnicity_5 == 'Middle Eastern or North Africa' | race_ethnicity_6 == 'Native Hawaiian or other Pacific Islander' | race_ethnicity_9 == 'Something else' ~ '0',                                  
                                   race_ethnicity_8 == 'Prefer not to answer' ~ as.character(NA))) %>%
    mutate(black_alone = case_when(race_ethnicity_3 == 'Black or African American' ~ '1', 
                                   race_ethnicity_3 == '-99' ~ as.character(NA),
                                   race_ethnicity_3 == '0' ~ '0',
                                   race_ethnicity_1 == 'American Indian or Alaskan Native' | race_ethnicity_2 == 'Asian' | race_ethnicity_7 == 'White' | race_ethnicity_4 == 'Hispanic or Latino' | race_ethnicity_5 == 'Middle Eastern or North Africa' | race_ethnicity_6 == 'Native Hawaiian or other Pacific Islander' | race_ethnicity_9 == 'Something else' ~ '0',
                                   race_ethnicity_8 == 'Prefer not to answer' ~ as.character(NA))) %>%
    mutate(race = case_when(black_alone == '1' ~ 'Black',
                                 white_alone == '1' ~ 'White',
                                 black_alone == '0' & white_alone == '0' ~ 'Other')) %>%
    mutate_at(vars(c('man_alone', 'straight_alone', 'white_alone', 'black_alone', 'race')), list(~factor(., ordered = FALSE)))


In [11]:
df %>% count(man_alone)

man_alone,n
<fct>,<int>
0.0,456
1.0,551
,11


In [12]:
df %>% count(straight_alone)

straight_alone,n
<fct>,<int>
0.0,161
1.0,844
,13


In [13]:
df %>% count(white_alone)

white_alone,n
<fct>,<int>
0.0,226
1.0,769
,23


In [14]:
df %>% count(black_alone)

black_alone,n
<fct>,<int>
0.0,905
1.0,90
,23


In [15]:
df %>% count(race)

race,n
<fct>,<int>
Black,90
Other,140
White,765
,23


In [16]:
df %>% count(age)

age,n
<chr>,<int>
18 - 24,46
25 - 34,299
35 - 44,210
45 - 54,89
55 - 64,99
65 - 74,189
75 - 84,71
85 or older,15


In [17]:
df <- df %>%
    mutate(age_brackets = case_when(age == '18 - 24' ~ '18 - 34', 
                                    age == '25 - 34' ~ '18 - 34',
                                    age == '35 - 44' ~ '35 - 64',
                                    age == '45 - 54' ~ '35 - 64',
                                    age == '55 - 64' ~ '35 - 64',
                                    TRUE ~ '65 and over'                                 
                                   )) %>%
    mutate(age_brackets = factor(age_brackets, levels = c('18 - 34', '35 - 64', '65 and over'), ordered = TRUE))

df %>% count(age_brackets)

age_brackets,n
<ord>,<int>
18 - 34,345
35 - 64,398
65 and over,275


In [18]:
df %>%
    count(education) 

education,n
<chr>,<int>
Associate degree,89
"Bachelor's degree (For example: BA, AB, BS)",397
Did not complete high school,10
"Doctorate degree (For example: PhD, EdD)",15
High school graduate - high school diploma or equivalent (for example: GED),147
"Master's degree (For example: MA, MS)",151
"Professional Degree (For example: MBA, MFA, DDS, DVM, LLB, JD)",21
Some college but no degree,185
,3


In [19]:
df <- df %>% mutate(education_level = case_when(education == 'Did not complete high school' | 
                                                education == 'High school graduate - high school diploma or equivalent (for example: GED)' |
                                                education == 'Some college but no degree' ~ 'Less than college degree',
                                                education == 'Associate degree' | education == 'Bachelor\'s degree (For example: BA, AB, BS)' ~ 'College degree',
                                                TRUE ~ 'Graduate degree'
                                                )
                   ) %>%
            mutate(education_level = factor(education_level, 
                                            levels = c('Less than college degree', 'College degree', 'Graduate degree'), 
                                            ordered = TRUE)
                )
df %>% count(education_level)

education_level,n
<ord>,<int>
Less than college degree,342
College degree,486
Graduate degree,190


In [20]:
df %>% count(income)

income,n
<chr>,<int>
"$10,000 - $19,999",88
"$100,000 - $149,999",79
"$20,000 - $29,999",113
"$30,000 - $39,999",132
"$40,000 - $49,999",129
"$50,000 - $59,999",142
"$60,000 - $69,999",67
"$70,000 - $79,999",75
"$80,000 - $89,999",49
"$90,000 - $99,999",49


In [21]:
df <- df %>% # mutate(income = str_trim(income, side = "both")) %>%
    mutate(income_level = case_when(income == '-99' ~ as.character(NA),
                                             income == 'Less than $10,000' | income == '$10,000 - $19,999' | income == '$20,000 - $29,999' | income == '$30,000 - $39,999' ~ 'Below 40K',
                                             income == '$40,000 - $49,999' | income == '$50,000 - $59,999' ~ '40K - 60K',
                                             TRUE ~ 'Over 60K'
                                    )
                   ) %>%
            mutate(income_level = factor(income_level, levels = c('Below 40K', '40K - 60K', 'Over 60K'), ordered = TRUE)
                  )
df %>% count(income_level)

income_level,n
<ord>,<int>
Below 40K,394
40K - 60K,271
Over 60K,353


# Save the Data in R format

Save it all as RData so we have the dfs and these transformations available for analysis.

In [22]:
save(df, file = out_file)