# Prescreen participation analysis

Purpose:
1. Capture basic participation statistics.
2. Peek at demographic distributions -- note we are not interested in these analyses because they only represent a subset of Prolific survey participants (main study participants came from other platforms)
3. Check for participation bias for the main survey.


## Prescreen check for bias in main analysis
The last question of the prescreen asked participants whether they were interested in the main study, given that it would ask them to log into their Amazon account. Only participants who answered this question were invited to participate in the main study.
We know this question presented a potential privacy concern for some participants.

Question:
Were there demographic differences in the response to this question that may have contributed to non-response bias in the main study?

Analysis method:

Variable `continue` defined as:
- 1 if participant passes the last question correctly, indicating they are interested and eligible in main study
- 0 otherwise

Logistic regression:
continue ~ age + gender + race

Note: 
- This analysis is only possible for participants recruited via Prolific who supplied their demographic information to Prolific
- The demographic variables available in this analysis are determined by what Prolific collected


In [1]:
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

prescreen_fname = "../../data/prescreen-survey/cleaned.csv"
data_df = pd.read_csv(prescreen_fname)

In [2]:
data_df.head()

Unnamed: 0,PASS_PRESCREEN,FAILED_ATTN_CHECK,RecordedDate,Q-requirements-1,Q-requirements-2,Q-attn-check-1,Q-prolific-mturk,Q-mturk-account,Q-attn-check-2,Duration (in seconds),...,version,Q-followup-study,connect,Age,Sex,Ethnicity simplified,Country of birth,Nationality,Student status,Employment status
0,,True,2022-11-01 14:49:05,Yes,Yes,Yes,,,,19,...,v1,,,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED
1,,,2022-11-01 14:50:03,Yes,No,,,,,10,...,v1,,,35,Male,White,United States,United States,Yes,DATA_EXPIRED
2,True,,2022-11-01 14:51:21,Yes,Yes,No,No,,No,28,...,v1,,,28,Male,Other,United States,United States,No,Full-Time
3,True,,2022-11-01 14:56:22,Yes,Yes,No,No,,No,40,...,v1,,,27,Male,White,United States,United States,DATA_EXPIRED,DATA_EXPIRED
4,True,,2022-11-01 14:56:37,Yes,Yes,No,No,,No,31,...,v1,,,35,Male,White,United States,United States,DATA_EXPIRED,DATA_EXPIRED


## Basic participation statistics

In [3]:
N = len(data_df)
print('%s total participants' % N)

N_pass_prescreen = data_df['PASS_PRESCREEN'].sum()
rate_pass_prescreen = N_pass_prescreen/N
print('%s --> %0.3f = %s/%s pass prescreen' % (N_pass_prescreen, rate_pass_prescreen, N_pass_prescreen, N))

21892 total participants
14010 --> 0.640 = 14010/21892 pass prescreen


In [4]:
print('Number of participants by survey version')
data_df['version'].value_counts()

Number of participants by survey version


version
v2               17052
cloudresearch     4430
v1                 410
Name: count, dtype: int64

Distribution of duration

In [5]:
data_df['Duration (in seconds)'].describe()

count    21892.000000
mean        94.690389
std        640.213671
min          5.000000
25%         42.000000
50%         66.000000
75%        101.000000
max      71767.000000
Name: Duration (in seconds), dtype: float64

Passing/failing attention check

In [6]:
print('Attention check failure rate')
round(data_df['FAILED_ATTN_CHECK'].sum()/len(data_df['FAILED_ATTN_CHECK']), 3)

Attention check failure rate


0.062

### Pass prescreen and opt out option

We ask at the end of the survey:

```
This is a pre-screen for another study that requires you to sign into your active Amazon account. 
In order to qualify for the main study, please verify that you can sign into your Amazon account and get to the following page: https://www.amazon.com/b2b/reports 

Fill in the blank from what you see at the top of that page: “Your Account > Your Orders > ___”
```

Answers include an option to opt out.

Of the participants who make it to this question (not filtered out beforehand):
- 0.796 pass and are invited to the main survey
- 0.102 opt out

We are interested in those who opt out.

In [7]:
(data_df['Q-followup-study'].value_counts().to_frame()
 .assign(portion=lambda x: round(x['count']/data_df['Q-followup-study'].value_counts().sum(),3)))

Unnamed: 0_level_0,count,portion
Q-followup-study,Unnamed: 1_level_1,Unnamed: 2_level_1
"The answer is ""Order History Reports""",13646,0.796
I am not interested in participating,1747,0.102
"The answer is ""Your Order History""",1337,0.078
I cannot access the page (not eligible for follow up study),227,0.013
"The answer is ""Your Reports""",123,0.007
"The answer is ""Your Payments History Reports""",62,0.004


## Demographics

Note demographics are only available for a subset of Prolific participants.

First peek at demographics we are not analyzing rigorously.

In [8]:
data_df['Country of birth'].value_counts()

Country of birth
United States      15179
CONSENT_REVOKED     1075
DATA_EXPIRED         370
China                 72
Germany               50
                   ...  
Dominica               1
Lithuania              1
Cote d'Ivoire          1
Morocco                1
Qatar                  1
Name: count, Length: 110, dtype: int64

In [9]:
data_df['Nationality'].value_counts()

Nationality
United States      16343
CONSENT_REVOKED     1075
DATA_EXPIRED           1
Name: count, dtype: int64

In [10]:
data_df['Student status'].value_counts()

Student status
No                 9629
DATA_EXPIRED       3758
Yes                2957
CONSENT_REVOKED    1075
Name: count, dtype: int64

In [11]:
data_df['Employment status'].value_counts()

Employment status
DATA_EXPIRED                                                5673
Full-Time                                                   5610
Part-Time                                                   1933
Unemployed (and job seeking)                                1299
Not in paid work (e.g. homemaker', 'retired or disabled)    1120
CONSENT_REVOKED                                             1075
Other                                                        565
Due to start a new job within the next month                 144
Name: count, dtype: int64

## Analysis for Age, Sex, Race/Ethnicity

### Setup with data transformations and explore

#### Age:
Make numeric or NaN

Make any value less than 18 or more than 100 NaN (i.e. will be dropped)

Transform to age groups matching main survey:
- 18 - 24 years
- 25 - 34 years
- 35 - 44 years
- 45 - 54 years
- 55 - 64 years
- 65 and older

In [12]:
display(data_df['Age'].value_counts())

Age
CONSENT_REVOKED    1075
22                  731
21                  660
24                  638
23                  614
                   ... 
1022                  1
923                   1
2                     1
93                    1
100                   1
Name: count, Length: 79, dtype: int64

In [13]:
# Transform age:
# Make numeric or nan
# Make any age less than 18 or more than 100 nan (i.e. will be dropped)
data_df['Age'] = pd.to_numeric(data_df['Age'], errors='coerce')
data_df['Age'] = data_df['Age'].apply(lambda a: a if (a <= 100 and a >= 18) else np.nan)
data_df['Age'].describe()

count    16233.000000
mean        35.319103
std         12.778203
min         18.000000
25%         25.000000
50%         32.000000
75%         42.000000
max        100.000000
Name: Age, dtype: float64

In [14]:
# Transform to age groups that match main survey questions
def age_group(age):
    if (18 <= age <= 24):
        return "18 - 24 years"
    if (25 <= age <= 34):
        return "25 - 34 years"
    if (35 <= age <= 44):
        return "35 - 44 years"
    if (45 <= age <= 54):
        return "45 - 54 years"
    if (55 <= age <= 64):
        return "55 - 64 years"
    if (65 <= age):
        return "65 and older"
    return np.nan

data_df['Age'] = data_df['Age'].apply(age_group)
print('%s Total participants with an age group' % data_df['Age'].value_counts().sum())
(data_df['Age'].value_counts().sort_index().to_frame()
 .assign(portion=lambda x: round(x['count']/data_df['Age'].value_counts().sum(),3)))

16233 Total participants with an age group


Unnamed: 0_level_0,count,portion
Age,Unnamed: 1_level_1,Unnamed: 2_level_1
18 - 24 years,3606,0.222
25 - 34 years,5677,0.35
35 - 44 years,3439,0.212
45 - 54 years,1820,0.112
55 - 64 years,1164,0.072
65 and older,527,0.032


#### Sex:

Limit analysis to the Male/Female binary

In [15]:
data_df['Sex'].value_counts()

Sex
Female               8774
Male                 7484
CONSENT_REVOKED      1075
Prefer not to say      63
DATA_EXPIRED           23
Name: count, dtype: int64

In [16]:
# restrict to Male/Female binary
data_df['Sex'] = data_df['Sex'].apply(lambda s: s if s in ['Female', 'Male'] else np.nan)
(data_df['Sex'].value_counts().to_frame()
 .assign(portion=lambda x: round(x['count']/data_df['Sex'].value_counts().sum(),3)))

Unnamed: 0_level_0,count,portion
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,8774,0.54
Male,7484,0.46


#### Ethnicity

Prolific did not provide a race category, but did provide "Ethnicity simplified"

Transform:
- Drop 'CONSENT_REVOKED' and 'DATA_EXPIRED' answers by setting to np.nan
- Collapse 'Other' and 'Mixed' to one category: 'Other or mixed' to better match main survey analysis

In [17]:
data_df['Ethnicity simplified'].value_counts()

Ethnicity simplified
White              11726
Asian               1355
Black               1295
Mixed               1130
CONSENT_REVOKED     1075
Other                648
DATA_EXPIRED         190
Name: count, dtype: int64

In [18]:
ethnicity_cats = ['White', 'Asian','Black', 'Mixed','Other']

data_df['Ethnicity simplified'] = data_df['Ethnicity simplified'].apply(lambda r: r if r in ethnicity_cats else np.nan)
data_df['Race'] = data_df['Ethnicity simplified'].apply(lambda r: 'Other or mixed' if r in ['Mixed','Other'] else r)
(data_df['Race'].value_counts().to_frame()
 .assign(portion=lambda x: round(x['count']/data_df['Race'].value_counts().sum(),3)))

Unnamed: 0_level_0,count,portion
Race,Unnamed: 1_level_1,Unnamed: 2_level_1
White,11726,0.726
Other or mixed,1778,0.11
Asian,1355,0.084
Black,1295,0.08


### Regression

Create dependent variable `opt_out` based on last question

- 1 if answered not interested in participating
- 0 otherwise

Restrict analysis to participants with value for each of the cagetories used in the regression model.

```
opt_out ~ Age + Sex + Race
```


In [30]:
# displaying / printing helpers
def get_predictors_table(model):
    return pd.DataFrame({
        'B (log odds)': model.params.apply(round, args=([3])),
        'Odds Ratio': np.exp(model.params).apply(round, args=([3])), 
        '95% CI for Odds Ratio': (
            np.exp(model.conf_int())
            .apply(lambda r: '[%0.3f, %0.3f]'%(r[0], r[1]), axis=1)
        ),
        'P-value': model.pvalues.apply(round, args=([3]))
    }).rename_axis('Predictor')

In [19]:
def get_opt_out(answer):
    """
    Returns np.nan if not answered -- indicates participant already filtered out
    Returns 1 if answer is to opt out; 0 otherwise
    """
    if str(answer) == 'nan':
        return np.nan
    if answer == 'I am not interested in participating':
        return 1
    return 0

data_df['opt_out'] = data_df['Q-followup-study'].apply(get_opt_out)
print('%s total opt outs' % data_df['opt_out'].sum())

1747.0 total opt outs


In [21]:
model_df = data_df[['opt_out', 'Age', 'Sex', 'Race']].dropna()
print('%s participants in model' % len(model_df))
model_df.head()

13154 participants in model


Unnamed: 0,opt_out,Age,Sex,Race
410,1.0,45 - 54 years,Male,White
411,0.0,25 - 34 years,Female,White
412,0.0,25 - 34 years,Male,Black
413,1.0,45 - 54 years,Male,White
414,0.0,55 - 64 years,Female,White


In [33]:
formula = "opt_out ~ C(Age, Treatment(reference='35 - 44 years')) + C(Sex, Treatment(reference='Male')) + C(Race, Treatment(reference='White'))"
print('formula')
print(formula)
model = smf.logit(formula=formula, data=model_df).fit()
display(model.summary2())
display(get_predictors_table(model))

formula
opt_out ~ C(Age, Treatment(reference='35 - 44 years')) + C(Sex, Treatment(reference='Male')) + C(Race, Treatment(reference='White'))
Optimization terminated successfully.
         Current function value: 0.325505
         Iterations 6


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.004
Dependent Variable:,opt_out,AIC:,8583.3983
Date:,2023-05-30 18:19,BIC:,8658.2431
No. Observations:,13154,Log-Likelihood:,-4281.7
Df Model:,9,LL-Null:,-4299.4
Df Residuals:,13144,LLR p-value:,5.124e-05
Converged:,1.0000,Scale:,1.0
No. Iterations:,6.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Intercept,-2.1528,0.0686,-31.3805,0.0000,-2.2872,-2.0183
"C(Age, Treatment(reference='35 - 44 years'))[T.18 - 24 years]",-0.1072,0.0954,-1.1234,0.2613,-0.2941,0.0798
"C(Age, Treatment(reference='35 - 44 years'))[T.25 - 34 years]",-0.0736,0.0791,-0.9306,0.3521,-0.2287,0.0814
"C(Age, Treatment(reference='35 - 44 years'))[T.45 - 54 years]",0.1835,0.1000,1.8356,0.0664,-0.0124,0.3795
"C(Age, Treatment(reference='35 - 44 years'))[T.55 - 64 years]",0.2334,0.1149,2.0318,0.0422,0.0083,0.4585
"C(Age, Treatment(reference='35 - 44 years'))[T.65 and older]",0.5391,0.1436,3.7548,0.0002,0.2577,0.8205
"C(Sex, Treatment(reference='Male'))[T.Female]",-0.0630,0.0587,-1.0746,0.2825,-0.1780,0.0519
"C(Race, Treatment(reference='White'))[T.Asian]",-0.1414,0.1168,-1.2107,0.2260,-0.3704,0.0875
"C(Race, Treatment(reference='White'))[T.Black]",-0.0568,0.1155,-0.4917,0.6230,-0.2832,0.1696


Unnamed: 0_level_0,B (log odds),Odds Ratio,95% CI for Odds Ratio,P-value
Predictor,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Intercept,-2.153,0.116,"[0.102, 0.133]",0.0
"C(Age, Treatment(reference='35 - 44 years'))[T.18 - 24 years]",-0.107,0.898,"[0.745, 1.083]",0.261
"C(Age, Treatment(reference='35 - 44 years'))[T.25 - 34 years]",-0.074,0.929,"[0.796, 1.085]",0.352
"C(Age, Treatment(reference='35 - 44 years'))[T.45 - 54 years]",0.184,1.201,"[0.988, 1.462]",0.066
"C(Age, Treatment(reference='35 - 44 years'))[T.55 - 64 years]",0.233,1.263,"[1.008, 1.582]",0.042
"C(Age, Treatment(reference='35 - 44 years'))[T.65 and older]",0.539,1.714,"[1.294, 2.272]",0.0
"C(Sex, Treatment(reference='Male'))[T.Female]",-0.063,0.939,"[0.837, 1.053]",0.283
"C(Race, Treatment(reference='White'))[T.Asian]",-0.141,0.868,"[0.690, 1.091]",0.226
"C(Race, Treatment(reference='White'))[T.Black]",-0.057,0.945,"[0.753, 1.185]",0.623
"C(Race, Treatment(reference='White'))[T.Other or mixed]",-0.08,0.924,"[0.757, 1.126]",0.431
