# Prescreen preprocessing

Raw prescreen data is private. We preprocess to output a public version.

The preprocessing does:
- Deduplicates by Prolific/Worker IDs (there could have been some multiple submissions)

- Merges survey datasets across Qualtrics prescreen surveys

- Merges Qualtrics data with Prolific demographic data

- Inserts column indicating data source & survey version

- Strips out Prolific / Worker IDs and separates comments from IDs (for privacy)

In [1]:
from datetime import date
import os

from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

PROLIFIC_PARTICIPANT_ID = 'PROLIFIC_ID'

data_path = r'../../data/prescreen-survey/'
input_data_path = data_path + 'raw/'
output_data_path = data_path

## Qualtrics survey data

There were multiple versions of the survey.

A mapping of column ID to question is saved in an output 'fields' csv file.

We merge survey data into one file.

In [2]:
# Set up reading in the data files

cloudresearch_name = 'qualtrics-prescreen-cloudresearch'
v1_name = 'qualtrics-prescreen-v1'
v2_name = 'qualtrics-prescreen-v2'

### Handle the fields and values mappings
Save the fields

In [3]:
# Example to show intuition of what this looks like
input_fname = input_data_path + 'qualtrics-prescreen-v1.csv'
df = pd.read_csv(input_fname)
df.drop(['PROLIFIC_PID', 'Q-prolific-ID'], axis=1).head()

Unnamed: 0,PASS_PRESCREEN,FAILED_ATTN_CHECK,RecordedDate,Q-requirements-1,Q-requirements-2,Q-attn-check-1,Q-prolific-mturk,Q-mturk-account,Q-attn-check-2,Q-comments,Duration (in seconds),ResponseId
0,PASS_PRESCREEN,FAILED_ATTN_CHECK,Recorded Date,Do you have an active Amazon account?,Have you been making purchases on your Amazon ...,Since you have made purchases on your Amazon a...,Do you also complete surveys (or HITs) using A...,Is the Amazon.com account that you use to comp...,The following question is very simple. When as...,Thank you for participating.\n\nComments? (opt...,Duration (in seconds),Response ID
1,"{""ImportId"":""PASS_PRESCREEN""}","{""ImportId"":""FAILED_ATTN_CHECK""}","{""ImportId"":""recordedDate"",""timeZone"":""America...","{""ImportId"":""QID90""}","{""ImportId"":""QID91""}","{""ImportId"":""QID108""}","{""ImportId"":""QID107""}","{""ImportId"":""QID96""}","{""ImportId"":""QID109""}","{""ImportId"":""QID48_TEXT""}","{""ImportId"":""duration""}","{""ImportId"":""_recordId""}"
2,,true,2022-11-01 14:49:05,Yes,Yes,Yes,,,,,19,R_V4kBfIAJNlFM7Kh
3,,,2022-11-01 14:50:03,Yes,No,,,,,,10,R_1FmFw7Ov1Cu784g
4,true,,2022-11-01 14:51:21,Yes,Yes,No,No,,No,,28,R_2VDWJck9MB1JQTu


In [4]:
print('Extracting and saving mapping of column keys to semantic values')

for name in [cloudresearch_name, v1_name, v2_name]:
    print('\n%s:' % name)
    input_fname = input_data_path + name +'.csv'
    output_fields_fname = output_data_path + name + '-fields' + '.csv'
    df = pd.read_csv(input_data_path + name +'.csv')
    fields_df = df.loc[0].rename('value').to_frame()
    display(fields_df)
    # Save fields
    print('saving to %s' % output_fields_fname)
    fields_df.to_csv(output_fields_fname, index_label='column')

Extracting and saving mapping of column keys to semantic values

qualtrics-prescreen-cloudresearch:


Unnamed: 0,value
connect,connect
PASS_PRESCREEN,PASS_PRESCREEN
FAILED_ATTN_CHECK,FAILED_ATTN_CHECK
RecordedDate,Recorded Date
Q-worker-id,What is your worker ID?
hitId,hitId
assignmentId,assignmentId
Duration (in seconds),Duration (in seconds)
ResponseId,Response ID
workerId,workerId


saving to ../../data/prescreen-survey/qualtrics-prescreen-cloudresearch-fields.csv

qualtrics-prescreen-v1:


Unnamed: 0,value
PASS_PRESCREEN,PASS_PRESCREEN
FAILED_ATTN_CHECK,FAILED_ATTN_CHECK
RecordedDate,Recorded Date
Q-prolific-ID,What is your Prolific ID
Q-requirements-1,Do you have an active Amazon account?
Q-requirements-2,Have you been making purchases on your Amazon ...
Q-attn-check-1,Since you have made purchases on your Amazon a...
Q-prolific-mturk,Do you also complete surveys (or HITs) using A...
Q-mturk-account,Is the Amazon.com account that you use to comp...
Q-attn-check-2,The following question is very simple. When as...


saving to ../../data/prescreen-survey/qualtrics-prescreen-v1-fields.csv

qualtrics-prescreen-v2:


Unnamed: 0,value
PASS_PRESCREEN,PASS_PRESCREEN
FAILED_ATTN_CHECK,FAILED_ATTN_CHECK
Q-prolific-ID,What is your Prolific ID
Duration (in seconds),Duration (in seconds)
RecordedDate,Recorded Date
ResponseId,Response ID
Q-requirements-1,Do you have an active Amazon account?
Q-requirements-2,When did you start using your current Amazon a...
Q-attn-check-2,The following question is very simple. It is a...
Q-followup-study,This is a pre-screen for another study that re...


saving to ../../data/prescreen-survey/qualtrics-prescreen-v2-fields.csv


### Handle the survey data

When dropping duplicates, keep the last.

#### v1

In [5]:
v1_df = (
    pd.read_csv(input_data_path + v1_name + '.csv')
    .drop([0,1]) # First 2 rows are Qualtrics metadata
    .drop(['PROLIFIC_PID'], axis=1)
    .rename({'Q-prolific-ID': PROLIFIC_PARTICIPANT_ID}, axis=1)
)
print('%s rows of data' % len(v1_df))
print('columns:', list(v1_df.columns))
# v1_df.head()

416 rows of data
columns: ['PASS_PRESCREEN', 'FAILED_ATTN_CHECK', 'RecordedDate', 'PROLIFIC_ID', 'Q-requirements-1', 'Q-requirements-2', 'Q-attn-check-1', 'Q-prolific-mturk', 'Q-mturk-account', 'Q-attn-check-2', 'Q-comments', 'Duration (in seconds)', 'ResponseId']


#### v2

In [6]:
v2_df = (
    pd.read_csv(input_data_path + v2_name + '.csv')
    .drop([0,1]) # First 2 rows are Qualtrics metadata
    .drop(['PROLIFIC_PID'], axis=1)
    .rename({'Q-prolific-ID': PROLIFIC_PARTICIPANT_ID}, axis=1)
)
print('%s rows of data' % len(v2_df))
print('columns:', list(v2_df.columns))
# v2_df.head()

17411 rows of data
columns: ['PASS_PRESCREEN', 'FAILED_ATTN_CHECK', 'PROLIFIC_ID', 'Duration (in seconds)', 'RecordedDate', 'ResponseId', 'Q-requirements-1', 'Q-requirements-2', 'Q-attn-check-2', 'Q-followup-study', 'Q-comments']


Merge Prolific data

In [7]:
v1_df['version'] = 'v1'
v2_df['version'] = 'v2'
prolific_df = pd.concat([v1_df, v2_df]).sort_values('RecordedDate')
print('%s total rows of Prolific data' % len(prolific_df))
# prolific_df.head()

17827 total rows of Prolific data


In [8]:
# Drop duplicates: Keep the last
print('Dropping %s duplicate prolific IDs' % (len(prolific_df) - len(prolific_df.drop_duplicates(PROLIFIC_PARTICIPANT_ID, keep='last'))))
prolific_df = prolific_df.drop_duplicates(PROLIFIC_PARTICIPANT_ID, keep='last')
print('%s remaining rows' % len(prolific_df))

Dropping 359 duplicate prolific IDs
17468 remaining rows


In [9]:
# Drop invalid IDs
print('Dropping rows with invalid prolific IDs (length != 24)')
prolific_df = prolific_df[prolific_df[PROLIFIC_PARTICIPANT_ID].apply(lambda s: len(str(s).strip()) == 24)]
print('%s remaining rows' % len(prolific_df))

Dropping rows with invalid prolific IDs (length != 24)
17464 remaining rows


#### Cloudresearch

In [10]:
cloudresearch_df = (
    pd.read_csv(input_data_path + cloudresearch_name + '.csv')
    .drop([0,1]) # First 2 rows are Qualtrics metadata
    .drop(['hitId', 'assignmentId', 'workerId'], axis=1)
)
print('%s rows of data' % len(cloudresearch_df))
cloudresearch_df['version'] = 'cloudresearch'
# cloudresearch_df.head()

4508 rows of data


Drop rows with bad worker ID value.

In [11]:
# Remove anything with test
cloudresearch_df = cloudresearch_df[cloudresearch_df['Q-worker-id'].apply(lambda s: not 'test' in str(s))]

In [12]:
# What should we expect from a worker ID? A given length.
print('Worker id length')
print(cloudresearch_df['Q-worker-id'].apply(lambda s: len(str(s).strip())).describe().round())
cloudresearch_df['Q-worker-id'].apply(lambda s: len(str(s).strip())).value_counts().head(6)

Worker id length
count    4505.0
mean       22.0
std         9.0
min         3.0
25%        14.0
50%        14.0
75%        32.0
max        44.0
Name: Q-worker-id, dtype: float64


Q-worker-id
32    1927
14    1903
13     620
12      19
24      12
20       4
Name: count, dtype: int64

In [13]:
# Remove any without expected length
expected_lengths = [12,13,14,32]
print('dropping worker ids without length in', expected_lengths)
def expected_length(s):
    return len(str(s).strip()) in expected_lengths
cloudresearch_df = cloudresearch_df[cloudresearch_df['Q-worker-id'].apply(expected_length)]
print('%s remaining rows' % len(cloudresearch_df))

dropping worker ids without length in [12, 13, 14, 32]
4469 remaining rows


In [14]:
# Drop any duplicate worker IDs
print('Dropping %s duplicate worker ids' % (len(cloudresearch_df) - len(cloudresearch_df.drop_duplicates(['Q-worker-id']))))
cloudresearch_df = cloudresearch_df.drop_duplicates(['Q-worker-id'])
print('%s remaining rows' % len(cloudresearch_df))

Dropping 37 duplicate worker ids
4432 remaining rows


## Merge Cloudresearch and Prolific data

In [15]:
prescreen_df = pd.concat([prolific_df, cloudresearch_df])
print('%s rows for merged prescreen data' % len(prescreen_df))

21896 rows for merged prescreen data


In [16]:
# Remove rows with test in the comment
prescreen_df = prescreen_df[prescreen_df['Q-comments'].apply(lambda s: not 'test' in str(s))]
print('%s rows after removing tests' % len(prescreen_df))

21892 rows after removing tests


## Prolific demographic data

Merge demographic data exports from prolific

In [20]:
prolific_demos_fname_prefix = 'demos-export-prolific-prescreen-'
prolific_demo_export_files = [input_data_path+f for f in os.listdir(input_data_path) if f.endswith('.csv') and prolific_demos_fname_prefix in f]
print('%s prolific demographic data export files to read' % len(prolific_demo_export_files))
sorted(prolific_demo_export_files)

9 prolific demographic data export files to read


['../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v0.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v1.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v2.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v3.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v4.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v5.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v6.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v7.csv',
 '../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v8.csv']

In [21]:
demos_df = None
for fi, fpath in enumerate(sorted(prolific_demo_export_files)):
    ddf = pd.read_csv(fpath)[[
        'Participant id',
        'Age', 'Sex', 'Ethnicity simplified', 
        'Country of birth', 'Nationality', 
        'Student status', 'Employment status'
    ]].rename({
        'Participant id': PROLIFIC_PARTICIPANT_ID,
    }, axis=1)
    print('%s: %s participants for file %s' % (fi, len(ddf), fpath))
    demos_df = ddf if demos_df is None else pd.concat([demos_df, ddf])

print('%s total rows' % len(demos_df))
demos_df.drop([PROLIFIC_PARTICIPANT_ID], axis=1).head()

0: 10 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v0.csv
1: 406 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v1.csv
2: 5785 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v2.csv
3: 514 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v3.csv
4: 7887 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v4.csv
5: 926 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v5.csv
6: 2166 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v6.csv
7: 61 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v7.csv
8: 52 participants for file ../../data/prescreen-survey/raw/demos-export-prolific-prescreen-v8.csv
17807 total rows


Unnamed: 0,Age,Sex,Ethnicity simplified,Country of birth,Nationality,Student status,Employment status
0,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED
1,35,Male,White,United States,United States,Yes,DATA_EXPIRED
2,28,Male,Other,United States,United States,No,Full-Time
3,27,Male,White,United States,United States,DATA_EXPIRED,DATA_EXPIRED
4,35,Male,White,United States,United States,DATA_EXPIRED,DATA_EXPIRED


Drop potential duplicate participant ids.

Which we keep/drop from the Prolific demo data shouldn't matter

In [22]:
print('dropping %s duplicate participant ids' % (len(demos_df) - len(demos_df.drop_duplicates([PROLIFIC_PARTICIPANT_ID]))))
demos_df = demos_df.drop_duplicates([PROLIFIC_PARTICIPANT_ID])
print('%s remaining rows' % len(demos_df))
# demos_df.drop([PROLIFIC_PARTICIPANT_ID], axis=1).head()

dropping 354 duplicate participant ids
17453 remaining rows


## Merge in Prolific demographic data

Note this will leave many rows without demographic data

In [23]:
prescreen_df = prescreen_df.merge(demos_df, how='left', on=PROLIFIC_PARTICIPANT_ID)

#### Handle comments

As per the IRB protocol, comments connected to survey participants are not published. They are removed from the cleaned data but printed below disconnected from participant IDs. 

In [24]:
comments = prescreen_df['Q-comments'].rename('comments').dropna()

In [25]:
print('%s comments:' % len(comments))
for c in comments.to_list():
    print('"%s"'%c)

1714 comments:
"I do have an active amazon account, I just misclicked. "
"This was stupid. I abhor having to lie to qualify for a stupid survey like this."
"Thanks for the invite!"
"Thanks for the opportunity to participate "
"I have been making 4+ years of purchases my apologies"
"No"
"thanks for the opportunity"
"this was great. "
"none"
"No"
"I do have a Mturk profile up and running, but I prefer Prolific by a long shot.  The ethics of Mturk HITs are questionable at best and Prolific does a much better job of ensuring the quality of data."
"I inadvertantly selected 'No.' I do have an active Amazon account. Sorry!"
"Hope you are having a beautiful day!"
"None. Thank you."
"I hope this is a study about how Amazon flagrantly violates the privacy of mturk workers."
"Sorry i do have an Amazon account "
"Thanks!"
"Thank you!"
"Thank you!"
"Thank you."
"no."
"I accidentally hit no instead of yes and couldn't go back :("
"No comments."
"I used a family member’s Amazon account until recently

## Save the cleaned data

In [26]:
fname = output_data_path + 'cleaned.csv'
prescreen_df.drop([PROLIFIC_PARTICIPANT_ID, 'Q-worker-id', 'Q-comments'], axis=1).to_csv(fname, index=False)
print('saving %s rows of data to %s' % (len(prescreen_df), fname))

saving 21892 rows of data to ../../data/prescreen-survey/cleaned.csv


In [102]:
pd.read_csv(fname).head()

Unnamed: 0,PASS_PRESCREEN,FAILED_ATTN_CHECK,RecordedDate,Q-requirements-1,Q-requirements-2,Q-attn-check-1,Q-prolific-mturk,Q-mturk-account,Q-attn-check-2,Duration (in seconds),...,version,Q-followup-study,connect,Age,Sex,Ethnicity simplified,Country of birth,Nationality,Student status,Employment status
0,,True,2022-11-01 14:49:05,Yes,Yes,Yes,,,,19,...,v1,,,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED,CONSENT_REVOKED
1,,,2022-11-01 14:50:03,Yes,No,,,,,10,...,v1,,,35,Male,White,United States,United States,Yes,DATA_EXPIRED
2,True,,2022-11-01 14:51:21,Yes,Yes,No,No,,No,28,...,v1,,,28,Male,Other,United States,United States,No,Full-Time
3,True,,2022-11-01 14:56:22,Yes,Yes,No,No,,No,40,...,v1,,,27,Male,White,United States,United States,DATA_EXPIRED,DATA_EXPIRED
4,True,,2022-11-01 14:56:37,Yes,Yes,No,No,,No,31,...,v1,,,35,Male,White,United States,United States,DATA_EXPIRED,DATA_EXPIRED
