# Preprocessing for the responses data

Processes the raw responses data that was exported from Qualtrics. Creates a "cleaned" version.

Cleaned version can be committed to the repository; raw vesion is not.

###### Note: This is for the responses data, not the uploaded transactions data


#### Handle comments

As per the IRB protocol, comments connected to survey participants are not published. They are removed from the cleaned data. 

#### Drop extra rows

When participants consent to share their data, the data file is uploaded to the same survey as a separate response. This generates an extra row of data for just that response, where only the fields for the data file (special hidden Qualtrics question) are filled.

We drop these from the cleaned data.
Note this does not lose the ability to link the data -- the uploaded files are named as the ResponseId of the participant who uploaded the file.


#### Drops unused columns

Columns automatically added by Qualtrics, not used by us

Fields we added in order to hack on the Qualtrics APIs

In [1]:
import numpy as np
import pandas as pd

raw_data_filepath = '../data/survey-v0-sample-fake.csv'
# raw_data_filepath = '../data/survey-v0-sample-raw.csv'
cleaned_data_filepath = '../data/sample-preprocessed.csv'
fields_filepath = '../data/fields.csv'

In [2]:
pd.set_option('max_columns', None)

df = pd.read_csv(raw_data_filepath)
print('shape:', df.shape)
print('all fields:', df.columns)
df.head(2)
# df.head(10) # To peak but not commit

shape: (8, 72)
all fields: Index(['StartDate', 'EndDate', 'Status', 'IPAddress', 'Progress',
       'Duration (in seconds)', 'Finished', 'RecordedDate', 'ResponseId',
       'RecipientLastName', 'RecipientFirstName', 'RecipientEmail',
       'ExternalReference', 'LocationLatitude', 'LocationLongitude',
       'DistributionChannel', 'UserLanguage', 'Q43_Id', 'Q43_Name', 'Q43_Size',
       'Q43_Type', 'intro-1', 'intro-2', 'download', 'download-fail-expl',
       'download-fail-screen_Id', 'download-fail-screen_Name',
       'download-fail-screen_Size', 'download-fail-screen_Type', 'q-demos-age',
       'Q-demos-hispanic', 'Q-demos-race', 'Q-demos-education',
       'Q-demos-income', 'Q-demos-gender', 'Q-sexual-orientation',
       'Q-demos-state', 'Q-amazon-use-howmany', 'Q-amazon-use-hh-size',
       'Q-amazon-use-how-oft', 'Q-substance-use_1', 'Q-substance-use_2',
       'Q-substance-use_3', 'Q-personal_1', 'Q-personal_2', 'Q-life-changes',
       'Q-fast-completion', 'Q-control', 'Q-

Unnamed: 0,StartDate,EndDate,Status,IPAddress,Progress,Duration (in seconds),Finished,RecordedDate,ResponseId,RecipientLastName,RecipientFirstName,RecipientEmail,ExternalReference,LocationLatitude,LocationLongitude,DistributionChannel,UserLanguage,Q43_Id,Q43_Name,Q43_Size,Q43_Type,intro-1,intro-2,download,download-fail-expl,download-fail-screen_Id,download-fail-screen_Name,download-fail-screen_Size,download-fail-screen_Type,q-demos-age,Q-demos-hispanic,Q-demos-race,Q-demos-education,Q-demos-income,Q-demos-gender,Q-sexual-orientation,Q-demos-state,Q-amazon-use-howmany,Q-amazon-use-hh-size,Q-amazon-use-how-oft,Q-substance-use_1,Q-substance-use_2,Q-substance-use_3,Q-personal_1,Q-personal_2,Q-life-changes,Q-fast-completion,Q-control,Q-altruism,Q-bonus-05,Q-bonus-20,Q-bonus-50,Q-data-value-05,Q-data-value-20,Q-data-value-50,Q-data-value-100,Q-data-value-any,Q-data-value-any_1_TEXT,Q-sell-YOUR-data,Q-sell-consumer-data,Q-small-biz-use,Q-census-use,Q-research-society,Q-attn-check,Q-comments,SurveyID,ResponseID,FQID,API_TOKEN,showdata,incentive,RandomID
0,Start Date,End Date,Response Type,IP Address,Progress,Duration (in seconds),Finished,Recorded Date,Response ID,Recipient Last Name,Recipient First Name,Recipient Email,External Data Reference,Location Latitude,Location Longitude,Distribution Channel,User Language,Hidden file upload question [QID47-UploadedFil...,Hidden file upload question [QID47-UploadedFil...,Hidden file upload question [QID47-UploadedFil...,Hidden file upload question [QID47-UploadedFil...,Amazon Purchases Survey Introduction Thi...,Survey questions and data collection We wi...,Start the data download process Go to the f...,Please explain below why you were not able to ...,Upload a screenshot displaying the issue here....,Upload a screenshot displaying the issue here....,Upload a screenshot displaying the issue here....,Upload a screenshot displaying the issue here....,What is your age group?,"Are you of Spanish, Hispanic, or Latino origin?",Choose one or more races that you consider you...,What is the highest level of education you hav...,What was your total household income before ta...,How do you describe yourself?,Which best describes your sexual orientation?,"50 States, D.C. and Puerto Rico",How many people do you share your Amazon accou...,"How many people are in your ""household""?",How often do you (+ anyone you share your acco...,Are any of the following the case for: You or...,Are any of the following the case for: You or...,Are any of the following the case for: You or...,Are any of the following the case for: You or...,Are any of the following the case for: You or...,"In 2021 did you, or someone you share your Ama...",Click to insert the file from Amazon below. ...,Click to insert the file from Amazon below. ...,Click to insert the file from Amazon below. ...,Click to insert the file from Amazon below. ...,Click to insert the file from Amazon below. ...,Click to insert the file from Amazon below. ...,Would you hypothetically consent to share your...,Would you hypothetically consent to share your...,Would you hypothetically consent to share your...,Would you hypothetically consent to share your...,How much would you share your data for? - Sele...,How much would you share your data for? - More...,Do you think Amazon should be able to sell YOU...,Do you think companies should be able to sell ...,Big companies currently collect and sell consu...,Do you think the U.S. Census Bureau should use...,Do you think researchers should be able to use...,This is an attention check. Help us find peopl...,Comments? (optional),SurveyID,ResponseID,FQID,API_TOKEN,showdata,incentive,RandomID
1,"{""ImportId"":""startDate"",""timeZone"":""America/De...","{""ImportId"":""endDate"",""timeZone"":""America/Denv...","{""ImportId"":""status""}","{""ImportId"":""ipAddress""}","{""ImportId"":""progress""}","{""ImportId"":""duration""}","{""ImportId"":""finished""}","{""ImportId"":""recordedDate"",""timeZone"":""America...","{""ImportId"":""_recordId""}","{""ImportId"":""recipientLastName""}","{""ImportId"":""recipientFirstName""}","{""ImportId"":""recipientEmail""}","{""ImportId"":""externalDataReference""}","{""ImportId"":""locationLatitude""}","{""ImportId"":""locationLongitude""}","{""ImportId"":""distributionChannel""}","{""ImportId"":""userLanguage""}","{""ImportId"":""QID47_FILE_ID""}","{""ImportId"":""QID47_FILE_NAME""}","{""ImportId"":""QID47_FILE_SIZE""}","{""ImportId"":""QID47_FILE_TYPE""}","{""ImportId"":""QID20""}","{""ImportId"":""QID25""}","{""ImportId"":""QID26""}","{""ImportId"":""QID45_TEXT""}","{""ImportId"":""QID46_FILE_ID""}","{""ImportId"":""QID46_FILE_NAME""}","{""ImportId"":""QID46_FILE_SIZE""}","{""ImportId"":""QID46_FILE_TYPE""}","{""ImportId"":""QID35""}","{""ImportId"":""QID76""}","{""ImportId"":""QID34""}","{""ImportId"":""QID80""}","{""ImportId"":""QID77""}","{""ImportId"":""QID79""}","{""ImportId"":""QID89""}","{""ImportId"":""QID88""}","{""ImportId"":""QID37""}","{""ImportId"":""QID38""}","{""ImportId"":""QID39""}","{""ImportId"":""QID36_2""}","{""ImportId"":""QID36_3""}","{""ImportId"":""QID36_4""}","{""ImportId"":""QID81_1""}","{""ImportId"":""QID81_5""}","{""ImportId"":""QID40""}","{""ImportId"":""QID24""}","{""ImportId"":""QID43""}","{""ImportId"":""QID68""}","{""ImportId"":""QID69""}","{""ImportId"":""QID70""}","{""ImportId"":""QID71""}","{""ImportId"":""QID59""}","{""ImportId"":""QID60""}","{""ImportId"":""QID61""}","{""ImportId"":""QID62""}","{""ImportId"":""QID64""}","{""ImportId"":""QID64_1_TEXT""}","{""ImportId"":""QID84""}","{""ImportId"":""QID44""}","{""ImportId"":""QID73""}","{""ImportId"":""QID74""}","{""ImportId"":""QID75""}","{""ImportId"":""QID85""}","{""ImportId"":""QID48_TEXT""}","{""ImportId"":""SurveyID""}","{""ImportId"":""QSEDResponseID""}","{""ImportId"":""FQID""}","{""ImportId"":""API_TOKEN""}","{""ImportId"":""showdata""}","{""ImportId"":""incentive""}","{""ImportId"":""RandomID""}"


In [3]:
cleaned_fields = [
    # Fields created by Qualtrics that we *DO NOT keep*
    #'StartDate', 'EndDate', 'Status', 'IPAddress', 'Progress',
    #'RecipientLastName', 'RecipientFirstName', 'RecipientEmail',
    #'ExternalReference', 'LocationLatitude', 'LocationLongitude',
    #'DistributionChannel', 'UserLanguage', 
    
    # Fields created by Qualtrics that we *DO keep*
    'Duration (in seconds)', 'Finished', 'RecordedDate', 'ResponseId',
    
    # Fields to handle the uploaded file -- do not keep
    # 'Q43_Id', 'Q43_Name', 'Q43_Size', 'Q43_Type', 
    
    # Fields for setup that have consent and continue vs exit Qs
    'intro-1', 'intro-2', 
    
    # Fields for guiding the participant through the download process
    'download', 
    # Fields for download process failure
    'download-fail-expl', 'download-fail-screen_Id', 'download-fail-screen_Name',
    'download-fail-screen_Size', 'download-fail-screen_Type', 
    
    # Fields for personal Qs. e.g. demographics data, amazon usage, life changes
    'q-demos-age', 'Q-demos-hispanic', 'Q-demos-race', 'Q-demos-education',
    'Q-demos-income', 'Q-demos-gender', 'Q-sexual-orientation', 'Q-demos-state', 
    'Q-amazon-use-howmany', 'Q-amazon-use-hh-size', 'Q-amazon-use-how-oft', 
    'Q-substance-use_1', 'Q-substance-use_2', 'Q-substance-use_3', 
    'Q-personal_1', 'Q-personal_2', 
    'Q-life-changes',
    
    # Fields for Q asking if they will share data -- specific to experiment arm
    # 'Q-fast-completion', unused
    'Q-control', 'Q-altruism', 'Q-bonus-05',
    'Q-bonus-20', 'Q-bonus-50', 
    
    # Fields for Qs about perceived data value
    'Q-data-value-05', 'Q-data-value-20', 'Q-data-value-50', 'Q-data-value-100', 
    'Q-data-value-any', 'Q-data-value-any_1_TEXT', 
    
    # Fields for Qs about how your data should be used
    'Q-sell-YOUR-data', 'Q-sell-consumer-data', 'Q-small-biz-use', 
    'Q-census-use', 'Q-research-society', 'Q-attn-check',
    
    # Comments are not clean
    # 'Q-comments',
    
    # Fields for important embedded data set set
    # Used to indicate experiment arm:
    'showdata',
    'incentive', 
    # We set this to connect responses to mturk workers we pay
    'RandomID',
    # We set these to make the API hacks work -- do not need for analysis
    # 'SurveyID', 'ResponseID', 'FQID', 'API_TOKEN',
]

First save all fields before dropping some.

In [4]:
fields_df = df.loc[0].rename('field').to_frame()
fields_df['cleaned'] = fields_df['field'].isin(cleaned_fields)
fields_df.head(10)

Unnamed: 0,field,cleaned
StartDate,Start Date,False
EndDate,End Date,False
Status,Response Type,False
IPAddress,IP Address,False
Progress,Progress,False
Duration (in seconds),Duration (in seconds),True
Finished,Finished,True
RecordedDate,Recorded Date,False
ResponseId,Response ID,False
RecipientLastName,Recipient Last Name,False


In [5]:
fields_df.to_csv(fields_filepath)
print('saved all fields to file %s' % fields_filepath)

saved all fields to file ../data/fields.csv


Drop the Qualtrics metadata

In [6]:
df = df.drop([0,1])

Drop the rows generated for the file uploads

In [7]:
print('dropping %s rows generated for file uploads' % len(df[~df['Q43_Id'].isna()]))
df = df[df['Q43_Id'].isna()].head()

dropping 2 rows generated for file uploads


In [8]:
cleaned_df = df[cleaned_fields]
print('cleaned data', cleaned_df.shape)
cleaned_df.head()

cleaned data (4, 49)


Unnamed: 0,Duration (in seconds),Finished,RecordedDate,ResponseId,intro-1,intro-2,download,download-fail-expl,download-fail-screen_Id,download-fail-screen_Name,download-fail-screen_Size,download-fail-screen_Type,q-demos-age,Q-demos-hispanic,Q-demos-race,Q-demos-education,Q-demos-income,Q-demos-gender,Q-sexual-orientation,Q-demos-state,Q-amazon-use-howmany,Q-amazon-use-hh-size,Q-amazon-use-how-oft,Q-substance-use_1,Q-substance-use_2,Q-substance-use_3,Q-personal_1,Q-personal_2,Q-life-changes,Q-control,Q-altruism,Q-bonus-05,Q-bonus-20,Q-bonus-50,Q-data-value-05,Q-data-value-20,Q-data-value-50,Q-data-value-100,Q-data-value-any,Q-data-value-any_1_TEXT,Q-sell-YOUR-data,Q-sell-consumer-data,Q-small-biz-use,Q-census-use,Q-research-society,Q-attn-check,showdata,incentive,RandomID
2,45,1,2022-09-13 08:39:39,R_2ykJU7y9XoyCZTk,4,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False,altruism,
3,288,1,2022-09-13 08:44:33,R_BV7CBQjtGOjHZF7,4,1,2.0,test problem flow by alex,F_YVTItdPN6hSEAkp,Screen Shot 2022-09-12 at 4.56.40 PM.png,53964.0,image/png,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True,altruism,
5,239,1,2022-09-13 08:48:35,R_w4seX0QuU1Lz477,4,1,1.0,,,,,,2.0,2.0,16.0,4.0,2.0,2.0,2.0,22.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,,,1.0,,,,,,,,4.0,4.0,13.0,3.0,3.0,123.0,False,bonus-20,199989274.0
7,335,1,2022-09-13 11:11:52,R_2rBGztPmiRoLHWj,4,1,1.0,,,,,,2.0,2.0,1.0,4.0,2.0,1.0,1.0,22.0,2.0,2.0,2.0,2.0,2.0,1.0,2.0,2.0,,,,1.0,,,,,,,,,2.0,2.0,1.0,1.0,1.0,123.0,False,bonus-05,689693001.0


How long did the survey take people on average?
(in minutes)

In [9]:
(cleaned_df['Duration (in seconds)'].astype(int).describe()/60).rename('minutes')

count    0.066667
mean     3.779167
std      2.122471
min      0.750000
25%      3.175000
50%      4.391667
75%      4.995833
max      5.583333
Name: minutes, dtype: float64

Save cleaned data to file.

In [10]:
print('saving data (N=%s) to %s...' % (len(cleaned_df), cleaned_data_filepath))
cleaned_df.to_csv(cleaned_data_filepath, index=False)
print('...saved')

saving data (N=4) to ../data/sample-preprocessed.csv...
...saved


In [11]:
# reading from file
cleaned_df = pd.read_csv(cleaned_data_filepath)
print('N=%s responses' % len(cleaned_df))
cleaned_df.head(3)

N=4 responses


Unnamed: 0,Duration (in seconds),Finished,RecordedDate,ResponseId,intro-1,intro-2,download,download-fail-expl,download-fail-screen_Id,download-fail-screen_Name,download-fail-screen_Size,download-fail-screen_Type,q-demos-age,Q-demos-hispanic,Q-demos-race,Q-demos-education,Q-demos-income,Q-demos-gender,Q-sexual-orientation,Q-demos-state,Q-amazon-use-howmany,Q-amazon-use-hh-size,Q-amazon-use-how-oft,Q-substance-use_1,Q-substance-use_2,Q-substance-use_3,Q-personal_1,Q-personal_2,Q-life-changes,Q-control,Q-altruism,Q-bonus-05,Q-bonus-20,Q-bonus-50,Q-data-value-05,Q-data-value-20,Q-data-value-50,Q-data-value-100,Q-data-value-any,Q-data-value-any_1_TEXT,Q-sell-YOUR-data,Q-sell-consumer-data,Q-small-biz-use,Q-census-use,Q-research-society,Q-attn-check,showdata,incentive,RandomID
0,45,1,2022-09-13 08:39:39,R_2ykJU7y9XoyCZTk,4,2,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,False,altruism,
1,288,1,2022-09-13 08:44:33,R_BV7CBQjtGOjHZF7,4,1,2.0,test problem flow by alex,F_YVTItdPN6hSEAkp,Screen Shot 2022-09-12 at 4.56.40 PM.png,53964.0,image/png,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,True,altruism,
2,239,1,2022-09-13 08:48:35,R_w4seX0QuU1Lz477,4,1,1.0,,,,,,2.0,2.0,16.0,4.0,2.0,2.0,2.0,22.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,,,,1.0,,,,,,,,4.0,4.0,13.0,3.0,3.0,123.0,False,bonus-20,199989274.0


## Comments

What did people have to say? Printed here without connection to participant Responses.

In [12]:
comments = df.set_index('ResponseId')['Q-comments'].rename('comments').dropna()
print('%s comments:' % len(comments))
for c in comments.to_list():
    print('\n"%s"'%c)

2 comments:

"test by alex. said yes for 0.20 incentive. showdata=false"

"Test - Do the comments work?"
