# Preprocessing


## Data inputs and outputs

### Inputs

Raw data are downloaded from Qualtrics as two separate datasets:

1. Survey responses in one csv, where each row corresponds to one participant.

2. Browser attributes as a folder of csv files, with each csv containing browser attributes collected from one participant.

The files are linked by the survey `Response ID` which was randomly generated by Qualtrics upon the participant entering the survey. There is a `Response ID` for each participant. This corresponds to a row in the survey responses file.
If the participant chose to share their browser attributes, there is a csv file named by this Response ID containing their browser attributes.


### Outputs

2 datasets.

Both datasets include survey responses with consistent columns corresonding to each survey question. For both datasets, a row corresponds to data from one participant, keyed by the `Response ID`.


1. survey-experiment-data.csv

This dataset is used for experiment analysis. It contains all completed responses, whether or not participants shared their browser attributes data. 
It is used to analyze which factors are most closely associated with participants' likelihood to share that data. It includes a flag indicating which experiment arm (`showdata=true/false`) the participant was in. 
It does not include the web browser attributes.

2. survey-and-browser-attributes-data.csv

This dataset only contains responses from participants who shared their browser attributes data. It contains the browser attributes as well as the survey responses, with a column for each question response and each browser attribute collected.



### Preprocessing steps

#### Read and process survey data.

Drop any incomplete responses, responses without consent, test responses.

Create field representing source: `source`: COLLEAGUE/PROLIFIC
- using PROLIFIC_ID and COLLEAGUE=true/false
- drop PROLIFIC_ID and COLLEAGUE columns

Count and drop responses with failed attention checks.

Drop all information beyond data explicitly collected through the survey and our added variables: `source`, `showdata`. Do not include recorded date (to improve participant privacy).


Record the free-text comments in a separate file and drop these from the public-facing survey data (to improve participant privacy).

#### Produce survey-experiment-data.csv

Save the data to `survey-experiment-data.csv`

#### Produce survey-and-browser-attributes-data.csv

Drop responses from participants who did not share their browser attributes data.

Drop columns related to the experiment (`showdata`) and whether browser data were collected (Consent/Decline response).

Read in browser attributes files and merge into survey data where each browser attribute name is a column.

Save the data to `survey-and-browser-attributes-data.csv`


### Notes about survey logic

We use logic within the Qualtrics survey to set embedded variables:
- `consent=false` if participant does not consent to participate at start of survey; otherwise continues
- `passed_attention_check=true` if participant passes attention check; otherwise exits survey


In [121]:
from datetime import date, datetime
import os

from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# input filepaths
input_data_path = '../data/unprocessed/'
input_survey_data_fpath = input_data_path + 'survey.csv'
input_browser_data_fpath = input_data_path + 'Q-csv-file/'
# output filepaths
output_data_path = '../data/'
fields_fpath = output_data_path + 'fields.csv'
comments_fpath = output_data_path + 'comments.csv'
survey_experiment_fpath = output_data_path + 'survey-experiment-data.csv'
survey_and_browser_attrs_fpath = output_data_path + 'survey-and-browser-attributes-data.csv'

### Read in the survey data 

And save the fields to file for downstream data users.

And drop incomplete and test responses.

In [122]:
# load survey data and save useful fields info to file for downstream data users
fields = [
    'ResponseId',
    'showdata', # experimet arm: true/false
    'Q-demos-age', 'Q-demos-gender', 'Q-demos-hispanic', 'Q-demos-race', 'Q-demos-income', 'Q-demos-state',
    'Q-share-browser-data', 
    'Q-perceptions-1', 'Q-perceptions-2',
]

usecols = [
    'Q-prolific', 'PROLIFIC_PID', # These should be the same. First is an explicit Q, latter set by URL parameter.
    'COLLEAGUE', 
    'consent',  # set to false if participant does not consent and exits
    'passed_attention_check', # set to true if passes, otherwise participant exits
] + fields + ['Q-comments']

survey_df = pd.read_csv(input_survey_data_fpath, usecols=usecols)

In [123]:
# save fields data to file
# first rows is fields/description of columns
fields_df = survey_df[fields].loc[0].rename('fields').to_frame()
fields_df.to_csv(fields_fpath)
print('saved fields data to file %s' % fields_fpath)
fields_df

saved fields data to file ../data/fields.csv


Unnamed: 0,fields
ResponseId,Response ID
showdata,showdata
Q-demos-age,What is your age group?
Q-demos-gender,How do you describe yourself?
Q-demos-hispanic,"Are you of Spanish, Hispanic, or Latino origin?"
Q-demos-race,Choose one or more races that you consider you...
Q-demos-income,What was your total household income before ta...
Q-demos-state,"50 States, D.C. and Puerto Rico"
Q-share-browser-data,We are asking to collect attributes from your ...
Q-perceptions-1,“I think that I understand how browser fingerp...


In [124]:
# Now drop the fields data which is the first 2 rows
survey_df = survey_df.drop([0,1])
# Drop any incomplete responses. i.e. if the last mandatory question is nan, drop the row.
last_Q = 'Q-perceptions-2'
survey_df = survey_df[survey_df[last_Q].notna()]
print('%s completed participant responses before filtering' % len(survey_df))

36 completed participant responses before filtering


#### Remove test data

Remove any rows where test is in the Prolific ID or comment

In [125]:
# Drop any test responses
tests = survey_df[
    survey_df['Q-prolific'].apply(lambda pid: 'test' in str(pid)) \
    | survey_df['Q-comments'].apply(lambda c: (len(str(c)) < 50) and ('test' in str(c).lower()))
]
print('Dropping %s tests' % len(tests))
display(tests)
survey_df = survey_df.drop(tests.index)
print('%s rows after dropping tests' % len(survey_df))

Dropping 0 tests


Unnamed: 0,ResponseId,Q-prolific,Q-demos-age,Q-demos-gender,Q-demos-hispanic,Q-demos-race,Q-demos-income,Q-demos-state,Q-share-browser-data,Q-perceptions-1,Q-perceptions-2,Q-comments,PROLIFIC_PID,COLLEAGUE,showdata,consent,passed_attention_check


36 rows after dropping tests


Create field representing source: `source`: COLLEAGUE/PROLIFIC
- using PROLIFIC_ID and COLLEAGUE=true/false
- can then drop PROLIFIC_ID and COLLEAGUE columns



In [126]:
# Does PROLIFIC_ID ever not equal Q-prolific?

bad_prolific_id = survey_df[survey_df['PROLIFIC_PID'].astype(str) != survey_df['Q-prolific'].astype(str)]
print('Found %s responses with mismatched Prolific IDs' % len(bad_prolific_id))
bad_prolific_id['ResponseId']

Found 0 responses with mismatched Prolific IDs


Series([], Name: ResponseId, dtype: object)

In [127]:
def get_source(row):
    if (row['COLLEAGUE'] == 'true') or (row['COLLEAGUE'] == True):
        return 'COLLEAGUE'
    if len(str(row['Q-prolific'])) > 5:
        return 'PROLIFIC'
    return np.nan

survey_df['source'] = survey_df.apply(get_source, axis=1)

#### Count and drop responses with failed attention checks.

Upon passing the attention check, responses had a variable set: `passed_attention_check=true`


In [128]:
failed_attn_checks = survey_df[survey_df['passed_attention_check']!='true']
print('Dropping %s/%s=%0.2f responses with failed attention checks' % (
    len(failed_attn_checks), len(survey_df), len(failed_attn_checks)/len(survey_df)))
survey_df = survey_df[survey_df['passed_attention_check']=='true']
print('%s remaining responses' % len(survey_df))

Dropping 0/36=0.00 responses with failed attention checks
36 remaining responses


#### Record free-text comments

And then separate from survey data for improve participant privacy.

Shuffle the rows.

In [129]:
# Sample with frac=1 to shuffle the rows.
comments = survey_df['Q-comments'].sample(frac=1).rename('comments').dropna().reset_index(drop=True).to_frame()
print('%s comments' % len(comments))
print('saving comments without response id and with different index to %s' % comments_fpath)
comments.to_csv(comments_fpath, index=False)
comments.head()

1 comments
saving comments without response id and with different index to ../data/comments.csv


Unnamed: 0,comments
0,This was alex filling out the survey for real.


### Save experiment data

Drop all information beyond data explicitly collected through the survey and our added variables: `source`, `showdata`.

Save public data to file.

In [130]:
survey_df = survey_df[fields + ['source']]
print('Saving %s survey responses to file %s' % (len(survey_df), survey_experiment_fpath))
survey_df.to_csv(survey_experiment_fpath, index=False)
survey_df.head()

Saving 36 survey responses to file ../data/survey-experiment-data.csv


Unnamed: 0,ResponseId,showdata,Q-demos-age,Q-demos-gender,Q-demos-hispanic,Q-demos-race,Q-demos-income,Q-demos-state,Q-share-browser-data,Q-perceptions-1,Q-perceptions-2,source
3,R_7oHb2rDZ3Gi98pJ,False,25 - 34 years,Female,No,White or Caucasian,"$25,000 - $49,999",Massachusetts,Consent to share,Somewhat agree,Neutral,COLLEAGUE
5,R_1OvNbWe9xUOSL5p,False,35 - 44 years,Female,No,White or Caucasian,"$150,000 or more",California,Consent to share,Strongly agree,Somewhat agree,COLLEAGUE
7,R_7VKKnaSiufpMWx8,True,25 - 34 years,Male,No,White or Caucasian,"$150,000 or more",Pennsylvania,Consent to share,Somewhat agree,Somewhat agree,COLLEAGUE
9,R_15L6Zce0L2JXX5d,True,25 - 34 years,Female,No,White or Caucasian,"$150,000 or more",North Carolina,Consent to share,Strongly agree,Somewhat disagree,COLLEAGUE
11,R_3WIIond1nGemg4u,False,45 - 54 years,Male,No,White or Caucasian,Prefer not to say,California,Consent to share,Somewhat agree,Neutral,COLLEAGUE


## Create the dataset with browser attributes

Only include data from participants who shared their data.

Drop the experiment arm data (`showdata`) and whether browser data were collected (Consent/Decline response).

In [244]:
sdf = survey_df[survey_df['Q-share-browser-data'].astype(str).apply(lambda s: s.lower().startswith('consent'))]
print('%s responses with shared browser data' % len(sdf))
# Drop the experiment and source data
sdf = sdf.drop(['showdata', 'Q-share-browser-data', 'source'], axis=1)
sdf.head()

34 responses with shared browser data


Unnamed: 0,ResponseId,Q-demos-age,Q-demos-gender,Q-demos-hispanic,Q-demos-race,Q-demos-income,Q-demos-state,Q-perceptions-1,Q-perceptions-2
3,R_7oHb2rDZ3Gi98pJ,25 - 34 years,Female,No,White or Caucasian,"$25,000 - $49,999",Massachusetts,Somewhat agree,Neutral
5,R_1OvNbWe9xUOSL5p,35 - 44 years,Female,No,White or Caucasian,"$150,000 or more",California,Strongly agree,Somewhat agree
7,R_7VKKnaSiufpMWx8,25 - 34 years,Male,No,White or Caucasian,"$150,000 or more",Pennsylvania,Somewhat agree,Somewhat agree
9,R_15L6Zce0L2JXX5d,25 - 34 years,Female,No,White or Caucasian,"$150,000 or more",North Carolina,Strongly agree,Somewhat disagree
11,R_3WIIond1nGemg4u,45 - 54 years,Male,No,White or Caucasian,Prefer not to say,California,Somewhat agree,Neutral


### Read in and attach the browser attributes data

Make a table where each row is for one Response ID.

The filenames are named such that they are appended with the Response ID of the participant who contributed the attributes.

In [245]:
b_data_fnames = [input_browser_data_fpath+f for f in os.listdir(input_browser_data_fpath) if f.endswith('.csv')]
print('%s browser attributes files to read' % len(b_data_fnames))
print('Here are sample filenames')
sorted(b_data_fnames)[:5]

34 browser attributes files to read
Here are sample filenames


['../data/unprocessed/Q-csv-file/R_0Oe0jra21r2z5gi_R_3pjudW2z2bigi89.csv',
 '../data/unprocessed/Q-csv-file/R_207tDum2jXrXaJ5_R_6m3yGklCcmVWFbP.csv',
 '../data/unprocessed/Q-csv-file/R_39t0VtEinEiQyqJ_R_6knOoppIyy0GLvu.csv',
 '../data/unprocessed/Q-csv-file/R_4WEaPOwrJCzxxZv_R_69UIue3mOLh9v2x.csv',
 '../data/unprocessed/Q-csv-file/R_8A8b868IJxSo2ye_R_1qxkaPlpiaX2hoy.csv']

In [246]:
files_read = 0
response_ids = []
browser_data_df = pd.DataFrame()
# df = None # for debugging

for i, f in enumerate(b_data_fnames):
    if i%15 == 0:
        print('%s: reading file %s/%s' % 
              (pd.Timestamp.now(), i, len(b_data_fnames)))
    response_id = f[-21:-4] 
    response_ids += [response_id]
    df = pd.read_csv(f).set_index('Attribute')
    df.loc['ResponseId'] = response_id
    browser_data_df = pd.concat([browser_data_df, df.transpose()])
    files_read += 1

2023-12-19 23:12:47.572043: reading file 0/34
2023-12-19 23:12:47.611890: reading file 15/34
2023-12-19 23:12:47.632765: reading file 30/34


In [247]:
print('Read browser data for %s unique response ids' % browser_data_df['ResponseId'].nunique())
browser_data_df.head(3)

Read browser data for 34 unique response ids


Attribute,User agent,Languages,Fonts,Plugins,Local storage,Timezone,Screen resolution,Color depth,Platform,Touch points,...,WebGL Unmasked Vendor,WebGL Unmasked Renderer,Hardware concurrency,Device memory,UA high entropy values,Canvas text,Canvas geometry,Canvas text hashed,Canvas geometry hashed,ResponseId
Value,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7...,"en-US,en,zh-CN","[""Arial Unicode MS"",""Gill Sans"",""Helvetica Neu...","[""PDF Viewer"",""Chrome PDF Viewer"",""Chromium PD...",True,America/New_York,"[1728,1117]",30,MacIntel,0,...,Google Inc. (Apple),"ANGLE (Apple, ANGLE Metal Renderer: Apple M1 P...",10,8,"{""architecture"":""arm"",""brands"":[{""brand"":""Not_...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","[1159232804,-959516503,1697727883,147613198,15...","[1369032737,1864480827,-2065969454,-955405403]",R_6rwgMbUatnaFhmN
Value,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7...,"en-US,en","[""Arial Unicode MS"",""Gill Sans"",""Helvetica Neu...","[""PDF Viewer"",""Chrome PDF Viewer"",""Chromium PD...",True,America/New_York,"[2560,1440]",24,MacIntel,0,...,Google Inc. (Intel Inc.),"ANGLE (Intel Inc., Intel(R) Iris(TM) Plus Grap...",8,8,"{""architecture"":""x86"",""brands"":[{""brand"":""Not_...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","[1159232804,-1830017959,-1467935733,-521720078...","[-1339238269,-218892239,25977246,-2047660831]",R_5hgenPKD9GShcbi
Value,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7...,"en-US,en","[""Arial Unicode MS"",""Gill Sans"",""Helvetica Neu...","[""PDF Viewer"",""Chrome PDF Viewer"",""Chromium PD...",True,America/New_York,"[1792,1120]",24,MacIntel,0,...,Google Inc. (ATI Technologies Inc.),"ANGLE (ATI Technologies Inc., AMD Radeon Pro 5...",12,8,"{""architecture"":""x86"",""brands"":[{""brand"":""Not_...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","[1159232804,-296350724,1504461871,1359279396,1...","[-1116732227,-59544541,1939342702,-1925551899]",R_33lJWCtlC5H2d0V


In [248]:
sdf = sdf.merge(browser_data_df, how='inner', on='ResponseId', suffixes=[''])
print('%s merged rows' % len(sdf))
sdf.head(3)

33 merged rows


Unnamed: 0,ResponseId,Q-demos-age,Q-demos-gender,Q-demos-hispanic,Q-demos-race,Q-demos-income,Q-demos-state,Q-perceptions-1,Q-perceptions-2,User agent,...,WebGL Renderer,WebGL Unmasked Vendor,WebGL Unmasked Renderer,Hardware concurrency,Device memory,UA high entropy values,Canvas text,Canvas geometry,Canvas text hashed,Canvas geometry hashed
0,R_7oHb2rDZ3Gi98pJ,25 - 34 years,Female,No,White or Caucasian,"$25,000 - $49,999",Massachusetts,Somewhat agree,Neutral,Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) Apple...,...,WebKit WebGL,Google Inc. (Intel),"ANGLE (Intel, Mesa Intel(R) Graphics (ADL GT2)...",12,8,"{""architecture"":""x86"",""brands"":[{""brand"":""Chro...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","[1159232804,37248733,382797749,1769326126,-854...","[-388367737,48470680,132272908,1998385952]"
1,R_1OvNbWe9xUOSL5p,35 - 44 years,Female,No,White or Caucasian,"$150,000 or more",California,Strongly agree,Somewhat agree,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7...,...,WebKit WebGL,Google Inc. (Intel Inc.),"ANGLE (Intel Inc., Intel(R) Iris(TM) Plus Grap...",8,8,"{""architecture"":""x86"",""brands"":[{""brand"":""Not_...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","[1159232804,-1830017959,-1467935733,-521720078...","[-1339238269,-218892239,25977246,-2047660831]"
2,R_7VKKnaSiufpMWx8,25 - 34 years,Male,No,White or Caucasian,"$150,000 or more",Pennsylvania,Somewhat agree,Somewhat agree,Mozilla/5.0 (X11; CrOS x86_64 14541.0.0) Apple...,...,WebKit WebGL,Google Inc. (Intel),"ANGLE (Intel, Mesa Intel(R) UHD Graphics (CML ...",8,8,"{""architecture"":""x86"",""brands"":[{""brand"":""Chro...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA...","[1159232804,1393461380,-1769153213,-1247694839...","[-388367737,48470680,132272908,1998385952]"


In [249]:
print('saving survey and browser attributes data to %s'% survey_and_browser_attrs_fpath)
sdf.to_csv(survey_and_browser_attrs_fpath, index=False)
print('...saved')

saving survey and browser attributes data to ../data/survey-and-browser-attributes-data.csv
...saved
