# Opinions and Gaze: Data Cleaning

This Jupyter notebook contains the gaze and questionnaire 
data cleaning for "Seeing the other side: Conflict and 
controversy increase gaze coordination" (Paxton, Dale, 
& Richardson, *in preparation*).

To run this file from scratch, you will need:
* `data/01-input/`: Directory of experiment data, available for download from the project's ICPSR directory (link here). Due to data sensitivity (per the Institutional Review Board of the University of California, Merced), only researchers from ICPSR member institutions may access the data.
    * `listener-gaze-raw/*.txt`: Listeners' raw gaze data
    * `listener-responses-raw/`:
        * `*.xml`: Listeners' trial order during experiment
        * `*.tsv`: Listeners' post-experiment questionnaire data
    * `speaker-gaze-raw/*.txt`: Speakers' raw gaze data
    * `speaker-audio_outputs/*.csv`: Speaker's audio segment timing
    * `speaker-segment_key.csv`: List of which speakers provided which stimulus segments
* `supplementary-code/`: Directory of additional functions

## Table of contents

* [Preliminaries](#Preliminaries)
* [Listener gaze data](#Listener-gaze-data)
    * [Convert listeners' raw SMI files](#Convert-listeners'-raw-SMI-files)
    * [Concatenate multifile data from single listeners](#Concatenate-multifile-data-from-single-listeners)
    * [Segment listeners' files by audio clip](#Segment-listeners'-files-by-audio-clip)
* [Listener survey response data](#Listener-survey-response-data)
* [Speaker gaze data](#Speaker-gaze-data)
    * [Convert speakers' raw SMI files](#Convert-speakers'-raw-SMI-files)
    * [Filter speaker gaze data to relevant topic](#Filter-speaker-gaze-data-to-relevant-topic)
* [Clean up interim directories](#Clean-up-interim-directories)

**Written by**: A. Paxton (University of California, Berkeley)

**Date last modified**: 28 March 2018

***

# Preliminaries

Import packages and set global variables.

In [1]:
import re, os, glob
import pandas as pd

Read in bespoke functions.

In [2]:
%run '../supplementary-code/func-clean_responses.py'

In [3]:
%run '../supplementary-code/func-stimulus_order.py'

In [4]:
%run '../supplementary-code/func-clean_gaze_data.py'

Create or specify required paths.

In [5]:
# specify input data file path
input_data_path = os.path.join('../data/01-input')

In [6]:
# create a new clean directory if it doesn't yet exist
cleaned_data_path = os.path.join('../data/02-data_cleaning')
if not os.path.exists(cleaned_data_path):
    os.makedirs(cleaned_data_path)

***

# Listener gaze data

## Convert listeners' raw SMI files

In [7]:
# grab all of our raw gaze data
raw_listener_gaze_files = os.path.join(input_data_path,
                                       'listener-gaze-raw/*.txt')
gazeData = glob.glob(raw_listener_gaze_files)

In [8]:
# create a new clean directory if it doesn't yet exist
prepped_listener_path = os.path.join(cleaned_data_path,
                                     'listener-gaze-prepped')
if not os.path.exists(prepped_listener_path):
    os.makedirs(prepped_listener_path)

In [9]:
# process each file
for gazeFile in gazeData: clean_gaze_data(gazeFile)

  if self.run_code(code, result):


Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/53362-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/56653-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/51196-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/58300-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/46060-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/57295-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/44554-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/47734-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/57040-redu_026_Trial084 Samples.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze-prepped/52291-continued-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/listener-gaze

## Concatenate multifile data from single listeners

In [10]:
# get all our processed files' names
processed_listener_files = os.path.join(prepped_listener_path,
                                       '*.csv')
processed_files = glob.glob(processed_listener_files)

In [11]:
# identify files with longer names than expected
possible_multipart_files = [re.findall('\d{5}', ID)[0] for ID 
                            in processed_files if len(ID)>48]

In [12]:
# identify possible multipart files
multipart_files = [gaze_id for gaze_id in processed_files if
                  re.findall('\d{5}', gaze_id)[0] in possible_multipart_files]
multipart_ids = [re.findall('\d{5}', mpf)[0] for mpf in multipart_files]

In [13]:
# figure out how many times these odd IDs appear in our files
from collections import Counter
single_ids = [ID for ID in Counter(multipart_ids) if 
              Counter(multipart_ids)[ID]==1]
multi_ids = [ID for ID in Counter(multipart_ids) if 
              Counter(multipart_ids)[ID]>1]

In [14]:
# identify the IDs that only occur once
single_files = [gaze_id for gaze_id in processed_files if
                  re.findall('\d{5}', gaze_id)[0] in single_ids]

In [15]:
# rename the oddly named files with unique IDs
for single_file in single_files:
    participant_id = re.findall('\d{5}', single_file)[0]
    new_file_name = os.path.join(prepped_listener_path,
                                 participant_id+'-smi-data.csv')
    os.rename(single_file,new_file_name)

In [16]:
# identify the multi-part participant files
multi_files = [gaze_id for gaze_id in processed_files if
                  re.findall('\d{5}', gaze_id)[0] in multi_ids]

In [17]:
# run through all possibly duplicated participants
for duplicate_id in multi_ids:
    
    # identify which files belong to this person
    target_files = [next_file for next_file in processed_files if
                      re.findall('\d{5}', next_file)[0] == duplicate_id]
    
    # concatenate the files
    concatenated_df = pd.DataFrame()
    for next_file in target_files:
        concatenated_df = concatenated_df.append(
                                            pd.read_csv(next_file)
                                        ).reset_index(drop=True)
        
    # delete the old files
    for next_file in target_files:
        os.remove(next_file)
        
    # if there are any duplicated rows, just keep the first
    concatenated_df = concatenated_df.drop_duplicates()
    
    # save the new file
    new_concat_name = os.path.join(prepped_listener_path,
                                   duplicate_id+'-smi-data.csv')
    concatenated_df.to_csv(new_concat_name, sep=',', 
                           header=True, index=False)

## Segment listeners' files by audio clip

In [18]:
# create a new clean directory if it doesn't yet exist
cleaned_listener_gaze_path = os.path.join(cleaned_data_path,
                                          'listener-gaze-cleaned')
if not os.path.exists(cleaned_listener_gaze_path):
    os.makedirs(cleaned_listener_gaze_path)

In [19]:
# grab the prepped gaze files
processed_listener_files = os.path.join(prepped_listener_path,
                                       '*.csv')
gazeCleaned = glob.glob(processed_listener_files)

In [20]:
# identify which stimuli only had 1 audio clip
single_stimuli = ['abortion',
                  'gay-marriage',
                  'legal-marijuana',
                  'tax-rich']

In [21]:
# cycle through the participants' data
for nextGaze in gazeCleaned:
    
    # grab the next set of gaze data
    gaze_df = pd.read_csv(nextGaze,sep=',')
    particID = re.findall('\d{5}',nextGaze)[0]
    
    # identify which stimuli are present in this participant's data
    available_stimuli = list(set(gaze_df['Stimulus']))
    pic_stimuli = [stim for stim in available_stimuli 
                       if len(re.findall(' Page',str(stim)))>0]
    
    # cycle through the available picture data
    for next_pic in pic_stimuli:
        
        # grab the next picture data and the correct name for it
        base_stim_name = next_pic.split(' ')[0]
        instr_and_gaze_subset = gaze_df[gaze_df['Stimulus'].\
                                            str.contains(base_stim_name)\
                                       ].reset_index(drop=True)
        gaze_subset = gaze_df[gaze_df['Stimulus'].\
                                  str.contains(next_pic)\
                             ].reset_index(drop=True)
        
        # if it's a unimodal issue, we don't have to worry about finding the second listening section
        if base_stim_name in single_stimuli:
            gaze_save_name = os.path.join(cleaned_listener_gaze_path,
                                          'gaze_trial-'+base_stim_name+
                                          '-'+particID+'.csv')
            gaze_subset.to_csv(gaze_save_name, index=False)
        
        # if it is part of a two-opinion topic, make sure we have both topics, and then slice them
        elif len(instr_and_gaze_subset['Stimulus'].unique())>2:
            
            # get first appearance of each stimulus
            first_appearance = instr_and_gaze_subset.\
                                        groupby('Stimulus').\
                                        first().\
                                        sort_values(by='Time',ascending=True).\
                                        reset_index()
            
            # index the last time the participant saw the "for" and "against" instructions
            against_instr = first_appearance['Time']\
                                [first_appearance['Stimulus'].\
                                         str.contains('-against.rtf')\
                                ].item()
            for_instr = first_appearance['Time']\
                                [first_appearance['Stimulus'].\
                                         str.contains('-for.rtf')\
                                ].item()

            # if the "against" instruction came first
            if against_instr < for_instr:
                against_listening = gaze_subset[gaze_subset['Time'] < for_instr]
                for_listening = gaze_subset[gaze_subset['Time'] > for_instr]
            
            # otherwise, the "for" instruction came first
            else:
                for_listening = gaze_subset.loc[gaze_subset['Time'] < against_instr]
                against_listening = gaze_subset.loc[gaze_subset['Time'] > against_instr]
                
            # create the file names
            against_file = os.path.join(cleaned_listener_gaze_path,'gaze_trial-'
                                        +base_stim_name+'-against-'+particID+'.csv')
            for_file = os.path.join(cleaned_listener_gaze_path,'gaze_trial-'
                                    +base_stim_name+'-for-'+particID+'.csv')
            
            # and then print it
            against_listening.to_csv(against_file,index=False)
            for_listening.to_csv(for_file,index=False)
        
        # if we don't have them both, let us know
        else:
            print "ERROR for ID "+particID+": Gaze Data for `"+base_stim_name+"` Not Found."
    
    # print update
    print "Participant "+particID+" Exported."

Participant 58048 Exported.
ERROR for ID 56644: Gaze Data for `death-penalty` Not Found.
Participant 56644 Exported.
Participant 56932 Exported.
Participant 55300 Exported.
Participant 57505 Exported.
Participant 54112 Exported.
Participant 58303 Exported.
Participant 56866 Exported.
Participant 58435 Exported.
Participant 43423 Exported.
Participant 44554 Exported.
Participant 58186 Exported.
Participant 55891 Exported.
Participant 58049 Exported.
Participant 47734 Exported.
Participant 55168 Exported.
Participant 56827 Exported.
Participant 56878 Exported.
Participant 57595 Exported.
Participant 57538 Exported.
Participant 57856 Exported.
Participant 56575 Exported.
Participant 54190 Exported.
Participant 57295 Exported.
Participant 47794 Exported.
Participant 58126 Exported.
Participant 57145 Exported.
Participant 57142 Exported.
Participant 58156 Exported.
Participant 57160 Exported.
Participant 49072 Exported.
Participant 53362 Exported.
Participant 53446 Exported.
Participant 575

***

# Listener survey response data

In [22]:
# cycle through listener raw data
raw_listener_response_path = os.path.join(input_data_path,
                                          'listener-responses-raw/') 
listener_folders = glob.glob(os.path.join(raw_listener_response_path,
                                          '*'))

In [23]:
# create a new clean directory if it doesn't yet exist
cleaned_listener_response_path = os.path.join(cleaned_data_path,
                                              'listener-responses-cleaned')
if not os.path.exists(cleaned_listener_response_path):
    os.makedirs(cleaned_listener_response_path)

In [24]:
# grab listener IDs associated with our gaze files
unique_listeners = glob.glob(os.path.join(cleaned_listener_response_path,
                                          '*.csv'))
unique_listeners = list(set([re.findall('\d{5}',f_name)[0] 
                             for f_name in unique_listeners]))

In [25]:
# specify our rating categories
l_ratings = ['rat-emot','passionate','know','convince','agree','common']

In [26]:
# cycle through the listeners to grab their questionnaire data
missing_files = pd.DataFrame()
for listener in unique_listeners:

    # identify the questionnaire and XML paths
    q_tsv_data_file = glob.glob(os.path.join(raw_listener_response_path,
                                             listener+'*.tsv'))
    xml_data_file = glob.glob(os.path.join(raw_listener_response_path,
                                           listener+'*.xml'))
    
    # if we've got a questionnaire, process it
    if len(q_tsv_data_file)>0 and len(xml_data_file)>0:
        
        # if we only have 1 per participant...
        if len(q_tsv_data_file)==1:
            
            # export TSV to a new file
            q_csv_data_file = os.path.join(cleaned_listener_response_path,
                                           listener+'_questionnaire.csv')
            clean_q_df = clean_responses(listener,
                                         q_tsv_data_file[0],
                                         q_csv_data_file)
            
            # grab their XML data           
            task_xml_data_file = open(xml_data_file[0], 'r').read()
            l_stimulus_order = stimulus_order(task_xml_data_file)
            l_stimulus_order = [re.sub('\-(for|against)','',stimulus) 
                                if re.sub('\-(for|against)','',stimulus) in single_stimuli 
                                else stimulus 
                                for stimulus in l_stimulus_order]

            # add the task order data to the dataframe
            clean_q_df['Topic'] = 'none'
            for rating in l_ratings:
                clean_q_df['Topic'].\
                        loc[clean_q_df['Source']==rating] \
                                = l_stimulus_order[0:len(\
                                                     clean_q_df['Topic'].\
                                                         loc[clean_q_df['Source']==rating])]

            # then save it and report back
            clean_q_df.to_csv(q_csv_data_file,index=False)
            print('Listener ID '+str(listener)+' Questionnaire Data Exported.')
            
        # if we've got more than 1 file per participant...
        else:
            
            # create an overarching dataframe and filename for the participant
            combined_clean_q_df = pd.DataFrame()
            q_csv_data_file = os.path.join(cleaned_listener_response_path,
                                           listener+'_questionnaire.csv')
            
            # preserve their file names when first processing
            for q_file in q_tsv_data_file:

                # get the requisite files
                next_ID = re.sub('_questionnaire.tsv', '', 
                                         os.path.basename(q_file))
                xml_data_file = os.path.join(raw_listener_response_path,
                                             next_ID+'-stimulus-log.xml')
                
                # clean up the TSV but DO NOT save
                clean_q_df = clean_responses(next_ID, q_file, None)
                
                # process the XML data
                task_xml_data_file = open(xml_data_file, 'r').read()
                l_stimulus_order = stimulus_order(task_xml_data_file)
                l_stimulus_order = [re.sub('\-(for|against)','',stimulus) 
                                    if re.sub('\-(for|against)','',stimulus) in single_stimuli 
                                    else stimulus 
                                    for stimulus in l_stimulus_order]

                # add the task order data to the dataframe
                clean_q_df['Topic'] = 'none'
                for rating in l_ratings:
                    clean_q_df['Topic'].\
                            loc[clean_q_df['Source']==rating] \
                                    = l_stimulus_order[0:len(\
                                                         clean_q_df['Topic'].\
                                                             loc[clean_q_df['Source']==rating])]
                
                # if this isn't the first dataset, remove the demographic survey data
                if combined_clean_q_df.size!=0:
                    clean_q_df = clean_q_df[clean_q_df['Topic']!='none']

                # add to master dataframe    
                combined_clean_q_df = combined_clean_q_df.append(clean_q_df)

            # save it and report back
            combined_clean_q_df.to_csv(q_csv_data_file,index=False)
            print 'Listener ID '+str(listener)+ ' Questionnaire Data Exported.'

    # if necessary files don't exist, let us know
    else:
        
        # if we don't have the questionnaire:
        if len(q_tsv_data_file)==0:
            missing_files = missing_files.append({'listener': listener, 
                                  'missing_file': 'questionnaire_tsv'}, 
                                 ignore_index=True)
            print 'ERROR: Listener ID '+str(listener)+' Questionnaire Data Not Found.'

        # if we don't have the order data
        if len(xml_data_file)==0:
            missing_files = missing_files.append({'listener': listener, 
                                  'missing_file': 'stimulus_order_data'}, 
                                 ignore_index=True)
            print 'ERROR: Listener ID '+str(listener)+' XML Data Not Found.'

# save missing data to file
missing_files.to_csv(os.path.join(cleaned_data_path,
                                  'listeners-unadded-missing_files.csv'),
                     index=False)

***

# Speaker gaze data

Prepare speaker gaze data for only each target trial.

## Convert speakers' raw SMI files

In [27]:
# grab speakers' raw gaze data
raw_speaker_gaze_path = os.path.join(input_data_path,
                                     'speaker-gaze-raw')
gazeData = glob.glob(os.path.join(raw_speaker_gaze_path,
                                  '*.txt'))

In [28]:
# create a new clean directory if it doesn't yet exist
prepped_speaker_path = os.path.join(cleaned_data_path,
                                    'speaker-gaze-prepped')
if not os.path.exists(prepped_speaker_path):
    os.makedirs(prepped_speaker_path)

In [29]:
# process each file
for gazeFile in gazeData: clean_gaze_data(gazeFile)

Processed SMI Data File: ../data/02-data_cleaning/speaker-gaze-prepped/44881-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/speaker-gaze-prepped/51916-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/speaker-gaze-prepped/53596-smi-data.csv
Processed SMI Data File: ../data/02-data_cleaning/speaker-gaze-prepped/56386-smi-data.csv


## Filter speaker gaze data to relevant topic

In [30]:
# create a new target directory if it doesn't yet exist
cleaned_speaker_path = os.path.join(cleaned_data_path,
                                    'speaker-gaze-cleaned')
if not os.path.exists(cleaned_speaker_path):
    os.makedirs(cleaned_speaker_path)

In [31]:
# read in the data for speaker audio clips
segment_key = pd.read_table(os.path.join(input_data_path,
                                        'speaker-segment_key.csv'),
                           sep=',')

In [32]:
# specify path to speakers' audio data
speaker_audio_path = os.path.join(input_data_path,
                                  'speaker-audio_outputs')

In [33]:
# get unique speakers
speaker_list = segment_key['speaker'].unique()

In [34]:
# cycle through the speakers
for speaker in speaker_list:
    
    # figure out how many segments the speaker did
    speaker_segments = segment_key[segment_key['speaker']==speaker].reset_index()
    
    # read in the speaker's data
    prepped_gaze = pd.read_csv(os.path.join(prepped_speaker_path,
                                            str(speaker)+
                                            '-smi-data.csv'))
    all_audio = pd.read_csv(os.path.join(speaker_audio_path,
                                         str(speaker)+'-winnowed_samples.csv'))
    all_audio['stim'] = all_audio['Stimulus'].replace(' Page 1','',regex=True)
        
    # cycle through the segments
    for segment_number in range(0,speaker_segments.shape[0]):
    
        # grab the next row
        next_segment = speaker_segments.iloc[segment_number]
        topic = next_segment['topic']
        side = next_segment['side']
        speaker = str(speaker)

        # grab only the times that correspond to our target trial
        audio_times = all_audio[all_audio['stim']==topic]
        start_time = min(audio_times['Time'])
        end_time = max(audio_times['Time'])

        # carve out the data between start_time and end_time and save only the columns we need
        target_gaze = prepped_gaze.loc[prepped_gaze['Time']>=start_time]
        target_gaze = target_gaze.loc[target_gaze['Time']<=end_time]
        if topic in single_stimuli:
            outname = os.path.join(cleaned_speaker_path,
                                   'gaze_speaker-'+topic+'-'+speaker+'.csv')
        else:
            outname = os.path.join(cleaned_speaker_path,
                                   'gaze_speaker-'+topic+'-'+side+'-'+speaker+'.csv')
        # save the data
        target_gaze.to_csv(outname,index=False)

        # print out an update
        print "Speaker Data Exported: "+topic+" ("+side+")"

Speaker Data Exported: death-penalty (against)
Speaker Data Exported: junk-food-tax (for)
Speaker Data Exported: tax-rich (for)
Speaker Data Exported: drinking-age (against)
Speaker Data Exported: gay-marriage (for)
Speaker Data Exported: junk-food-tax (against)
Speaker Data Exported: legal-marijuana (for)
Speaker Data Exported: abortion (for)
Speaker Data Exported: death-penalty (for)
Speaker Data Exported: drinking-age (for)


***

# Clean up interim directories

Once we're done, we can delete the interim files.

In [35]:
import shutil

In [36]:
shutil.rmtree(prepped_speaker_path)

In [37]:
shutil.rmtree(prepped_listener_path)