# Opinions and Gaze: Data Processing

This Jupyter notebook contains the gaze and questionnaire data processing for "Seeing the other side: Conflict and controversy increase gaze coordination" (Paxton, Dale, & Richardson, *in preparation*).

To run this file from scratch, you will need:
* `data/`: Directory of experiment data, available for download from the project's ICPSR directory (link here). Due to data sensitivity (per the Institutional Review Board of the University of California, Merced), only researchers from ICPSR member institutions may access the data.
    * `listener-gaze-raw/*.txt`: Listeners' raw gaze data
    * `listener-responses-raw/`:
        * `*.xml`: Listeners' trial order during experiment
        * `*.tsv`: Listeners' post-experiment questionnaire data
    * `speaker-gaze-raw/*.txt`: Speakers' raw gaze data
    * `speaker-audio_outputs/*.csv`: Speaker's audio segment timing
    * `speaker-segment_key.csv`: List of which speakers provided which stimulus segments
* `supplementary-code/`: Directory of additional functions

## Table of contents

* [Preliminaries](#Preliminaries)
* [Listener gaze data](#Listener-gaze-data)
    * [Convert listeners' raw SMI files](#Convert-listeners'-raw-SMI-files)
    * [Concatenate multifile data from single listeners](#Concatenate-multifile-data-from-single-listeners)
    * [Segment listeners' files by audio clip](#Segment-listeners'-files-by-audio-clip)
* [Listener survey response data](#Listener-survey-response-data)
* [Speaker gaze data](#Speaker-gaze-data)
    * [Convert speakers' raw SMI files](#Convert-speakers'-raw-SMI-files)
    * [Filter speaker gaze data to relevant topic](#Filter-speaker-gaze-data-to-relevant-topic)
* [Clean up interim directories](#Clean-up-interim-directories)

**Written by**: A. Paxton
<br>**Date last modified**: 21 February 2017

***

# Preliminaries

Import packages and set global variables.

In [None]:
import re, os, glob
import pandas as pd

Read in bespoke functions.

In [None]:
%run '../supplementary-code/func-clean_responses.py'

In [None]:
%run '../supplementary-code/func-stimulus_order.py'

In [None]:
%run '../supplementary-code/func-clean_gaze_data.py'

# Listener gaze data

## Convert listeners' raw SMI files

In [None]:
# grab all of our raw gaze data
gazeData = glob.glob('../data/listener-gaze-raw/*.txt')

In [None]:
# create a new clean directory if it doesn't yet exist
if not os.path.exists('../data/listener-gaze-prepped/'):
    os.makedirs('../data/listener-gaze-prepped')

In [None]:
# process each file
for gazeFile in gazeData: clean_gaze_data(gazeFile)

## Concatenate multifile data from single listeners

In [None]:
# get all our processed files' names
processed_files = glob.glob('../data/listener-gaze-prepped/*.csv')

In [None]:
# identify files with longer names than expected
possible_multipart_files = [re.findall('\d{5}', ID)[0] for ID 
                            in processed_files if len(ID)>48]

In [None]:
# identify possible multipart files
multipart_files = [gaze_id for gaze_id in processed_files if
                  re.findall('\d{5}', gaze_id)[0] in possible_multipart_files]
multipart_ids = [re.findall('\d{5}', mpf)[0] for mpf in multipart_files]

In [None]:
# figure out how many times these odd IDs appear in our files
from collections import Counter
single_ids = [ID for ID in Counter(multipart_ids) if 
              Counter(multipart_ids)[ID]==1]
multi_ids = [ID for ID in Counter(multipart_ids) if 
              Counter(multipart_ids)[ID]>1]

In [None]:
# identify the IDs that only occur once
single_files = [gaze_id for gaze_id in processed_files if
                  re.findall('\d{5}', gaze_id)[0] in single_ids]

In [None]:
# rename the oddly named files with unique IDs
for single_file in single_files:
    participant_id = re.findall('\d{5}', single_file)[0]
    new_file_name = os.path.join('../data/listener-gaze-prepped',
                                 participant_id+'-smi-data.csv')
    os.rename(single_file,new_file_name)

In [None]:
# identify the multi-part participant files
multi_files = [gaze_id for gaze_id in processed_files if
                  re.findall('\d{5}', gaze_id)[0] in multi_ids]

In [None]:
# run through all possibly duplicated participants
for duplicate_id in multi_ids:
    
    # identify which files belong to this person
    target_files = [next_file for next_file in processed_files if
                      re.findall('\d{5}', next_file)[0] == duplicate_id]
    
    # concatenate the files
    concatenated_df = pd.DataFrame()
    for next_file in target_files:
        concatenated_df = concatenated_df.append(
                                            pd.read_csv(next_file)
                                        ).reset_index(drop=True)
        
    # delete the old files
    for next_file in target_files:
        os.remove(next_file)
        
    # if there are any duplicated rows, just keep the first
    concatenated_df = concatenated_df.drop_duplicates()
    
    # save the new file
    new_concat_name = os.path.join('../data/listener-gaze-prepped',
                                     duplicate_id+'-smi-data.csv')
    concatenated_df.to_csv(new_concat_name, sep=',', 
                           header=True, index=False)

## Segment listeners' files by audio clip

In [None]:
# create a new clean directory if it doesn't yet exist
if not os.path.exists('../data/listener-gaze-cleaned/'):
    os.makedirs('../data/listener-gaze-cleaned')

In [None]:
# open up cleaned ones and take a peek at what's going on
gazeCleaned = glob.glob('../data/listener-gaze-prepped/*.csv')

In [None]:
# identify which stimuli only had 1 audio clip
single_stimuli = ['abortion','gay-marriage','legal-marijuana','tax-rich']

In [None]:
# cycle through the participants' data
for nextGaze in gazeCleaned:
    
    # grab the next set of gaze data
    gaze_df = pd.read_csv(nextGaze,sep=',')
    particID = re.findall('\d{5}',nextGaze)[0]
    
    # identify which stimuli are present in this participant's data
    available_stimuli = list(set(gaze_df['Stimulus']))
    pic_stimuli = [stim for stim in available_stimuli 
                       if len(re.findall(' Page',str(stim)))>0]
    
    # cycle through the available picture data
    for next_pic in pic_stimuli:
        
        # grab the next picture data and the correct name for it
        base_stim_name = next_pic.split(' ')[0]
        instr_and_gaze_subset = gaze_df[gaze_df['Stimulus'].\
                                            str.contains(base_stim_name)\
                                       ].reset_index(drop=True)
        gaze_subset = gaze_df[gaze_df['Stimulus'].\
                                  str.contains(next_pic)\
                             ].reset_index(drop=True)
        
        # if it's a unimodal issue, we don't have to worry about finding the second listening section
        if base_stim_name in single_stimuli:
            gaze_save_name = os.path.join('../data/listener-gaze-cleaned/gaze_trial-'
                                          +base_stim_name+'-'+particID+'.csv')
            gaze_subset.to_csv(gaze_save_name, index=False)
        
        # if it is part of a two-opinion topic, make sure we have both topics, and then slice them
        elif len(instr_and_gaze_subset['Stimulus'].unique())>2:
            
            # get first appearance of each stimulus
            first_appearance = instr_and_gaze_subset.\
                                        groupby('Stimulus').\
                                        first().\
                                        sort_values(by='Time',ascending=True).\
                                        reset_index()
            
            # index the last time the participant saw the "for" and "against" instructions
            against_instr = first_appearance['Time']\
                                [first_appearance['Stimulus'].\
                                         str.contains('-against.rtf')\
                                ].item()
            for_instr = first_appearance['Time']\
                                [first_appearance['Stimulus'].\
                                         str.contains('-for.rtf')\
                                ].item()

            # if the "against" instruction came first
            if against_instr < for_instr:
                against_listening = gaze_subset[gaze_subset['Time'] < for_instr]
                for_listening = gaze_subset[gaze_subset['Time'] > for_instr]
            
            # otherwise, the "for" instruction came first
            else:
                for_listening = gaze_subset.loc[gaze_subset['Time'] < against_instr]
                against_listening = gaze_subset.loc[gaze_subset['Time'] > against_instr]
                
            # create the file names
            against_file = os.path.join('../data/listener-gaze-cleaned/gaze_trial-'
                                          +base_stim_name+'-against-'+particID+'.csv')
            for_file = os.path.join('../data/listener-gaze-cleaned/gaze_trial-'
                                          +base_stim_name+'-for-'+particID+'.csv')
            
            # and then print it
            against_listening.to_csv(against_file,index=False)
            for_listening.to_csv(for_file,index=False)
        
        # if we don't have them both, let us know
        else:
            print "ERROR for ID "+particID+": Gaze Data for `"+base_stim_name+"` Not Found."
    
    # print update
    print "Participant "+particID+" Exported."

# Listener survey response data

In [None]:
# cycle through listener raw data
listener_folders = glob.glob('../data/listener-responses-raw/*')

In [None]:
# create a new clean directory if it doesn't yet exist
if not os.path.exists('../data/listener-responses-cleaned/'):
    os.makedirs('../data/listener-responses-cleaned')

In [None]:
# grab listener IDs associated with our gaze files
unique_listeners = glob.glob('../data/listener-gaze-cleaned/*.csv')
unique_listeners = list(set([re.findall('\d{5}',f_name)[0] 
                             for f_name in unique_listeners]))

In [None]:
# specify our rating categories
l_ratings = ['rat-emot','passionate','know','convince','agree','common']

In [None]:
# cycle through the listeners to grab their questionnaire data
missing_questionnaires = list()
for listener in unique_listeners:

    # identify the questionnaire and XML paths
    q_tsv_data_file = glob.glob('../data/listener-responses-raw/'+listener+'*.tsv')
    xml_data_file = glob.glob('../data/listener-responses-raw/'+listener+'*.xml')
    
    # if we've got a questionnaire, process it
    if len(q_tsv_data_file)>0:    
        
        # if we only have 1 per participant...
        if len(q_tsv_data_file)==1:
            
            # export TSV to a new file
            q_csv_data_file = os.path.join('../data/listener-responses-cleaned/'
                                   +listener+'_questionnaire.csv')
            clean_q_df = clean_responses(listener,
                                         q_tsv_data_file[0],
                                         q_csv_data_file)
            
            # grab their XML data
            task_xml_data_file = open(xml_data_file[0], 'r').read()
            l_stimulus_order = stimulus_order(task_xml_data_file)
            l_stimulus_order = [re.sub('\-(for|against)','',stimulus) 
                                if re.sub('\-(for|against)','',stimulus) in single_stimuli 
                                else stimulus 
                                for stimulus in l_stimulus_order]

            # add the task order data to the dataframe
            clean_q_df['Topic'] = 'none'
            for rating in l_ratings:
                clean_q_df['Topic'].\
                        loc[clean_q_df['Source']==rating] \
                                = l_stimulus_order[0:len(\
                                                     clean_q_df['Topic'].\
                                                         loc[clean_q_df['Source']==rating])]
                
            # then save it and report back
            clean_q_df.to_csv(q_csv_data_file,index=False)
            print('Listener ID '+str(listener)+' Questionnaire Data Exported.')
            

        # if we've got more than 1 file per participant...
        else:
            
            # create an overarching dataframe and filename for the participant
            combined_clean_q_df = pd.DataFrame()
            q_csv_data_file = os.path.join('../data/listener-responses-cleaned/'
                       +listener+'_questionnaire.csv')
            
            # preserve their file names when first processing
            for q_file in q_tsv_data_file:

                # get the requisite files
                next_ID = re.sub('_questionnaire.tsv', '', 
                                         os.path.basename(q_file))
                xml_data_file = os.path.join('../data/listener-responses-raw/'
                                   +next_ID+'-stimulus-log.xml')
                
                # clean up the TSV but DO NOT save
                clean_q_df = clean_responses(next_ID, q_file, None)
                
                # process the XML data
                task_xml_data_file = open(xml_data_file, 'r').read()
                l_stimulus_order = stimulus_order(task_xml_data_file)
                l_stimulus_order = [re.sub('\-(for|against)','',stimulus) 
                                    if re.sub('\-(for|against)','',stimulus) in single_stimuli 
                                    else stimulus 
                                    for stimulus in l_stimulus_order]

                # add the task order data to the dataframe
                clean_q_df['Topic'] = 'none'
                for rating in l_ratings:
                    clean_q_df['Topic'].\
                            loc[clean_q_df['Source']==rating] \
                                    = l_stimulus_order[0:len(\
                                                         clean_q_df['Topic'].\
                                                             loc[clean_q_df['Source']==rating])]
                
                # if this isn't the first dataset, remove the demographic survey data
                if combined_clean_q_df.size!=0:
                    clean_q_df = clean_q_df[clean_q_df['Topic']!='none']

                # add to master dataframe    
                combined_clean_q_df = combined_clean_q_df.append(clean_q_df)

            # save it and report back
            combined_clean_q_df.to_csv(q_csv_data_file,index=False)
            print 'Listener ID '+str(listener)+ ' Questionnaire Data Exported.'

    # if it doesn't exist, let us know
    else:
        missing_questionnaires.append([listener])
        print 'ERROR: Listener ID '+str(listener)+ ' Questionnaire Data Not Found.'

# Speaker gaze data

Prepare speaker gaze data for only each target trial.

## Convert speakers' raw SMI files

In [None]:
# grab speakers' raw gaze data
gazeData = glob.glob('../data/speaker-gaze-raw/*.txt')

In [None]:
# create a new clean directory if it doesn't yet exist
if not os.path.exists('../data/speaker-gaze-prepped/'):
    os.makedirs('../data/speaker-gaze-prepped')

In [None]:
# process each file
for gazeFile in gazeData: clean_gaze_data(gazeFile)

## Filter speaker gaze data to relevant topic

In [None]:
# create a new target directory if it doesn't yet exist
if not os.path.exists('../data/speaker-gaze-cleaned/'):
    os.makedirs('../data/speaker-gaze-cleaned')

In [None]:
# read in the data for speaker audio clips
segment_key = pd.read_table('../data/speaker-segment_key.csv',
                           sep=',')

In [None]:
# get unique speakers
speaker_list = segment_key['speaker'].unique()

In [None]:
# cycle through the speakers
for speaker in speaker_list:
    
    # figure out how many segments the speaker did
    speaker_segments = segment_key[segment_key['speaker']==speaker]
    
    # read in the speaker's data
    prepped_gaze = pd.read_csv('../data/speaker-gaze-prepped/'+
                               str(speaker)+'-smi-data.csv')
    all_audio = pd.read_csv('../data/speaker-audio_outputs/'+
                              str(speaker)+'-winnowed_samples.csv')
    all_audio['stim'] = all_audio['Stimulus'].replace(' Page 1','',regex=True)
        
    # cycle through the segments
    for next_segment in range(0,speaker_segments.shape[1]):
    
        # grab the next row
        next_segment = segment_key.iloc[next_segment]
        topic = next_segment['topic']
        side = next_segment['side']
        speaker = str(speaker)

        # grab only the times that correspond to our target trial
        audio_times = all_audio[all_audio['stim']==topic]
        start_time = min(audio_times['Time'])
        end_time = max(audio_times['Time'])

        # carve out the data between start_time and end_time and save only the columns we need
        target_gaze = prepped_gaze.loc[prepped_gaze['Time']>=start_time]
        target_gaze = target_gaze.loc[target_gaze['Time']<=end_time]
        if topic in single_stimuli:
            outname = os.path.join('../data/speaker-gaze-cleaned/gaze_speaker-'
                                   +topic+'-'+speaker+'.csv')
        else:
            outname = os.path.join('../data/speaker-gaze-cleaned/gaze_speaker-'
                                   +topic+'-'+side+'-'+speaker+'.csv')
        # save the data
        target_gaze.to_csv(outname,index=False)

        # print out an update
        print "Speaker Data Exported: "+topic+" ("+side+")"

# Clean up interim directories

Once we're done, we can delete the interim files.

In [None]:
import shutil

In [None]:
shutil.rmtree('../data/speaker-gaze-prepped/')

In [None]:
shutil.rmtree('../data/listener-gaze-prepped/')