# Opinions and Gaze: Data De-Identification (Step 0 of 3)

This Jupyter notebook contains code for stripping out the original participant 
numeric identifiers and substitiuting them for even less identifiable ones, 
as part of "Seeing the other side: Conflict and controversy increase gaze 
coordination" (Paxton, Dale, & Richardson, *in preparation*).

This notebook is the **zeroth of three** notebooks for the "Opinions and Gaze"
project, as the data preceding this cannot be released. However, the code are
shared here for maximal openness.

**Note**: Due to data sensitivity (per the Institutional Review Board of the 
University of California, Merced, where these data were collected), only 
researchers from ICPSR member institutions may access the data through the 
approved link.

**Written by**: A. Paxton (University of Connecticut)

**Date last modified**: 03 March 2021

## Table of contents

* [Preliminaries](#Preliminaries)
* [Listener data](#Listener-data)
    * [Swap identifiers from listeners' raw SMI files](#Swap-identifiers-from-listeners'-raw-SMI-files)
    * [Swap listener response data](#Swap-listener-response-data)
* [Speaker data](#Speaker-data)
    * [Swap identifiers from speakers' gaze data](#Swap-identifiers-from-speakers'-gaze-data)
    * [Swap speaker audio data](#Swap-speaker-audio-data)
    * [Swap segment key information](#Swap-segment-key-information)

***

# Preliminaries

Import packages and set global variables.

In [1]:
import re, os, glob
import pandas as pd

Create or specify required paths.

In [2]:
# specify input data file path
input_data_path = os.path.join('../data/00-raw')

In [3]:
# create a new clean directory if it doesn't yet exist
cleaned_data_path = os.path.join('../data/01-input')
if not os.path.exists(cleaned_data_path):
    os.makedirs(cleaned_data_path)

Read in bespoke functions.

In [4]:
%run '../supplementary-code/func-swap_ids.py'

In [5]:
%run '../supplementary-code/func-swap_ids_filename.py'

# Listener data

## Swap identifiers from listeners' raw SMI files

In [6]:
# grab all of our original gaze data
raw_listener_gaze_files = os.path.join(input_data_path,
                                       'listener-gaze-raw/*.txt')
gazeData = glob.glob(raw_listener_gaze_files)

In [7]:
# create a new clean directory if it doesn't yet exist
raw_listener_path = os.path.join(cleaned_data_path,
                                     'listener-gaze-raw')
if not os.path.exists(raw_listener_path):
    os.makedirs(raw_listener_path)

In [8]:
# grab all unique 5-digit identifiers from the list of filenames
identifier_df = pd.DataFrame({'original_id':[re.search('\d{5}',x).group() 
                                              for x in gazeData]})

In [9]:
# create new identifiers
identifier_df['new_id'] = (identifier_df.reset_index().index+1).astype(str)
identifier_df['new_id']= identifier_df['new_id'].apply(lambda x: x.zfill(5))

In [10]:
# process each file and save a report
swap_reports_df = pd.DataFrame()
for gazeFile in gazeData: 
    next_swapped_report = swap_ids(in_file_path = gazeFile, 
                                   swap_identifier_df = identifier_df)
    swap_reports_df = swap_reports_df.append(next_swapped_report)

In [11]:
# update index
swap_reports_df = swap_reports_df.reset_index(drop = True)

In [12]:
# confirm that we've only got only 3 swaps per file
swap_reports_df['diff_new'] = (swap_reports_df['swapped_new_count'] - 
                               swap_reports_df['original_new_count'])
swap_reports_df['diff_new'].unique()

array([3])

## Swap listener response data

In [13]:
# cycle through listener raw data
raw_listener_response_path = os.path.join(input_data_path,
                                          'listener-responses-raw/') 
listener_folders = glob.glob(os.path.join(raw_listener_response_path,
                                          '*'))

In [14]:
# create a new clean directory if it doesn't yet exist
cleaned_listener_response_path = os.path.join(cleaned_data_path,
                                              'listener-responses-raw')
if not os.path.exists(cleaned_listener_response_path):
    os.makedirs(cleaned_listener_response_path)

In [15]:
# grab listener IDs associated with our gaze files
unique_responders = list(set([re.findall('\d{5}',f_name)[0] 
                             for f_name in listener_folders]))

In [16]:
# grab all unique 5-digit identifiers from the list of filenames
responder_id_df = pd.DataFrame({'original_responder_id':[re.search('\d{5}',x).group() 
                                                         for x in unique_responders]})

In [17]:
# merge questionnaire and gaze identifiers to get unique new IDs
gaze_and_questionnaire_ids = (responder_id_df
                              .merge(identifier_df,
                                     left_on='original_responder_id', 
                                     right_on='original_id',
                                     how='outer')
                             .sort_values(by='new_id')
                             .reset_index(drop=True))

In [18]:
# grab folks who were missing and give them new sequential IDs
formerly_missing_df = (gaze_and_questionnaire_ids
                       .loc[gaze_and_questionnaire_ids['new_id'].isna()]
                       .reset_index()
                       .drop(columns="new_id")
                       .rename(columns = {"index":"new_id"}))
formerly_missing_df['new_id']= formerly_missing_df['new_id'].apply(
    lambda x: str(x+1).zfill(5))

In [19]:
# append questionnaire-only participants to gaze and questionnaire dataframe
gaze_and_questionnaire_ids = (gaze_and_questionnaire_ids
                              .append(formerly_missing_df)
                              .dropna(axis=0,
                                      subset=["new_id"])
                              .reset_index(drop=True)
                              .drop(columns="original_id")
                              .rename(columns = {"original_responder_id":"original_id"}))

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [20]:
# process each file and save a report
swap_questionnaire_reports_df = pd.DataFrame()
for listener_file in listener_folders: 
    next_swapped_report = swap_ids(in_file_path = listener_file, 
                                   swap_identifier_df = gaze_and_questionnaire_ids)
    swap_questionnaire_reports_df = swap_questionnaire_reports_df.append(next_swapped_report)

In [21]:
# update index
swap_questionnaire_reports_df = (swap_questionnaire_reports_df
                                 .sort_values(by="new_id")
                                 .reset_index(drop = True))

# Speaker data

## Swap identifiers from speakers' gaze data

In [22]:
# grab speakers' raw gaze data
raw_speaker_gaze_path = os.path.join(input_data_path,
                                     'speaker-gaze-raw')
gazeData = glob.glob(os.path.join(raw_speaker_gaze_path,
                                  '*.txt'))

In [23]:
# create a new clean directory if it doesn't yet exist
prepped_speaker_path = os.path.join(cleaned_data_path,
                                    'speaker-gaze-raw')
if not os.path.exists(prepped_speaker_path):
    os.makedirs(prepped_speaker_path)

In [24]:
# grab speaker IDs associated with our gaze files
unique_speakers = list(set([re.findall('\d{5}',f_name)[0] 
                             for f_name in gazeData]))

In [25]:
# grab all unique 5-digit identifiers from the list of filenames
speaker_id_df = pd.DataFrame({'original_id':[re.search('\d{5}',x).group() 
                                                         for x in unique_speakers]})

In [26]:
# create new identifiers
speaker_id_df['new_id'] = (speaker_id_df.reset_index().index+99991).astype(str)
speaker_id_df['new_id']= speaker_id_df['new_id'].apply(lambda x: x.zfill(5))

In [27]:
# process each file and save a report
swap_speaker_reports_df = pd.DataFrame()
for gazeFile in gazeData: 
    next_swapped_report = swap_ids(in_file_path = gazeFile, 
                                   swap_identifier_df = speaker_id_df)
    swap_speaker_reports_df = swap_speaker_reports_df.append(next_swapped_report)

In [28]:
# update index
swap_speaker_reports_df = swap_speaker_reports_df.reset_index(drop = True)

In [29]:
# confirm that we've only got only 3 swaps per file
swap_speaker_reports_df['diff_new'] = (swap_speaker_reports_df['swapped_new_count'] - 
                                       swap_speaker_reports_df['original_new_count'])
swap_speaker_reports_df['diff_new'].unique()

array([3])

## Swap speaker audio data

Speakers' audio data files only include their identifiers in their filename.

In [30]:
# specify path to speakers' audio data
speaker_audio_path = os.path.join(input_data_path,
                                  'speaker-audio_outputs')
audioData = glob.glob(os.path.join(speaker_audio_path,
                                   '*.csv'))

In [31]:
# create a new clean directory if it doesn't yet exist
prepped_speaker_path = os.path.join(cleaned_data_path,
                                    'speaker-audio_outputs')
if not os.path.exists(prepped_speaker_path):
    os.makedirs(prepped_speaker_path)

In [32]:
# process each file
swap_speaker_audio_reports_df = pd.DataFrame()
for audioFile in audioData: 
    swap_ids_filename(in_file_path = audioFile, 
                      swap_identifier_df = speaker_id_df)

## Swap segment key information

In [33]:
# specify path to speakers' segment data
speaker_segment_data = os.path.join(input_data_path,
                                  'speaker-segment_key.csv')

In [34]:
# read in file
speakers_segment_df = pd.read_csv(speaker_segment_data)
speakers_segment_df['speaker'] = speakers_segment_df['speaker'].astype(str)

In [35]:
# merge with speaker ID data
speakers_swapped_segment_df = (speakers_segment_df
                               .merge(speaker_id_df,
                                      left_on = "speaker",
                                      right_on = "original_id")
                              .drop(columns=["speaker", "original_id"])
                              .rename(columns = {"new_id":"speaker"}))
speakers_swapped_segment_df = speakers_swapped_segment_df[["speaker", "topic", "side"]]

In [36]:
# write to file
speaker_swapped_filepath = os.path.join(cleaned_data_path,
                                        'speaker-segment_key.csv')
speakers_swapped_segment_df.to_csv(path_or_buf=speaker_swapped_filepath,
                                   index=False)