# Opinions and Gaze: Data Processing (Step 2 of 3)

This Jupyter notebook contains the data processing for 
for "Seeing the other side: Conflict and controversy 
increase gaze coordination" (Paxton, Dale, & Richardson, 
*in preparation*).

This notebook is the **second of three** notebooks for the
"Opinions and Gaze" project. This must be run **after**
the `oag-data_cleaning.ipynb` file.

To run this notebook completely from scratch, you will need the following
files: 

* `data/02-data_cleaning`: Directory of cleaned study data, produced by 
    `oag-data_cleaning.ipynb`.
    * `listener-gaze-cleaned/*.csv`: Listeners' clean gaze data.
    * `listener-responses-cleaned/*.csv`: Listeners' clean questionnaire data.
    * `speaker-gaze-cleaned/*.csv`: Speakers'clean gaze data.
* `supplementary-code/`: Directory of additional functions and global
    variables.

**Note**: Due to data sensitivity (per the Institutional Review Board of the 
University of California, Merced), only researchers from ICPSR 
member institutions may access study data through the approved link.

## Table of contents

* [Preliminaries](#Preliminaries)
* [Prepare listener and speaker data](#Prepare-listener-and-speaker data)
    - [Downsample data](#Downsample-data)
    - [Prepare speaker gaze data](#Prepare-speaker-gaze-data)
    - [Prepare listener gaze data](#Prepare-listener-gaze-data)
    - [Remove unusable segments](#Remove-unusable-segments)
* [Cross-recurrence quantification analysis](#Cross-recurrence-quantification-analysis)
    - [Calculate cross-recurrence between speakers and listeners](#Calculate-cross-recurrence-between-speakers-and-listeners)
* [Create baseline data](#Create-baseline-data)
* [Create analysis and plotting dataframes](#Create-analysis-and-plotting-dataframes)
    - [Bind real and baseline data](#Bind-real-and-baseline-data)
    - [Convert text to numeric](#Convert-text-to-numeric)
    - [Center dummy-coded variables](#Center-dummy-coded-variables)
    - [Remove NAs](#Remove-NAs)
    - [Create polynomial time variables](#Create-polynomial-time-variables)
    - [Save the dataset](#Save-the-dataset)

**Written by**: A. Paxton (University of California, Berkeley)

**Date last modified**: 29 March 2018

***

# Preliminaries

In [None]:
# clear the space
rm(list=ls())

In [None]:
# read in the needed files and functions
source('../supplementary-code/libraries_and_functions-oag.r')

In [None]:
# create a folder for processed data, if it doesn't exist
if (!dir.exists(processed_data_path)){
    dir.create(processed_data_path)
}

In [None]:
# specify paths for cleaned data
speaker_gaze_cleaned_path = file.path(cleaned_data_path,
                                     'speaker-gaze-cleaned')
listener_gaze_cleaned_path = file.path(cleaned_data_path,
                                      'listener-gaze-cleaned')
listener_responses_cleaned_path = file.path(cleaned_data_path,
                                      'listener-responses-cleaned')

***

# Prepare listener and speaker data

## Downsample data

In [None]:
# create downsampled time variable
sampling_round_time = create_downsampled_time(sampling_hz)

## Prepare speaker gaze data

Here, we read in each speaker's gaze data, remove data
for any segments that were not presented to listeners, downsample,
and then export individual files for each segment.

Due to equipment failure, we do not have any
usable speaker gaze data for `drinking-age-against`.

In [None]:
# create a folder, if it doesn't exist
speaker_dataframe_path = file.path(processed_data_path,
                                   'speaker-dataframes')
if (!dir.exists(speaker_dataframe_path)){
    dir.create(speaker_dataframe_path)
}

In [None]:
# get list of cleaned speaker files
speaker_gaze_files = list.files(speaker_gaze_cleaned_path, 
                                recursive=FALSE, 
                                full.names=TRUE,
                                pattern=".csv")

In [None]:
# create dataframes for completed and missing speakers
speaker_times = data.frame()
all_missing_speakers = data.frame()

# cycle through each speaker's file
for (speaker_file in speaker_gaze_files){
    
    # identify metadata from file name
    speaker_metadata = strapply(speaker_file,
                                "gaze_speaker-(.*).csv", c)[[1]]
    speaker_val = strapply(speaker_metadata,
                       "-([[:digit:]]{5})", c)[[1]]
    
    # identify topic and side (if two-sided issue)
    metadata_to_list = strsplit(speaker_metadata,'-')[[1]]
    second_to_last_word = tail(strsplit(speaker_metadata,'-')[[1]],
                               n=2)[1]
    if (second_to_last_word=='for' | second_to_last_word=='against'){
        topic_val = strapply(speaker_metadata,
                         "(.*)-(for|against)-[[:digit:]]{5}", c)[[1]][1]
        side_val = strapply(speaker_metadata,
                        "(.*)-(for|against)-[[:digit:]]{5}", c)[[1]][2]
        topic_and_side = paste(topic_val, '-', side_val, sep="")
    } else {
        topic_val = strapply(speaker_metadata,
                 "(.*)-[[:digit:]]{5}", c)[[1]][1]
        side_val = 'only'
        topic_and_side = topic_val
    }

    # read in file, add metadata, and fix time
    gaze_data = read.csv(speaker_file, 
                         stringsAsFactors=FALSE) %>%
        mutate(speaker = speaker_val) %>%
        mutate(topic = topic_val) %>%
        mutate(side = side_val) %>%
        mutate(time = round((Time - min(Time))/1000000, 
                            sampling_round_time))

    # rename old variables
    setnames(gaze_data,
             old=old_rename_columns, 
             new=new_rename_columns)

    # figure out how many samples we should have
    max_time = max(na.omit(gaze_data)$time)
    total_downsampled_samples = max_time * sampling_hz
    
    # clean up the data
    gaze_data = gaze_data %>% ungroup() %>%
    
        # keep fixations of real AOIs
        dplyr::filter(r_event=='Fixation') %>%
        dplyr::filter(r_aoi!='White Space' & r_aoi!='-') %>%

        # remove duplicate time slices
        group_by(time, speaker) %>%
        slice(1) %>% ungroup() %>%

        # drop unwanted columns
        select(one_of(speaker_keep_columns))
    
    # clean up the gaze data only if we have usable data
    gaze_datatable = as.data.table(gaze_data)
    if (nrow(gaze_datatable)>0){

        # create dataframe of all expected time slices
        setkey(gaze_datatable, speaker, time)
        time_slices = CJ(speaker_val,
                         seq(0.0,
                             max_time,
                             by=sampling_fraction)) %>%
            mutate(V2 = round(V2, sampling_round_time))

        # pad the dataframe with NAs as needed
        gaze_data = gaze_data %>% ungroup() %>%

            # merge with time segments
            full_join(., time_slices,
                              by=c('speaker'='V1',
                                   'time'='V2')) %>%
            dplyr::arrange(time) %>%
        
            # keep the speaker, topic, and side info
            mutate(speaker = speaker_val) %>%
            mutate(topic = topic_val) %>%
            mutate(side = side_val) %>%
            
            # replace all NA with 99
            mutate_all(funs(replace(., 
                                    is.na(.), 
                                    99)))
        
        # yell at us if we have more than one row per slice
        if (nrow(gaze_data)!=(total_downsampled_samples+1)){
            print(paste("Speaker ", speaker_val,
                        ", Topic `", topic_val,
                        "`: Improper time slicing -- ERROR",
                    sep=""))
        }

        # save cleaned dataframe to file
        next_file_basename = paste0(speaker_val,'-',
                                    topic_and_side,
                                    '-gaze_df.csv')
        next_file_name = file.path(speaker_dataframe_path,
                                   next_file_basename)
        write.table(gaze_data, next_file_name,
                    append=FALSE, sep=",", row.names=FALSE)

        # append speaker gaze data to dataframe
        speaker_times = rbind.data.frame(speaker_times,
                                         gaze_data)
        
        # print update
        print(paste("Speaker ",speaker_val,
                    ", Topic `",topic_val,
                    "`: Data exported",
                    sep=""))
        
    } else {
        # print update
        print(paste("Speaker ",speaker_val,",",
                    " Topic `",topic_val,
                    "`: No data recorded -- ERROR",
                    sep=""))
        
        # save missing to dataframe
        all_missing_speakers = rbind.data.frame(
                    all_missing_speakers,
                    data.frame(speaker = speaker_val,
                               topic = topic_val,
                               side = side_val,
                               r_event = 'all_missing'))
    }
}

# save any completely missing speakers
missing_speaker_path = file.path(processed_data_path,
                                'speaker-missing_data.csv')
write.table(all_missing_speakers, missing_speaker_path,
            append=FALSE, sep=',', row.names=FALSE)

# write everyone else's data
speaker_time_path = file.path(speaker_dataframe_path,
                              'speaker-master-df.csv')
write.table(speaker_times, speaker_time_path,
            append=FALSE, sep=',', row.names=FALSE)

## Prepare listener gaze data

Here, we cycle through all of the listeners' data by
segment, perform some basic data checks, downsample,
and then export individual files for each segment.

In [None]:
# create a folder, if it doesn't exist
listener_dataframe_path = file.path(processed_data_path,
                                    'listener-dataframes')
if (!dir.exists(listener_dataframe_path)){
    dir.create(listener_dataframe_path)
}

In [None]:
# grab all individual listener files
listener_gaze_files = list.files(listener_gaze_cleaned_path, 
                                 recursive=FALSE, 
                                 full.names=TRUE,
                                 pattern=".csv")

In [None]:
# grab speaker times
speaker_master_df_path = file.path(speaker_dataframe_path,
                                   'speaker-master-df.csv')
speaker_df = read.table(speaker_master_df_path,
                        header=TRUE, row.names=NULL, sep=",")

In [None]:
# identify unique speakers and topics from df
speaker_segment_list = speaker_df %>%
    group_by(speaker, topic, side) %>%
    select(speaker, topic, side) %>%
    distinct()

In [None]:
# keep track of our missing data
missing_listener_data = data.frame()
missing_listener_file = file.path(processed_data_path,
                                  'listener-missing_data.csv')

In [None]:
# keep track of our unbalanced monologues
unbalanced_listeners = data.frame()
unbalanced_listener_file = file.path(processed_data_path,
                                     'listener-unbalanced_segments.csv')

In [None]:
# keep track of our listeners without opinions
missing_opinion_listeners = data.frame()
missing_opinion_file = file.path(processed_data_path,
                                 'listener-missing_opinions.csv')

In [None]:
# keep track of our listeners without questionnaires
missing_questionnaire_listeners = data.frame()
missing_questionnaire_file = file.path(processed_data_path,
                                       'listener-missing_questionnaires.csv')

In [None]:
# clean the listener data
number_segments = nrow(speaker_segment_list)
for (next_row in seq_along(1:number_segments)){
    
    # track how many complete listeners we have for each segment
    complete_listeners = 0
    
    # isolate the data for this topic
    next_segment = speaker_segment_list[next_row,]
    next_topic = next_segment$topic
    next_speaker = next_segment$speaker
    speaker_side = as.character(next_segment$side)
    next_topic_df = dplyr::filter(speaker_df,
                                  speaker==next_speaker,
                                  topic==next_topic,
                                  side==speaker_side)
    speaker_time = max(next_topic_df$time)
    total_downsampled_samples = speaker_time*sampling_hz

    # concatenate side and topic (if needed)
    if (speaker_side=='only'){
        topic_and_side = next_topic
    } else {
        topic_and_side = paste(next_topic, '-',
                               speaker_side, sep='')
    }
    
    # grab the files for a single topic
    listener_gaze = list.files(listener_gaze_cleaned_path,
                               recursive=FALSE, 
                               full.names=TRUE,
                               pattern=as.character(topic_and_side))
    
    # cycle through the listeners
    for (next_listener in listener_gaze){
        
        # only continue if we have their questionnaire data
        listener_ID = strapply(next_listener, "-([[:digit:]]*).csv", c)[[1]]
        listener_questionnaire_basename = paste0(listener_ID,
                                                 '_questionnaire.csv')
        listener_questionnaire_file = file.path(listener_responses_cleaned_path,
                                                listener_questionnaire_basename)
        if (file.exists(listener_questionnaire_file)){
            
            # find their gaze files
            listener_gaze_path = list.files(listener_gaze_cleaned_path,
                                           full.names=TRUE,
                                           pattern=paste('gaze_trial-',topic_and_side,
                                                         '.*',
                                                         as.character(listener_ID),
                                                         sep=''))
            
            # read in the file and downsample
            listener_gaze_data = read.csv(listener_gaze_path[1],
                                          stringsAsFactors=FALSE) %>%
                mutate(time = round((Time - min(Time))/1000000,
                                    sampling_round_time))
            
            # identify the participant's segment duration
            listener_segment_time = max(listener_gaze_data$time)
            total_listener_slices = listener_segment_time*sampling_hz
            
            # only move forward if we have opinion data on this topic
            listener_questionnaire = read.table(listener_questionnaire_file,
                                         sep=',', header=TRUE)
            listener_opinions = dplyr::filter(listener_questionnaire, 
                                              Topic==as.character(topic_and_side))
            
            # identify user demographics and survey data
            listener_gender = dplyr::filter(listener_questionnaire,
                                             Source=='gender')$Answer
            listener_natlang = set_concat_opinion(listener_questionnaire, 
                                                 'Answer',
                                                 'Source',
                                                 'nativelang')
            listener_age = dplyr::filter(listener_questionnaire,
                                             Source=='age')$Answer
            listener_agree = set_opinion(listener_opinions, 
                                           'Answer',
                                           'Source',
                                           'agree')
            
            # only continue if listeners' trials were roughly the same 
            if (abs(speaker_time - listener_segment_time)<.2) {
                if (nrow(listener_opinions)>0){

                    # rename old variables
                    setnames(listener_gaze_data,
                             old=old_rename_columns, 
                             new=new_rename_columns)

                    # clean up the data
                    listener_gaze_data = listener_gaze_data %>% ungroup() %>%
                    
                        # add demographics
                        select(one_of(listener_keep_columns)) %>%
                        mutate(topic = next_topic) %>%
                        mutate(listener = listener_ID) %>%
                        mutate(side = speaker_side) %>%

                        # keep fixations for real AOIs
                        dplyr::filter(r_event=='Fixation') %>%
                        dplyr::filter(r_aoi!='White Space' & r_aoi!='-') %>%

                        # remove duplicate time slices
                        group_by(time, listener) %>%
                        slice(1) %>% ungroup()
                    
                    # if we still have data, process it
                    if (nrow(listener_gaze_data)>0){
                                                
                        # pad the time variable
                        listener_datatable = as.data.table(listener_gaze_data)
                        setkey(listener_datatable, listener, time)
                        time_slices = CJ(listener_ID,
                                         seq(0,
                                             listener_segment_time,
                                             by=sampling_fraction)) %>%
                            mutate(V2 = round(V2, sampling_round_time))

                        # merge with gaze data
                        listener_gaze_data = listener_gaze_data %>% ungroup() %>%

                            # merge with time segments
                            full_join(., time_slices,
                                              by=c('listener'='V1',
                                                   'time'='V2')) %>%
                            arrange(time) %>%
                        
                            # add metadata
                            mutate(topic = next_topic) %>%
                            mutate(listener = listener_ID) %>%
                            mutate(side = speaker_side) %>%

                            # add survey responses
                            mutate(gender = listener_gender) %>%
                            mutate(age = listener_age) %>%
                            mutate(native_lang = listener_natlang) %>%
                            mutate(agree = listener_agree)
                                
                        # yell at us if we have more than one row per slice
                        if (nrow(listener_gaze_data)!=(total_listener_slices+1)){
                            print(paste('`', topic_and_side, '`: Listener ', listener_ID,
                                        " Error -- improper time slicing", sep=""))
                        }

                        # save cleaned dataframe to file
                        outfile_basename = paste0('gaze_trial-',
                                                  topic_and_side,'-',
                                                  listener_ID,'.csv')
                        outfile_name = file.path(listener_dataframe_path,
                                                 outfile_basename)
                        write.table(listener_gaze_data, 
                                    outfile_name,
                                    append=FALSE, sep=',', row.names=FALSE)
                        
                        # get counts of all missing data
                        next_missing_data = listener_gaze_data %>% ungroup() %>%
                            select(one_of(listener_keep_columns)) %>%
                            group_by(r_event) %>%
                            summarize(counts = n())

                        # update our complete listener count
                        complete_listeners = complete_listeners + 1
                        
                    } else {
                    
                        # we only have blinking/missing event data
                        print(paste('`', topic_and_side, '`: Listener ', listener_ID,
                                    ' not added (entirely blink/missing event data)', 
                                    sep=""))
                        
                        # update the proportion missing
                        next_missing_data = data.frame(
                                            counts = total_downsampled_samples,
                                            r_event = 'all_missing')
                    }
                    
                    # add all missing data for listener to same dataframe
                    next_missing_data = next_missing_data %>% ungroup() %>%
                            mutate(listener = listener_ID) %>%
                            mutate(topic = next_topic) %>%
                            mutate(side = speaker_side) %>%
                            mutate(speaker_duration_samples = total_downsampled_samples) %>%
                            mutate(proportion = counts/speaker_duration_samples)
                    
                    # append individual missing data stats to group dataframe
                    missing_listener_data = rbind.data.frame(missing_listener_data,
                                                             next_missing_data)
                    
                } else {
                    # we don't have the data for the listener's opinion
                    print(paste('`',topic_and_side, '`: Listener ', listener_ID,
                                ' not added (opinion not found)', sep=""))
                    
                    # store the dataframe
                    next_missing_opinion = data.frame(listener = listener_ID,
                                                      topic = next_topic,
                                                      side = speaker_side)
                    missing_opinion_listeners = rbind.data.frame(missing_opinion_listeners,
                                                                 next_missing_opinion)
                }
            } else {
                # the listener's time differed from the monologue
                print(paste('`',topic_and_side,'`: Listener ', listener_ID,
                            ' not added (too short/long trial; discrepancy = ',
                            round(abs(speaker_time - max(listener_gaze_data$time)),
                                 2), ' sec)', sep=""))
                
                # store the dataframe
                next_unbalanced_listener = data.frame(listener = listener_ID,
                                                      topic = next_topic,
                                                      side = speaker_side)
                unbalanced_listeners = rbind.data.frame(unbalanced_listeners,
                                                        next_unbalanced_listener)
            }
        }else{ 
            # we can't find the questionnaire
            print(paste('`',topic_and_side,'`: Listener ', listener_ID,
                        ' not added (questionnaire not found)',sep=""))
            
            # store the dataframe
            next_missing_questionnaire = data.frame(listener = listener_ID,
                                                    topic = next_topic,
                                                    side = speaker_side)
            missing_questionnaire_listeners = rbind.data.frame(missing_questionnaire_listeners,
                                                               next_missing_questionnaire)
        }
    }
    
    # print update
    print(paste("`", topic_and_side, "`: ", complete_listeners,
                " complete listeners added", sep=""))
}

# save dataframes to file if the tables have rows
if (nrow(missing_listener_data)>0){
    write.table(missing_listener_data,
                missing_listener_file,
                append=FALSE, sep=',' ,row.names=FALSE)
}
if (nrow(unbalanced_listeners)>0){
    write.table(unbalanced_listeners,
                unbalanced_listener_file,
                append=FALSE, sep=',', row.names=FALSE)
}
if (nrow(missing_opinion_listeners)>0){
    write.table(missing_opinion_listeners,
                missing_opinion_file,
                append=FALSE, sep=',', row.names=FALSE)
}
if (nrow(missing_questionnaire_listeners)>0){
    write.table(missing_questionnaire_listeners,
                missing_opinion_file,
                append=FALSE, sep=',', row.names=FALSE)
}

## Remove unusable segments

In this section, we'll figure out which segments have
enough data to be considered for inclusion. Included
segments must been missing no more than 30% of
gaze samples for that segment and must have a
record of their post-listening opinion for that segment.

In [None]:
# create structure for unusable segments
unusable_segment_df = data.frame()

### Identify segments with missing opinions, unbalanced time, or missing questionnaires

In [None]:
if (file.exists(missing_opinion_file)){

    # load in the missing opinion table
    missing_opinion = read.table(missing_opinion_file,
                                 sep=',',
                                 header=TRUE) %>%
        mutate(reason = 'missing_opinion')
    
    # append to dataframe
    unusable_segment_df = rbind.data.frame(unusable_segment_df,
                                           missing_opinion)
}

In [None]:
if (file.exists(missing_questionnaire_file)){

    # load in the missing questionnaire table
    missing_questionnaires = read.table(missing_questionnaire_file,
                                        sep=',',
                                        header=TRUE) %>%
        mutate(reason = 'missing_questionnaire')
    
    # append to dataframe
    unusable_segment_df = rbind.data.frame(unusable_segment_df,
                                           missing_questionnaires)
}

In [None]:
if (file.exists(unbalanced_listener_file)){

    # load in the unbalanced segments table
    unbalanced_listeners = read.table(unbalanced_listener_file,
                                      sep=',',
                                      header=TRUE) %>%
        mutate(reason = 'unbalanced_listener')
    
    # append to dataframe
    unusable_segment_df = rbind.data.frame(unusable_segment_df,
                                           unbalanced_listeners)
}

### Identify segments with more than 30% missing data

In [None]:
# load in the missing data table
missing_data = read.table(missing_listener_file,
                          sep=',',
                          header=TRUE) %>%

    # grab segments with more than 30% missing
    dplyr::filter((is.na(r_event) & proportion>.3) |
                  r_event=='all_missing') %>%

    # remove unneeded columns and add reason
    select(listener, topic, side) %>%
    mutate(reason = 'excessive_missing')

In [None]:
# add to list
unusable_segment_df = rbind.data.frame(unusable_segment_df,
                                       missing_data)

### Remove unusable files from directories

In [None]:
# figure out which files we'll be removing
unusable_segment_df = unusable_segment_df %>% ungroup() %>%

    # identify base path and file name
    mutate(segment_path = listener_dataframe_path) %>%
    mutate(segment_name = ifelse(side=='only',
                                 paste0('gaze_trial-',
                                        topic,
                                        '-',
                                        listener,
                                        '.csv'),
                                 paste0('gaze_trial-',
                                        topic,
                                        '-',
                                        side,
                                        '-',
                                        listener,
                                        '.csv')
                                )) %>%

    # construct full path
    mutate(segment_file = file.path(segment_path,
                                    segment_name)) %>%

    # drop now-unneeded variables
    select(-segment_path, -segment_name)

In [None]:
# check that the segments actually existed before deleting
delete_segments = unusable_segment_df$segment_file
delete_segments = delete_segments[file.exists(delete_segments)]

In [None]:
# remove them (silently)
file.remove(delete_segments)

### Save list of unusable segments

In [None]:
# save to file
write.table(unusable_segment_df,
            file.path(processed_data_path,
                      'listener-unusable_segments.csv'),
            append=FALSE,sep=',',row.names=FALSE)

***

# Cross-recurrence quantification analysis

## Calculate cross-recurrence between speakers and listeners

This section reads in the files created in the previous section and 
performs CRQA over each speaker-listener pair. The results are then 
saved to a dataframe for future use in statistical analyses.

In [None]:
# if we don't have a CRQA results folder, make it
crqa_results_path = file.path(processed_data_path,
                             'crqa-data-results')
if (!dir.exists(crqa_results_path)){
    dir.create(crqa_results_path)
}

In [None]:
# create empty data frame
crqa_frame = data.frame()

In [None]:
# identify all the speaker files
speaker_gaze_files = list.files(speaker_dataframe_path, 
                                recursive=FALSE,
                                full.names=TRUE,
                                pattern="\\d+.*\\.csv")

In [None]:
# perform CRQA over every speaker-listener pair
for (next_speaker_gaze in speaker_gaze_files){
    
    # read speaker file
    speaker_data = read.csv(next_speaker_gaze,
                      sep=',',
                      stringsAsFactors=FALSE,
                      header=TRUE,
                      skipNul = TRUE)
    
    # grab topic and ID
    speaker_ID = unique(speaker_data$speaker)
    speaker_topic = unique(speaker_data$topic)
    speaker_side = unique(speaker_data$side)
    
    # create a temporary data frame for this one topic
    temp_crqa_frame = data.frame()

    # keep only the variables we need for CRQA
    speaker_data = speaker_data %>%
        select(one_of(crqa_compare_columns)) %>%
        dplyr::rename(speaker_aoi = r_aoi)
    
    # identify speaker's topic and side for file name
    if (speaker_side == 'only') {
        topic_and_side = speaker_topic        
    } else {
        topic_and_side = paste(speaker_topic,
                               '-',
                               speaker_side,
                               sep='')
    }
    
    # get the files for this topic and side
    listener_gaze_files = list.files(listener_dataframe_path, 
                                     pattern = (topic_and_side), 
                                     full.names = TRUE)
    
    # cycle through target listener gaze files
    for (next_listener in listener_gaze_files){

        # load in the next listener's data
        listener_data = read.csv(next_listener,
                                 stringsAsFactors=FALSE, 
                                 sep=',', 
                                 header=TRUE, 
                                 skipNul = TRUE)
        
        # grab listener questionnaire info 
        listener_questionnaire = listener_data %>%
            select(one_of(crqa_questionnaire_columns)) %>%
            distinct()
            
        # keep only the listener bits we need
        listener_data = listener_data %>%
            select(one_of(crqa_compare_columns)) %>%
            dplyr::rename(listener_aoi = r_aoi)
        
        # merge with speaker data and refactor missing
        both_data = full_join(speaker_data,
                              listener_data,
                              by='time') %>%
            mutate(listener_aoi = ifelse(is.na(listener_aoi),
                                         98,
                                         listener_aoi)) %>%
            mutate(speaker_aoi = ifelse(is.na(speaker_aoi),
                                        99,
                                        speaker_aoi))
        
        # run categorical CRQA
        gaze_drp = drpdfromts(both_data$speaker_aoi,
                              both_data$listener_aoi,
                              ws=win_size,
                              datatype="categorical")

        # correct maxlag and maxrec if missing data
        if (length(gaze_drp$maxlag)==0){
            
            # grab the maximum rec and lag that aren't missing
            crqa_remaining = gaze_drp$profile[!is.na(gaze_drp$profile)]
            gaze_drp$maxrec = max(crqa_remaining)
            gaze_drp$maxlag = match(gaze_drp$maxrec,
                                    gaze_drp$profile)
        }

        # save to dataframes
        next_crqa = data.frame('speaker' = speaker_ID,
                               'listener' = listener_questionnaire$listener,
                               'topic' = speaker_topic,
                               'side' = speaker_side,
                               'topic_and_side' = topic_and_side,
                               't' = -win_size:win_size, 
                               'r' = gaze_drp$profile,
                               'maxrec' = gaze_drp$maxrec,
                               'maxlag' = gaze_drp$maxlag[1], 
                               'agree' = listener_questionnaire$agree,
                               'gender' = listener_questionnaire$gender,
                               'native_lang' = listener_questionnaire$native_lang,
                               'age' = listener_questionnaire$age,
                               row.names=NULL)
        temp_crqa_frame = rbind.data.frame(temp_crqa_frame,
                                           next_crqa)
    }
    
    # create filename for this topic's dataframe
    temp_out_filename = paste0('crqa_data-',
                               topic_and_side,'.csv')
    
    # save this topic's dataframe
    write.table(temp_crqa_frame,
                file.path(crqa_results_path,
                          temp_out_filename),
                append=FALSE,
                sep=',',
                row.names=FALSE)
    
    # append this topic's dataframe to master dataframe
    crqa_frame = rbind.data.frame(crqa_frame,
                                  temp_crqa_frame)
    
    # print update
    print(paste("`", topic_and_side, "`: CRQA complete", sep=""))
}

# save master CRQA frame
write.table(crqa_frame,
            file.path(crqa_results_path,'crqa_data-all_topics.csv'),
            append=FALSE, sep=',', row.names=FALSE)

***

# Create baseline data

In this section, we'll create the surrogate baselines
to establish the degree to which we would expect gaze
coordination to occur by chance.

In [None]:
# if we don't have a baseline folder, make it
new_baseline_dirname = paste0('crqa-baseline-',
                               as.numeric(Sys.time()[1]))
new_baseline_path = file.path(processed_data_path,
                              new_baseline_dirname)
if (!dir.exists(new_baseline_path)){
    dir.create(new_baseline_path)
}

In [None]:
# identify all the speaker files
speaker_gaze_files = list.files(speaker_dataframe_path, 
                                recursive=FALSE,
                                full.names=TRUE,
                                pattern="\\d+.*\\.csv")

In [None]:
# create empty data frame
crqa_baseline = data.frame()

In [None]:
# perform CRQA over surrogate data
for (next_speaker_gaze in speaker_gaze_files){
    
    # set seed for reproducibility
    set.seed(111)
    
    # read speaker file
    speaker_data = read.csv(next_speaker_gaze,
                            sep=',',
                            stringsAsFactors=FALSE,
                            header=TRUE,
                            skipNul = TRUE)

    # grab topic and ID
    speaker_ID = unique(speaker_data$speaker)
    speaker_topic = unique(speaker_data$topic)
    speaker_side = unique(speaker_data$side)
    
    # create a temporary data frame for this one topic
    temp_crqa_baseline = data.frame()

    # keep only the variables we need for CRQA
    speaker_data = speaker_data %>%
        select(one_of(crqa_compare_columns)) %>%
        dplyr::rename(speaker_aoi = r_aoi)
    
    # identify speaker's topic and side for file name
    if (speaker_side == 'only') {
        topic_and_side = speaker_topic        
    } else {
        topic_and_side = paste(speaker_topic,
                               '-',
                               speaker_side,
                               sep='')
    }
    
    # get the files for this topic and side
    listener_gaze_files = list.files(listener_dataframe_path, 
                                     pattern=(topic_and_side),
                                     full.names=TRUE)
    
    # re-create surrogate dyad and run again
    for (next_listener in listener_gaze_files){
        
        # load in the next listener's data
        listener_data = read.csv(next_listener,
                                 stringsAsFactors=FALSE, 
                                 sep=',', 
                                 header=TRUE, 
                                 skipNul = TRUE)

        # grab listener questionnaire info 
        listener_questionnaire = listener_data %>%
            select(one_of(crqa_questionnaire_columns)) %>%
            distinct()

        # keep only the listener bits we need
        listener_data = listener_data %>%
            select(one_of(crqa_compare_columns)) %>%
            dplyr::rename(listener_aoi = r_aoi)
        
        # merge with speaker data and refactor missing
        both_data = full_join(speaker_data,
                              listener_data,
                              by='time') %>%
            mutate(listener_aoi = ifelse(is.na(listener_aoi),
                                         98,
                                         listener_aoi)) %>%
            mutate(speaker_aoi = ifelse(is.na(speaker_aoi),
                                        99,
                                        speaker_aoi))

        # shuffle the dyad multiple times
        for (run in 1:10){
            
            # shuffle listener's data
            both_data = both_data %>%
                mutate(listener_aoi = sample(listener_aoi))

            # run categorical CRQA
            gaze_drp = drpdfromts(both_data$speaker_aoi,
                                  both_data$listener_aoi,
                                  ws=win_size,
                                  datatype="categorical")

            # correct maxlag and maxrec if missing data
            if (length(gaze_drp$maxlag)==0){

                # grab the maximum rec and lag that aren't missing
                crqa_remaining = gaze_drp$profile[!is.na(gaze_drp$profile)]
                gaze_drp$maxrec = max(crqa_remaining)
                gaze_drp$maxlag = match(gaze_drp$maxrec,
                                        gaze_drp$profile)
            }

            # save to dataframes
            next_baseline = data.frame('speaker' = speaker_ID,
                                       'listener' = listener_questionnaire$listener,
                                       'topic_and_side' = topic_and_side,
                                       't' = -win_size:win_size, 
                                       'r' = gaze_drp$profile,
                                       'maxrec' = gaze_drp$maxrec,
                                       'maxlag' = gaze_drp$maxlag[1], 
                                       'agree' = listener_questionnaire$agree,
                                       'gender' = listener_questionnaire$gender,
                                       'native_lang' = listener_questionnaire$native_lang,
                                       'age' = listener_questionnaire$age,
                                       'data' = 'baseline',
                                       'run' = run,
                                       row.names=NULL)
            temp_crqa_baseline = rbind.data.frame(temp_crqa_baseline,
                                                  next_baseline)
        }
    }
    
    # create filename for this topic's dataframe
    temp_out_filename = paste0('crqa_baseline_data-',
                               topic_and_side,'.csv')
    
    # save this topic's dataframe
    write.table(temp_crqa_baseline,
                file.path(new_baseline_path,
                          temp_out_filename),
                append=FALSE,
                sep=',',
                row.names=FALSE)
    
    # append this topic's dataframe to master dataframe
    crqa_baseline = rbind.data.frame(crqa_baseline,
                                     temp_crqa_baseline)
    
    # print update
    print(paste("`", topic_and_side, "`: Surrogate complete", sep=""))
}

# save master baseline frame
final_out_file = file.path(new_baseline_path,
                           'crqa_baseline_data-all_topics.csv')
write.table(crqa_baseline,
            final_out_file,
            append=FALSE, sep=',', row.names=FALSE)

***

# Create analysis and plotting dataframes

In this section, we'll first create first- and second-order
orthogonal variables for lag (`t`) and then create separate 
dataframes to analyze and plot the data. The *analysis 
dataframe* will include all main terms and manually generated 
interaction terms between the target terms, all of which will
then be centered and standardized so that we can interpret model 
estimates as effect sizes. The *plotting dataframe* will 
contain only the main terms and will be neither centered
nor standardized.

## Bind real and baseline data

In [None]:
# specify CRQA results path
crqa_results_path = file.path(processed_data_path,
                             'crqa-data-results')

In [None]:
# get most recent baseline directory
processed_directories = list.dirs(processed_data_path)
all_baseline_directories = processed_directories[
        !is.na(str_match(processed_directories,'baseline'))]
target_baseline_path = file.path(tail(all_baseline_directories))

In [None]:
# read in real data
real_crqa_frame = read.table(file.path(crqa_results_path,
                                       'crqa_data-all_topics.csv'),
                             sep=',',
                             header=TRUE,
                             skipNul=TRUE) %>%
    mutate(data = 'real') %>%
    select(-topic, -side)

In [None]:
# read in baseline data
baseline_crqa_frame = read.table(file.path(target_baseline_path,
                                       'crqa_baseline_data-all_topics.csv'),
                             sep=',',
                             header=TRUE,
                             skipNul=TRUE) %>%
    select(-run)

In [None]:
# create a joint dataframe
crqa_frame = rbind.data.frame(real_crqa_frame,
                              baseline_crqa_frame)

## Convert text to numeric

In [None]:
# create numerical versions of most text variables
crqa_frame = crqa_frame %>% ungroup() %>%
    
    # convert "agree" to binary (1=agree, 0=disagree)
    mutate(agree = as.numeric(agree)) %>%
    mutate(agree = (agree > 2) * 1) %>%

    # identify whether the topic is dominant- (0) or mixed-view (1)
    mutate(viewtype = ifelse(topic_and_side %in% dominant_view_topics,
                             0,
                             1)) %>%
    
    # convert topic and side to number
    mutate(topic_and_side = as.numeric(topic_and_side)) %>%
    
    # assign numbers to "gender" (alphabetical)
    mutate(gender = ifelse(gender=='Female',
                           0,
                           ifelse(gender=='Male',
                                  1,
                                  2))) %>%

    # assign number to "native_lang" (alphabetical)
    mutate(native_lang = ifelse(native_lang=='English',
                                0,
                                ifelse(native_lang=='Spanish',
                                       1,
                                       2))) %>%

    # set baseline to 0 and real data to 1
    mutate(data = ifelse(data=='real',
                         1,
                         0))

## Center dummy-coded variables

In [None]:
crqa_frame = crqa_frame %>% ungroup() %>%
    
    # contrast-code the binary variables
    mutate(agree = agree - .5) %>% # agree=.5, disagree=-.5
    mutate(viewtype = viewtype - .5) %>% # mixed=.5, dominant=-.5
    mutate(data = data - .5) %>% # real=.5, baseline=-.5

    # center gender and language
    mutate(gender = as.numeric(scale(gender,
                                     center=TRUE,
                                     scale=FALSE))) %>%
    mutate(native_lang = as.numeric(scale(native_lang,
                                          center=TRUE,
                                          scale=FALSE)))

## Remove NAs

In [None]:
crqa_frame = na.omit(crqa_frame)

## Create polynomial time variables

In [None]:
# create a polynomial term for time variable
crqa_frame$t = crqa_frame$t + win_size + 1
poly_t = poly((unique(crqa_frame$t)),
              2)
crqa_frame[, paste("ot",1:2,sep = "")] = poly_t[crqa_frame$t,
                                                1:2]
crqa_frame$t = crqa_frame$t - win_size - 1

In [None]:
# ensure that we don't have any 0s in ot1 or ot2
crqa_frame = crqa_frame %>% ungroup() %>%
    mutate(ot1 = ot1 + min(ot1) + 1) %>%
    mutate(ot2 = ot2 + min(ot2) + 1)

## Save the dataset

In [None]:
# create a folder for analysis-ready data, if it doesn't exist
if (!dir.exists(analysis_data_path)){
    dir.create(analysis_data_path)
}

In [None]:
write.table(crqa_frame,
            file.path(analysis_data_path,
                      'oag-plotting_df.csv'),
            append=FALSE,
            sep=',',
            row.names=FALSE)