# Dynamics of Explanation Project

## Stage 3: Data Preparation

This notebook is part of the "Dynamics of Explanation" project and prepares data for future analysis. It requires 

To run this notebook, you will need the following files:

* **`./supplementary-code/libraries_and_functions-dyn_exp.r`**: Loads in necessary libraries and creates new functions for our analyses.
* **`./supplementary-code/TASA.rda`**: TASA corpus used by the `LSAfun` function (see Gunther et al., 2015, *Behavior Research Methods*).
* **`./data/`**:  Files with participant data. *Due to ethical considerations relating to participant privacy, no participant data may be shared at this time.*
* **`global-warming-transcript-clean.csv`**: Analysis-ready transcript from the stimulus video, ["How Global Warming Works in Under 5 Minutes"](http://www.howglobalwarmingworks.org/) (Ranney, Lamprey, Reinholz, Le, Ranney, & Goldwasser, 2013).

**Table of Contents:**
1. [Preliminaries](#Preliminaries). Reads in all necessary modules.
1. [Prepare gaze data](#Prepare-gaze-data). Concatenate and trim gaze data from both "watch" and "explanation" phases.
1. [Prepare transcript data](#Prepare-transcript-data). Concatenate participants' explanation transcripts.
1. [Identify participants](#Identify-participants). Identify participants who appear in both gaze phases and have complete transcripts.
1. [Calculate LSA for stimulus and participant transcripts](#Calculate-LSA-for-stimulus-and-participant-transcripts). Computes similarity of participants' explanations with the stimulus in multidimensional LSA space (created with TASA corpus).
1. [Recode factors in gaze data](#Recode-factors-in-gaze-data). Winnow down the participants' gaze data to our target variables and recode them to numeric.
1. [Recode questionnaire responses](#Recode-questionnaire-responses). Winnow down the participants' survey responses and recode them to numeric.
1. [Create merged analysis dataframe](#Create-merged-analysis-dataframe). Combine all relevant data into a single dataframe for analysis in the next Jupyter notebook.

**Written by**: A. Paxton (University of California, Berkeley)   
**Date last modified**: 2 August 2016

***

# Preliminaries

This section reads in all necessary modules.

[To top.](#Dynamics-of-Explanation-Project)

In [1]:
# clear our workspace
rm(list=ls())

In [2]:
# set intial working directory
setwd('./')

In [3]:
# import the source file with all our working directories and custom functions
source('./supplementary-code/libraries_and_functions-dyn_exp.r')

Loading required package: Matrix
: package ‘Matrix’ was built under R version 3.2.4Loading required package: tseriesChaos
Loading required package: deSolve
: package ‘deSolve’ was built under R version 3.2.4
Attaching package: ‘deSolve’

The following object is masked from ‘package:graphics’:

    matplot

Loading required package: fields
: package ‘fields’ was built under R version 3.2.5Loading required package: spam
Loading required package: grid
Spam version 1.3-0 (2015-10-24) is loaded.
Type 'help( Spam)' or 'demo( spam)' for a short introduction 
and overview of this package.
Help for individual functions is also obtained by adding the
suffix '.spam' to the function name, e.g. 'help( chol.spam)'.

Attaching package: ‘spam’

The following objects are masked from ‘package:base’:

    backsolve, forwardsolve

Loading required package: maps
: package ‘maps’ was built under R version 3.2.5Loading required package: plot3D
Loading required package: pracma
: package ‘pracma’ was built und

***

# Prepare gaze data

This section imports the gaze files from both the "watch" and "explanation" phase, concatenating all of the individual participants' data into a single dataframe. 

We're currently having problems with stimulus associations, so this section makes no assumptions about where we'll find the data for each stage.

***

In [None]:
# find all possible files
gaze_files = list.files('./data/gaze_files_clean/',pattern='*.txt',recursive='true',full.names=TRUE)

In [None]:
# create an empty dataframe for the watch data
watch_df = data.frame(matrix(vector(), 
                             0, 
                             length(gaze_columns_to_keep)+1,
                             dimnames=list(c(), c('participant',gaze_columns_to_keep))))

In [None]:
# create an empty dataframe for the explain data
exp_df = data.frame(matrix(vector(), 
                           0, 
                           length(gaze_columns_to_keep)+1,
                           dimnames=list(c(), c('participant',gaze_columns_to_keep))))

In [None]:
# create an empty dataframe for the "thank you" data
thanks_df = data.frame(matrix(vector(), 
                           0, 
                           length(gaze_columns_to_keep)+1,
                           dimnames=list(c(), c('participant',gaze_columns_to_keep))))

In [None]:
# create an empty dataframe for problem files, too
error_df = data.frame(participant = numeric(),
                      gaze_file = character(),
                      actual_stimulus = character())

In [None]:
# concatenate all participant gaze data 
for (next_file in gaze_files){
    
    # identify participant ID
    participant = strtoi(str_extract(next_file, "(?<=DE)[0-9]{1,3}"))
    
    # read in participant's file
    new_gaze_df = import_gaze_files(next_file)
        
    # strip out everything that's not a sample
    new_gaze_df = new_gaze_df %>%
        dplyr::filter(type=="SMP")
    
    # identify the stimulus/stimuli associated with the file
    actual_stimulus = gsub(" ",'_',unique(new_gaze_df$stimulus))
    
    # attach gaze data to the appropriate dataframes
    if (sum(!is.na(str_match(string=unique(actual_stimulus),pattern=("(-)|(.avi)"))))){    
       
        # if it's a watch trial, add participant's ID and gaze data to overall df
        watch_df = bind_rows(watch_df,
                             data.frame(participant,new_gaze_df))

    } else if (sum(!is.na(str_match(string=unique(actual_stimulus),pattern=("xplanation"))))) {
        
        # if it's an explanation trial, add participant's ID and gaze data to overall df
        exp_df = bind_rows(exp_df,
                           data.frame(participant,new_gaze_df))

    } else if (sum(!is.na(str_match(string=unique(actual_stimulus),pattern=("rtf"))))) {
        
        # if it's the thank-you screen, add participant's ID and gaze data to overall df
        thanks_df = bind_rows(thanks_df,
                              data.frame(participant,new_gaze_df))    
        
    } else {

        # if we can't figure out where it belongs, add it to the error
        actual_stimulus = paste(actual_stimulus[nzchar(actual_stimulus)],collapse='.')
        gaze_file = next_file
        error_df = bind_rows(error_df,
                             data.frame(participant,gaze_file,actual_stimulus))
    }
}

## Quick peek at the gaze trial lengths

Let's make sure that our data look as expected. The video from the "watch" phase lasted approximately 4 minutes and 45 seconds, so each participant should have approximately 4.75 minutes of data in their `watch` trial. All participants were then asked to synthesize that information in the 2-minute "explanation" phase, so we expect to see that the `exp` trials last about 2 minutes.

In [26]:
# check the mean "watch" phase time
watch_breakdown = watch_df %>%
    group_by(participant) %>%
    summarize(trial_time = max(r_mins) - min(r_mins))

In [12]:
# create plot for the "watch" data
watch_gaze_hist = qplot(watch_breakdown$trial_time, geom='histogram', bins=30) +
  geom_histogram(aes(fill = ..count..),bins=30) +
  xlab('Time in Trial (min)') + ylab('Frequency') +
  ggtitle('Length of "Watch" Phase\nfor All Participants') +
  labs(fill="Freq.")

# save it
ggsave(plot = watch_gaze_hist,
       height = 3,
       width = 3,
       filename = './figures/gaze_watch_time-dyn_exp.jpg')

<div style="width:300px; height=200px">
![](figures/gaze_watch_time-dyn_exp.jpg)
</div>
**Figure**. Histogram of the length of time (in minutes) of participants' watch phases.

We see that the majority of participants did spend the anticipated 4.75 minutes to watch the video, but known problems in the data export process appear to have shifted the labeling of some of the early participants. We'll therefore be sure to cut out any participants whose labeled "watch" phases lasted more than 5 minutes.

In [32]:
# check the mean "explanation" phase time
exp_breakdown = exp_df %>%
    group_by(participant) %>%
    summarize(trial_time = max(r_mins) - min(r_mins))

In [33]:
# create plot for the "explanation" data
exp_gaze_hist = qplot(exp_breakdown$trial_time, geom='histogram', bins=30) +
  geom_histogram(aes(fill = ..count..),bins=30) +
  xlab('Time in Trial (min)') + ylab('Frequency') +
  ggtitle('Length of "Explanation" Phase\nfor All Participants') +
  labs(fill="Freq.")

# save it
ggsave(plot = exp_gaze_hist,
       height = 3,
       width = 3,
       filename = './figures/gaze_exp_time-dyn_exp.jpg')

<div style="width:300px; height=200px">
![](figures/gaze_exp_time-dyn_exp.jpg)
</div>

**Figure**. Histogram of the lengths (in number of words) of participants' explanations.

We see that the majority of participants spent the expected 2 minutes to complete the explanation phase, with the exception of one participant.

## Save the unified gaze dataframes

In [40]:
# save watch dataframe
watch_df = dplyr::filter(watch_df,
                         participant %in% watch_breakdown$participant[watch_breakdown$trial_time<5])
write.table(watch_df, "./data/analysis_files/watch_phase_gaze_data.csv",row.names=FALSE, sep=',')

In [38]:
# save explain dataframe
write.table(exp_df, "./data/analysis_files/explain_phase_gaze_data.csv",row.names=FALSE, sep=',')

In [None]:
# save "thanks" dataframe
write.table(thanks_df, "./data/analysis_files/thankyou_gaze_data.csv",row.names=FALSE, sep=',')

In [None]:
# save any trials that we couldn't categorize
write.table(error_df, "./data/analysis_files/problematic_gaze_data.csv",row.names=FALSE, sep=',')

In [None]:
# let us know that all processing is finished
beepr::beep("treasure")

***

# Prepare transcript data

After concatenating and winnowing the gaze data, we next turn to participants' transcript data.

***

In [4]:
# find all possible transcript files
transcript_files = list.files('./data/transcript_files_clean',pattern='*-timestamped-transcript-data.csv',full.names=TRUE)

In [5]:
# create an empty dataframe for the transcript data
transcript_df = data.frame(matrix(vector(), 
                                  0, 
                                  length(transcript_columns)+1,
                                  dimnames=list(c(), c('participant',transcript_columns))))

In [6]:
# cycle through all of our transcript files
for (next_file in transcript_files){
    
    # identify participant ID
    participant = strtoi(str_extract(next_file, "(?<=DE)[0-9]{1,3}"))
    
    # read in participant's file
    new_transcript_df = import_transcript_files(next_file)
    
    # append to transcript file
    transcript_df = bind_rows(transcript_df, data.frame(participant,new_transcript_df))
}

In [7]:
# save unified transcript to file
write.table(transcript_df, "./data/analysis_files/participant_transcript_data.csv",row.names=FALSE, sep=',')

## Quick peek at the transcript length

Now that we've imported our files, let's take a look at the data.

In [8]:
# create a quick summary of transcript data
transcript_summaries = transcript_df %>%
    group_by(participant) %>%
    summarise(exp_length = mean(nr))

In [9]:
# create a plot for the data
exp_length_hist = qplot(transcript_summaries$exp_length, geom='histogram', bins=30) +
  geom_histogram(aes(fill = ..count..),bins=30) +
  xlab('Number of words') + ylab('Frequency') +
  ggtitle('Length of Explanation\nfor All Participants') +
  labs(fill="Freq.")

# save it
ggsave(plot = exp_length_hist,
       height = 3,
       width = 3,
       filename = './figures/explanation_lengths-dyn_exp.jpg')

<div style="width:300px; height=200px">
![**Figure**. Histogram of the lengths (in number of words) of participants' explanations.](figures/explanation_lengths-dyn_exp.jpg)
</div>

**Figure**. Histogram of the lengths (in number of words) of participants' explanations

In [10]:
# let us know that all processing is finished
beepr::beep("fanfare")

***

# Identify participants

For the current analyses, we will only include participants who have reliable data from all three sources (gaze from the "watch" phase, gaze and transcript from the "explanation" phase).

***

In [25]:
# read in transcript data
transcript_df = read.table("./data/analysis_files/participant_transcript_data.csv",header=TRUE, sep=',')

In [5]:
# read in gaze data from watch phase
watch_df = read.table("./data/analysis_files/watch_phase_gaze_data.csv",header=TRUE, sep=',')

In [6]:
# read in gaze data from explanation phase
exp_df = read.table("./data/analysis_files/explain_phase_gaze_data.csv",header=TRUE, sep=',')

In [7]:
# let us know that all processing is finished
beepr::beep("fanfare")

In [8]:
# identify participants who appear in all three
t_participants = unique(transcript_df$participant)
w_participants = unique(watch_df$participant)
e_participants = unique(exp_df$participant)
complete_participants = Reduce(intersect, list(t_participants,w_participants,e_participants))

In [9]:
# narrow down transcript data to include only our target participants and export
transcript_df = dplyr::filter(transcript_df,participant %in% complete_participants)
write.table(transcript_df,'./data/analysis_files/participant_transcript_data.csv',row.names=FALSE, sep=',')

In [10]:
# narrow down watch gaze data to include only our target participants and export
watch_df = dplyr::filter(watch_df,participant %in% complete_participants)
write.table(watch_df,'./data/analysis_files/watch_phase_gaze_data.csv',row.names=FALSE, sep=',')

In [11]:
# narrow down explanation gaze data to include only our target participants and export
exp_df = dplyr::filter(exp_df,participant %in% complete_participants)
write.table(exp_df,'./data/analysis_files/explain_phase_gaze_data.csv',row.names=FALSE, sep=',')

***

# Calculate LSA for stimulus and participant transcripts 

Now that we have the participant transcripts cleaned and stored, we'll use LSA to determine the similarity of participants' explanations to the stimulus.

***

In [17]:
# read in the TASA LSA space
load("./supplementary-code/TASA.rda")

In [26]:
# load in movie transcript and split strings (in case we have more than 2 words in 1 cell)
movie_transcript = read.table('./global-warming-transcript-clean.csv',sep=',',header=TRUE)
movie_transcript = paste(movie_transcript$word,collapse=' ')

In [27]:
# concatenate each participant's transcript to a single cell and remove any with fewer than 10 words
lsa_df = transcript_df %>%
    group_by(participant) %>%
    summarise(exp_length = n(),
              words = paste(word,collapse=' ')) %>%
    dplyr::filter(exp_length > 10) %>%
    select(-exp_length)

In [28]:
# because `dplyr` is having problems with costring, let's `lapply` it
lsa_df$exp_sim = unlist(lapply(lsa_df$words,function(x) costring(x,movie_transcript,tvectors=TASA)))

In [29]:
# save to file
write.table(lsa_df,'./data/analysis_files/lsa_explanation_data.csv',
            row.names=FALSE, sep=',')

## Quick peek at the similarity distribution

In [304]:
# create a plot for the data
exp_sim_hist = qplot(transcript_df$exp_sim, geom='histogram', bins=30) +
  geom_histogram(aes(fill = ..count..),bins=30) +
  xlab('Cosine') + ylab('Frequency') +
  ggtitle('Similarity of Explanation\nto Stimulus\nfor All Participants') +
  labs(fill="Freq.")

# save it
ggsave(plot = exp_sim_hist,
       height = 3,
       width = 3,
       filename = './figures/explanation_similarity-dyn_exp.jpg')

<div style="width:300px; height=200px">
![](figures/explanation_similarity-dyn_exp.jpg)
</div>

**Figure**. Histogram of similarity scores (i.e., cosines over multidimensional LSA space) of each participant's explanation to the stimulus.

***

# Recode factors in gaze data

In order to downsample, we need to recode the character factor variables to numeric for both gaze datasets.

***

In [115]:
# identify string variables that we don't need right now
toss_string_vars = c("l_aoi_hit","r_aoi_hit","type")

In [116]:
# identify string variables to recode
keep_string_vars = c("l_event_info","r_event_info")

## Watch phase

In [117]:
# read in gaze data from watch phase
watch_df = read.table("./data/analysis_files/watch_phase_gaze_data.csv",header=TRUE, sep=',',stringsAsFactors=FALSE)

In [118]:
# remove unneeded variables from the watch dataframe and recode stimulus
watch_df = watch_df %>%
    select(-one_of(toss_string_vars)) %>%
    mutate(stimulus = 1)

In [119]:
# identify the `event_info` factors and convert to numeric
eye_event_vars = data.frame(original_factors = c('Blink','Fixation','Saccade','-'),
                            numeric_factors = 1:4)

In [121]:
# create a mapping from string to int
eye_event_map = setNames(eye_event_vars$numeric_factors,eye_event_vars$original_factors)

In [122]:
# map registered datapoints and NAs for watch df
watch_df$l_event_info[] = eye_event_map[watch_df$l_event_info]
watch_df$r_event_info[] = eye_event_map[watch_df$r_event_info]
watch_df$l_event_info[is.na(watch_df$l_event_info)] = 999
watch_df$r_event_info[is.na(watch_df$r_event_info)] = 999

In [131]:
# save watch dataframe
write.table(watch_df,'./data/analysis_files/watch_phase_gaze_data.csv',row.names=FALSE, sep=',')

## Explanation phase

In [123]:
# read in gaze data from explanation phase
exp_df = read.table("./data/analysis_files/explain_phase_gaze_data.csv",header=TRUE, sep=',',stringsAsFactors=FALSE)

In [124]:
# remove unneeded variables from the explanation dataframe and recode stimulus
exp_df = exp_df %>%
    select(-one_of(toss_string_vars)) %>%
    mutate(stimulus = 0)

In [125]:
# map registered datapoints and NAs for explanation df
exp_df$l_event_info[] = event_map[exp_df$l_event_info]
exp_df$r_event_info[] = event_map[exp_df$r_event_info]
exp_df$l_event_info[is.na(exp_df$l_event_info)] = 999
exp_df$r_event_info[is.na(exp_df$r_event_info)] = 999

In [132]:
# save explanation dataframe
write.table(exp_df,'./data/analysis_files/explain_phase_gaze_data.csv',row.names=FALSE, sep=',')

***

# Recode questionnaire responses

Select the questionnaire responses of interest and save them to a new dataframe. For the first analysis, we'll just see whether they think climate change is happening (`CC1`) and how strongly they think it (`CC2`)

In [289]:
# read in questionnaire file and rename participant variable
question_df = fread('./data/questionnaire_files_clean/DE-questionnaire_clean.csv',sep='^')
question_df = plyr::rename(question_df,c("Subject"='participant'))

In [290]:
# specify mappings from text to numeric codes
questionnaire_vars = data.frame(original_factors = c('No',
                                                     "Don\'t Know",
                                                     "Yes",
                                                     'Not at all sure',
                                                     'Somewhat sure',
                                                     'Extremely sure',
                                                     'Very Sure'),
                               numeric_factors = 1:7)

In [291]:
# create a mapping from string to int
questionnaire_map = setNames(questionnaire_vars$numeric_factors,questionnaire_vars$original_factors)

In [292]:
# map registered datapoints and NAs for questionnaire df
question_df$CC1[] = questionnaire_map[question_df$CC1]
question_df$CC2[] = questionnaire_map[question_df$CC2]
question_df$CC1[is.na(question_df$CC1)] = 999
question_df$CC1[is.na(question_df$CC2)] = 999

In [293]:
# select only the questions we'd like to keep and convert all variables to numeric
question_df = question_df %>% 
    select(participant,CC1,CC2) %>%
    mutate(participant = str_replace_all(participant,'DE','')) %>%
    mutate_each(funs(strtoi(.))) %>%
    plyr::rename(.,c("CC1" = "cc_exists","CC2" = "cc_confidence"))

In [294]:
# convert 999 to NAs and remove incomplete cases
question_df[question_df==999] = NA
question_df = question_df[complete.cases(question_df),]

In [296]:
# save questionnaire dataframe
write.table(question_df,'./data/analysis_files/questionnaire_data.csv',row.names=FALSE, sep=',')

## Quick peek at the opinion distributions

Because only 1 participant did not report believing that climate change occurred (stating `Don't know` instead), we'll only look at the distribution of strength of opinions in global warming.

In [298]:
# create a plot for the data
cc_confidence_plot = qplot(question_df$cc_confidence, geom='histogram', bins=30) +
  geom_histogram(aes(fill = ..count..),bins=30) +
  xlab('Strength of opinion') + ylab('Frequency') +
  ggtitle('Beliefs about Global Warming') +
  labs(fill="Freq.")

# save it
ggsave(plot = cc_confidence_plot,
       height = 3,
       width = 3,
       filename = './figures/cc_confidence_hist-dyn_exp.jpg')

<div style="width:300px; height=200px">
![](figures/cc_confidence_hist-dyn_exp.jpg)
</div>

**Figure**. Histogram of beliefs -- from 4 (`Not at all sure`) to 7 (`Very sure`) -- in the existence of climate change.

***

# Create merged analysis dataframe

Merge gaze (watch and explanation phase), transcript, and questionnaire data into a single dataframe for analysis in the next notebook.

***

In [31]:
# read in new lsa data
lsa_df = read.table("./data/analysis_files/lsa_explanation_data.csv",header=TRUE, sep=',')

In [32]:
# read in gaze data from watch phase
watch_df = read.table("./data/analysis_files/watch_phase_gaze_data.csv",header=TRUE, sep=',')

In [33]:
# read in gaze data from explanation phase
exp_df = read.table("./data/analysis_files/explain_phase_gaze_data.csv",header=TRUE, sep=',')

In [34]:
# read in the questionnaire data
question_df = read.table("./data/analysis_files/questionnaire_data.csv",header=TRUE, sep=',')

In [35]:
# downsample watch dataframe to 10Hz
watch_df = watch_df %>% ungroup() %>%
    mutate(r_time = round(r_mins,2)) %>%
    group_by(participant,stimulus,r_time) %>%
    arrange(time) %>%
    filter(row_number()==1) %>%
    arrange(participant,time)
watch_df = data.frame(watch_df)

In [36]:
# join the watch data with the transcript and questionnaire data
watch_df = plyr::join(watch_df,lsa_df,by="participant")
watch_df = plyr::join(watch_df,question_df,by="participant")

In [37]:
# downsample explain dataframe to 10Hz
exp_df = exp_df %>% ungroup() %>%
    mutate(r_time = round(r_mins,2)) %>%
    group_by(participant,stimulus,r_time) %>%
    arrange(time) %>%
    filter(row_number()==1) %>%
    arrange(participant,time)
exp_df = data.frame(exp_df)

In [38]:
# join the explanation data with the transcript and questionnaire data
exp_df = plyr::join(exp_df,lsa_df,by="participant")
exp_df = plyr::join(exp_df,question_df,by="participant")

In [39]:
# join all together, remove 'words' column, and convert to numeric
all_data_df = bind_rows(watch_df,exp_df)
all_data_df = select(all_data_df,-words)
all_data_df = mutate_each(all_data_df,funs(as.numeric(.)))

In [40]:
# only take complete cases
all_data_df = all_data_df[complete.cases(all_data_df),]

In [41]:
# save combined dataframe
write.table(all_data_df,'./data/analysis_files/final_analysis_data.csv',row.names=FALSE,sep=',')

In [42]:
beepr::beep('fanfare')

***