# Dynamics of Explanation Project

## Stage 2: Data Cleaning

This code is part of the "Dynamics of Explanation" project and cleans up individual participants' data before we begin to process them. The code relies on cleaned eyetracker data produced from another Jupyter notebook for the project, ``dyn_exp-step1-snippets_of_video.ipynb``.\*

To run this file from scratch, you will require the following files:

* **`supplementary-code/DE_transcript_cleanup.py`**: Creates function to clean up participants' transcripts
* **`supplementary-code/DE_text_cleanup.py`**: Creates function to clean up questionnaire files
* **`global-warming-transcript.txt`**: Transcription of script from the stimulus video, ["How Global Warming Works in Under 5 Minutes"](http://www.howglobalwarmingworks.org/) (Ranney, Lamprey, Reinholz, Le, Ranney, & Goldwasser, 2013).
* **`data/`**: Folder with participant data.\*
    * **`data/transcript_files_raw/`**: Folder with participants' raw transcripts.\*
    * **`data/questionnaire_files_raw/`**: Folder with participants' raw questionnaire data.\*

\* *Due to ethical considerations relating to participant privacy, no participant data may be shared at this time. Files and folders marked with an asterisk contain such data and are therefore not included in the public repository.*

**Table of Contents:**
1. [Preliminaries](#Preliminaries). Reads in all necessary modules.
1. [Identify participants](#Identify-participants). Automatically identifies participants we'll be using for the analysis.
1. [Prepare questionnaire data](#Prepare-questionnaire-data). Cleans Excel spreadsheet of participant questionnaire data and exports as CSV.
1. [Prepare transcript data](#Prepare-transcript-data). Cleans participant and stimulus transcripts.

**Written by**: A. Paxton (University of California, Berkeley)   
**Date last modified**: 1 August 2016

***

# Preliminaries

This section reads in all necessary modules and preps the data.

[To top.](#Dynamics-of-Explanation-Project)

***

In [1]:
# import the necessary modules
import re, os, subprocess, string, enchant, glob
import pandas as pd
from autocorrect import spell
from enchant.checker import SpellChecker

In [2]:
# set working directory
root = './'
os.chdir(root)

In [3]:
# load in spell checking information
d = enchant.Dict("en_US")
chkr = SpellChecker("en_US")

In [4]:
# read in our special functions
%run './supplementary-code/DE_transcript_cleanup.py'
%run './supplementary-code/DE_text_cleanup.py'

***

# Identify participants

This section automatically identifies participants we'll be using for the analysis.

[To top.](#Dynamics-of-Explanation-Project)

***

In [5]:
# grab all transcripts and extract the participant numbers ONLY
transcript_files = glob.glob(root+'data/*.txt')
p_numbers = [re.findall('DE_(\d\d)',t_file)[0] for t_file in transcript_files]

In [6]:
p_numbers

['17',
 '18',
 '19',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '58',
 '59',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '68',
 '69',
 '70',
 '71',
 '72',
 '73',
 '74',
 '75',
 '76',
 '77',
 '78',
 '79',
 '80',
 '81',
 '82',
 '83',
 '84',
 '85',
 '86',
 '87',
 '88',
 '89']

***

# Prepare Questionnaire Data

Read in the Excel spreadsheet of participant questionnaire data, truncate variable names, create key between truncated and full variable names, then export questionnaire data and key to CSV.

[To top.](#Dynamics-of-Explanation-Project)

***

In [13]:
# read in spreadsheet to pandas df
question_data = pd.read_excel(root+'data/questionnaire_files_raw/DE_Q_Sheet.xlsx', 'Sheet1', index_col=None)

In [20]:
question_data.columns.values[43:45]

array([ u"CC1: Recently, you may have noticed that global warming has been getting some attention in the news. Global warming refers to the idea that the world's average temperature has been increasing over the past 150 years, may be increasing more in the future, and that the world's climate may change as a result. What do you think? Do you think that global warming is happening?",
       u'CC2: If you answered YES to the last question, how sure are you that global warming is happening? If you answered NO to the last question, how sure are you that global warming is not happening?'], dtype=object)

In [6]:
# create a new dataframe for us to use as a key between old and new datasets
question_column_key = pd.DataFrame(list(question_data.columns.values), columns = ["original_name"])

In [7]:
# remove any trailing spaces from variable names and remove anything after colons
question_data.columns = question_data.columns.str.strip()
question_data.columns = question_data.columns.str.replace('\:.*','')

In [8]:
# append our newly cleaned column names to our original dataframe
question_column_key['edited_name'] = list(question_data.columns.values)

In [12]:
# save cleaned questionnaire and variable name key
question_data.to_csv(root+'data/questionnaire_files_clean/DE-questionnaire_clean.csv',
                       sep="^", encoding='utf-8', index=False)
question_column_key.to_csv(root+'data/questionnaire_files_clean/DE-questionnaire_key.csv',
                       sep="^", encoding='utf-8', index=False)

***

# Prepare Transcript Data

This section reads in and cleans participant transcript data and the stimulus transcript.

[To top.](#Dynamics-of-Explanation-Project)

***

In [22]:
# read in each of the transcripts and exports each as a CSV
transcript_files = glob.glob(root+'data/transcript_files_raw/DE*.txt')
for tdata in transcript_files:
    
    # clean up the transcript 
    transcript_df, transcript_text = DE_transcript_cleanup(pd.DataFrame.from_csv(tdata,sep='\t',index_col=None))
    
    # save the edited timestamped transcript file
    p_num = re.findall('DE_(\d\d)',tdata)[0]
    new_tfile_name = root+'data/transcript_files_clean/DE'+str(p_num)+'-timestamped-transcript-data.csv'
    transcript_df.to_csv(new_tfile_name,sep=',',index=None)
    
    # save the text-only transcript file
    new_tfile_name = root+'data/transcript_files_clean/DE'+str(p_num)+'-text-transcript-data.csv'
    transcript_text_out = open(new_tfile_name,'w')
    transcript_text_out.write('word\n'+'\n'.join(transcript_text)+'\n')
    transcript_text_out.close()
    
    print 'Processed Transcript Data: DE '+str(p_num)

Processed Transcript Data: DE 17
Processed Transcript Data: DE 18
Processed Transcript Data: DE 19
Processed Transcript Data: DE 21
Processed Transcript Data: DE 22
Processed Transcript Data: DE 23
Processed Transcript Data: DE 24
Processed Transcript Data: DE 25
Processed Transcript Data: DE 26
Processed Transcript Data: DE 27
Processed Transcript Data: DE 28
Processed Transcript Data: DE 29
Processed Transcript Data: DE 30
Processed Transcript Data: DE 33
Processed Transcript Data: DE 34
Processed Transcript Data: DE 35
Processed Transcript Data: DE 36
Processed Transcript Data: DE 37
Processed Transcript Data: DE 38
Processed Transcript Data: DE 41
Processed Transcript Data: DE 42
Processed Transcript Data: DE 43
Processed Transcript Data: DE 44
Processed Transcript Data: DE 45
Processed Transcript Data: DE 46
Processed Transcript Data: DE 47
Processed Transcript Data: DE 48
Processed Transcript Data: DE 49
Processed Transcript Data: DE 51
Processed Transcript Data: DE 52
Processed 

In [353]:
# edit and export the movie transcript
datafile = open(root+'global-warming-transcript.txt','r')
datafile = datafile.read().lower()

# clean up appreviations and strip out punctuation
datafile = re.sub('\'ll',' will', datafile)
datafile = re.sub('(do|are|is|was)n\'?t','\\1 not',datafile)
datafile = re.sub('\'re',' are',datafile)
datafile = re.sub('\'ve',' have',datafile)
datafile = re.sub('(it|there|that)\'s','\\1 is',datafile)
datafile = re.sub('\'s','',datafile)
datafile = re.compile('[%s]' % re.escape(string.punctuation)).sub(' ', datafile)
datafile = re.sub(' +',' ',datafile)

# write it to a CSV file
movie_text_out = open(root+'global-warming-transcript-clean.csv','w')
movie_text_out.write('word\n'+'\n'.join(datafile.split(' '))+'\n')
movie_text_out.close()

# print update
print 'Processed Movie Transcript'

Processed Movie Transcript
