# Transcript Formatting
This notebook formats the preprocessed transcripts with a variety of conventions for use in linguistic feature calculation. It outputs a series of files containing the processed transcripts with punctuation, no punctuation, separated by utterance, or separated by single words for local coherence calculations.

In [23]:
# Importing necessary libraries
import pandas as pd
import re
import string
import csv

In [24]:
# Loading the cleaned but unformatted transcripts
unprocessed_transcripts = pd.read_csv('processed_data/transcripts.csv').drop(['Unnamed: 0'], axis = 1)
display(unprocessed_transcripts)

Unnamed: 0,Text,Coherence
0,"3,000 years have I been fighting. Every mornin...",4
1,"“Dad, you 're on TV again !” I heard Eric 's v...",5
2,"When Tyler entered the ward, his daughter Vale...",5
3,His body was failing. He had taken care of it ...,4
4,"I saw the button. It was simple, red, no words...",5
...,...,...
1534,The human body is an amazing thing. It is able...,3
1535,"Once upon a time, a young woman broke her vase...",4
1536,"Once upon a time, two beautiful diamonds and a...",3
1537,Lora and her team had been searching for the l...,5


### Transcripts With Punctuation
The following set of transcripts are processed such that they retain the original punctuation of the retell, but with formatting errors fixed. These are used for SpaCy tokenization and Stanza parsing, which are able to incorporate punctuation to identify sentence boundaries.

In [48]:
# Regex substitutions for processing transcripts
regex_sub = {
    re.compile(r'\. ⁇ '): ' ', # Replaces instances of the special character ⁇ after a period with a space
    re.compile(r' ⁇ '): '\'', # Replacing instances of the special character ⁇ between two spaces with an apostrophe
    re.compile(r'([^.?!;,])( )([^iI]?\'[a-z])'): r'\1\3', # Removing spaces that are incorrectly separating contractions
    re.compile(r'([a-z])(-)([a-z])'): r'\1 \3', # Removing hyphens
    re.compile(r' ([\.!\?]”)'): r'\1', # Removing unnecessary spaces
    re.compile(r'\\\''): r'\'', # Removing slashes before apostrophes
    re.compile(fr'(?![A-Za-z0-9\s“”’…é{re.escape(string.punctuation)}]).'): '', # Removing non-alphanumeric or punctuation characters
    re.compile(r'[\#\*\+\^\{\}\|~]'): '', # Removing specific punctuation characters
    re.compile(r'(,[\'"“”])'): r'\1 ', # Adding a space after commas followed by forms of punctuation
    re.compile(r'([!\?\.,-:;]\')([^\s])'): r'\1 \2', # Adding a space after commas followed by forms of punctuation
    re.compile(r'<.*?>'): '', # Removing anything between <>
    re.compile(r' +|\n'): ' ', # Replacing all instances of multiple spaces or newline characters with a space
    re.compile(r' s '): '\'s ', # Replacing instances of standalone s with 's (a byproduct of previous substitutions)
    re.compile(r' t '): '\'t ' # Replacing instances of standalone t with 't (a byproduct of previous substitutions)
}

# Processing each story using the defined substitutions
processed_transcripts = unprocessed_transcripts.copy()
for i in unprocessed_transcripts.index:
    story_string = unprocessed_transcripts['Text'].iloc[i]
    for pattern, substitution in regex_sub.items():
        story_string = pattern.sub(substitution, story_string)
    processed_transcripts.loc[i,'Text'] = story_string

display(processed_transcripts)

Unnamed: 0,Text,Coherence
0,"3,000 years have I been fighting. Every mornin...",4
1,"“Dad, you're on TV again!” I heard Eric's voic...",5
2,"When Tyler entered the ward, his daughter Vale...",5
3,His body was failing. He had taken care of it ...,4
4,"I saw the button. It was simple, red, no words...",5
...,...,...
1534,The human body is an amazing thing. It is able...,3
1535,"Once upon a time, a young woman broke her vase...",4
1536,"Once upon a time, two beautiful diamonds and a...",3
1537,Lora and her team had been searching for the l...,5


In [51]:
# Example of cleaned transcripts with punctuation

print(unprocessed_transcripts.loc[860, 'Text'])
print(processed_transcripts.loc[860, 'Text'])

“So you know the answer is “It 's a very important answer for me”, so the answer is “What are the odds the answer is right. So, the answer is right .” So we 've all been at this for about a year now. We 've done everything. Nothing ever happens. We just got ta get to the surface and get there .” The Google 's computer was n't responding. This was a new one. “The answer is right, and I 'm afraid we need to go .” “So, we go. I 'll be right there in the morning .” The Google 's computer was n't responding. “I 'm going to do it.
“So you know the answer is “It's a very important answer for me”, so the answer is “What are the odds the answer is right. So, the answer is right.” So we've all been at this for about a year now. We've done everything. Nothing ever happens. We just got ta get to the surface and get there.” The Google's computer wasn't responding. This was a new one. “The answer is right, and I'm afraid we need to go.” “So, we go. I'll be right there in the morning.” The Google's c

In [27]:
# Saving the processed transcripts (cleaned and formatted, but retains original punctuation)
processed_transcripts.to_csv('processed_data/transcripts_spacy_formatted.csv')

### Transcripts Without Punctuation
The following set of transcripts are processed such that they retain the capitalization and spacing of the original retells, but contain no characters that are not alphanumeric. These are used for calculating unique word counts, and for processing transcripts to determine local coherence.

In [28]:
# Processing each story using the defined substitutions
no_punctuation_processed_transcripts = processed_transcripts.copy()
for i in processed_transcripts.index:
    story_string = processed_transcripts['Text'].iloc[i]
    story_string_replaced = re.sub(r'[^A-Za-z0-9\s]', '', story_string) # Removing any non-alphanumeric or white-space characters
    story_string_replaced = re.sub(r' +|\n', ' ', story_string_replaced) # Replacing all instances of multiple spaces or newline characters with a space
    no_punctuation_processed_transcripts.loc[i,'Text'] = story_string_replaced

display(no_punctuation_processed_transcripts)

Unnamed: 0,Text,Coherence
0,3000 years have I been fighting Every morning ...,4
1,Dad youre on TV again I heard Erics voice from...,5
2,When Tyler entered the ward his daughter Valer...,5
3,His body was failing He had taken care of it v...,4
4,I saw the button It was simple red no words on...,5
...,...,...
1534,The human body is an amazing thing It is able ...,3
1535,Once upon a time a young woman broke her vase ...,4
1536,Once upon a time two beautiful diamonds and a ...,3
1537,Lora and her team had been searching for the l...,5


In [29]:
# Saving the processed transcripts (cleaned and formatted with no punctuation)
no_punctuation_processed_transcripts.to_csv('processed_data/transcripts_no_punc_formatted.csv')

### Transcripts for Local Coherence
The following set of transcripts are intended for the calculation of local coherence, using the R code developed by [Hoffman et al., 2018](https://elifesciences.org/articles/38907). This code, which can be found in [this repository](https://osf.io/8atfn/overview), utilizes the embedding space created with latent semantic analysis to calculate the similarity between adjacent words and utterances in a transcript. This requires the 

In [30]:
# Creating a list to store DataFrames for concatenation
all_data_frames = []

# Creating a DataFrame for each transcript that contains one word per row 
for i in no_punctuation_processed_transcripts.index:
    story = no_punctuation_processed_transcripts['Text'].iloc[i]
    story_list = []
    word = story.split(' ')
    story_list.extend(word)
    story_list = [item for item in story_list if item != '' '']
    response_data_frame = pd.DataFrame({
                                        'study_id': i,
                                        'group': 'transcripts',
                                        'prompt': 'retell',
                                        'response': story_list
                                        })
    all_data_frames.append(response_data_frame)

# Appending all DataFrames together (rows of resulting DataFrame are single words from each indexed transcript)
concatenated_transcripts = pd.concat(all_data_frames, axis = 0)

display(concatenated_transcripts)

Unnamed: 0,study_id,group,prompt,response
0,0,transcripts,retell,3000
1,0,transcripts,retell,years
2,0,transcripts,retell,have
3,0,transcripts,retell,I
4,0,transcripts,retell,been
...,...,...,...,...
85,1538,transcripts,retell,the
86,1538,transcripts,retell,cause
87,1538,transcripts,retell,of
88,1538,transcripts,retell,the


In [31]:
# Saving the processed transcripts (split by single word for use in coherence code)
concatenated_transcripts.to_csv('processed_data/transcripts_hoffman_coherence_formatted.txt', sep='\t', index = False, header = False)

# ***NEED TO EXPLAIN THIS TOO!!! SPLITTING BY SENTENCE***

In [32]:
# Creating a list to store utterances
story_lists = []

# Splits each processed text at given punctuation marks, a proxy for utterance boundaries
for i in processed_transcripts.index:
    story_string = processed_transcripts['Text'].iloc[i]
    story_string_split = re.split(r'[;\.?!“”…"]+' + '|' + r'[\s]- ' + '|' + r'\' ', story_string)
    story_string_split = [item for item in story_string_split if item != '' '']
    story_lists.append(story_string_split)

print(story_lists[0])

['3,000 years have I been fighting', ' Every morning, the raccoons scratch at my eyes', ' Every evening, the skunks spray me while the opossums chew at my feet', ' I have never had any tools', ' I have only my hands', ' I don’t remember the place I came from before this', ' All I remember is the daily fight between me and these animals', ' No matter how many times I kill them, they come back the next day', ' No matter how many times I’ve ripped them limb from limb, they are here for their appointment the next day just as eager to tear me apart', ' They want my body to be destroyed beyond recognition, and most days they succeed', ' When I wake up in the morning, all my wounds from the day before are gone', ' Not even a scratch on my little toe', ' Why do these animals want to hurt me so bad', ' What have I done to deserve this fate', ' All I know anymore is fighting', ' The struggle', ' But we aren’t struggling for a purpose, we’re just here', ' No one else has ever peered in to our for

In [33]:
# Saving the processed transcripts (split at utterance boundaries, retaining other punctuation)
with open("processed_data/sentences.csv", "w") as f:
    wr = csv.writer(f)
    wr.writerows(story_lists)