# Transcript Processing
This notebook completes a variety of pre-processing steps for the narrative transcripts collected from the [HANNA](https://github.com/dig-team/hanna-benchmark-asg/blob/main/README.md) and [CoheSentia](https://github.com/AviyaMn/CoheSentia/blob/main/README.md) corpuses. The HANNA corpus contains 1,056 human or large-language model (LLM) generated stories to given prompts, each of which is scored by three human raters for coherence. The CoheSentia corpus contains 500 stories generated by GPT-3 and scored by seven human raters for global coherence, with a holistic score provided as a combination of ratings.

In line with the instructions provided to human scorers, this notebook truncates transcripts to the last full utterance ([Chhun et al., 2022](https://arxiv.org/pdf/2208.11646)). Additionally, it calculates single coherence scores by averaging the ratings of multiple scorers, and ensures that all scores are on a 1 - 5 scale. Finally, it creates a single DataFrame containing combined transcripts from both corpuses and their corresponding coherence scores.

In [39]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import json

In [40]:
def transcript_truncation(data_frame: pd.DataFrame) -> pd.DataFrame:

    """ 
        Truncates transcripts to the last complete utterance, as indicated by punctuation (periods,
        exclamation points, question marks, or speech quotations).

        Parameters 
        ----------
        data_frame: pd.DataFrame
            Input DataFrame with Text column storing transcripts
        
        Returns
        -------
        pd.DataFrame
            DataFrame with Text column replaced with cleaned transcripts
    """

    ending_char = ['.','!','?','"']
    for i in data_frame.index:
        string = data_frame.loc[i, 'Text']
        index = max([string.rfind(char) for char in ending_char])
        if index == -1:
            index = len(string)
        data_frame.loc[i, 'Text'] = string[0:index+1]

    return data_frame

In [41]:
def coherence_checks(data_frame: pd.DataFrame, column_name: str) -> None:

    """
        Checks that no coherence value is greater than 5 or less than 1
        
        Parameters 
        ----------
        data_frame: pd.DataFrame
            Input DataFrame with column storing coherence scores
        column_name: str
            Name of column containing coherence scores

        Returns
        -------
        None

        Raises
        ------
        ValueError
            If coherence score is less than 1
        ValueError
            If coherence score is greater than 5     
    """

    if any(data_frame[column_name] < 1.0):
        raise ValueError('Coherence score cannot be less than 1')
    if any(data_frame[column_name] > 5.0):
        raise ValueError('Coherence score cannot be greater than 5')

In [42]:
# Loading Hanna transcript data
hanna_transcripts = pd.read_csv('raw_data/hanna_stories_annotations.csv')

# Dropping unnecessary columns
hanna_transcripts = hanna_transcripts.drop(['Name', 
                                           'Worker ID', 
                                           'Assignment ID', 
                                           'Work time in seconds',
                                           'Relevance',
                                           'Empathy',
                                           'Surprise',
                                           'Engagement',
                                           'Complexity',
                                        ],
                                        axis = 1
)

display(hanna_transcripts)

Unnamed: 0,Story ID,Prompt,Human,Story,Model,Coherence
0,0,When you die the afterlife is an arena where y...,"3,000 years have I been fighting. Every mornin...","3,000 years have I been fighting. Every mornin...",Human,4
1,0,When you die the afterlife is an arena where y...,"3,000 years have I been fighting. Every mornin...","3,000 years have I been fighting. Every mornin...",Human,5
2,0,When you die the afterlife is an arena where y...,"3,000 years have I been fighting. Every mornin...","3,000 years have I been fighting. Every mornin...",Human,2
3,1,A new law is enacted that erases soldiers memo...,"“Dad, you 're on TV again !” I heard Eric 's v...","“Dad, you 're on TV again !” I heard Eric 's v...",Human,5
4,1,A new law is enacted that erases soldiers memo...,"“Dad, you 're on TV again !” I heard Eric 's v...","“Dad, you 're on TV again !” I heard Eric 's v...",Human,4
...,...,...,...,...,...,...
3163,1054,"When a new president is elected, they are give...",“Mr President I want you to know I am telling ...,'said a puppet'President Bush stopped the old ...,TD-VAE,1
3164,1054,"When a new president is elected, they are give...",“Mr President I want you to know I am telling ...,'said a puppet'President Bush stopped the old ...,TD-VAE,2
3165,1055,You discover a grand hall filled with legendar...,"Waking with a start, my blankets strewn wildly...",It is YOU. Your mother's greatest love.”...'Oh...,TD-VAE,4
3166,1055,You discover a grand hall filled with legendar...,"Waking with a start, my blankets strewn wildly...",It is YOU. Your mother's greatest love.”...'Oh...,TD-VAE,5


In [43]:
# Empty DataFrame for storing new coherence values
hanna_transcripts_cleaned = pd.DataFrame(columns=hanna_transcripts.columns)

# Calculating the average of the three coherence scores
for idx in hanna_transcripts['Story ID'].unique():
    hanna_transcripts_individual = hanna_transcripts[hanna_transcripts['Story ID'] == idx]
    hanna_transcripts_cleaned.loc[len(hanna_transcripts_cleaned)] = hanna_transcripts_individual.iloc[0]
    hanna_transcripts_cleaned.loc[idx, 'Coherence'] = np.round(hanna_transcripts_individual['Coherence'].mean())

# Checking that coherence scores were accurately calculated and rounded
coherence_checks(hanna_transcripts_cleaned, 'Coherence')

# Dropping unnecessary columns
hanna_transcripts_restructured = hanna_transcripts_cleaned[['Story', 'Coherence']].rename(columns={'Story':'Text'})

# Applying truncation (ending transcripts at the last complete utterance)
hanna_transcripts_restructured = transcript_truncation(hanna_transcripts_restructured)

display(hanna_transcripts_restructured)

Unnamed: 0,Text,Coherence
0,"3,000 years have I been fighting. Every mornin...",4.0
1,"“Dad, you 're on TV again !” I heard Eric 's v...",5.0
2,"When Tyler entered the ward, his daughter Vale...",5.0
3,His body was failing. He had taken care of it ...,4.0
4,"I saw the button. It was simple, red, no words...",5.0
...,...,...
1051,"' I want no more.' she cried, tossing her cloa...",3.0
1052,'it says. 'it repeats in every language the wo...,2.0
1053,opens almost a month after the start of Star T...,4.0
1054,'said a puppet'President Bush stopped the old ...,2.0


In [44]:
# Examples of truncating transcripts

print(hanna_transcripts_cleaned.loc[510, 'Story'])
print(hanna_transcripts_restructured.loc[510, 'Text'])

“Dear everyone. My name is Thomas Reed. As I was saying, I'm an astronaut, so you all know my story, so you know how I've come to this decision. I know that I've tried to be brave for all of you and kept everyone happy in my words, but I've come a long way since then. I've been doing my best to understand the human condition, but it seems that, once you take one step closer to this final step, you will find that there is still a chance, somewhere, of an abyss in which humanity shall die. Please remember to look after your fellow brothers and sisters. Please take care of yourself and your fellow man.” And with that, the whole group burst out laughing. They were all so proud of their plan, and so fascinated with the modern day business that they took to calling the lottery. After all, who couldn't guess that the stars would be sparkling in response to that one last television broadcast? Those unlucky enough to have the opportunity to pick one, or all, of the human race to die, felt a bit

In [45]:
# Loading the two CoheSentia data files
with open('raw_data/cohesentia_stories_1.json', 'r') as file:
        cohesentia_stories_1 = json.load(file)
with open('raw_data/cohesentia_stories_2.json', 'r') as file:
        cohesentia_stories_2 = json.load(file)

# Creating a DataFrame with both files
cohesentia_transcripts = pd.concat([pd.DataFrame(cohesentia_stories_1).T,pd.DataFrame(cohesentia_stories_2).T]).sort_values(by = 'StoryID')

# Determining the consensus score (average global coherence) for each transcript
cohesentia_transcripts['Coherence'] = [cohesentia_transcripts['HolisticData'].iloc[i[0]]['consensus_score'] for i in enumerate(cohesentia_transcripts['StoryID'])]

# Checking that coherence scores were accurately calculated and rounded
coherence_checks(cohesentia_transcripts, 'Coherence')

# Dropping unnecessary columns
cohesentia_transcripts_restructured = cohesentia_transcripts[['Text','Coherence']]

# Applying truncation (ending transcripts at the last complete utterance)
cohesentia_transcripts_restructured = transcript_truncation(cohesentia_transcripts_restructured)

display(cohesentia_transcripts_restructured)

Unnamed: 0,Text,Coherence
0,David always knew he had a lot of weight to lo...,2
1,I was curious about the world and I wanted to ...,2
2,"Once upon a time, there were seven deadly sins...",2
3,Sally was a very special little girl. She had ...,5
4,Once upon a time there was an ocean of lies. S...,5
...,...,...
478,The human body is an amazing thing. It is able...,3
479,"Once upon a time, a young woman broke her vase...",4
480,"Once upon a time, two beautiful diamonds and a...",3
481,Lora and her team had been searching for the l...,5


In [46]:
# Concatenating and formatting both datasets to contain only the story text and an integer coherence value from 1 to 5
transcripts = pd.concat([hanna_transcripts_restructured,cohesentia_transcripts_restructured])
transcripts.reset_index(drop = True, inplace = True)
transcripts['Coherence'] = transcripts['Coherence'].astype(int)

# Checking that all scores were appropriately converted
coherence_checks(transcripts, 'Coherence')

display(transcripts)

Unnamed: 0,Text,Coherence
0,"3,000 years have I been fighting. Every mornin...",4
1,"“Dad, you 're on TV again !” I heard Eric 's v...",5
2,"When Tyler entered the ward, his daughter Vale...",5
3,His body was failing. He had taken care of it ...,4
4,"I saw the button. It was simple, red, no words...",5
...,...,...
1534,The human body is an amazing thing. It is able...,3
1535,"Once upon a time, a young woman broke her vase...",4
1536,"Once upon a time, two beautiful diamonds and a...",3
1537,Lora and her team had been searching for the l...,5


In [47]:
# Saving the preprocessed transcripts
transcripts.to_csv('processed_data/transcripts.csv')