## Environment setup

In [None]:
!pip install torch
!pip install -qq transformers

## Read the files

In [274]:
import os, sys, re
import pandas as pd

sys.path.insert(0, '/home/evelyn/Documents/digiphil/scripts')
from utils import *
from data_processing import *


datapath = "/home/evelyn/Documents/digiphil/data/gutenberg/dysto"
dystro_str = read_file_str(datapath)
book_names = get_names(datapath)

## Preprocessing

In [279]:
# Clean the texts, including replacing negative contraction ('t) to its full form (not), and normalize punctuations
preprocessed_dystro = mass_text_preprocessing(dystro_str)
print('Original: ', dystro_str[1][:1000])
print(" --------------------------------------- ")
print('Processed: ', preprocessed_dystro[1][:1000])

Original:  Everything was perfectly swell.  There were no prisons, no slums, no insane asylums, no cripples, no poverty, no wars.  All diseases were conquered. So was old age.  Death, barring accidents, was an adventure for volunteers.  The population of the United States was stabilized at forty-million souls.  One bright morning in the Chicago Lying-in Hospital, a man named Edward K. Wehling, Jr.,  waited for his wife to give birth. He was the only man waiting.  Not many people were born a day any more.  Wehling was fifty-six, a mere stripling in a population whose average age was one hundred and twenty-nine.  X-rays had revealed that his wife was going to have triplets. The children would be his first.  Young Wehling was hunched in his chair, his head in his hand. He was so rumpled, so still and colorless as to be virtually invisible. His camouflage was perfect, since the waiting room had a disorderly and demoralized air, too. Chairs and ashtrays had been moved away from the walls.  

In [288]:
# Segement the texts in to list of strings
book_sents = mass_sent_tokenizer(preprocessed_dystro)
# Remove empty lines
book_sents = remove_dot_line(book_sents)
book_sents[0][:10]

[' If it was good enough for your grandfather, forget it .',
 'it is much too good for anyone else!',
 'Gramps Ford, his chin resting on his hands, his hands on the crook of his cane, was staring irascibly at the five-foot television screen that dominated the room.',
 "On the screen, a news commentator was summarizing the day's happenings.",
 'Every thirty seconds or so, Gramps would jab the floor with his cane-tip and shout, "Hell, we did that a hundred years ago!"',
 "Emerald and Lou, coming in from the balcony, where they had been seeking that 2185 A. D. rarity--privacy--were obliged to take seats in the back row, behind Lou's father and mother, brother and sister-in-law, son and daughter-in-law, grandson and wife, granddaughter and husband, great-grandson and wife, nephew and wife, grandnephew and wife, great-grandniece and husband, great-grandnephew and wife--and, of course, Gramps, who was in front of everybody.",
 'All save Gramps, who was somewhat withered and bent, seemed, by 

In [289]:
# Convert the list of sentences into dataframe for easier data handling
df = pd.DataFrame(book_sents)
df = df.T # transpose  dataframe
df.columns = book_names #rename columns
df = df.replace(np.nan, '')
# combine all books into one column
df_ori = df.melt(value_name='all')['all']
df_ori = pd.DataFrame(df_ori)
df_ori

Unnamed: 0,all
0,"If it was good enough for your grandfather, f..."
1,it is much too good for anyone else!
2,"Gramps Ford, his chin resting on his hands, hi..."
3,"On the screen, a news commentator was summariz..."
4,"Every thirty seconds or so, Gramps would jab t..."
...,...
59895,The magnitude of the task may be understood wh...
59896,This is the end of the Everhard Manuscript.
59897,It breaks off abruptly in the middle of a...
59898,She must have received warning of the com...


## Sentiment Analysis

### Whole collection

In [663]:
# Perform sentiment analysis using BERT
whole_collect = list(df_ori["all"])
result_all_data = sentiment_analyzer(whole_collect)

# Get the classified sentiment                         
positive_num, negative_num = count_polarity(result_all_data)
print('Number of postive labels vs negative labels: ', positive_num, " vs ", negative_num)
print('Ratios of postive labels vs negative labels in the whole collection: ', polarity_ratio(result_all_data, positive_num, negative_num))

loading configuration file https://huggingface.co/siebert/sentiment-roberta-large-english/resolve/main/config.json from cache at /home/evelyn/.cache/huggingface/transformers/228e83e1ade2247aebc5f0725e330fa58dedee3d9eec36c9249f25084a946130.1aece0680a18a95d51d6e1a5f83631412da37b87db65380c52052161354505ba
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers":

Number of postive labels vs negative labels:  14433  vs  45467
Ratios of postive labels vs negative labels in the whole collection:  (0.24095158597662772, 0.7590484140233723)


### for 2BR02B only

In [653]:
# Perform sentiment analysis using BERT
result_to_be_book = sentiment_analyzer(book_sents[1])

# Get the classified sentiment     
to_be_positive_num, to_be_negative_num = count_polarity(result_to_be_book)
print('Number of postive labels vs negative labels: ', to_be_positive_num, " vs ", to_be_negative_num)
print('Ratios of postive labels vs negative labels in the whole collection: ', polarity_ratio(result_to_be_book, to_be_positive_num, to_be_negative_num))

Number of postive labels vs negative labels:  115  vs  144
Ratios of postive labels vs negative labels in the whole collection:  (0.444015444015444, 0.555984555984556)


### for 2BR02B characters

##### Ed

In [655]:
# Perform sentiment analysis using BERT
result_to_be_ed = sentiment_analyzer(ed_dialog)

# Get the classified sentiment     
ed_positive_num, ed_negative_num = count_polarity(result_to_be_ed)
print('Number of postive labels vs negative labels: ', ed_positive_num, " vs ", ed_negative_num)
print('Ratios of postive labels vs negative labels in the whole collection: ', polarity_ratio(result_to_be_ed, ed_positive_num, ed_negative_num))

loading configuration file https://huggingface.co/siebert/sentiment-roberta-large-english/resolve/main/config.json from cache at /home/evelyn/.cache/huggingface/transformers/228e83e1ade2247aebc5f0725e330fa58dedee3d9eec36c9249f25084a946130.1aece0680a18a95d51d6e1a5f83631412da37b87db65380c52052161354505ba
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers":

Number of postive labels vs negative labels:  7  vs  16
Ratios of postive labels vs negative labels in the whole collection:  (0.30434782608695654, 0.6956521739130435)





In [656]:
result_to_be_ed

Unnamed: 0,text,pred,label,score
0,one bright morning in the chicago lying-in hos...,1,POSITIVE,0.996789
1,"wehling was fifty-six, a mere stripling in a p...",0,NEGATIVE,0.996718
2,"young wehling was hunched in his chair, his he...",0,NEGATIVE,0.998107
3,"wehling, the waiting father, mumbled something...",0,NEGATIVE,0.998563
4,"""wehling, "" said the waiting father, sitting u...",0,NEGATIVE,0.996208
5,"""edward k. wehling, jr. , is the name of the h...",1,POSITIVE,0.998001
6,"""oh, mr. wehling, "" said dr. hitz, ""i didn not...",0,NEGATIVE,0.998021
7,"""the invisible man, "" said wehling.",0,NEGATIVE,0.989193
8,"""hooray, "" said wehling emptily.",0,NEGATIVE,0.998388
9,said wehling.,1,POSITIVE,0.978962


##### Dr

In [673]:
# Perform sentiment analysis using BERT
result_to_be_dr = sentiment_analyzer(dr_dialog)

dr_positive_num, dr_negative_num = count_polarity(result_to_be_dr)
print('Number of postive labels vs negative labels: ', dr_positive_num, " vs ", dr_negative_num)
print('Ratios of postive labels vs negative labels in the whole collection: ', polarity_ratio(result_to_be_dr, dr_positive_num, dr_negative_num))

loading configuration file https://huggingface.co/siebert/sentiment-roberta-large-english/resolve/main/config.json from cache at /home/evelyn/.cache/huggingface/transformers/228e83e1ade2247aebc5f0725e330fa58dedee3d9eec36c9249f25084a946130.1aece0680a18a95d51d6e1a5f83631412da37b87db65380c52052161354505ba
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers":

Number of postive labels vs negative labels:  14  vs  7
Ratios of postive labels vs negative labels in the whole collection:  (0.6666666666666666, 0.3333333333333333)





##### Leora

In [675]:
# Perform sentiment analysis using BERT
result_to_be_leora = sentiment_analyzer(leora_dialog)

leora_positive_num, leora_negative_num = count_polarity(result_to_be_leora)
print('Number of postive labels vs negative labels: ', leora_positive_num, " vs ", leora_negative_num)
print('Ratios of postive labels vs negative labels in the whole collection: ', polarity_ratio(result_to_be_leora, leora_positive_num, leora_negative_num))

loading configuration file https://huggingface.co/siebert/sentiment-roberta-large-english/resolve/main/config.json from cache at /home/evelyn/.cache/huggingface/transformers/228e83e1ade2247aebc5f0725e330fa58dedee3d9eec36c9249f25084a946130.1aece0680a18a95d51d6e1a5f83631412da37b87db65380c52052161354505ba
Model config RobertaConfig {
  "_name_or_path": "siebert/sentiment-roberta-large-english",
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 1024,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 16,
  "num_hidden_layers":

Number of postive labels vs negative labels:  4  vs  6
Ratios of postive labels vs negative labels in the whole collection:  (0.4, 0.6)





## Emotion Classification

#### Functions

In [None]:
!pip install NRCLex

In [None]:
from collections import Counter, defaultdict

def flatten_list(nested_list):
    return [item for items in nested_list for item in items]

def emo_count(text):
    """ 
    count the total number of different emotions in the given text
    
    param: list of text (string) 
    return dict proportion of the emotions in the text
    """
    emo_count_dict = defaultdict(int)
    
    for sent in text:
        data = NRCLex(sent)
        emo_d = data.raw_emotion_scores
        for emo, score in emo_d.items():
          if emo in emo_count_dict:
              emo_count_dict[emo] += score
          else: 
              emo_count_dict[emo] = score
    return emo_count_dict


def sort_dict(ind):
    """ Function to sort dictionary by value in descending order """
    return sorted(ind.items(), key=lambda x: x[1], reverse=True)


import plotly.express as px

def dict_to_df(in_df):
    """ Convert multiclass dict to multiclass dataframe
    arg: dictionary
    return dataframe
    """
    emo_df = pd.DataFrame.from_dict(in_df, orient='index')
    emo_df = emo_df.reset_index()
    emo_df = emo_df.rename(columns={'index' : 'Emotion Classification' , 0: 'Emotion Count'})
    emo_df = emo_df.sort_values(by=['Emotion Count'], ascending=False)
    return emo_df

def label_classifier(label):
    if label == "trust":
        return 0 # positive
    if label == "positive":
        return 0 # positive
    if label == "joy":
        return 0 # positive
    if label == "anticipation":
        return 0 # positive 
    if label == "surprise":
        return 0 # positive
    if label == "fear":
        return 1 # negative
    if label == "negative":
        return 1 # negative
    if label == "sadness":
        return 1 # negative
    if label == "anger":
        return 1 # negative
    if label == "disgust":
        return 1 # negative


def multiclass_to_binary_count(df, binarynum=0):
    """ 
    Function to classify the emotion into binary (pos/neg) and count their total sum 
    """
    df['Emotion Classification']= df['Emotion Classification'].apply(lambda x: label_classifier(x))
    binary = df[df['Emotion Classification']==binarynum] #positve emotion
    return binary["Emotion Count"].sum().astype(np.int32)


def multi_2_binary_df(df, pos_num=27):
    neg_num = df["Emotion Count"].sum() - pos_num
    new_df = pd.DataFrame([pos_num, neg_num], index= ["positive", "negative"], columns=["Count"])
    new_df = new_df.reset_index()
    return new_df
    
    
def visualize_df(in_df, x, y, title):
    """ Plot dataframe to bar chart"""
    fig = px.bar(in_df, x = x, y = y, color = y, title = title, orientation='h', width = 800, height = 400)
    fig.show()

#### Whole collection

In [481]:
collection_sents = flatten_list(book_sents)
collection_emo = emo_count(collection_sents)
collection_emo

defaultdict(int,
            {'anticipation': 8911,
             'joy': 7125,
             'positive': 18018,
             'surprise': 4944,
             'trust': 10674,
             'negative': 15508,
             'anger': 6239,
             'fear': 8706,
             'sadness': 7923,
             'disgust': 3803})

In [644]:
all_emo_df = dict_to_df(collection_emo)
print(visualize_df(all_emo_df, x='Emotion Count', y='Emotion Classification', title="Emotional distribution in the entire collection")) #27/45 = 0.6
print(visualize_df(multi_2_binary_df(all_emo_df, multiclass_to_binary_count(all_emo_df, 0)), x="Count", y= "index", title="Binary version of emotional distribution in the entire collection"))

None


None


#### To Be Book

In [668]:
to_be_emo = emo_count(book_sents[1])
to_be_df = dict_to_df(to_be_emo)
print(visualize_df(to_be_df, x='Emotion Count', y='Emotion Classification', title="Emotional distribution in the 2BR02B book"))
print(visualize_df(multi_2_binary_df(to_be_df, multiclass_to_binary_count(to_be_df, 0)), x="Count", y= "index", title="Binary version of emotional distribution in the 2BR02B book"))

None


None


#### Characters

In [881]:
def emo_count_print(text):
    """ 
    count the total number of different emotions in the given text
    
    param: list of text (string) 
    return dict proportion of the emotions in the text
    """    
    for sent in text:
        data = NRCLex(sent)
        emo_d = data.affect_list
        print(sent)
        print(emo_d)
        print()
        
emo_count_print(dr_dialog)

"that's good of dr. hitz, " said the orderly.
['anticipation', 'joy', 'positive', 'surprise', 'trust', 'positive']

he was referring to one of the male figures in white, whose head was a portrait of dr. benjamin hitz, the hospital's chief obstetrician.
['anticipation', 'joy', 'positive', 'trust', 'fear', 'sadness', 'trust', 'trust']

hitz was a blindingly handsome man.
[]

"gosh--" she said, and she blushed and became humble--"that--that puts me right next to dr. hitz. "
['disgust', 'negative', 'positive', 'sadness']

she said, worshiping the portrait of hitz.
[]

and, while leora duncan was posing for her portrait, into the waitingroom bounded dr. hitz himself.
[]

said dr. hitz heartily.
['joy', 'positive']

"last i heard, " said dr. hitz, "they had one, and were trying to scrape another two up. "
[]

"oh, mr. wehling, " said dr. hitz, "i didn not see you. "
[]

"they just phoned me that your triplets have been born, " said dr. hitz.
[]

"you don not sound very happy, " said dr. hitz

In [636]:
ed_emo = emo_count(ed_dialog)
dr_emo = emo_count(dr_dialog)
leora_emo = emo_count(leora_dialog)
print("Edward's emotion distribution is", sort_dict(ed_emo))
print("Dr's emotion distribution is", sort_dict(dr_emo))
print("Leora's emotion distribution is", sort_dict(leora_emo))

Edward's emotion distribution is [('trust', 8), ('positive', 7), ('fear', 6), ('joy', 5), ('negative', 5), ('sadness', 4), ('anticipation', 4), ('surprise', 3), ('anger', 3)]
Dr's emotion distribution is [('positive', 10), ('trust', 9), ('joy', 6), ('anticipation', 5), ('sadness', 3), ('negative', 3), ('surprise', 2), ('fear', 2), ('disgust', 1), ('anger', 1)]
Leora's emotion distribution is [('negative', 2), ('anger', 1), ('fear', 1), ('sadness', 1), ('surprise', 1)]


##### Ed

In [875]:
ed_emo_df = dict_to_df(ed_emo)
print(ed_emo_df)
print(visualize_df(ed_emo_df, x='Emotion Count', y='Emotion Classification', title="Edward Emotional Distribution")) #27/45 = 0.6
print(visualize_df(multi_2_binary_df(ed_emo_df, multiclass_to_binary_count(ed_emo_df, 0)), x="Count", y= "index", title="Binary version of Edward Emotional Distribution"))

  Emotion Classification  Emotion Count
2                  trust              8
5               positive              7
0                   fear              6
4                    joy              5
7               negative              5
1                sadness              4
3           anticipation              4
6               surprise              3
8                  anger              3


None


None


##### Dr

In [876]:
dr_emo_df = dict_to_df(dr_emo)
print(dr_emo_df)
print(visualize_df(dr_emo_df, x='Emotion Count', y='Emotion Classification', title="Dr Emotional Distribution")) #27/45 = 0.6
print(visualize_df(multi_2_binary_df(dr_emo_df, multiclass_to_binary_count(dr_emo_df, 0)), x="Count", y= "index", title="Binary version of Dr Emotional Distribution"))

  Emotion Classification  Emotion Count
2               positive             10
4                  trust              9
1                    joy              6
0           anticipation              5
6                sadness              3
8               negative              3
3               surprise              2
5                   fear              2
7                disgust              1
9                  anger              1


None


None


##### Leora

In [880]:
leora_emo_df = dict_to_df(leora_emo)
print(leora_emo_df)
print(visualize_df(leora_emo_df, x='Emotion Count', y='Emotion Classification', title="Leora's Emotional Distribution")) #27/45 = 0.6
print(visualize_df(multi_2_binary_df(leora_emo_df, multiclass_to_binary_count(leora_emo_df, 0)), x="Count", y= "index", title="Binary version of Leora's Emotional Distribution"))

  Emotion Classification  Emotion Count
0               negative              2
1                  anger              1
2                   fear              1
3                sadness              1
4               surprise              1


None


None


In [850]:
def emo_df_for_eval(text):
    """ 
    create dataframe for predicted emotions in the given text
    
    param: list of text (string) 
    return df
    """
    emo_final_list = []
    
    for sent in text:
        data = NRCLex(sent)
        # emo_d = data.raw_emotion_scores
        emo_list = data.affect_list
    
        if sent not in emo_final_list:
            if len(emo_list) > 0:
                emo_final_list.append([sent, emo_list])
            else: 
                emo_final_list.append([sent, ["NaN"]])
        else:
            emo_final_list.append([sent, emo_list])
    emo_df_char = pd.DataFrame(emo_final_list, columns= ['sentence', 'emo_pred'])
    return emo_df_char

leora_emo_pred_df = emo_df_for_eval(leora_dialog)
dr_emo_pred_df = emo_df_for_eval(dr_dialog)
ed_emo_pred_df = emo_df_for_eval(ed_dialog)

In [851]:
leora_emo_pred_df

Unnamed: 0,sentence,emo_pred
0,"""my name's leora duncan. """,[NaN]
1,"""duncan, duncan, duncan, "" he said, scanning t...",[NaN]
2,"""well, "" said leora duncan, ""that's more the d...",[negative]
3,"and, while leora duncan was posing for her por...",[NaN]
4,"""well, miss duncan!",[NaN]
5,"miss duncan!""",[NaN]
6,said leora duncan.,[NaN]
7,"""i wish people wouldn not call it that, "" said...",[NaN]
8,"""that sounds so much better, "" said leora duncan.",[NaN]
9,and then he shot leora duncan.,"[anger, fear, negative, sadness, surprise]"


## Evaluation

In [826]:
# Load eval data

eval_bi_leora = pd.read_csv("/home/evelyn/Documents/digiphil/final_project/data_label/labeled_leora.csv", converters={'actual_emo': lambda x: x[:].split(',')})
eval_bi_dr = pd.read_csv("/home/evelyn/Documents/digiphil/final_project/data_label/labeled_dr.csv", converters={'actual_emo': lambda x: x[:].split(',')})
eval_bi_ed= pd.read_csv("/home/evelyn/Documents/digiphil/final_project/data_label/labeled_ed.csv", converters={'actual_emo': lambda x: x[:].split(',')})

In [827]:
eval_bi_leora

Unnamed: 0,text,actual_label,pred,label,actual_emo
0,"""my name's leora duncan. """,1,1,POSITIVE,[positive]
1,"""duncan, duncan, duncan, "" he said, scanning t...",0,0,NEGATIVE,[anticipation]
2,"""well, "" said leora duncan, ""that's more the d...",0,0,NEGATIVE,[negative]
3,"and, while leora duncan was posing for her por...",1,1,POSITIVE,"[positive, surprise]"
4,"""well, miss duncan!",1,0,NEGATIVE,"[positive, surprise]"
5,"miss duncan!""",1,0,NEGATIVE,[positive]
6,said leora duncan.,1,1,POSITIVE,[neutral]
7,"""i wish people wouldn not call it that, "" said...",0,0,NEGATIVE,[sadness]
8,"""that sounds so much better, "" said leora duncan.",1,1,POSITIVE,[positive]
9,and then he shot leora duncan.,0,0,NEGATIVE,"[sadness, anger, negative]"


In [686]:
from sklearn.metrics import f1_score

In [752]:
# Binary sentiment analysis F1 score 
print("Leora F1 score: ", f1_score(list(eval_bi_leora["actual_label"]), list(result_to_be_leora['pred']), average='macro'))
print("Dr F1 score: ", f1_score(list(eval_bi_dr["actual_label"]), list(result_to_be_dr['pred']), average='macro'))
print("Ed F1 score: ",f1_score(list(eval_bi_ed["actual_label"]), list(result_to_be_ed['pred']), average='macro'))

Leora F1 score:  0.8
Dr F1 score:  0.8444444444444444
Ed F1 score:  0.8083333333333333


In [874]:
# Micro
print("Leora F1 score: ", f1_score(list(eval_bi_leora["actual_label"]), list(result_to_be_leora['pred']), average='micro'))
print("Dr F1 score: ", f1_score(list(eval_bi_dr["actual_label"]), list(result_to_be_dr['pred']), average='micro'))
print("Ed F1 score: ",f1_score(list(eval_bi_ed["actual_label"]), list(result_to_be_ed['pred']), average='micro'))

Leora F1 score:  0.8000000000000002
Dr F1 score:  0.8571428571428571
Ed F1 score:  0.8260869565217391


In [753]:
# # Binary sentiment analysis accuracy score 

import numpy as np

def compute_accuracy(y_true, y_pred):
    return np.sum(np.equal(y_true, y_pred)) / len(y_true)

print("Leora accuracy score: ", compute_accuracy(list(eval_bi_leora["actual_label"]), list(result_to_be_leora['pred'])))
print("Dr accuracy score: ",compute_accuracy(list(eval_bi_dr["actual_label"]), list(result_to_be_dr['pred'])))
print("Ed accuracy score: ",compute_accuracy(list(eval_bi_ed["actual_label"]), list(result_to_be_ed['pred'])))

Leora accuracy score:  0.8
Dr accuracy score:  0.8571428571428571
Ed accuracy score:  0.8260869565217391


### evaluating the emotion classification 


In [828]:
# Create df for evaluating the emotion classification 

#for_emo_eval_leora = eval_bi_leora.assign(actual_emo=eval_bi_leora["actual_emo"].str.split(',')).explode("actual_emo")
#for_emo_eval_dr = eval_bi_dr.assign(actual_emo=eval_bi_dr["actual_emo"].str.split(',')).explode("actual_emo")
#for_emo_eval_ed = eval_bi_ed.assign(actual_emo=eval_bi_ed["actual_emo"].str.split(',')).explode("actual_emo")


def clean_df_for_emoeval(indf, columns=["actual_label", "pred", "label"], axis=1):
    # drop columns that are not needed
    indf = indf.drop(columns, axis=1)
    # rename the existing columns so that they are the same as the prediction df
    indf.columns = ["sentence", "emo_pred"]
    return indf


def find_difference_in_df(df_true, df_pred):
     """ find the different between the two df """
     diff_list = [x for x in list(df_true['sentence'].unique()) if x not in list(df_pred['sentence'].unique())]
     dff_df = df_true[(df_true['sentence'].isin(diff_list))]
     return dff_df


def normalize_NA_pred(df_true, df_pred):
    """ Function to normalize the difference between the two 
    by appending the missing sentences found in the find_diff_in_df function
    to the current prediction df to get a fully complete df 
    with all the sentences as it is in the original dataset

    """
    missing_sent_df = find_difference_in_df(df_true, df_pred)
    current_df = df_pred
    return missing_sent_df.append(current_df, ignore_index=True) 

In [None]:
for_emo_eval_leora = clean_df_for_emoeval(eval_bi_leora)
for_emo_eval_dr = clean_df_for_emoeval(eval_bi_dr)
for_emo_eval_ed = clean_df_for_emoeval(eval_bi_ed)

normalized_emo_pred_df_leora = normalize_NA_pred(for_emo_eval_leora, leora_emo_pred_df)
normalized_emo_pred_df_dr = normalize_NA_pred(for_emo_eval_dr, dr_emo_pred_df)
normalized_emo_pred_df_ed = normalize_NA_pred(for_emo_eval_ed, ed_emo_pred_df)

In [872]:
def compute_accruacy_multilabel(true_df, pred_df):
    true_dict = true_df.to_dict()
    pred_dict = pred_df.to_dict()
     
    accuracy = []
    
    assert len(true_dict["sentence"]) == len(pred_dict["sentence"])
    assert len(true_dict["emo_pred"]) == len(pred_dict["emo_pred"])
    
    for idx, tru_label in true_dict["emo_pred"].items():
        for i, pred_label in pred_dict["emo_pred"].items():
            if idx == i:
                interiem_accruacy = []
                for label in pred_label:
                    if label in tru_label:
                      #  print(label)
                        interiem_accruacy.append(1)
                    else:
                        interiem_accruacy.append(0)
                cal_interiem_acc = sum(interiem_accruacy) / len(tru_label)
                accuracy.append(cal_interiem_acc)
               # print(cal_interiem_acc)
                
    return np.mean(accuracy) 

In [873]:
print("Leora accuracy score: ", compute_accruacy_multilabel(for_emo_eval_leora, normalized_emo_pred_df_leora))
print("Dr accuracy score: ", compute_accruacy_multilabel(for_emo_eval_dr, normalized_emo_pred_df_dr))
print("Ed accuracy score: ", compute_accruacy_multilabel(for_emo_eval_ed, normalized_emo_pred_df_ed))

Leora accuracy score:  0.13333333333333333
Dr accuracy score:  0.39682539682539686
Ed accuracy score:  0.25362318840579706


## Character Frequency

Sentence segmentation
Tokenization

1. Find name entities
2. Analyze top 3 characters by occurrence count

In [330]:
import spacy

nlp = spacy.load('en_core_web_sm')

def list_name_entities(text):
    unique_name = set()
    name_entities = []
    for sentence in text:
        doc = nlp(sentence)
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                unique_name.add(ent.text)
                name_entities.append(ent.text)
    print(name_entities)
    name_counts = pd.value_counts(name_entities)
    return name_counts

list_name_entities(book_sents[1])

['Edward K. Wehling', 'Young Wehling', 'Hitz', 'Benjamin Hitz', 'Hitz', 'Cannery', 'Mother', 'Lucky Pierre', 'Leora Duncan', 'Duncan', 'Duncan', 'Duncan', 'Leora Duncan', 'Hitz', 'Hitz', 'Zeus', 'Leora Duncan', 'Hitz', 'Duncan', 'Duncan', 'Hitz', 'Leora Duncan', 'Hitz', 'Edward K. Wehling', 'Wehling', 'Hitz', 'Hitz', 'Hitz', 'Hitz', 'Hitz', 'Wehling', 'Hitz', 'Hitz', 'Hitz', 'Hitz', 'Leora Duncan', 'Hitz', 'Hitz', 'Leora Duncan', 'Wehling', 'Hitz', 'Hitz', 'Leora Duncan']


Hitz                 20
Leora Duncan          7
Duncan                5
Wehling               3
Edward K. Wehling     2
Young Wehling         1
Benjamin Hitz         1
Cannery               1
Mother                1
Lucky Pierre          1
Zeus                  1
dtype: int64

In [298]:
to_be_or_not_sents = list(df["2_B_R_0_2_B"])
to_be_or_not_tokens = [word_tokenize(t) for t in to_be_or_not_sents] # tokenize the sentence to word level
to_be_or_not_tokens = [item for items in to_be_or_not_tokens for item in items] # flatten the nest list
to_be_or_not_tokens

['Everything',
 'was',
 'perfectly',
 'swell',
 '.',
 'There',
 'were',
 'no',
 'prisons',
 ',',
 'no',
 'slums',
 ',',
 'no',
 'insane',
 'asylums',
 ',',
 'no',
 'cripples',
 ',',
 'no',
 'poverty',
 ',',
 'no',
 'wars',
 '.',
 'All',
 'diseases',
 'were',
 'conquered',
 '.',
 'So',
 'was',
 'old',
 'age',
 '.',
 'Death',
 ',',
 'barring',
 'accidents',
 ',',
 'was',
 'an',
 'adventure',
 'for',
 'volunteers',
 '.',
 'The',
 'population',
 'of',
 'the',
 'United',
 'States',
 'was',
 'stabilized',
 'at',
 'forty-million',
 'souls',
 '.',
 'One',
 'bright',
 'morning',
 'in',
 'the',
 'Chicago',
 'Lying-in',
 'Hospital',
 ',',
 'a',
 'man',
 'named',
 'Edward',
 'K.',
 'Wehling',
 ',',
 'Jr.',
 ',',
 'waited',
 'for',
 'his',
 'wife',
 'to',
 'give',
 'birth',
 '.',
 'He',
 'was',
 'the',
 'only',
 'man',
 'waiting',
 '.',
 'Not',
 'many',
 'people',
 'were',
 'born',
 'a',
 'day',
 'any',
 'more',
 '.',
 'Wehling',
 'was',
 'fifty-six',
 ',',
 'a',
 'mere',
 'stripling',
 'in',
 'a',

In [370]:
character_dict = {
                  "ed": ["edward k. wehling jr.", "wehling", "edward k. wehling", 'young wehling', "edward", "mr wehling"],
                  "painter": ["the painter", "painter"],
                  "dr": ["dr. ritz", "dr. benjamin hitz", "hitz", "benjamin"],
                  "leora": ["leora duncan", "leora", "duncan", "miss duncan", "ms duncan"],
                  }

def count_characters(txt, in_dict): # take list of strings (words)
    count_dict = {}
    for word in txt:
        lower_word = word.lower()
        for name, var in in_dict.items():
            if lower_word in var:    
                if name in count_dict:
                    count_dict[name] += 1
                else:
                    count_dict[name] = 0
    return count_dict

character_freq = count_characters(to_be_or_not_tokens, character_dict)
name_count_df = pd.DataFrame.from_dict(character_freq, orient='index', columns=["count"])
name_count_df = name_count_df.reset_index()
name_count_df.columns= ["name", "count"]
name_count_df.sort_values("count",ascending=False)

Unnamed: 0,name,count
0,ed,24
2,dr,20
3,leora,17
1,painter,14


In [None]:
def find_name(name, text):
    m = re.search(rf"\b(?=\w){name}\b(?!\w)", text, re.IGNORECASE)
  #  print(m)
    try:
        return m.group(0)
    except:
        pass

def book_find_name(name, inlist):
    """ 
    """
    name_dialog = []
    for i in range(len(inlist)):
        sent = inlist[i]
        find_n = find_name(name, sent)
        if find_n is not None:
            name_dialog.append(find_n)
    return name_dialog

#find_name("said wehling", to_be_or_not_sents[1])
book_find_name("wehling", to_be_or_not_sents)

In [369]:
lowercase_book = [x.lower() for x in to_be_or_not_sents]
for sent in lowercase_book:
    if "wehling" in sent:
        print(sent)

one bright morning in the chicago lying-in hospital, a man named edward k. wehling, jr. ,  waited for his wife to give birth.
wehling was fifty-six, a mere stripling in a population whose average age was one hundred and twenty-nine.
young wehling was hunched in his chair, his head in his hand.
wehling, the waiting father, mumbled something without raising his head.
"wehling, " said the waiting father, sitting up, red-eyed and frowzy.
"edward k. wehling, jr. , is the name of the happy father-to-be. "
"oh, mr. wehling, " said dr. hitz, "i didn not see you. "
"the invisible man, " said wehling.
"hooray, " said wehling emptily.
said wehling.
dr. hitz became rather severe with wehling, towered over him.
wehling?"
"i think it's perfectly keen, " said wehling tautly.
wehling?"
"nope, " said wehling sulkily.
"a drupelet, mr. wehling, is one of the little knobs, one of the little pulpy grains of a blackberry, " said dr. hitz.
wehling continued to stare at the same spot on the wall.
"i want thos

In [388]:
def get_character_dialogue(book, character_dict, character):
    """
    Get the dialog of a specific character for the given book
    Args:
        book ([list of strings]): [a book containing sentences as strings]
    """
    character_name_variations = character_dict[character]
        
    lowercase_book = [x.lower() for x in book]
    dialog_list = []
    
    for sent in lowercase_book:
        chara_dialog = [sent for name in character_name_variations if name in sent]
        
        if len(chara_dialog) < 2 and len(chara_dialog) > 0:
            dialog_list.append(chara_dialog)
        if len(chara_dialog) > 1:
            dialog_list.append([chara_dialog[0]])
            
    return dialog_list

In [392]:
ed_dialog = flatten_list(get_character_dialogue(to_be_or_not_sents, character_dict, "ed"))
dr_dialog = flatten_list(get_character_dialogue(to_be_or_not_sents, character_dict, "dr"))
leora_dialog = flatten_list(get_character_dialogue(to_be_or_not_sents, character_dict, "leora"))

In [393]:
ed_dialog

['one bright morning in the chicago lying-in hospital, a man named edward k. wehling, jr. ,  waited for his wife to give birth.',
 'wehling was fifty-six, a mere stripling in a population whose average age was one hundred and twenty-nine.',
 'young wehling was hunched in his chair, his head in his hand.',
 'wehling, the waiting father, mumbled something without raising his head.',
 '"wehling, " said the waiting father, sitting up, red-eyed and frowzy.',
 '"edward k. wehling, jr. , is the name of the happy father-to-be. "',
 '"oh, mr. wehling, " said dr. hitz, "i didn not see you. "',
 '"the invisible man, " said wehling.',
 '"hooray, " said wehling emptily.',
 'said wehling.',
 'dr. hitz became rather severe with wehling, towered over him.',
 'wehling?"',
 '"i think it\'s perfectly keen, " said wehling tautly.',
 'wehling?"',
 '"nope, " said wehling sulkily.',
 '"a drupelet, mr. wehling, is one of the little knobs, one of the little pulpy grains of a blackberry, " said dr. hitz.',
 'we

In [394]:
leora_dialog

['"my name\'s leora duncan. "',
 '"duncan, duncan, duncan, " he said, scanning the list.',
 '"well, " said leora duncan, "that\'s more the disposal people, isn not it?',
 'and, while leora duncan was posing for her portrait, into the waitingroom bounded dr. hitz himself.',
 '"well, miss duncan!',
 'miss duncan!"',
 'said leora duncan.',
 '"i wish people wouldn not call it that, " said leora duncan.',
 '"that sounds so much better, " said leora duncan.',
 'and then he shot leora duncan.']