A word's arousal is a measure of how exciting, interesting, attention-grabbing it is. In this notebook, we will assign an arousal to every word used in the texts, and then explore whether arousal influences how memorable sentences are.

**Possible Measures**

1. [The Glasgow Norms](https://link.springer.com/article/10.3758/s13428-018-1099-3) - This study used 1000 in-person British people to score `5,000 words` on a scale from 1-9 (boring --> exciting).
   * This is one of the best word pools out there because it takes into account that words have multiple meanings (*e.g.* a romantic date is much more exciting than calendar date)
2. [Warriner, Kuperman, and Bysbaert Emotional Norms](https://link.springer.com/article/10.3758/s13428-012-0314-x) - This study used 2000 online raters to score `13,000 words` on a scale from 1-9 (boring --> exciting).

# (0) Load in Data

In [47]:
import pandas as pd
import numpy as np
import json
import string

passages = ["A1", "A2", "A3", "A4", "A5"]  
df = pd.DataFrame()

for pas_id in passages:
    pas_df = pd.read_excel("data\ListA_TextProperties.xlsx", sheet_name=pas_id).drop(columns=["max_arousal_Glasgow", "arousal_Glasgow"], errors='ignore')
    df = pd.concat([df, pas_df], ignore_index= True)

df.head(1) 

Unnamed: 0,sentence,rec_prob,concreteness_Brysbaert,sentence_length,word_length,word_frequency_cmx,word_concreteness_cmx,grade_level,input,paragraph,passage_id
0,Small and large abscesses may need to be treat...,0.84,2.94,10,1.5,2.14,446.33,6.01,1,1,A1


In [48]:
## This will be the dictionary we store each (word, arousal) pair in
word_to_arousal_dictionary = {} 

## This will be the dictionary we store (word with stem, root word in word pool) pairs
with open("data\Arousal_equivalent_words.json","r") as dic:           
    equivalence_dict = json.load(dic)

# (1) Glasgow Norms

This word pool is interesting as it contains scores for ambiguous words. Let's load in the scores, and see what percent of words are covered by this word pool.

In [49]:
# (0) Load in word pool
glasgow_arousal_df = pd.read_csv("data/Glasgow_Norms.csv")
glasgow_arousal_df = glasgow_arousal_df[["Words", "AROU"]]
glasgow_arousal_df = glasgow_arousal_df.rename(columns={"Words": "word", "AROU":"arousal"}).drop(0)

glasgow_arousal_df.head()

Unnamed: 0,word,arousal
1,abattoir,4.2
2,abbey,3.125
3,abbreviate,3.273
4,abdicate,4.194
5,abdication,3.846


In [50]:
# (0) Get a list of unique words used in each passage
passage_words = " ".join(df["sentence"])                            # Passage string
passage_words = passage_words.replace("%"," percent")
passage_words = passage_words.replace("18"," eighteen")
passage_words = passage_words.replace("36"," thirty six")
passage_words = passage_words.replace("9"," nine")
passage_words = passage_words.translate(str.maketrans('', '', string.punctuation))      # Removes punctuation
passage_words = passage_words.lower().split()                       # Turns lower case and to list
passage_words = list(set(passage_words))                            # Get unique words

# (1) Find % of words immediately covered by word pool
word_pool = glasgow_arousal_df["word"].to_list()
arousal_lst = glasgow_arousal_df["arousal"].to_list()
num_words_in_word_pool = 0
not_in_pool = []
words_with_multiple_meanings = []

for passage_word in passage_words:                                          # For each word used in the passage, I want to check and see if it has a arousal scoring
    scored_word_lst = []
    origWord = passage_word                                                    # Save original word

    for _ in range(2):                                                         # For the case:   passage_word -> root word -> word (meaning)
        if passage_word in equivalence_dict:                                   # If the passage word needs to be stemmed, stem it.
            passage_word = equivalence_dict[passage_word]


    for scored_word, arousal in zip(word_pool, arousal_lst): 
        if scored_word.split(' (')[0] == passage_word or scored_word == passage_word:
            scored_word_lst.append(scored_word)
            word_to_arousal_dictionary[scored_word] = float(arousal)


    if len(scored_word_lst) == 0:
        not_in_pool.append(origWord)

    elif len(scored_word_lst) >= 1:
        num_words_in_word_pool += 1
        if len(scored_word_lst) >= 2:
            words_with_multiple_meanings.append((passage_word, scored_word_lst))

    else:
        not_in_pool.append(passage_word)


percent = num_words_in_word_pool / len(passage_words)

print(f"Total # of Passage Words {len(passage_words)}")
print(f"Number of Words Covered: {num_words_in_word_pool}")
print(f"Missing words: {sorted(not_in_pool)}")
print(f"Ambiguous words ({len(words_with_multiple_meanings)} words): {words_with_multiple_meanings}")

Total # of Passage Words 406
Number of Words Covered: 187
Missing words: ['a', 'about', 'abrasions', 'abscess', 'abscesses', 'accessible', 'additional', 'airway', 'along', 'also', 'an', 'and', 'anesthetic', 'another', 'antibiotic', 'any', 'are', 'area', 'areas', 'around', 'as', 'ask', 'asking', 'assess', 'assessment', 'assessments', 'at', 'avpu', 'back', 'backward', 'be', 'because', 'before', 'bleeding', 'both', 'breathing', 'by', 'calculate', 'calculated', 'can', 'cause', 'caused', 'causes', 'chemical', 'chemicals', 'coma', 'compress', 'consciousness', 'considered', 'contacted', 'contamination', 'contusion', 'contusions', 'could', 'cravats', 'day', 'degree', 'different', 'differently', 'do', 'drainage', 'drapes', 'dressings', 'dry', 'each', 'eighteen', 'electrical', 'elevate', 'enough', 'environment', 'excess', 'exit', 'explosions', 'exposed', 'extended', 'extensive', 'eyeball', 'eyebrows', 'eyelashes', 'feeling', 'find', 'finding', 'five', 'flame', 'flush', 'for', 'found', 'fracture'

Let's keep the ratings for the 35% of words covered by this word pool, and try to cover the rest of the missing words using the Warriner, Cooper, Brysbaert ratings.

# (2) WKB Emotional Norms

In [51]:
# (0) Load in Word Pool
wkb_arousal_df = pd.read_excel("data\WKB_Emotion_Norms.xlsx")
wkb_arousal_df = wkb_arousal_df[["Word", "A.Mean.Sum"]]
wkb_arousal_df = wkb_arousal_df.rename(columns={"Word": "word", "A.Mean.Sum":"arousal"}).drop(0)

wkb_arousal_df.head()


Unnamed: 0,word,arousal
1,abalone,2.65
2,abandon,3.73
3,abandonment,4.95
4,abbey,2.2
5,abdomen,3.68


In [52]:
# (1) Find the number of words covered by this word pool, along with the words that are still missing
scored_words = wkb_arousal_df.word.to_list()
arousal_lst = wkb_arousal_df.arousal.to_list()
covered_by_pool_lst = []

for missing_word in not_in_pool:                                # For each word that's missing


    if missing_word in equivalence_dict:                        # Check if there is equivalent word in word pool
        not_in_pool.remove(missing_word)
        missing_word = equivalence_dict[missing_word]

    if missing_word in scored_words:
        index =  scored_words.index(missing_word)
        covered_by_pool_lst.append(missing_word)
        word_to_arousal_dictionary[missing_word] = float(arousal_lst[index])
        not_in_pool = [word for word in not_in_pool if word != missing_word]

    # for scored_word, arousal in zip(scored_words, arousal_lst): # Check if word in word pool
    #     if missing_word == scored_word:
    #         covered_by_pool_lst.append(missing_word)
    #         word_to_arousal_dictionary[scored_word] = float(arousal)
    #         not_in_pool.remove(missing_word)
    

print(f"Number of words covered: {len(covered_by_pool_lst)}")
print(f"Covered Words: {sorted(covered_by_pool_lst)}")
print(f"Number of missing words: {len(not_in_pool)}")
print(f"Missing words: {sorted(not_in_pool)}")

Number of words covered: 154
Covered Words: ['abscess', 'abscess', 'accessible', 'additional', 'airway', 'anesthetic', 'antibiotic', 'area', 'area', 'ask', 'ask', 'assess', 'assessment', 'assessment', 'back', 'back', 'be', 'bleed', 'breath', 'calculate', 'calculate', 'can', 'cause', 'cause', 'cause', 'chemical', 'chemical', 'coma', 'compress', 'consciousness', 'consider', 'contact', 'contamination', 'contusion', 'contusion', 'day', 'degree', 'different', 'different', 'do', 'drainage', 'drape', 'dry', 'eighteen', 'electrical', 'elevate', 'environment', 'excess', 'exit', 'explosion', 'expose', 'extend', 'extensive', 'eyeball', 'eyebrow', 'eyelash', 'feeling', 'find', 'find', 'find', 'five', 'flame', 'flush', 'fracture', 'gauze', 'get', 'get', 'healing', 'hoarse', 'hot', 'immediate', 'incision', 'include', 'include', 'increase', 'inflammation', 'inhalation', 'inject', 'insert', 'internal', 'involvement', 'jagged', 'level', 'look', 'look', 'loss', 'may', 'medical', 'millimeter', 'minimize'

Let's check our word pool so far!

In [53]:
print(word_to_arousal_dictionary)

{'steam': 4.419, 'object (thing)': 3.118, 'note': 4.065, 'promote': 5.516, 'symptom': 4.226, 'wrap': 4.059, 'treat': 6.879, 'forceful': 5.912, 'other': 2.742, 'wound (cut)': 5.206, 'opening': 5.0, 'roll (turn over)': 4.364, 'injury': 5.091, 'danger': 7.147, 'warm': 5.857, 'kind (type)': 3.758, 'verbal': 4.706, 'process': 3.471, 'process (modify)': 3.571, 'process (procedure)': 3.088, 'process (understand)': 3.97, 'electricity': 5.849, 'give': 4.625, 'clear': 4.667, 'scalp': 3.636, 'measure': 4.125, 'second (position)': 4.152, 'killer': 6.303, 'near': 4.546, 'eye': 3.886, 'record (information)': 3.6, 'patient (medical)': 4.324, 'make': 4.765, 'cover': 4.226, 'type (category)': 3.343, 'apply': 3.971, 'infection': 4.471, 'carry': 3.563, 'motor': 3.941, 'hard (not soft)': 5.121, 'small': 3.424, 'slight (little)': 3.057, 'pain': 5.515, 'severe': 5.206, 'check': 3.546, 'stop': 4.355, 'safe': 4.912, 'test': 4.242, 'high': 5.879, 'black': 3.657, 'sure': 4.914, 'victim': 5.486, 'entry': 4.594, 

# (3) Score Sentences

In [54]:
%%capture
def arousal(item_string: str, word_pool: dict, equivalence_dict: dict = {}):
    """Returns the average arousal  of the word in a text

    Args:
        item_string (str): 
            A sentence to compute arousal of
        
        word_pool (dict[str, int]):
            Dictionary of (word, arousal) pairs

        equivalence_dict (dict[str, str]):
            Dictionary mapping words to equivalent words in the word pool

    Returns:
        max_arousal (int): 
            maximum arousal of the words in the sentence

        missing_words (list): 
            List of words not in word pool
    """

    # (0) Clean up string -- change '%' -> 'percent', remove punctuation, lowercase letters
    item_string = item_string.replace("%"," percent")
    item_string = item_string.translate(str.maketrans('', '', string.punctuation))
    item_string = item_string.lower()

    # (1) Compute arousal of each word in string
    arousal_lst = []
    missing_words = []

    for word in item_string.split():
        if word in equivalence_dict:
            word = equivalence_dict[word]
            
        if word in word_pool: 
            arousal_lst.append(word_pool[word])
        else:
            missing_words.append(word)

    if arousal_lst != []:
        avg_arousal = sum(arousal_lst)/len(arousal_lst)      # watch out for division by zero!
    else:
        avg_arousal = np.nan

    return avg_arousal, missing_words


In [55]:
%%capture
# (1) Load in word pool and dictionary equivalent words
word_pool = word_to_arousal_dictionary

with open("data\Arousal_equivalent_words.json","r") as dic:           
    equivalence_dict = json.load(dic)
    
# (2) Assign an arousal to each sentence and save results
missing_words = []
with pd.ExcelWriter("data/ListA_TextProperties.xlsx", mode="a", if_sheet_exists="replace") as writer:
    for pas_id in passages:
        pas_df = df.query(f"passage_id == '{pas_id}'")
        arousal_column =  [arousal(sentence, word_pool, equivalence_dict)[0] for sentence in pas_df.sentence]
        missing_words.append([arousal(sentence, word_pool, equivalence_dict)[1] for sentence in pas_df.sentence])
        pas_df["arousal_Glasgow"] = arousal_column
        pas_df = pas_df[["sentence", "rec_prob", "arousal_Glasgow"] + pas_df.columns[2:-1].to_list()]    
        pas_df.round(2).to_excel(writer, sheet_name=pas_id, index=False)  

# (4) Check Measures

Now let's check whether these arousal rating align with intution, and that the fact that words have multiple meanings doesn't mess us the scoring too much.

In [56]:
# (1) Let's create a dataframe that associates each word, with a sentence, and an arousal

validation_df = pd.DataFrame(columns=["Word", "Arousal", "Sentence","Probability of Recall"])

for i, row in df.iterrows():        # For each sentence
    sentence = row.sentence
    for word in sentence.split():   # For each word in a sentence
        validation_df.loc[len(validation_df)] = [word, arousal( word, word_pool, equivalence_dict)[0] , sentence, row.rec_prob]

validation_df = validation_df.sort_values("Arousal", ascending=False).dropna(subset=["Arousal"])
validation_df.head(10)


Unnamed: 0,Word,Arousal,Sentence,Probability of Recall
869,sensation,7.677,Test the sensation and movement around the wou...,0.11
610,dangers.,7.147,"Secure a safe environment, clear of any close ...",0.38
270,lightning,7.032,Victims of lightning strikes may need extended...,0.33
224,lightning,7.032,Explosions and lightning may cause burn and ad...,0.54
59,treating,6.879,Make sure everything is ready before treating ...,0.24
8,treated,6.879,Small and large abscesses may need to be treat...,0.84
296,treat,6.879,"Once the burn process is stopped, begin to tre...",0.38
566,contamination,6.71,Completing this step also prevents further con...,0.33
216,"flame,",6.6,"Possible causes include open flame, hot liquid...",0.72
222,Explosions,6.35,Explosions and lightning may cause burn and ad...,0.54


It looks like super exciting words aren't particularly well remembered. Let's investigate this.

In [69]:
# (0) Load in Data
import matplotlib.pyplot as plt

passages = ["A1", "A2", "A3", "A4", "A5"]  
df = pd.DataFrame()

for pas_id in passages:
    pas_df = pd.read_excel("data\ListA_TextProperties.xlsx", sheet_name=pas_id)
    df = pd.concat([df, pas_df], ignore_index= True)

df = df.sort_values("arousal_Glasgow", ascending=False)
df.head(1)


Unnamed: 0,sentence,rec_prob,arousal_Glasgow,concreteness_Brysbaert,sentence_length,word_length,word_frequency_cmx,word_concreteness_cmx,grade_level,input,paragraph,passage_id
5,Make sure everything is ready before treating ...,0.24,5.09,2.23,9,1.67,2.71,338.2,7.59,6,2,A1


In [59]:
upper_percentile = np.percentile(df.arousal_Glasgow, q=75)
lower_percentile = np.percentile(df.arousal_Glasgow, q=25)
arousing_sentences = df.query(f"arousal_Glasgow > {upper_percentile}")
boring_sentences = df.query(f"arousal_Glasgow < {lower_percentile}")
print("Mean Recall of  Sentences:", df.rec_prob.mean())
print("Mean Recall of Exciting Sentences:", arousing_sentences.rec_prob.mean())
print("Mean Recall of Boring Sentences:", boring_sentences.rec_prob.mean())

Mean Recall of  Sentences: 0.4886
Mean Recall of Exciting Sentences: 0.4364
Mean Recall of Boring Sentences: 0.5456
