# 1. Preparation

**1.0 Import Lexicons** <br>
Initially we intended to use LIWC lexicon dictionairies (download [here](https://pypi.org/project/liwc/), and install using `!pip install -U liwc`). But it would require considerable fee. Therefore, we turned to a free equivalent called EMPATH whose guideline could be accessed [here](https://github.com/Ejhfast/empath-client). If it's still not working sufficiently, we will try the [SEANCE](https://www.linguisticanalysistools.org/seance.html). <br>

**1.1 Explore EMPATHY, finding what existing lexicons from EMPATHY could be adopted directly.** <br>
Next we explore the EMPATHY. In [Yarkoni (2011)](https://www.sciencedirect.com/science/article/pii/S0092656610000541), the referential article, its Table 1 displays a correlation between LIWC lexicons and the five dimensions of Big-Five personalities. We use this table as a benchmark to filter the EMPATHY, find the available labels, and then apply them to our dataset. <br>

**1.2 Use Spacy to add lexicons that we need but missing from EMPATHY.** <br>
For those missing, some of them such as 1st, 2nd, 3rd person pronouns could be added by parsing with Spacy, but for some of them there is a lack of instrument. We will take that as a limitation of this study. This new lexicon as a substitute of LIWC, we will call it EMPATHYe (Empathy extended).<br> 

**The results shows that:** <br>
**EMPATHY has:** <br>
affect, positive_emotions, negative_emotions, anger, sadness, hearing, communication, friends, family, swearing_terms <br>
**Need Spacy for:** <br>
pronouns(PRON), articles(DET), prepositions(PREP), numbers(NUM) <br>
1st person sg/pl, 2nd person, 3rd person pronouns <br>
Past/present/future tense vb. <br>
And we **neglect the rest.** <br>


In [67]:
### 1.0 Import Lexicon pkg ###
# !pip install empath spacy pandas numpy scipy statsmodels
import pandas as pd
import spacy
from empath import Empath
lexicon = Empath()

### 1.1 EXPLORING EMPATHY ###
### WHAT EXISTING CLASSES IN EMPATHY COULD BE ADOPTED DIRECTLY ###
# Print all category (class) names
print(list(lexicon.cats.keys()))
print()

# Define the list of class (category) names we're looking for
categories_to_check = ["total pronouns", "pron", "first person sing.", "first person", "second person", "third person",
 "negation", "assent", "articles", "prep", "prepositions", "number",
 "affect", "positive", "optimism", "negative", "anxiety", "anger", "sadness", 
 "cognitive", "causation", "insight", "discrepancy", "inhibition", "tentative", "certainty", 
 "sensory", "seeing", "hearing", "feeling", "social", "communication", "references",
 "friend", "family", "human", "time", "tense", "space", "up", "down", 
 "inclusive", "exclusive", "motion", "occupation", "school", "job", "work", "achieve", 
 "leisure", "home", "sport", "tv", "movie", "music", "sound", "money", "finance",
 "metaphysics", "religion", "death", "physical", "body", "sexuality", "sex", "eat", "drink", "sleep", "groom", "swear"]

# Convert categories_to_check to lowercase for case-insensitive comparison
categories_to_check_lower = [cat.lower() for cat in categories_to_check]

# Find matching categories (substring match, case insensitive)
matched_categories = [cat for cat in lexicon.cats if any(search_term in cat.lower() for search_term in categories_to_check_lower)]
not_matched_categories = [cat for cat in categories_to_check if not any(search_term in cat.lower() for search_term in lexicon.cats.keys())]

# Output matched and not matched categories
print("Matched categories:", matched_categories)
print()
print("Not matched categories:", not_matched_categories)


['help', 'office', 'dance', 'money', 'wedding', 'domestic_work', 'sleep', 'medical_emergency', 'cold', 'hate', 'cheerfulness', 'aggression', 'occupation', 'envy', 'anticipation', 'family', 'vacation', 'crime', 'attractive', 'masculine', 'prison', 'health', 'pride', 'dispute', 'nervousness', 'government', 'weakness', 'horror', 'swearing_terms', 'leisure', 'suffering', 'royalty', 'wealthy', 'tourism', 'furniture', 'school', 'magic', 'beach', 'journalism', 'morning', 'banking', 'social_media', 'exercise', 'night', 'kill', 'blue_collar_job', 'art', 'ridicule', 'play', 'computer', 'college', 'optimism', 'stealing', 'real_estate', 'home', 'divine', 'sexual', 'fear', 'irritability', 'superhero', 'business', 'driving', 'pet', 'childish', 'cooking', 'exasperation', 'religion', 'hipster', 'internet', 'surprise', 'reading', 'worship', 'leader', 'independence', 'movement', 'body', 'noise', 'eating', 'medieval', 'zest', 'confusion', 'water', 'sports', 'death', 'healing', 'legend', 'heroic', 'celebr

In [60]:
# do an example analysis for light testing
result = lexicon.analyze("he kiss the other person", normalize=True)
filtered_result = {category: value for category, value in result.items() if value > 0}
print(filtered_result)

{'sexual': 0.2, 'love': 0.2}


# 2. Data Processing

**Recall the hypotheses for word level**

We want to see if the correlations between LIWC categories and Big Five personlity traits align with the trend in Yarkoni(2011). That is: <br>
|   EMPATHYe   |Label Name| Neuroticism  | Extroversion |   Openness   |Agreeableness |Conscientiousness|
|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| pronouns     |*pronouns|      +       |      +       |       --     |       ++     |       -      |
| 1st person sing.|*first_person_sg|   ++      |      +       |       -      |       +      |       0      |
| 1st person plural|*first_person_pl|   -      |     ++       |       --     |       ++     |       +      |
| 1st person   |*first_person||++|+|--|++|+|
| 2nd person   |*second_person||--|++|--|+|0|
| 3rd person   |*third_person|+|+|-|+|-|
| negations    |*negations|++|-|--|-|--|
| articles     |*articles|--|-|++|+|++|
| prepositions |*prepositions|-|-|++|+|+|
| numbers      |*numbers|-|--|--|++|+|
| affect       |affection|+|+|--|+|-|
| positive emotions|positive_emotion|-|++|--|++|+|
| optimism    |optimism|--|+|0|++|++|
| negative emotions|negative_emotion|++|+|0|--|--|
| anger        |anger|++|+|+|--|--|
| sadness      |sadness|++|+|-|+|--|
| hearing      |hearing|+|++|--|+|--|
| communication|communication|0|++|-|+|-|
| friends      |friends|--|++|-|++|+|
| family       |family|-|+|--|++|+|
| past tense vb.|*past_tense|+|-|--|+|0|
| present tense vb.|*present_tense|+|-|--|0|-|
| future tense vb.|*future_tense|-|-|-|-|-|
| occupation   |occupation|+|--|+|-|+|
| school       |school|+|-|+|-|-|
| job/work     |work|+|--|+|-|+|
| achievement  |achievement|+|--|-|+|--|
| leisure      |leisure|-|++|--|++|+|
| home         |home|0|+|--|++|+|
| sports       |sports|-|+|--|+|0|
| music        |music|-|++|+|+|--|
| money        |money|+|-|-|--|-|
| religion     |religion|-|++|+|+|-|
| death        |death|+|+|++|--|--|
| body states  |body|+|++|-|++|-|
| sexuality    |sexuality|+|++|0|++|-|
| eating       |eating|-|+|--|+|-|
| sleep        |sleep|++|-|--|++|-|
| swearing words|swearing_terms|++|+|+|--|--|
(_Label Name_ refers to its new name in our _EMPATHYe_)

**Re-classification Needed**:<br>
The following labels from _EMPATHY_ will be renamed/reclassified in our _EMPATHYe_ <br>
1)Work: domestick_work, blue_collar_job, white_collar_job, work <br>
2)Music: music, sound, musical <br>
3)Sexuality: sexual <br>

**Dataset**<br>
Our dataset use the collection of the complete 8 series of *Harry Potter* film series. Originally we use only the first film but turns out it's not sufficient for a significant result, therefore we applied them all. <br>

**Build our own lexicon: *EMPATHYe*** <br>
After the pre-processing stage, each character's lines form a separate dataset. Currently each dataset has the following labels: tokens, frequencies, postags. Now we need to add a new label called "*empathye*", which contains the needed lexicon information. Some could be proceeded directly by EMPATHY, some as stated before, need further processing using Spacy. <br>

We follow the steps below:

**2.1 Handle Empathy** <br>

**2.2 Apply Spacy** <br>

In [68]:
### 2.1 START WITH EMPATHY ###

# labels to keep
labels_to_keep = [
    'money', 'domestic_work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school',
    'blue_collar_job', 'optimism', 'home', 'sexual', 'superhero', 'religion', 'body', 'eating', 'sports',
    'death', 'communication', 'hearing', 'music', 'sound', 'work', 'sadness', 'emotional', 'affection',
    'anger', 'white_collar_job', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'musical'
]

# merging rules:
merge_rules = {
    'work': ['domestic_work', 'blue_collar_job', 'white_collar_job', 'work'],
    'music': ['music', 'sound', 'musical'],
    'sexuality': ['sexual']
}

temp_lexicon = {} # temporarily store the lexicon

# filter and merge based on the rules above
for label in labels_to_keep:
    # 检查标签是否需要合并
    merged = False
    for new_label, old_labels in merge_rules.items():
        if label in old_labels:
            # 如果是要合并的标签，将内容合并至新标签
            if new_label not in temp_lexicon:
                temp_lexicon[new_label] = set()
            temp_lexicon[new_label].update(lexicon.cats[label])
            merged = True
            break
    # 如果标签不在合并规则内，直接添加到临时存储中
    if not merged:
        temp_lexicon[label] = lexicon.cats[label]

# Clear original lexicon contents
lexicon.cats.clear()

# Reassign the updated content to lexicon
lexicon.cats.update(temp_lexicon)

In [69]:
### 2.2 USE SPACY TO PROCEED MORE

# [1] PERSONAL PRONOUNS
nlp = spacy.load("en_core_web_lg") # load the English model
# 定义代词标签
pronouns = {
    "first_person_sg": ["I", "me", "my", "mine"],
    "first_person_pl": ["we", "us", "our", "ours"],
    "first_person": ["I", "me", "my", "mine", "we", "us", "our", "ours"],
    "second_person": ["you", "your", "yours"],
    "third_person": ["he", "him", "his", "she", "her", "hers", "they", "them", "their", "theirs"],
}

# 将 pronouns 添加到 lexicon
for label, words in pronouns.items():
    lexicon.cats[label] = words

# 检查添加后的 lexicon
print(list(lexicon.cats.keys()))

['money', 'work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school', 'optimism', 'home', 'sexuality', 'superhero', 'religion', 'body', 'eating', 'sports', 'death', 'communication', 'hearing', 'music', 'sadness', 'emotional', 'affection', 'anger', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'first_person_sg', 'first_person_pl', 'first_person', 'second_person', 'third_person']


In [70]:
# define the function for finding past, present, future tense words;
# for finding numbers, prepositions, articles, negations

# tense verbs
def label_tenses(file_path):
    # 读取 CSV 文件，不指定列名
    df = pd.read_csv(file_path, header=None)
    
    # 存储结果
    labeled_verbs = {
        'past_tense': [],
        'present_tense': [],
        'future_tense': []
    }

    # 遍历每一行文本
    for index in range(len(df)):
        # 使用 spaCy 处理每一行文本
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        # 查找动词
        for token in doc:
            # if token.pos_ == "VERB":  # 确保是动词
                # 根据时态分类
                if token.tag_ in ['VBD', 'VBN']:  # 过去时动词
                    labeled_verbs['past_tense'].append(token.text)
                elif token.tag_ in ['VBZ', 'VBP', 'VBG']:  # 现在时动词
                    labeled_verbs['present_tense'].append(token.text)
                elif token.tag_ == 'MD':  # 将来时动词（情态动词）
                    # labeled_verbs['future_tense'].append(token.nbor().text)
                    # 需要检查下一个词是否为动词以确定将来时
                     if token.nbor().pos_ == "VERB":
                        labeled_verbs['future_tense'].append(token.nbor().text)

    return labeled_verbs

# numbers
def label_numbers(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_numbers = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.like_num:  # 判断是否是数字
                labeled_numbers.append(token.text)

    return labeled_numbers


# prepositions

def label_prepositions(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_prepositions = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.pos_ == "ADP":  # 介词的 POS 标签是 ADP
                labeled_prepositions.append(token.text)

    return labeled_prepositions


# articles

def label_articles(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_articles = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.pos_ == "DET":  # 冠词的 POS 标签是 DET
                labeled_articles.append(token.text)

    return labeled_articles


# negations

def label_negations(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_negations = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.dep_ == "neg":  # 否定词的依存关系标签是 neg
                labeled_negations.append(token.text)

    return labeled_negations


def label_pronouns(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_pronouns = {"first_person": [], "second_person": [], "third_person": []}

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.text in pronouns["first_person"]:
                labeled_pronouns["first_person"].append(token.text)
            elif token.text in pronouns["second_person"]:
                labeled_pronouns["second_person"].append(token.text)
            elif token.text in pronouns["third_person"]:
                labeled_pronouns["third_person"].append(token.text)

    return labeled_pronouns

def label_empathy(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_empathy = {label: [] for label in lexicon.cats.keys()}

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问每一行的文本
        doc_text = doc.text.lower()  # 将文本转换为小写以匹配词汇表中的单词

        for label, words in lexicon.cats.items():
            for word in words:
                if word in doc_text:
                    labeled_empathy[label].append(word)

    return labeled_empathy

# 3. Data Processing

Based on the lexicons, we start the data analysis of correlation between characters' personalties and their lines. <br>
At this stage, we need to count the frequency and percentage of all labels per characters. Later on we will make the comparison based on their personality. <br>

*percentage = freq / number_of_tokens

Here is the **personality scores** from the reference study:

|   Character   | Neuroticism  | Extroversion |   Openness   |Agreeableness |Conscientiousness|
|--------------|--------------|--------------|--------------|--------------|--------------|
|Ron Wesley|3.22|4.9|4.02|3.76|3.01|
|Hermine Granger|4.22|4.65|5.12|4.07|6.22|
|Albus Dumbledore|5.52|4.36|5.52|5.07|5.73|
|Lord Voldmort|3|4.36|4.27|1.95|4.88|
|Draco Malfoy|3.15|4.23|3.86|2.15|4.16|
|Harry Potter|3.85|3.92|5.13|4.11|4.36|
|Severus Snape|4.43|2.65|4.08|2.6|5.49|

In [71]:
from collections import Counter

def analysis_tense(file_path):
    # 调用不同的函数并打印结果
    tenses_result = label_tenses(file_path)

    # 计算总单词数
    total_words = 0

    # 计算每个句子的总单词数
    df = pd.read_csv(file_path, header=None)
    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）
        total_words += len(doc)  # 统计当前句子的单词数

    # 获取各类动词的数量
    past_count = len(tenses_result['past_tense'])
    present_count = len(tenses_result['present_tense'])
    future_count = len(tenses_result['future_tense'])

    # 计算百分比
    past_percentage = (past_count / total_words) * 100 if total_words > 0 else 0
    present_percentage = (present_count / total_words) * 100 if total_words > 0 else 0
    future_percentage = (future_count / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Past Tense Verbs: {past_count} ({past_percentage:.2f}%)")
    print(f"Present Tense Verbs: {present_count} ({present_percentage:.2f}%)")
    print(f"Future Tense Verbs: {future_count} ({future_percentage:.2f}%)")

def analysis_numbers(file_path):
    labeled_numbers = label_numbers(file_path)
    
    # 统计数字token的总数量
    total_numbers = len(labeled_numbers)
    
    # 计算总词数（包含数字和非数字的总词数）
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算数字token的占比
    number_percentage = (total_numbers / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total number words: {total_numbers} ({number_percentage:.2f}%)")
    # print(f"Total words containing numbers: {total_numbers}")
    # print(f"Percentage of numbers in all tokens: {percentage:.2f}%")


# Analysis for prepositions
def analysis_prepositions(file_path):
    labeled_prepositions = label_prepositions(file_path)
    
    # 统计介词的总数量
    total_prepositions = len(labeled_prepositions)
    
    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算介词的占比
    preposition_percentage = (total_prepositions / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total prepositions: {total_prepositions} ({preposition_percentage:.2f}%)")


# Analysis for articles
def analysis_articles(file_path):
    labeled_articles = label_articles(file_path)
    
    # 统计冠词的总数量
    total_articles = len(labeled_articles)
    
    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算冠词的占比
    article_percentage = (total_articles / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total articles: {total_articles} ({article_percentage:.2f}%)")


# Analysis for negations
def analysis_negations(file_path):
    labeled_negations = label_negations(file_path)
    
    # 统计否定词的总数量
    total_negations = len(labeled_negations)
    
    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算否定词的占比
    negation_percentage = (total_negations / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Total negations: {total_negations} ({negation_percentage:.2f}%)")

def analysis_pronouns(file_path):
    labeled_pronouns = label_pronouns(file_path)

    # 统计每种代词的数量
    first_person_count = len(labeled_pronouns["first_person"])
    second_person_count = len(labeled_pronouns["second_person"])
    third_person_count = len(labeled_pronouns["third_person"])

    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 计算各代词的百分比
    first_person_percentage = (first_person_count / total_words) * 100 if total_words > 0 else 0
    second_person_percentage = (second_person_count / total_words) * 100 if total_words > 0 else 0
    third_person_percentage = (third_person_count / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"First Person Pronouns: {first_person_count} ({first_person_percentage:.2f}%)")
    print(f"Second Person Pronouns: {second_person_count} ({second_person_percentage:.2f}%)")
    print(f"Third Person Pronouns: {third_person_count} ({third_person_percentage:.2f}%)")

def analysis_empathy(file_path):
    labeled_empathy = label_empathy(file_path)

    # 统计每个类别的词频
    category_counts = {label: len(words) for label, words in labeled_empathy.items()}
    total_empathy_words = sum(category_counts.values())

    # 计算总词数
    df = pd.read_csv(file_path, header=None)
    total_words = sum(len(nlp(row[0])) for row in df.values)
    
    # 打印每个类别的频率和百分比
    print(f"Total empathy-related words: {total_empathy_words} ({(total_empathy_words / total_words) * 100:.2f}% of total words)")
    for label, count in category_counts.items():
        percentage = (count / total_words) * 100 if total_words > 0 else 0
        print(f"{label}, {count} ({percentage:.2f}%)")


In [72]:
# implement analysis and output results:

def analysis_all(file_path):
    
    print("Tense Analysis:")
    analysis_tense(file_path)
    
    print("\nNumber Analysis:")
    analysis_numbers(file_path)
    
    print("\nPreposition Analysis:")
    analysis_prepositions(file_path)
    
    print("\nArticle Analysis:")
    analysis_articles(file_path)
    
    print("\nNegation Analysis:")
    analysis_negations(file_path)

    print("\nPronoun Analysis:")
    analysis_pronouns(file_path)
    
    print("\nEmpathy Analysis:")
    analysis_empathy(file_path)


In [74]:
print("Counts per character:")

print("\nAlbus Dumbledore")
analysis_all("Tokens/Dumbledore.csv")

print("\nHarry Potter")
analysis_all("Tokens/Harry.csv")

print("\nHermione Granger")
analysis_all("Tokens/Hermione.csv")

print("\nDraco Malfoy")
analysis_all("Tokens/Malfoy.csv")

print("\nRon Wesley")
analysis_all("Tokens/Ron.csv")

print("\nSeverus Snape")
analysis_all("Tokens/Snape.csv")

print("\nLord Voldemort")
analysis_all("Tokens/Voldemort.csv")

Counts per character:

Albus Dumbledore
Tense Analysis:
Past Tense Verbs: 386 (4.04%)
Present Tense Verbs: 708 (7.42%)
Future Tense Verbs: 109 (1.14%)

Number Analysis:
Total number words: 95 (1.00%)

Preposition Analysis:
Total prepositions: 619 (6.49%)

Article Analysis:
Total articles: 575 (6.03%)

Negation Analysis:
Total negations: 123 (1.29%)

Pronoun Analysis:
First Person Pronouns: 406 (4.25%)
Second Person Pronouns: 297 (3.11%)
Third Person Pronouns: 147 (1.54%)

Empathy Analysis:
Total empathy-related words: 5391 (56.49% of total words)
money, 54 (0.57%)
work, 1047 (10.97%)
sleep, 48 (0.50%)
occupation, 8 (0.08%)
family, 92 (0.96%)
swearing_terms, 31 (0.32%)
leisure, 28 (0.29%)
school, 76 (0.80%)
optimism, 46 (0.48%)
home, 62 (0.65%)
sexuality, 10 (0.10%)
superhero, 8 (0.08%)
religion, 48 (0.50%)
body, 82 (0.86%)
eating, 121 (1.27%)
sports, 111 (1.16%)
death, 100 (1.05%)
communication, 140 (1.47%)
hearing, 118 (1.24%)
music, 149 (1.56%)
sadness, 51 (0.53%)
emotional, 47 (0.49

IndexError: [E042] Error accessing `doc[0].nbor(1)`, for doc of length 1.

# Results on Language Use per Character

| Label                          | Albus Dumbledore | Harry Potter | Hermione Granger | Draco Malfoy | Ron Wesley | Severus Snape | Lord Voldemort |
|--------------------------------|------------------|--------------|------------------|--------------|------------|---------------|----------------|
| **Tense Analysis**             |                  |              |                  |              |            |               |                |
| Past Tense Verbs               | 390 (4.09%)     | 907 (4.65%)  | 469 (4.50%)      | 75 (4.32%)   | 389 (4.02%)| 109 (3.63%)   | 80 (3.69%)     |
| Present Tense Verbs            | 702 (7.36%)     | 1711 (8.78%) | 957 (9.18%)      | 156 (8.98%)  | 892 (9.21%)| 225 (7.49%)   | 149 (6.88%)    |
| Future Tense Verbs             | 110 (1.15%)     | 159 (0.82%)  | 83 (0.80%)       | 16 (0.92%)   | 50 (0.52%) | 34 (1.13%)    | 31 (1.43%)     |
| **Number Analysis**            |                  |              |                  |              |            |               |                |
| Total number words             | 95 (1.00%)      | 133 (0.68%)  | 61 (0.59%)       | 12 (0.69%)   | 76 (0.78%) | 22 (0.73%)    | 14 (0.65%)     |
| **Preposition Analysis**       |                  |              |                  |              |            |               |                |
| Total prepositions             | 627 (6.57%)     | 910 (4.67%)  | 527 (5.06%)      | 90 (5.18%)   | 471 (4.86%)| 217 (7.22%)   | 111 (5.12%)    |
| **Article Analysis**           |                  |              |                  |              |            |               |                |
| Total articles                 | 575 (6.03%)     | 772 (3.96%)  | 447 (4.29%)      | 82 (4.72%)   | 399 (4.12%)| 175 (5.82%)   | 84 (3.88%)     |
| **Negation Analysis**          |                  |              |                  |              |            |               |                |
| Total negations                | 123 (1.29%)     | 375 (1.92%)  | 208 (2.00%)      | 39 (2.24%)   | 186 (1.92%)| 38 (1.26%)    | 37 (1.71%)     |
| **Pronoun Analysis**           |                  |              |                  |              |            |               |                |
| First Person Pronouns          | 406 (4.25%)     | 1030 (5.28%) | 386 (3.70%)      | 90 (5.18%)   | 425 (4.39%)| 102 (3.39%)   | 141 (6.51%)    |
| Second Person Pronouns         | 297 (3.11%)     | 486 (2.49%)  | 268 (2.57%)      | 65 (3.74%)   | 239 (2.47%)| 119 (3.96%)   | 83 (3.83%)     |
| Third Person Pronouns          | 147 (1.54%)     | 358 (1.84%)  | 169 (1.62%)      | 21 (1.21%)   | 191 (1.97%)| 41 (1.36%)    | 37 (1.71%)     |
| **Empathy Analysis**           |                  |              |                  |              |            |               |                |
| Total empathy-related words    | 5391 (56.49%)   | 10603 (54.39%)| 5464 (52.42%)    | 1079 (62.08%)| 5179 (53.47%)| 1662 (55.31%) | 1371 (63.30%)  |
| money                          | 54 (0.57%)      | 72 (0.37%)   | 47 (0.45%)       | 3 (0.17%)    | 28 (0.29%) | 12 (0.40%)    | 11 (0.51%)     |
| work                           | 1047 (10.97%)   | 3105 (15.93%)| 1469 (14.09%)    | 241 (13.87%) | 1498 (15.47%)| 336 (11.18%) | 272 (12.56%)   |
| sleep                          | 48 (0.50%)      | 80 (0.41%)   | 44 (0.42%)       | 4 (0.23%)    | 47 (0.49%) | 16 (0.53%)    | 6 (0.28%)      |
| occupation                     | 8 (0.08%)       | 7 (0.04%)    | 11 (0.11%)       | 0 (0.00%)    | 7 (0.07%)  | 4 (0.13%)     | 2 (0.09%)      |
| family                         | 92 (0.96%)      | 123 (0.63%)  | 67 (0.64%)       | 25 (1.44%)   | 72 (0.74%) | 22 (0.73%)    | 24 (1.11%)     |
| swearing_terms                 | 31 (0.32%)      | 58 (0.30%)   | 35 (0.34%)       | 9 (0.52%)    | 62 (0.64%) | 11 (0.37%)    | 3 (0.14%)      |
| leisure                        | 28 (0.29%)      | 24 (0.12%)   | 23 (0.22%)       | 5 (0.29%)    | 17 (0.18%) | 8 (0.27%)     | 3 (0.14%)      |
| school                         | 76 (0.80%)      | 44 (0.23%)   | 49 (0.47%)       | 11 (0.63%)   | 17 (0.18%) | 21 (0.70%)    | 4 (0.18%)      |
| optimism                       | 46 (0.48%)      | 49 (0.25%)   | 49 (0.47%)       | 9 (0.52%)    | 31 (0.32%) | 16 (0.53%)    | 12 (0.55%)     |
| home                           | 62 (0.65%)      | 147 (0.75%)  | 58 (0.56%)       | 22 (1.27%)   | 76 (0.78%) | 23 (0.77%)    | 26 (1.20%)     |
| sexuality                      | 10 (0.10%)      | 9 (0.05%)    | 14 (0.13%)       | 0 (0.00%)    | 6 (0.06%)  | 2 (0.07%)     | 4 (0.18%)      |
| superhero                      | 8 (0.08%)       | 10 (0.05%)   | 11 (0.11%)       | 1 (0.06%)    | 4 (0.04%)  | 2 (0.07%)     | 1 (0.05%)      |
| religion                       | 48 (0.50%)      | 79 (0.41%)   | 21 (0.20%)       | 6 (0.35%)    | 10 (0.10%) | 12 (0.40%)    | 5 (0.23%)      |
| body                           | 82 (0.86%)      | 120 (0.62%)  | 75 (0.72%)       | 24 (1.38%)   | 66 (0.68%) | 31 (1.03%)    | 25 (1.15%)     |
| eating                         | 121 (1.27%)     | 117 (0.60%)  | 73 (0.70%)       | 9 (0.52%)    | 74 (0.76%) | 20 (0.67%)    | 25 (1.15%)     |
| sports                         | 111 (1.16%)     | 125 (0.64%)  | 49 (0.47%)       | 11 (0.63%)   | 45 (0.46%) | 16 (0.53%)    | 10 (0.46%)     |
| death                          | 100 (1.05%)     | 119 (0.61%)  | 98 (0.94%)       | 6 (0.35%)    | 87 (0.90%) | 39 (1.30%)    | 14 (0.65%)     |


# 4. Data Analysis

Now we run the statistical tests to test our hypotheses. <br>

**Normal Distribution** <br>
Before deciding which correlation test (Spearman or Pearson) to apply later, we first need to explore if our data is normally distributed. <br>

**Test on Differences Between Characters**


**Test on Correlation**
As the result shows that only "total prepositions" and "sports" are not normally distributed, we will use _Spearman_ test for these two, and _Pearson_ for the rest to test the correlation. <br>

In [None]:
# import packages for statistical analysis
import pandas as pd
from scipy import stats
from scipy.stats import shapiro
from scipy.stats import pearsonr, spearmanr

In [None]:
### SIGNIFICANT TEST ON CORRELATION ###


# 数据字典
data = {
    'Character': ['Albus Dumbledore', 'Harry Potter', 'Hermione Granger', 'Draco Malfoy', 
                  'Ron Wesley', 'Severus Snape', 'Lord Voldemort'],
    'Past Tense Verbs': [4.09, 4.65, 4.50, 4.32, 4.02, 3.63, 3.69],
    'Present Tense Verbs': [7.36, 8.78, 9.18, 8.98, 9.21, 7.49, 6.88],
    'Future Tense Verbs': [1.15, 0.82, 0.80, 0.92, 0.52, 1.13, 1.43],
    'Total number words': [1.00, 0.68, 0.59, 0.69, 0.78, 0.73, 0.65],
    'Total prepositions': [6.57, 4.67, 5.06, 5.18, 4.86, 7.22, 5.12],
    'Total articles': [6.03, 3.96, 4.29, 4.72, 4.12, 5.82, 3.88],
    'Total negations': [1.29, 1.92, 2.00, 2.24, 1.92, 1.26, 1.71],
    'First Person Pronouns': [4.25, 5.28, 3.70, 5.18, 4.39, 3.39, 6.51],
    'Second Person Pronouns': [3.11, 2.49, 2.57, 3.74, 2.47, 3.96, 3.83],
    'Third Person Pronouns': [1.54, 1.84, 1.62, 1.21, 1.97, 1.36, 1.71],
    'Total empathy-related words': [56.49, 54.39, 52.42, 62.08, 53.47, 55.31, 63.30],
    'money': [0.57, 0.37, 0.45, 0.17, 0.29, 0.40, 0.51],
    'work': [10.97, 15.93, 14.09, 13.87, 15.47, 11.18, 12.56],
    'sleep': [0.50, 0.41, 0.42, 0.23, 0.49, 0.53, 0.28],
    'occupation': [0.08, 0.04, 0.11, 0.00, 0.07, 0.13, 0.09],
    'family': [0.96, 0.63, 0.64, 1.44, 0.74, 0.73, 1.11],
    'swearing_terms': [0.32, 0.30, 0.34, 0.52, 0.64, 0.37, 0.14],
    'leisure': [0.29, 0.12, 0.22, 0.29, 0.18, 0.27, 0.14],
    'school': [0.80, 0.23, 0.47, 0.63, 0.18, 0.70, 0.18],
    'optimism': [0.48, 0.25, 0.47, 0.52, 0.32, 0.53, 0.55],
    'home': [0.65, 0.75, 0.56, 1.27, 0.78, 0.77, 1.20],
    'sexuality': [0.10, 0.05, 0.13, 0.00, 0.06, 0.07, 0.18],
    'superhero': [0.08, 0.05, 0.11, 0.06, 0.04, 0.07, 0.05],
    'religion': [0.50, 0.41, 0.20, 0.35, 0.10, 0.40, 0.23],
    'body': [0.86, 0.62, 0.72, 1.38, 0.68, 1.03, 1.15],
    'eating': [1.27, 0.60, 0.70, 0.52, 0.76, 0.67, 1.15],
    'sports': [1.16, 0.64, 0.47, 0.63, 0.46, 0.53, 0.46],
    'death': [1.05, 0.61, 0.94, 0.35, 0.90, 1.30, 0.65],
}

# 创建数据框
df = pd.DataFrame(data)

# 定义一个函数来进行ANOVA测试
def run_anova(df, column):
    f_value, p_value = stats.f_oneway(*[df[column][df['Character'] == character] for character in df['Character']])
    return f_value, p_value

# 选择需要进行ANOVA测试的列
columns_to_test = df.columns[1:]

# 存储结果
results = {}

# 对每一列进行ANOVA测试
for column in columns_to_test:
    f_value, p_value = run_anova(df, column)
    results[column] = {'F-value': f_value, 'p-value': p_value}

# 打印结果
for column, result in results.items():
    print(f"{column}: F-value = {result['F-value']:.4f}, p-value = {result['p-value']:.4f}")


In [None]:
### DISTRIBUTION TEST ON LEXICON DATA ###

# Defining the data for each category based on the percentages provided for each character
data = {
    "Past Tense Verbs": [4.09, 4.65, 4.50, 4.32, 4.02, 3.63, 3.69],
    "Present Tense Verbs": [7.36, 8.78, 9.18, 8.98, 9.21, 7.49, 6.88],
    "Future Tense Verbs": [1.15, 0.82, 0.80, 0.92, 0.52, 1.13, 1.43],
    "Total number words": [1.00, 0.68, 0.59, 0.69, 0.78, 0.73, 0.65],
    "Total prepositions": [6.57, 4.67, 5.06, 5.18, 4.86, 7.22, 5.12],
    "Total articles": [6.03, 3.96, 4.29, 4.72, 4.12, 5.82, 3.88],
    "Total negations": [1.29, 1.92, 2.00, 2.24, 1.92, 1.26, 1.71],
    "First Person Pronouns": [4.25, 5.28, 3.70, 5.18, 4.39, 3.39, 6.51],
    "Second Person Pronouns": [3.11, 2.49, 2.57, 3.74, 2.47, 3.96, 3.83],
    "Third Person Pronouns": [1.54, 1.84, 1.62, 1.21, 1.97, 1.36, 1.71],
    "Total empathy-related words": [56.49, 54.39, 52.42, 62.08, 53.47, 55.31, 63.30],
    "money": [0.57, 0.37, 0.45, 0.17, 0.29, 0.40, 0.51],
    "work": [10.97, 15.93, 14.09, 13.87, 15.47, 11.18, 12.56],
    "sleep": [0.50, 0.41, 0.42, 0.23, 0.49, 0.53, 0.28],
    "occupation": [0.08, 0.04, 0.11, 0.00, 0.07, 0.13, 0.09],
    "family": [0.96, 0.63, 0.64, 1.44, 0.74, 0.73, 1.11],
    "swearing_terms": [0.32, 0.30, 0.34, 0.52, 0.64, 0.37, 0.14],
    "leisure": [0.29, 0.12, 0.22, 0.29, 0.18, 0.27, 0.14],
    "school": [0.80, 0.23, 0.47, 0.63, 0.18, 0.70, 0.18],
    "optimism": [0.48, 0.25, 0.47, 0.52, 0.32, 0.53, 0.55],
    "home": [0.65, 0.75, 0.56, 1.27, 0.78, 0.77, 1.20],
    "sexuality": [0.10, 0.05, 0.13, 0.00, 0.06, 0.07, 0.18],
    "superhero": [0.08, 0.05, 0.11, 0.06, 0.04, 0.07, 0.05],
    "religion": [0.50, 0.41, 0.20, 0.35, 0.10, 0.40, 0.23],
    "body": [0.86, 0.62, 0.72, 1.38, 0.68, 1.03, 1.15],
    "eating": [1.27, 0.60, 0.70, 0.52, 0.76, 0.67, 1.15],
    "sports": [1.16, 0.64, 0.47, 0.63, 0.46, 0.53, 0.46],
    "death": [1.05, 0.61, 0.94, 0.35, 0.90, 1.30, 0.65],
}

# Perform Shapiro-Wilk test for each dataset
for label, values in data.items():
    stat, p_value = shapiro(values)
    print(f"{label}: p-value = {p_value:.5f}")
    if p_value > 0.05:
        print("  The data is normally distributed (p > 0.05)\n")
    else:
        print("  The data is NOT normally distributed (p <= 0.05)\n")


In [None]:
### DISTRIBUTION TEST ON PERSONALITY DATA ###

# Big Five 各个性格特征的评分数据
neuroticism = [3.22, 4.22, 5.52, 3.0, 3.15, 3.85, 4.43]
extroversion = [4.9, 4.65, 4.36, 4.36, 4.23, 3.92, 2.65]
openness = [4.02, 5.12, 5.52, 4.27, 3.86, 5.13, 4.08]
agreeableness = [3.76, 4.07, 5.07, 1.95, 2.15, 4.11, 2.6]
conscientiousness = [3.01, 6.22, 5.73, 4.88, 4.16, 4.36, 5.49]

# 将各维度的结果存储在字典中，方便批量检验
big_five_data = {
    "Neuroticism": neuroticism,
    "Extroversion": extroversion,
    "Openness": openness,
    "Agreeableness": agreeableness,
    "Conscientiousness": conscientiousness
}

# 对每个性格特征进行 Shapiro-Wilk 检验并打印 p-value 和判断结果
for trait, scores in big_five_data.items():
    stat, p_value = shapiro(scores)
    print(f"{trait}: p-value = {p_value:.3f}")
    if p_value > 0.05:
        print(f"  The data is normally distributed (p > 0.05)\n")
    else:
        print(f"  The data is NOT normally distributed (p <= 0.05)\n")


In [None]:
### CORRELATION TEST ###

# Language feature data
data = {
    "Past Tense Verbs": [4.09, 4.65, 4.50, 4.32, 4.02, 3.63, 3.69],
    "Present Tense Verbs": [7.36, 8.78, 9.18, 8.98, 9.21, 7.49, 6.88],
    "Future Tense Verbs": [1.15, 0.82, 0.80, 0.92, 0.52, 1.13, 1.43],
    "Total number words": [1.00, 0.68, 0.59, 0.69, 0.78, 0.73, 0.65],
    "Total prepositions": [6.57, 4.67, 5.06, 5.18, 4.86, 7.22, 5.12],  # Not normal
    "Total articles": [6.03, 3.96, 4.29, 4.72, 4.12, 5.82, 3.88],
    "Total negations": [1.29, 1.92, 2.00, 2.24, 1.92, 1.26, 1.71],
    "First Person Pronouns": [4.25, 5.28, 3.70, 5.18, 4.39, 3.39, 6.51],
    "Second Person Pronouns": [3.11, 2.49, 2.57, 3.74, 2.47, 3.96, 3.83],
    "Third Person Pronouns": [1.54, 1.84, 1.62, 1.21, 1.97, 1.36, 1.71],
    "Total empathy-related words": [56.49, 54.39, 52.42, 62.08, 53.47, 55.31, 63.30],
    "money": [0.57, 0.37, 0.45, 0.17, 0.29, 0.40, 0.51],
    "work": [10.97, 15.93, 14.09, 13.87, 15.47, 11.18, 12.56],
    "sleep": [0.50, 0.41, 0.42, 0.23, 0.49, 0.53, 0.28],
    "occupation": [0.08, 0.04, 0.11, 0.00, 0.07, 0.13, 0.09],
    "family": [0.96, 0.63, 0.64, 1.44, 0.74, 0.73, 1.11],
    "swearing_terms": [0.32, 0.30, 0.34, 0.52, 0.64, 0.37, 0.14],
    "leisure": [0.29, 0.12, 0.22, 0.29, 0.18, 0.27, 0.14],
    "school": [0.80, 0.23, 0.47, 0.63, 0.18, 0.70, 0.18],
    "optimism": [0.48, 0.25, 0.47, 0.52, 0.32, 0.53, 0.55],
    "home": [0.65, 0.75, 0.56, 1.27, 0.78, 0.77, 1.20],
    "sexuality": [0.10, 0.05, 0.13, 0.00, 0.06, 0.07, 0.18],
    "superhero": [0.08, 0.05, 0.11, 0.06, 0.04, 0.07, 0.05],
    "religion": [0.50, 0.41, 0.20, 0.35, 0.10, 0.40, 0.23],
    "body": [0.86, 0.62, 0.72, 1.38, 0.68, 1.03, 1.15],
    "eating": [1.27, 0.60, 0.70, 0.52, 0.76, 0.67, 1.15],
    "sports": [1.16, 0.64, 0.47, 0.63, 0.46, 0.53, 0.46],  # Not normal
    "death": [1.05, 0.61, 0.94, 0.35, 0.90, 1.30, 0.65],
}

# Big Five personality scores
big_five_data = {
    "Neuroticism": [3.22, 4.22, 5.52, 3.0, 3.15, 3.85, 4.43],
    "Extroversion": [4.9, 4.65, 4.36, 4.36, 4.23, 3.92, 2.65],
    "Openness": [4.02, 5.12, 5.52, 4.27, 3.86, 5.13, 4.08],
    "Agreeableness": [3.76, 4.07, 5.07, 1.95, 2.15, 4.11, 2.6],
    "Conscientiousness": [3.01, 6.22, 5.73, 4.88, 4.16, 4.36, 5.49],
}

# Correlation results storage
correlation_results = {}

# Perform correlation tests
for trait, scores in big_five_data.items():
    correlation_results[trait] = {}
    for label, values in data.items():
        # Use Spearman for non-normal distributions, otherwise use Pearson
        if label in ["Total prepositions", "sports"]:
            corr, p_value = spearmanr(scores, values)
            test_type = "Spearman"
        else:
            corr, p_value = pearsonr(scores, values)
            test_type = "Pearson"
        
        # Store results in dictionary
        correlation_results[trait][label] = (test_type, corr, p_value)

# Print correlation results
for trait, results in correlation_results.items():
    print(f"\n{trait} Correlations:")
    for label, (test_type, corr, p_value) in results.items():
        print(f"  {label}: {test_type} correlation = {corr:.3f}, p-value = {p_value:.5f}")


# Results on Correlation Between Language Use and Personality

| Lexicon Label                     | Neuroticism | Extroversion | Openness | Agreeableness | Conscientiousness |
|-----------------------------------|-------------|--------------|---------|----------------|-------------------|
| Past Tense Verbs                  | 0.60782     | 0.11027      | 0.37594 | 0.51510        | 0.30459           |
| Present Tense Verbs               | 0.93158     | 0.23155      | 0.57914 | 0.97234        | 0.49114           |
| Future Tense Verbs                | 0.81733     | 0.17556      | 0.79429 | 0.95029        | 0.85915           |
| Total number words                | -0.658      | 0.453        | -0.554  | -0.119         | **0.00695**       |
| Total prepositions                | -0.250      | -0.144       | 0.071   | 0.036          | -0.571            |
| Total articles                    | -0.392      | 0.377        | -0.026  | 0.232          | **0.04242**       |
| Total negations                   | 0.104       | 0.050        | 0.042   | -0.338         | 0.586              |
| First Person Pronouns             | -0.052      | -0.553       | -0.452  | -0.552         | 0.395              |
| Second Person Pronouns            | -0.228      | -0.550       | -0.174  | -0.298         | -0.185            |
| Third Person Pronouns             | 0.206       | -0.072       | -0.142  | 0.010          | 0.218              |
| Total empathy-related words       | -0.279      | -0.585       | -0.504  | -0.631         | 0.027              |
| money                             | 0.418       | -0.167       | 0.057   | 0.537          | -0.193            |
| work                              | 0.120       | 0.189        | 0.121   | -0.161         | 0.600              |
| sleep                             | -0.029      | 0.410        | 0.193   | 0.499          | -0.480            |
| occupation                        | 0.505       | -0.302       | 0.350   | 0.586          | -0.149            |
| family                            | -0.494      | -0.244       | -0.567  | -0.702         | -0.187            |
| swearing_terms                   | -0.571      | 0.451        | -0.254  | -0.429         | -0.339            |
| leisure                           | -0.411      | 0.394        | -0.067  | 0.012          | -0.672            |
| school                            | -0.277      | 0.453        | 0.110   | 0.263          | -0.600            |
| optimism                          | 0.051       | -0.470       | -0.088  | -0.065         | -0.259            |
| home                              | -0.342      | -0.574       | -0.473  | **-0.777**     | 0.168              |
| sexuality                         | 0.627       | -0.606       | 0.015   | 0.305          | 0.122              |
| superhero                         | 0.570       | 0.303        | 0.588   | 0.751          | -0.022            |
| religion                          | -0.266      | 0.422        | 0.130   | 0.289          | -0.281            |
| body                              | -0.330      | -0.432       | -0.306  | -0.526         | -0.111            |
| eating                            | -0.047      | -0.260       | -0.522  | -0.024         | -0.491            |
| sports                            | -0.252      | **0.836**    | 0.108   | 0.126          | -0.090            |
| death                             | 0.129       | 0.105        | 0.245   | 0.544          | -0.492            |
