# 1. Preparation

**1.0 Import Lexicons** <br>
Initially we intended to use LIWC lexicon dictionairies (download [here](https://pypi.org/project/liwc/), and install using `!pip install -U liwc`). But it would require considerable fee. Therefore, we turned to a free equivalent called EMPATH whose guideline could be accessed [here](https://github.com/Ejhfast/empath-client). If it's still not working sufficiently, we will try the [SEANCE](https://www.linguisticanalysistools.org/seance.html). <br>

**1.1 Explore EMPATHY, finding what existing lexicons from EMPATHY could be adopted directly.** <br>
Next we explore the EMPATHY. In [Yarkoni (2011)](https://www.sciencedirect.com/science/article/pii/S0092656610000541), the referential article, its Table 1 displays a correlation between LIWC lexicons and the five dimensions of Big-Five personalities. We use this table as a benchmark to filter the EMPATHY, find the available labels, and then apply them to our dataset. <br>

**1.2 Use Spacy to add lexicons that we need but missing from EMPATHY.** <br>
For those missing, some of them such as 1st, 2nd, 3rd person pronouns could be added by parsing with Spacy, but for some of them there is a lack of instrument. We will take that as a limitation of this study. This new lexicon as a substitute of LIWC, we will call it EMPATHYe (Empathy extended).<br> 

**The results shows that:** <br>
**EMPATHY has:** <br>
affect, positive_emotions, negative_emotions, anger, sadness, hearing, communication, friends, family, swearing_terms <br>
**Need Spacy for:** <br>
pronouns(PRON), articles(DET), prepositions(PREP), numbers(NUM) <br>
1st person sg/pl, 2nd person, 3rd person pronouns <br>
Past/present/future tense vb. <br>
And we **neglect the rest.** <br>


In [24]:
### 1.0 Import Lexicon pkg ###
# !pip install empath spacy pandas
import pandas as pd
import spacy
from empath import Empath
lexicon = Empath()

### 1.1 EXPLORING EMPATHY ###
### WHAT EXISTING CLASSES IN EMPATHY COULD BE ADOPTED DIRECTLY ###
# Print all category (class) names
print(list(lexicon.cats.keys()))
print()

# Define the list of class (category) names we're looking for
categories_to_check = ["total pronouns", "pron", "first person sing.", "first person", "second person", "third person",
 "negation", "assent", "articles", "prep", "prepositions", "number",
 "affect", "positive", "optimism", "negative", "anxiety", "anger", "sadness", 
 "cognitive", "causation", "insight", "discrepancy", "inhibition", "tentative", "certainty", 
 "sensory", "seeing", "hearing", "feeling", "social", "communication", "references",
 "friend", "family", "human", "time", "tense", "space", "up", "down", 
 "inclusive", "exclusive", "motion", "occupation", "school", "job", "work", "achieve", 
 "leisure", "home", "sport", "tv", "movie", "music", "sound", "money", "finance",
 "metaphysics", "religion", "death", "physical", "body", "sexuality", "sex", "eat", "drink", "sleep", "groom", "swear"]

# Convert categories_to_check to lowercase for case-insensitive comparison
categories_to_check_lower = [cat.lower() for cat in categories_to_check]

# Find matching categories (substring match, case insensitive)
matched_categories = [cat for cat in lexicon.cats if any(search_term in cat.lower() for search_term in categories_to_check_lower)]
not_matched_categories = [cat for cat in categories_to_check if not any(search_term in cat.lower() for search_term in lexicon.cats.keys())]

# Output matched and not matched categories
print("Matched categories:", matched_categories)
print()
print("Not matched categories:", not_matched_categories)


['help', 'office', 'dance', 'money', 'wedding', 'domestic_work', 'sleep', 'medical_emergency', 'cold', 'hate', 'cheerfulness', 'aggression', 'occupation', 'envy', 'anticipation', 'family', 'vacation', 'crime', 'attractive', 'masculine', 'prison', 'health', 'pride', 'dispute', 'nervousness', 'government', 'weakness', 'horror', 'swearing_terms', 'leisure', 'suffering', 'royalty', 'wealthy', 'tourism', 'furniture', 'school', 'magic', 'beach', 'journalism', 'morning', 'banking', 'social_media', 'exercise', 'night', 'kill', 'blue_collar_job', 'art', 'ridicule', 'play', 'computer', 'college', 'optimism', 'stealing', 'real_estate', 'home', 'divine', 'sexual', 'fear', 'irritability', 'superhero', 'business', 'driving', 'pet', 'childish', 'cooking', 'exasperation', 'religion', 'hipster', 'internet', 'surprise', 'reading', 'worship', 'leader', 'independence', 'movement', 'body', 'noise', 'eating', 'medieval', 'zest', 'confusion', 'water', 'sports', 'death', 'healing', 'legend', 'heroic', 'celebr

In [10]:
# do an example analysis for light testing
result = lexicon.analyze("he kiss the other person", normalize=True)
filtered_result = {category: value for category, value in result.items() if value > 0}
print(filtered_result)

{'sexual': 0.2, 'love': 0.2}


# 2. Data Processing

**Recall the hypotheses for word level**

We want to see if the correlations between LIWC categories and Big Five personlity traits align with the trend in Yarkoni(2011). That is: <br>
|   EMPATHYe   |Label Name| Neuroticism  | Extroversion |   Openness   |Agreeableness |Conscientiousness|
|--------------|--------------|--------------|--------------|--------------|--------------|--------------|
| pronouns     |*pronouns|      +       |      +       |       --     |       ++     |       -      |
| 1st person sing.|*first_person_sg|   ++      |      +       |       -      |       +      |       0      |
| 1st person plural|*first_person_pl|   -      |     ++       |       --     |       ++     |       +      |
| 1st person   |*first_person||++|+|--|++|+|
| 2nd person   |*second_person||--|++|--|+|0|
| 3rd person   |*third_person|+|+|-|+|-|
| negations    |*negations|++|-|--|-|--|
| articles     |*articles|--|-|++|+|++|
| prepositions |*prepositions|-|-|++|+|+|
| numbers      |*numbers|-|--|--|++|+|
| affect       |affection|+|+|--|+|-|
| positive emotions|positive_emotion|-|++|--|++|+|
| optimism    |optimism|--|+|0|++|++|
| negative emotions|negative_emotion|++|+|0|--|--|
| anger        |anger|++|+|+|--|--|
| sadness      |sadness|++|+|-|+|--|
| hearing      |hearing|+|++|--|+|--|
| communication|communication|0|++|-|+|-|
| friends      |friends|--|++|-|++|+|
| family       |family|-|+|--|++|+|
| past tense vb.|*past_tense|+|-|--|+|0|
| present tense vb.|*present_tense|+|-|--|0|-|
| future tense vb.|*future_tense|-|-|-|-|-|
| occupation   |occupation|+|--|+|-|+|
| school       |school|+|-|+|-|-|
| job/work     |work|+|--|+|-|+|
| achievement  |achievement|+|--|-|+|--|
| leisure      |leisure|-|++|--|++|+|
| home         |home|0|+|--|++|+|
| sports       |sports|-|+|--|+|0|
| music        |music|-|++|+|+|--|
| money        |money|+|-|-|--|-|
| religion     |religion|-|++|+|+|-|
| death        |death|+|+|++|--|--|
| body states  |body|+|++|-|++|-|
| sexuality    |sexuality|+|++|0|++|-|
| eating       |eating|-|+|--|+|-|
| sleep        |sleep|++|-|--|++|-|
| swearing words|swearing_terms|++|+|+|--|--|
(_Label Name_ refers to its new name in our _EMPATHYe_)

**Re-classification Needed**:<br>
The following labels from _EMPATHY_ will be renamed/reclassified in our _EMPATHYe_ <br>
1)Work: domestick_work, blue_collar_job, white_collar_job, work <br>
2)Music: music, sound, musical <br>
3)Sexuality: sexual <br>

**Dataset**<br>
Our dataset use the collection of the complete 8 series of *Harry Potter* film series. Originally we use only the first film but turns out it's not sufficient for a significant result, therefore we applied them all. <br>

**Build our own lexicon: *EMPATHYe*** <br>
After the pre-processing stage, each character's lines form a separate dataset. Currently each dataset has the following labels: tokens, frequencies, postags. Now we need to add a new label called "*empathye*", which contains the needed lexicon information. Some could be proceeded directly by EMPATHY, some as stated before, need further processing using Spacy. <br>
**2.1 Handle Empathy** <br>
**2.2 Apply Spacy** <br>

In [21]:
### 2.1 START WITH EMPATHY ###

# labels to keep
labels_to_keep = [
    'money', 'domestic_work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school',
    'blue_collar_job', 'optimism', 'home', 'sexual', 'superhero', 'religion', 'body', 'eating', 'sports',
    'death', 'communication', 'hearing', 'music', 'sound', 'work', 'sadness', 'emotional', 'affection',
    'anger', 'white_collar_job', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'musical'
]

# merging rules:
merge_rules = {
    'work': ['domestic_work', 'blue_collar_job', 'white_collar_job', 'work'],
    'music': ['music', 'sound', 'musical'],
    'sexuality': ['sexual']
}

temp_lexicon = {} # temporarily store the lexicon

# filter and merge based on the rules above
for label in labels_to_keep:
    # 检查标签是否需要合并
    merged = False
    for new_label, old_labels in merge_rules.items():
        if label in old_labels:
            # 如果是要合并的标签，将内容合并至新标签
            if new_label not in temp_lexicon:
                temp_lexicon[new_label] = set()
            temp_lexicon[new_label].update(lexicon.cats[label])
            merged = True
            break
    # 如果标签不在合并规则内，直接添加到临时存储中
    if not merged:
        temp_lexicon[label] = lexicon.cats[label]

# Clear original lexicon contents
lexicon.cats.clear()

# Reassign the updated content to lexicon
lexicon.cats.update(temp_lexicon)

In [23]:
### 2.2 USE SPACY TO PROCEED MORE

# [1] PERSONAL PRONOUNS
nlp = spacy.load("en_core_web_sm") # load the English model
# 定义代词标签
pronouns = {
    "first_person_sg": ["I", "me", "my", "mine"],
    "first_person_pl": ["we", "us", "our", "ours"],
    "first_person": ["I", "me", "my", "mine", "we", "us", "our", "ours"],
    "second_person": ["you", "your", "yours"],
    "third_person": ["he", "him", "his", "she", "her", "hers", "they", "them", "their", "theirs"],
}

# 将 pronouns 添加到 lexicon
for label, words in pronouns.items():
    lexicon.cats[label] = words

# 检查添加后的 lexicon
print(list(lexicon.cats.keys()))

['money', 'work', 'sleep', 'occupation', 'family', 'swearing_terms', 'leisure', 'school', 'optimism', 'home', 'sexuality', 'superhero', 'religion', 'body', 'eating', 'sports', 'death', 'communication', 'hearing', 'music', 'sadness', 'emotional', 'affection', 'anger', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'first_person_sg', 'first_person_pl', 'first_person', 'second_person', 'third_person']


In [52]:
# define the function for finding past, present, future tense words;
# for finding numbers, prepositions, articles, negations

# tense verbs
def label_tenses(file_path):
    # 读取 CSV 文件，不指定列名
    df = pd.read_csv(file_path, header=None)
    
    # 存储结果
    labeled_verbs = {
        'past_tense': [],
        'present_tense': [],
        'future_tense': []
    }

    # 遍历每一行文本
    for index in range(len(df)):
        # 使用 spaCy 处理每一行文本
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        # 查找动词
        for token in doc:
            # if token.pos_ == "VERB":  # 确保是动词
                # 根据时态分类
                if token.tag_ in ['VBD', 'VBN']:  # 过去时动词
                    labeled_verbs['past_tense'].append(token.text)
                elif token.tag_ in ['VBZ', 'VBP', 'VBG']:  # 现在时动词
                    labeled_verbs['present_tense'].append(token.text)
                elif token.tag_ == 'MD':  # 将来时动词（情态动词）
                    # labeled_verbs['future_tense'].append(token.nbor().text)
                    # 需要检查下一个词是否为动词以确定将来时
                     if token.nbor().pos_ == "VERB":
                        labeled_verbs['future_tense'].append(token.nbor().text)

    return labeled_verbs

# numbers
def label_numbers(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_numbers = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.like_num:  # 判断是否是数字
                labeled_numbers.append(token.text)

    return labeled_numbers


# prepositions

def label_prepositions(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_prepositions = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.pos_ == "ADP":  # 介词的 POS 标签是 ADP
                labeled_prepositions.append(token.text)

    return labeled_prepositions


# articles

def label_articles(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_articles = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.pos_ == "DET":  # 冠词的 POS 标签是 DET
                labeled_articles.append(token.text)

    return labeled_articles


# negations

def label_negations(file_path):
    df = pd.read_csv(file_path, header=None)
    labeled_negations = []

    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）

        for token in doc:
            if token.dep_ == "neg":  # 否定词的依存关系标签是 neg
                labeled_negations.append(token.text)

    return labeled_negations

# 3. Data Analysis

**Find the personality** <br> 
Based on the lexicons, we start the data analysis of a character's personalty. <br>
Step 1: count the frequencies per lexicon. <br>
Step 2: compare the lexicon frequencies between characters, using percentage = freq / number_of_tokens <br>
Step 3: verify the hypotheses, i.e. based on the percentage we would know if characters being more extroverted characters tend to use more certain words compare to those less. <br>

In [53]:
def process_tense_analysis(file_path):
    # 调用不同的函数并打印结果
    tenses_result = label_tenses(file_path)

    # 计算总单词数
    total_words = 0

    # 计算每个句子的总单词数
    df = pd.read_csv(file_path, header=None)
    for index in range(len(df)):
        doc = nlp(df.iloc[index, 0])  # 访问第一列（每一行的文本）
        total_words += len(doc)  # 统计当前句子的单词数

    # 获取各类动词的数量
    past_count = len(tenses_result['past_tense'])
    present_count = len(tenses_result['present_tense'])
    future_count = len(tenses_result['future_tense'])

    # 计算百分比
    past_percentage = (past_count / total_words) * 100 if total_words > 0 else 0
    present_percentage = (present_count / total_words) * 100 if total_words > 0 else 0
    future_percentage = (future_count / total_words) * 100 if total_words > 0 else 0

    # 打印结果
    print(f"Past Tense Verbs: {past_count} ({past_percentage:.2f}%)")
    print(f"Present Tense Verbs: {present_count} ({present_percentage:.2f}%)")
    print(f"Future Tense Verbs: {future_count} ({future_percentage:.2f}%)")

In [55]:
### DUMBLEDORE ###
process_tense_analysis('tokens/Dumbledore.csv')

Past Tense Verbs: 281 (2.94%)
Present Tense Verbs: 362 (3.79%)
Future Tense Verbs: 0 (0.00%)


In [None]:
### HARRY ###
process_tense_analysis('tokens/Harry.csv')

In [None]:
### HERMIONE ###