# 1. Preparation

**1.0 Import Lexicons** <br>
Initially we intended to use LIWC lexicon dictionairies (download [here](https://pypi.org/project/liwc/), and install using `!pip install -U liwc`). But it would require subscription which we could not afford. Therefore, we turned to a free equivalent called EMPATH whose guideline could be accessed [here](https://github.com/Ejhfast/empath-client). If it's still not working sufficiently, we will try the [SEANCE](https://www.linguisticanalysistools.org/seance.html). <br>

**1.1 Explore EMPATHY, finding what existing lexicons from EMPATHY could be adopted directly.** <br>
Then we explore the EMPATHY. In [Yarkoni (2011)](https://www.sciencedirect.com/science/article/pii/S0092656610000541), the referential article, the Table displays a correlation between LIWC lexicons and the five dimensions of Big-Five personalities. We use this table as a benchmark to filter the EMPATHY, find these that can be used, and apply them to our dataset. <br>

**The results shows that:** <br>
**EMPATHY has:** <br>
affect, positive_emotions, negative_emotions, anger, sadness, hearing, communication, friends, family, swearing_terms <br>
**Need Spacy for:** <br>
pronouns(PRON), articles(DET), prepositions(PREP), numbers(NUM) <br>
1st person sg/pl, 2nd person, 3rd person pronouns <br>
Past/present/future tense vb. <br>
**Neglect the rest:** <br>


**1.2 Use Spacy to add lexicons that we need but missing from EMPATHY.** <br>
For those missing, some of them such as 1st, 2nd, 3rd person pronouns could be added by using spacy to lemmatise, but for some of them there is a lack of instrument. We will take that as a limitation of this study. <br>


In [9]:
### 1.0 Import Lexicon pkg ###
!pip install empath spacy
from empath import Empath
lexicon = Empath()

### 1.1 EXPLORING EMPATHY ###
### WHAT EXISTING CLASSES IN EMPATHY COULD BE ADOPTED DIRECTLY ###
# Print all category (class) names
print(list(lexicon.cats.keys()))
print()

# Define the list of class (category) names we're looking for
categories_to_check = ["total pronouns", "pron", "first person sing.", "first person sing.", "first person", "secon person", "third person",
 "negation", "assent", "articles", "prep", "prepositions", "number",
 "affect", "positive","optimism", "negative", "anxiety", "anger", "sadness", 
 "cognitive", "causation", "insight", "discrepancy", "inhibition", "tentative", "certainty", 
 "sensory", "seeing", "hearing", "feeling", "social", "communication", "references",
 "friend", "family", "human", "time", "tense", "space", "up", "down", 
 "inclusive", "exclusive", "motion", "occupation", "school", "job", "work", "achieve", 
 "leisure", "home", "sport", "tv", "movie", "music", "sound", "money", "finance",
 "metaphysics", "religion", "death", "physical", "body", "sexuality", "eat", "drink", "sleep"] 
# Find matching categories (sub-string match, case insensitive)
matched_categories = [cat for cat in lexicon.cats if any(search_term in cat.lower() for search_term in categories_to_check_lower)]
not_matched_categories = [cat for cat in categories_to_check if not any(search_term in lexicon.cats for search_term in [cat.lower()])]
# Output matched and not matched categories
print("Matched categories:", matched_categories)
print()
print("Not matched categories:", not_matched_categories)

Matched categories: ['money', 'domestic_work', 'sleep', 'occupation', 'family', 'leisure', 'school', 'social_media', 'blue_collar_job', 'optimism', 'home', 'superhero', 'religion', 'body', 'eating', 'sports', 'death', 'communication', 'hearing', 'weather', 'music', 'sound', 'work', 'sadness', 'emotional', 'affection', 'anger', 'white_collar_job', 'negative_emotion', 'friends', 'achievement', 'positive_emotion', 'musical']

Not matched categories: ['total pronouns', 'pron', 'first person sing.', 'first person sing.', 'first person', 'secon person', 'third person', 'negation', 'assent', 'articles', 'prep', 'prepositions', 'number', 'affect', 'positive', 'negative', 'anxiety', 'cognitive', 'causation', 'insight', 'discrepancy', 'inhibition', 'tentative', 'certainty', 'sensory', 'seeing', 'feeling', 'social', 'references', 'friend', 'human', 'time', 'tense', 'space', 'up', 'down', 'inclusive', 'exclusive', 'motion', 'job', 'achieve', 'sport', 'tv', 'movie', 'finance', 'metaphysics', 'physi

In [10]:
# do an example analysis for light testing
result = lexicon.analyze("he kiss the other person", normalize=True)
filtered_result = {category: value for category, value in result.items() if value > 0}
print(filtered_result)

{'sexual': 0.2, 'love': 0.2}


In [12]:
### 1.2 USE SPACY TO ADD MORE ### 
# [1] PERSONAL PRONOUNS
import spacy
nlp = spacy.load("en_core_web_sm") # load the English model
# Defined a function to identify the personal pronouns in dataset
def identify_personal_pronouns(text):
    doc = nlp(text)
    pronouns = {
        "1st person singular": [],
        "1st person plural": [],
        "2nd person": [],
        "3rd person singular": [],
        "3rd person plural": []
    }
    for token in doc:
        if token.pos_ == "PRON":  # only check pronouns
            if token.text.lower() in ["i", "me", "my", "mine"]:  # 1st singular
                pronouns["1st person singular"].append(token.text)
            elif token.text.lower() in ["we", "us", "our", "ours"]:  # 1st plural
                pronouns["1st person plural"].append(token.text)
            elif token.text.lower() in ["you", "your", "yours"]:  # 2nd person
                pronouns["2nd person"].append(token.text)
            elif token.text.lower() in ["he", "him", "his", "she", "her", "hers", "it", "its"]:  # 3rd singular
                pronouns["3rd person singular"].append(token.text)
            elif token.text.lower() in ["they", "them", "their", "theirs"]:  # 3rd plural
                pronouns["3rd person plural"].append(token.text)
    
    return pronouns

# light testing with an example sentence
text = "I have a friend, and he said that we should meet you and them."
result = identify_personal_pronouns(text)
print(result)

{'1st person singular': ['I'], '1st person plural': ['we'], '2nd person': ['you'], '3rd person singular': ['he'], '3rd person plural': ['them']}


In [13]:
# [2] VERB TENSE
def identify_verb_tenses(text):
    doc = nlp(text)
    tenses = {
        "past": [],
        "present": [],
        "future": []
    }
    for token in doc:
        if token.pos_ == "VERB":  # verbs only
            # 检查时态
            if token.tag_ in ["VBD", "VBN"]:  # past
                tenses["past"].append(token.text)
            elif token.tag_ in ["VBZ", "VBP", "VB"]:  # present
                tenses["present"].append(token.text)
            elif token.text.lower().startswith("will") or token.tag_ == "MD":  # future
                tenses["future"].append(token.text)    
    return tenses

# light testing
text = "I will go to the store tomorrow. He went there yesterday, and I am here now."
result = identify_verb_tenses(text)
print(result)

{'past': ['went'], 'present': ['go'], 'future': []}


In [14]:
# [3] NEGATIONS
def identify_negations(text):
    doc = nlp(text)
    negations = []  
    for token in doc:
        if token.dep_ == "neg": 
            negations.append(token.text)  
    return negations
# light testing
text = "I do not like apples, but I love oranges. He isn't coming to the party."
result = identify_negations(text)
print(result)

['not', "n't"]


# 2. Data Processing

> Recall the hypotheses for word level
> We want to see if the correlations between LIWC categories and Big Five personlity traits align with the trend in Yarkoni(2011)
> That is:
> N

Our dataset use the collection of the complete 8 series of *Harry Potter* film series. Orginally we use only the first film but turns out it's not sufficient for a significant result, therefore we applied them all. <br>

For word level, we follow the steps below: <br>

**2.1 Add the lexicons** <br>
After the pre-processing stage, each character's lines form a separate dataset. Currently each dataset has the following labels: tokens, frequencies, postags. Now we need to add a new label called "*lexicon*", which contains the needed LIWC lexicon information. Some could be proceeded directly by EMPATHY, some as stated before, need further processing using Spacy. <br>

**2.2 Find the personality** <br> 
Based on the lexicons, we start the data analysis of a character's personalty.
Step 1: count the frequencies per lexicon.
Step 2: compare the lexicon frequencies between characters, using percentage = freq / number_of_tokens
Step 3: verify the hypotheses, i.e. based on the percentage we would know if characters being more extroverted characters tend to use more certain words compare to those less.

In [None]:
###  Data processing ###

# Database: lines per person
# Put the sentences in, give each a lexicon marker, this will be a new category
# Use previous studies' conclusion on characters' BIG-5, 
#   we explore if correlations between Big Five personality traits and LIWC categories
#   align with conclusions from Yarkoni (2011) Table 1

In [None]:
'''
Word / lexicon / character / Freq.

character personality tendency

coefficient
'''