# BIG FIVE Model

### WORD AND PHRASE CORRELATIONS
Instantiating the model **Word and phrase correlations model** `[2: H. Andrew Schwartz, Johannes C. Eichstaedt, Margaret L. Kern, Lukasz
Dziurzynski, Stephanie M. Ramones, Megha Agrawal, Achal Shah, Michal
Kosinski, David Stillwell, Martin EP Seligman, Lyle H. Ungar. 2013 Personality, Gender, and Age in the Language of Social Media: The Open-
Vocabulary Approach.]`, `[4: M. L. Kern, J. C. Eichstaedt, H. A. Schwartz, L. Dziurzynski, L. H. Ungar, D. J. Stillwell, M. Kosinski, S. M. Ramones, M. E. P. Seligman. 2014. The Online Social Self: An Open Vocabulary Approach to Personality]`.

The model is based on differential language analysis (DLA) - Open Vocabulary. It is based on tree key characteristics:
1. Linguistic Feature Extraction
<center> $p(phrase|subject) = \large{\frac{freq(phrase, subject)}{\sum_{phrase' \in vocab(subject)} freq(phrase', subject)}}$ </center>

> where vocab(subject) returns a list of all words and phrases used by the subject

<center> $p(topic|subject) = \sum_{word \in topic} p(topic|word) * p(word|subject)$ </center>

> where p(word|subject) is the subject's use of the normalised word; p(topic|word) is the probability of the argument given the word

2. Correlational Analysis: using ordinary least squares regression

3. Visualization: with word clouds that resize the size of words according to their frequency;  and based on the strength of the correlation of the word with the demographic and psychological measurement of interest

#### Big-Five prediction
* To predict a user's personality, the incidence rate of: Extroversion, Conscientiousness, Openness to Experience, Agreeableness and Neuroticism (High and Low respectively). To do this I therefore used the `top100.1to3grams.gender_age_controlled.rmatrix` [2][4] models of each of the personality traits. Having the correlation between each phrase and personality trait in the models (ordinary least squares linear regression, with significance at a Bonferroni corrected p < 0.001), I decided to use it to predict the high and low traits of each 'user_id':

<center>$\overline{correlation}(category|document) = \large{\frac{\sum_{phrase \in category} correlation(phrase, document)}{n}}$</center>

> where $correlation(phrase, document)$ is the correlation of the phrase when all $words \in phrase$ are present in the document in some order; n is the number of correlations found in document $(|\sum_{phrase \in category} correlation(phrase, document)|)$. 

* For the case study **female neuroticism for mathematics** `[56: M. Lunardon, T. Cerni, R. I. Rumiati. 2022. Numeracy Gender Gap in STEM Higher Education: The Role of Neuroticism and Math Anxiety]`, I consider:


> 1.  a document is the user (meaning as Bag-Of-Words)
2. $\overline{correlation}(category|user) = \large{\frac{\sum_{phrase \in category} correlation(phrase, user)}{n}} $ = $\large{\frac{\sum_{phrase \in category} correlation(phrase, user)}{|\sum_{phrase \in category} correlation(phrase, user)|}}$, with $category = \{A, C, E, N, O\}$
3. ($phrase \in category$) match with ($phrase \in user$) $\quad$ if $\quad$  ($\forall$ $word \in phrase)\in  category$ then $word \in BOW_{user}$ for $user \in User\_id$

* The dataset on which to Big-Five personality prediction is `BAG-OF-WORDS DATASET`.


### CONFIGURATION

#### Libraries and packages

In [2]:
import csv
import numpy as np
import pandas as pd
import re
import time

#### Project Directory

In [3]:
%cd "/home/eleonora/Desktop/TESI Code" 
!ls 

/home/eleonora/Desktop/TESI Code
'BAG-OF-WORDS DATASET.ipynb'   OLD
'BIG FIVE Model.ipynb'	       RESULTS
 DATASETS		      'TWITTER OCCUPATION DATASET'
'GENDER Model.ipynb'	      'WORD CLOUDS'
 MODELS


### BAG-OF-WORDS DATASETS

*BAG-OF-WORDS DATASET* allows me to obtain for each user of `Twitter Occupation Dataset` a .csv file of its Bag-Of-Words, in which:
- column 1: key
- column 2: user_id
- column 3: data

In [4]:
# path of BAG-OF-WORDS DATASET
path_BOWs_DATASET = "./DATASETS/BAG-OF-WORDS DATASET.csv"

From the file `BAG-OF-WORDS DATASET.csv` I instantiate the `df_BOWs` dataframe as its identifier:

In [5]:
df_BOWs  = pd.read_csv(path_BOWs_DATASET)
new_index  = df_BOWs['Unnamed: 0']
df_BOWs = df_BOWs[['user_id', 'data']]
df_BOWs = df_BOWs.set_index(new_index)
df_BOWs = df_BOWs.rename_axis("key")
print(df_BOWs.head())

        user_id                                               data
key                                                               
0     206749819  "#1":0.0003209757663296421:2 "#12":0.000160487...
1     706405455  "#20":3.470173855710171e-05:1 "#25":3.47017385...
2     185699053  "#1":0.0004828585224529213:1 "#2":0.0009657170...
3    1134832688  "#":0.00026518164942985947:4 "#9":6.6295412357...
4      36040783  "#":6.967670011148272e-05:1 "#1":0.00034838350...


> *Example: user 206749819 in his Twitter posts made use of word '#1' with a word frequency ratio over all words of 0.000320976 and 2 frequency, word '#12' with a 0.000160488 ratio and 1 frequency, word '#17' with a 0.000160488 ratio and 2 frequency, and so on.*


A few printouts of the dataframe contents:

In [6]:
user_id  = (df_BOWs.iloc[2])[0] # user_id
data_user = (df_BOWs.iloc[2])[1] # data for user 'user_id'
data_line = data_user.split(' ') # BOW of 'user_id'
data_entry = data_line[0].split(':') # first word of user_id

print("user_id: %s " % user_id)
print("word: %s " % data_entry[0])
print("x_word: %f " % float(data_entry[1]))
print("freq: %i " % int(data_entry[2]))

user_id: 185699053 
word: "#1" 
x_word: 0.000483 
freq: 1 


Number of lines present in the dataframe, except the header:

In [7]:
TOT_users = int(df_BOWs['user_id'].shape[0])
print("Numbers lines present in BOWs DATASET: %i" %TOT_users)

Numbers lines present in BOWs DATASET: 5189


### FUNCTIONS

In [None]:
# method that transform  personality model in dictionairy {word_dic, correlation_dic}

def transformInDict(personality_modelLines):
        dic_peronalityModel = dict()
        for i, line in enumerate(personality_modelLines):
        if i >= 1: # avoid header
            dic = line.split(","); 
            word_dic = dic[0].strip('"') # key
            correlation_dic = dic[1].strip('"') # value
            dic_personalityModel.update({word_dic : correlation_dic})
            
    return dic_personalityModel

In [None]:
# method that return the content of "./MODELS/BF/" as list of dictionaries

def extract_personalityModels(keywords_personality):
    personality_dictList = list()
    for k, key in enumerate(keywords_personality):
        personality = open("./MODELS/BF/"+key+".top100.1to3grams.gender_age_controlled.rmatrix.csv", "r")
        personality_modelLines = personality.readlines() 
        personality_modelLinesList.append(transformInDict(personality_modelLines))
        personality.close()

    return personality_dictList

In [None]:
 # method that search math between phrase and words in BOW after the first

 def matchWord_phrase(idx_phrase, phrase_bf, aux, user_id, count_word):
    for i in range(0, count_word): # for each [word, x_word, freq_word] for user_id 'key'
        data_entry = data_line[i].split(':') # i entry in BOW of 'key' user_id
        word_bow = str(data_entry[0])
        word = re.sub('"', "", word_bow)
                
      
        if word_bow == phrase_bf[idx_phrase]:
        if idx_phrase == len(phrase_bf) - 1: # effettive match between BOW_user and phrase (complete match)
            aux.append(word_bow)
            return True, aux 
        else: # partial match
            match, aux = matchWord_phrase(idx_phrase+1, lines_bow,  phrase_bf, aux, count_word)
            if match:
                aux.append(word_bow)
                return True, aux
            else:
                return False, aux

  return False, aux # no match found

In [None]:
# method that return the phrase and its correlation, if exist, in category

def personalityPhrase(line, j):
  
  if j >= 1 and len(line) > 3: # avoid case , ,
    bf = line.split(','); 
    phrase_bf = (bf[0].strip('\n')) # select phrase in personality model

    if '"' in phrase_bf:
      phrase_bf = phrase_bf.replace('""', '"')
      phrase_bf = phrase_bf[1:-1]

        
    if len(phrase_bf) == 0:
      phrase_bf = '"'

    # special case line "_,", _, _, (e.g., the lord, "A")
    next = 1
    if bf[next] == '"':
      next += 1
      phrase_bf = bf[0] + ","
      phrase_bf = phrase_bf[1:]
 
    correlation = float((bf[next].strip('\n'))) # phrase correlation in personality model
    return phrase_bf, correlation

  return None, 0

In [None]:
# method for computation sum_{phrase \in category} correlation(phrase, user) for each category, for 'user_id'

def sumBOWUserCorrCategoryPersonalityWords(personality_dictList, user_id):
    norma_correlationList = list()
    supportList = list()

    for k, personality_dict in enumerate(personality_dictList): # for each trait personality
        occuranceModelMatch = 0
        sum_correlation = 0
        
        data_user = (df_BOWs.iloc[key])[1] # data column for user_id 'key'
        data_line = data_user.split(' ') # BOW of user_id 'N_user'
        count_word = int(len(data_line))
    
        for phrase in personality_dict.keys():
            for i in range(0, count_word): # for each [word, x_word, freq_word] for user_id 'key'
                data_entry = data_line[i].split(':') # i entry in BOW of 'key' user_id
                word_bow = str(data_entry[0])
                word = re.sub('"', "", word_bow)
                
        # (pharase ∈ category) match with (pharase ∈ vocab(user))    if    (∀ word ∈ phrase) ∈ BOWuser  for some  user ∈ User_id
                aux = []
                if word_bow == phrase[0]: # phrase is a word (first match)
                    aux.append(phrase[0])
                    if 0 == len(phrase) - 1: # effettive match between BOW_user and phrase (complete match)
                        match = True
                    else: # partial match
                        match, aux = matchWord_phrase(1, phrase, aux, count_word)
                        
                    if match: # for each match
                        occuranceModelMatch += 1
                        sum_correlation += personality_dict.get(phrase) # the correlation is only in the phrase (is sufficient do computation in the first partial match)
                        #if correlation >= 0.0:
                            #print("positive %s %s %f" %(phrase, aux, correlation))
                        #else:
                            #print("negative %s %s %f" %(phrase, aux, correlation))

        if occuranceModelMatch != 0:
            norma_correlationList.insert(k, float(sum_correlation / occuranceModelMatch))
        else: # none correlations for j trait personality
            norma_correlationList.insert(k, 0)

        supportList.insert(k, (occuranceModelMatch * 100) / count_word)

    return supportList, norma_correlationList

In [None]:
# method that computation the mean of values belonging to list 

def mean_statistics(list, len):
  mean_list = 0
  for l in list:
    mean_list += l

  return float(mean_list / len)

In [None]:
# method that define the sum_{phrase \in category} correlation(phrase, user) for each trait of personality; and return the dataframe with these correlations

def resultsPredictionPersonality(supportList, norma_correlationList, user_id, highlow_traits): 
    ## computation mean correlation and mean support
    mean_correlation = mean_statistics(norma_correlationList, len(norma_correlationList))
    mean_support = mean_statistics(supportList, len(supportList))

    return pd.DataFrame([[norma_correlationList[0], supportList[0], norma_correlationList[1], supportList[1], norma_correlationList[2], supportList[2], norma_correlationList[3], supportList[3], 
                      norma_correlationList[4], supportList[4], mean_correlation, mean_support, highlow_traits]], 
                      columns = ["A norma-correlation", "A support (%)", "C norma-correlation", "C support (%)", "E norma-correlation", "E support (%)", "N norma-correlation", "N support (%)",
                     "O norma-correlation", "O support (%)", "mean correlation", "mean support (%)", "high/low traits"], index = [user_id])

In [None]:
# method that define the high/low traits of personality for 'user_id'; and return the dataframe with this traits, if exist

def highLowTraitPersonality(norma_correlationList, user_id):
  A = norma_correlationList[0]
  C = norma_correlationList[1]
  E = norma_correlationList[2]
  N = norma_correlationList[3]
  O = norma_correlationList[4]

  # - low extroversion: [-0.089 to -0.036]
  # - high extroversion: [0.059 to 0.111]
  # - low agreeableness: [-0.123 to -0.034]
  # - high agreeableness: [0.032 to 0.059]
  # - low conscientiousness: [-0.105 to -0.039]
  # - high conscientiousness: [0.035 to 0.069]
  # - low emotional stability: [-0.086 to -0.042]
  # (- high neuroticism: [0.042 to 0.086])
  # - high emotional stability: [0.023 to 0.047]
  # (- low neuroticism: [-0.047 to -0.023])
  # - low openness: [-0.090 to -0.039]
  # - high openness: [0.072 to 0.124]
  
    highlow_traits = ""
    if A >= -0.123 and A <= -0.034:
        highlow_traits += "LA "
    if A >= 0.032 and A <= 0.059:
        highlow_traits += "HA "

    if C >= -0.105 and C <= -0.039:
        highlow_traits += "LC " 
    if C >= 0.035 and C <= 0.069:
        highlow_traits += "HC "

    if E >= -0.089 and E <= -0.036:
        highlow_traits += "LE "
    if E >= 0.059 and E <= 0.111:
        highlow_traits += "HE "

    if O >= -0.090 and O <= -0.039:
        highlow_traits += "LO "
    if O >= 0.072 and O <= 0.124:
        highlow_traits += "HO "

    if N >= -0.047 and N <= -0.023:
        highlow_traits += "LN "
    if N >= 0.042 and N <= 0.086:
        highlow_traits += "HN "
        
    highlow_traits = highlow_traits[:-1] # remove last character (" ")
    return highlow_traits

In [None]:
# main method for pedict trait personality for each user

def prediction_personality(keywords_personality):
    path_csv = "./RESULTS/Predictions BIG FIVE Model.csv"
    
    personality_dictList = extract_personalityModels(keywords_personality)

    df_outBF = pd.DataFrame()
    
    ## extract user_id and computation dot product (w_f * x_f) between personality models and BOWs DATASET
    for key in range(0, TOT_users): # extract user_id
        user_id = int((df_BOWs.loc[key])['user_id'])
        
        ##  sum_{phrase \in category} correlation(phrase, user) for each category/trait
        supportList, norma_correlationList = sumBOWUserCorrCategoryPersonalityWords(personality_dictList, user_id)
        highlow_traits = highLowTraitPersonality(norma_correlationList, user_id)
        
        ## print result trait personality for user_id in Predictions BIG FIVE Model
        df_outBF = pd.concat([df_outBF, resultsPredictionPersonality(supportList, norma_correlationList, user_id, highlow_traits)]
        # if N_user == 500: break
                             
    return df_outBF, df_outHL

In [None]:
# method that transform in 'Prediction BIG FIVE Model.csv' and 'Low High Personality Traits BIG FIVE Model.csv' dframe prediction personality

def outRESULT(path_resultcsv, df_outGender):
    df_outBF.to_csv(path_csv) 
    df_outHL.to_csv(path_)  
    print("'Prediction BIG FIVE Model.csv' successfully created")
    print("'Low High Personality Traits BIG FIVE Model.csv' successfully created")

### MAIN 

Big Five Predictions with Low and High traits

In [None]:
keywords_personality = ["A", "C", "E", "N", "O"]
prediction_personality(keywords_personality)
### calculation of efficiency (read files): 1 create_outputCSV + 1 create_outputTraitsText + 5 extract_personalityModels + 1 user_id + |user_id| sumBOWUserCorrCategoryPersonalityWords + |user_id| print_resultsPredictionPersonality + |user_id| print_highlowTraitPersonalit

'Prediction BIG FIVE Model.csv' successfully created
'Low High Personality Traits BIG FIVE Model.txt' successfully created
