# GENDER Model

### AGE AND GENDER LEXICA

Instantiating the model **Age and Gender Lexica** `[87: M. Sap, G. Park, J. C. Eichstaedt, M. L. Kern, D. J. Stillwell, M. Kosinski, L. H. Ungar, H. A. Schwartz. 2014. Developing age and gender predictive lexica over social media.]`.

The model is based on the creation of a weighted lexicon, which is the sum of all frequency-related weighted words on a document/user:

<center>
$usage_{lex} = \sum_{word \in lex} w_{lex}(word) \ast \large{\frac{freq(word, doc)}{freq(\ast, doc)}}$
</center>

> where $w_{lex}(word)$ is the lexicon $(lex)$ weighted by the $word$; $freq(word, doc)$ is the frequency of the $word$ for the document/user; $freq(\ast,doc)$ is the total number of words counted document/user. 


#### Gender classification
* For classify on the basis of gender I have to use the `emnlp14gender` sub-model of [87], which allows me to apply a linear multivariate with the purpose of separating female and male classes:

<center>
$y = (\sum_{f \in features} w_f \ast x_f) + w_0$
</center>

> where $x_f$ is the feature value $(f)$; $w_f$ is the feature coefficient; $w_0$ is the intercept. In classification $y$ is used as a class separator (e.g., >= 0 is female, < 0 is male).

* For the case study **female neuroticism for mathematics** `[56: M. Lunardon, T. Cerni, R. I. Rumiati. 2022. Numeracy Gender Gap in STEM Higher Education: The Role of Neuroticism and Math Anxiety]`, I consider:


> 1.  a document is the user (meaning as Bag-Of-Words)
2. $x_f = x_{word} = \large{\frac{freq(word, user)}{freq(\ast, user)}}$ $\qquad$ with $\qquad$
  $freq(\ast, user) = \sum_{freq \in Freq} freq$ $\qquad$ and
  $\quad$ $Freq = \bigcup_{i \in I} freq_i$ $=$ $\bigcup_{\forall user \in User\_id, \forall word \in BOW_{user}}$ $\{freq(word,user) | freq(word,user) $=$ freq_i \text{ for some i } \in I \}$

* The dataset on which to gender classification is `BAG-OF-WORDS DATASET`.


### CONFIGURATION

#### Libraries and packages

In [158]:
import csv
import numpy as np
import pandas as pd
import re
import time

#### Project Directory

In [159]:
%cd "/home/eleonora/Desktop/TESI Code" 
!ls 

/home/eleonora/Desktop/TESI Code
'BAG-OF-WORDS DATASET (copy).ipynb'   MODELS
'BAG-OF-WORDS DATASET.csv'	      OLD
'BAG-OF-WORDS DATASET.ipynb'	      RESULTS
'BIG FIVE Model.ipynb'		     'TWITTER OCCUPATION DATASET'
 DATASETS			     'WORD CLOUDS'
'GENDER Model.ipynb'


### BAG-OF-WORDS DATASET

*BAG-OF-WORDS DATASET* allows me to obtain for each user of `Twitter Occupation Dataset` a .csv file of its Bag-Of-Words, in which:
- column 1: key
- column 2: user_id
- column 3: data



In [160]:
# path of BAG-OF-WORDS DATASET
path_BOWs_DATASET = "./DATASETS/BAG-OF-WORDS DATASET.csv"

From the file `BAG-OF-WORDS DATASET.csv` I instantiate the `df_BOWs` dataframe as its identifier:

In [161]:
df_BOWs  = pd.read_csv(path_BOWs_DATASET)
new_index  = df_BOWs['Unnamed: 0']
df_BOWs = df_BOWs[['user_id', 'data']]
df_BOWs = df_BOWs.set_index(new_index)
df_BOWs = df_BOWs.rename_axis("key")
print(df_BOWs.head())

        user_id                                               data
key                                                               
0     206749819  "#1":0.0003209757663296421:2 "#12":0.000160487...
1     706405455  "#20":3.470173855710171e-05:1 "#25":3.47017385...
2     185699053  "#1":0.0004828585224529213:1 "#2":0.0009657170...
3    1134832688  "#":0.00026518164942985947:4 "#9":6.6295412357...
4      36040783  "#":6.967670011148272e-05:1 "#1":0.00034838350...


> *Example: user 206749819 (belongs to group 0) in his Twitter posts made use of word '#1' with a word frequency ratio over all words of 0.000320976 and 2 frequency, word '#12' with a 0.000160488 ratio and 1 frequency, word '#17' with a 0.000160488 ratio and 2 frequency, and so on.*



A few printouts of the dataframe contents:

In [162]:
user_id  = (df_BOWs.iloc[2])[0] # user_id
data_user = (df_BOWs.iloc[2])[1] # data for user 'user_id'
data_line = data_user.split(' ') # BOW of 'user_id'
data_entry = data_line[0].split(':') # first word of user_id

print("user_id: %s " % user_id)
print("word: %s " % data_entry[0])
print("x_word: %f " % float(data_entry[1]))
print("freq: %i " % int(data_entry[2]))

user_id: 185699053 
word: "#1" 
x_word: 0.000483 
freq: 1 


Number of lines present in the dataframe, except the header:

In [163]:
TOT_users = int(df_BOWs['user_id'].shape[0])
print("Numbers lines present in BOWs DATASET: %i" %TOT_users)

Numbers lines present in BOWs DATASET: 5189


### FUNCTIONS

In [164]:
# method that transform gender model in dictionairy {word_dic, weight_dic}

def transformInDict(gender_modelLines):
    dic_genderModel = dict()
    for i, line in enumerate(gender_modelLines):
        if i >= 1: # avoid header
            dic = line.split(","); 
            word_dic = dic[0].strip('"') # key
            weight_dic = (dic[1].strip('\n')).strip('"') # value
            dic_genderModel.update({word_dic : weight_dic})
            
    return dic_genderModel

In [165]:
# method that return the content of ageGenderLexica - gender as dictionary

def extract_genderModel():
    gender_model = open("./MODELS/gender/emnlp2014_ageGenderLexica/emnlp14gender.csv", "r")   
    gender_modelLines = gender_model.readlines()
    gender_model.close()
      
    return transformInDict(gender_modelLines) 

In [166]:
# method that extract the intercept by `emnlp14gender.csv` trasformed in dictionary

def extract_intercept(dic_genderModel):
    return float(dic_genderModel['_intercept'])

In [167]:
# method that find w_f for word 'word'

def extract_weightWord(word, dic_genderModel):
    weight = 0 # if word in gender model not exist then w_f = 0 (not change the partial sum dot product)
    match = False

    if word in dic_genderModel:
        match = True
        weight = float(dic_genderModel[word])

    return weight, match

In [168]:
# method that computation \sum w_f * x_f for each word, for user_id 'key'

def dotProductForUser(dic_genderModel, key):
    start_dot = time.perf_counter()

    y_par = 0 # dot product partial
    data_user = (df_BOWs.iloc[key])[1] # data column for user_id 'key'
    data_line = data_user.split(' ') # BOW of user_id 'N_user'
    count_word = int(len(data_line))
    count_wordMatch = 0
    
    ## computation \sum w_f * x_f for each word
    for i in range(0, count_word): # for each [word, x_word, freq_word] for user_id 'key'
        data_entry = data_line[i].split(':') # i entry in BOW of 'key' user_id
        word_bow = str(data_entry[0])
        word = re.sub('"', "", word_bow)
        
        x_word =  float(data_entry[1]) # x_f
        
        weight, match = extract_weightWord(word, dic_genderModel) # extract w_f for word
        
        if match:
            count_wordMatch += 1
        y_par += np.dot(weight, x_word) # w_f * x_f
   
    end_dot = time.perf_counter()
    print("dot product time: ", end_dot - start_dot)

    return y_par, ((count_wordMatch * 100) /  count_word)

In [169]:
# main method for the classification gender
 
def prediction_gender():

    ## extract model of `emnlp14gender.csv`
    dic_genderModel = extract_genderModel() 
    ## find intercept (w_0) for do classification gender
    intercept = extract_intercept(dic_genderModel)

    df_outGender = pd.DataFrame()

    ## extract user_id  and computation dot product (w_f * x_f) between emnlp14gender.csv and bow.csv
    for key in range(0, TOT_users): # extract user_id
        user_id = int((df_BOWs.loc[key])['user_id'])
            
        start = time.perf_counter()
        
        y_par, ratio_match = dotProductForUser(dic_genderModel, key) # dot product \sum w_f * x_f for each word
        y = y_par + intercept # classification

        ## if positive classification then user is female, otherwise is male
        if y >= 0:
            gender = 'Female'
        else:
            gender = 'Male'

        df = pd.DataFrame([[user_id, gender, y, ratio_match]], columns = ['user_id', 'gender', 'y', 'support (%)'], index = [key])
        df_outGender = pd.concat([df_outGender, df])

        end = time.perf_counter()
              
        print(user_id)
        print("The time classification is: ", end - start)

    return df_outGender

In [170]:
# method that transform in 'Classification Gender Model.csv' dframe gender classification

def outRESULT(path_resultcsv, df_outGender):
    df_outGender.to_csv(path_resultcsv) 
    print("'Classification Gender Model.csv' successfully created")

### MAIN

Gender classification

In [171]:
## create Classification Gender Model.csv (output)
start = time.perf_counter()
path_resultcsv = "./RESULTS/Classification Gender Model.csv"
df_outGender = prediction_gender()

startcsv = time.perf_counter()
outRESULT(path_resultcsv, df_outGender)  
endcsv = time.perf_counter()
end = time.perf_counter()
print("tempo csv", endcsv - startcsv)
print("Execution time: ", end - start) 

### |gender_model| cardinality of words in gender model; open file and computation is O(1); |user_id|is the number users; N_words number size BOW of each user
### calculation of efficiency: 1 + |gender model| for extract gender model and convert in dictionary + |user_id| x ( N_words + 1 for dotproduct + 1 per y computation
# + 1 for dataframe ) + 1 for outRESULT = 
### = (1 + |gender_model|) + |user_id| x (N_words + 1) = O(|gender_model| + |user_id| x N_words)

dot product time:  0.015599610982462764
206749819
The time classification is:  0.018478718993719667
dot product time:  0.034040439990349114
706405455
The time classification is:  0.0364887360483408
dot product time:  0.007683642965275794
185699053
The time classification is:  0.009481279004830867
dot product time:  0.027887228992767632
1134832688
The time classification is:  0.02970442094374448
dot product time:  0.01888689398765564
36040783
The time classification is:  0.020834108989220113
dot product time:  0.019075629010330886
18741793
The time classification is:  0.02067556994734332
dot product time:  0.02687697799410671
72824493
The time classification is:  0.028710677986964583
dot product time:  0.0220629179966636
49522996
The time classification is:  0.02372117101913318
dot product time:  0.01151803694665432
68400904
The time classification is:  0.01427217002492398
dot product time:  0.05094072996871546
27211922
The time classification is:  0.05340268096188083
dot product time: 