# Appendix 6 Gender Inference - Rule Based Approach

### Part 1: Username

Using the human-classified sample of 1000 random rows from our dataset of tweets, this program crosschecks all substrings of the User field against a list of known male and female first names, and makes a gender classification where it finds a match.

#### Names dataset & adjustments made to it
We took a dataset (https://github.com/organisciak/names/blob/master/data/us-likelihood-of-gender-by-name-in-2014.csv) of the top 1000 female names and top 1000 male names in the US in 2014. For any name that featured in both the male and female lists, we cross-checked its gender against the probabilty list and assigned it to the gender with the highest statistical probability for that name. In addition to this, based on some early experiments, we added some words which were not names but clear signifiers of gender commonly used in usernames, eg "girl", "chick", "momma".

#### Other details
We matched names in case insensitive terms, as many usernames do not use standard name capitalisation.

Further, we removed names of 3 characters or less, as these led to a high proportion of inaccurate matches (eg username AzanianSea contained a match for the female name "Nia", username XMASTIMEblog contained a match for the male name "Tim" when in reality neither username should be associated with a particular gender).

In [1]:
import pandas as pd
import numpy as np

# Import the human-classified 1000-row sample
tweets_sample1000_withgender = pd.read_excel('tweets_sample1000.xlsx')
tweets_sample1000_withgender

Unnamed: 0,index,description,user,gender_username_human,gender_description_human,gender_human_final
193,123,I didn't mean to call you an angry mob. Mama a...,LM_Shepard,U,F,F
270,377,"Economist, opinion columnist, libertarian, Geo...",DorfmanJeffrey,M,U,M
104,1664,#mediadopedealer A Willy Wonka creation 👀 and ...,Ashlee_Ray,F,U,F
366,2087,Author of the book Meridian Hill Park. License...,feejaysee,M,U,M
906,2202,"I bleed Red, White and Blue. God bless Texas. ...",lapadooza,U,U,U
397,2260,,DanKronstadt,M,U,M
482,2310,"Wife, Mom, Teacher. Just trying to make sense ...",JLaufe,U,F,F
332,2622,bon vivant/raconteur/troubadour \nopinionated ...,MichaelSalamone,M,U,M
20,3002,Just an introvert forever trying to break out ...,MissElsa86,F,F,F
359,3346,I enjoy being around my Twitter family on here...,tom_lewisville,M,U,M


In [2]:
# Import dataset of names and their genders
namesfinal = pd.read_excel("top2000names2014final.xlsx")

# Sort names dataframe so longest names come first (so that Erica will then match against Erica and not Eric)
namesfinal["namelength"] = ""
for i in range(len(namesfinal)):
    namesfinal.at[i, "namelength"] = len(namesfinal.at[i, "name"])
namesfinal = namesfinal.sort_values(by=['namelength'], ascending=False)
namesfinal = namesfinal.reset_index(drop=True)
print(namesfinal.shape)

# Remove any names 3 characters or less
namesfinal.drop(namesfinal[namesfinal.namelength < 4].index, inplace=True)
print(namesfinal.shape)
namesfinal

(2013, 4)
(1944, 4)


Unnamed: 0,name,gender,prob,namelength
0,Maximiliano,M,1.000000,11
1,Christopher,M,0.996485,11
2,Bernadette,F,1.000000,10
3,Cristopher,M,1.000000,10
4,Alessandra,F,1.000000,10
5,Antoinette,F,1.000000,10
6,Alexandria,F,0.999682,10
7,Maximilian,M,1.000000,10
8,Marguerite,F,1.000000,10
9,Kristopher,M,0.999799,10


In [4]:
# Create a column with username converted to lowercase, on both datasets, to enable case insensitive matching
tweets_sample1000_withgender["user_lower"] = tweets_sample1000_withgender["user"].str.lower()
namesfinal["name_lower"] = namesfinal["name"].str.lower()

# Create 2 columns on our dataset of tweets for the algorithm to make its gender classifications in
tweets_sample1000_withgender["gender_username_algo"] = ""
tweets_sample1000_withgender["namematched_algo"] = ""

In [5]:
# External loop iterates through our dataset of tweets
for i in range(len(tweets_sample1000_withgender)):
    # Internal loop iterates through the list of names
    for j in range (len(namesfinal)):
        # This step ensures that shorter names do not override longer names when matching
        if tweets_sample1000_withgender.at[i, "gender_username_algo"]=="":
            # If a name is found within the user field, the corresponding gender is assigned and (for evaluation purposes), the name matched on is recorded
            if (namesfinal.at[j, "name_lower"] in tweets_sample1000_withgender.at[i, "user_lower"]):
                tweets_sample1000_withgender.at[i, "gender_username_algo"] = namesfinal.at[j, "gender"]
                tweets_sample1000_withgender.at[i, "namematched_algo"] = namesfinal.at[j, "name"]
            else:
                continue
        else:
            continue
    # Just to get a sense of progress while processing
    if i%300==0:
        (print(i))

0
300
600
900


#### Evaluation of the algorithm's username-based classifications

We analysed the algorithm's username based classifications in terms of quantity and accuracy against the human's:

In [6]:
# Method which loops through the data to compare the algorithm's gender classifications to the human's
def label_gendermatch (row):
    if row["gender_username_algo"] == row["gender_username_human"]:
        return 'OK'
    elif (row["gender_username_algo"] == "") & (row["gender_username_human"]=='U'):
        return 'OK'
    else:
        return 'notOK'

# Populate the gendermatch column with 'OK' if the human and algorithm classifications are the same, else 'not OK' if they are diferent
tweets_sample1000_withgender['gendermatch']=tweets_sample1000_withgender.apply (lambda row: label_gendermatch (row), axis=1)

# Group the data to aggregate the algorithm's classifications based on count
print (tweets_sample1000_withgender.groupby('gender_username_algo').agg({'user': 'size'})
      .rename(columns={'user': 'Count of classifications'}), '\n')

# Group the data to aggregate the gendermatch column, based on count
print (tweets_sample1000_withgender.groupby("gendermatch").agg({'user': 'size'})
      .rename(columns={'user': 'Count of classifications in agreement with human'}), '\n')

# Pivot the data to aggregate the classifications where the gendermatch was not OK
print("Analysis of 193 notOK rows")
pivot2=pd.pivot_table(tweets_sample1000_withgender[tweets_sample1000_withgender['gendermatch']!='OK'], index=["gender_username_algo"],values=["user"], aggfunc=[len], margins=True)
pivot2.columns=['count']
pivot2.index.names = ['algo gender']
pivot2["%age"] = round(pivot2["count"]/193*100, 1)
print(pivot2, '\n')

                      Count of classifications
gender_username_algo                          
                                           562
F                                          179
M                                          259 

             Count of classifications in agreement with human
gendermatch                                                  
OK                                                        807
notOK                                                     193 

Analysis of 193 notOK rows
             count   %age
algo gender              
               109   56.5
F               29   15.0
M               55   28.5
All            193  100.0 



The first pivot table shows us that the algorithm assigns a gender to (179+259=438) 44% of the 1000 usernames.

The second pivot shows that 81% of its decisions (to assign M, F or nothing) are 'OK' (the same as the human) and 19% not the same ('notOK'). 

The third pivot shows that of the 193 'not OK's, 56.5% are instances where the algorithm made no assignment where the human did. Upon inspection of the data, these were often cases of excluded 3 letter names or words that we prevented the algorithm from matching against to increase the accuracy of the assignments it did make. Abbreviated and unknown names were also a factor here. 

The other 84 instances (43%) where the algorithm made a prediction of M or F that conflicted with the human's prediction of M/F/U were of importance to us. We examined them further by exporting the data, evaluating both the reason for the algorithm's error and the severity of its error, and re-importing the evaluated data:

In [8]:
# Export the dataset for evaluation
writer = pd.ExcelWriter('tweets_sample1000_withgender_username.xlsx')
tweets_sample1000_withgender.to_excel(writer,'Sheet1')
writer.save()

In [9]:
# Reimport the evaluated dataset
probtweets = pd.read_excel("84probtweets.xlsx")

pivot4=pd.pivot_table(probtweets, index=["Problem"],values=["index"], aggfunc=[len])
pivot4.columns=['count']
pivot4 = pivot4.reset_index().sort_values(['count'], ascending=False).set_index(['count'])
print(pivot4, '\n')

pivot5=pd.pivot_table(probtweets, index=["Severity"],values=["index"], aggfunc=[len], margins=True)
pivot5.columns=['count']
print(pivot5, '\n')

                        Problem
count                          
46                   wordinword
19                      surname
14               otheruseofname
5      unknownname & wordinword 

                  count
Severity               
gender incorrect     15
gender unknown       69
All                  84 



The first pivot shows us that the most common reason for an inaccurate gender assignment was "word in word", ie a name from the list matching against the username but to a human eye, only as part of a longer word. For example, username "cgtnamerica" was a match for female name Erica and so got assigned as female by the algorithm. Other issues were surnames being matched as first names (eg username "meg_andrews" matching for male name Andrew) , other uses of names than self-identification confusing the algorithm (eg  username "MyManjimmyjack" matching for male name "Jimmy") and names unknown to the algorithm leading to incorrect matches of shorter words (eg username "miss_scarlet" was a match for male name Carl, as Scarlet wasn't on the names list)

The second pivot shows an assessment of the severity of the inaccurate matches. Fifteen are cases where it is clear that the algorithm has matched the wrong gender (eg F where a human can see the gender is clearly M). Sixty-nine are cases where the algorithm has matched a gender of F or M where the human can make no reasonable gender inference one way or the other from the username.

### Part 2: Description

Using the same human-classified sample of 1000 random rows from our dataset of tweets, this program crosschecks tokenised words from the description field against a list of keywords known to be reflective of gender, and makes a gender classification where it finds a match.

#### Keyword List
When inferring gender from the description field (which is essentially the user's self-written profile) a number of oft-repeated key words allowed the human to classify the gender of the writer (eg "wife", "mom", "father").We made a list of 35 such keywords that the human had used when identifying gender from the description

In [10]:
# Import keyword list
descr_keywords = pd.read_excel("descr_keywords.xlsx")
descr_keywords

Unnamed: 0,word,gender
0,businesswoman,F
1,businessman,M
2,congressman,M
3,sisterhood,F
4,fisherman,M
5,gentleman,M
6,godmother,F
7,daughter,F
8,feminism,F
9,feminist,F


In [11]:
# Add column to main dataset for new gender prediction based on description
# Create a column with username converted to lowercase, on both datasets
tweets_sample1000_withgender["gender_description_algo"] = ""

# Make description field lowercase
tweets_sample1000_withgender["description"] = tweets_sample1000_withgender["description"].str.lower()

# Import NLTK to tokenize the description field
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

tweets_sample1000_withgender["description_tokenized"] = ""
tokenizer = RegexpTokenizer(r'\w+')

# Run the tokenizing algorithm
for i in range(len(tweets_sample1000_withgender)):
    if pd.notnull(tweets_sample1000_withgender.at[i, 'description'])==False :
        (tweets_sample1000_withgender.at[i, 'description_tokenized'])=""
    else:
        (tweets_sample1000_withgender.at[i, 'description_tokenized'])=tokenizer.tokenize((tweets_sample1000_withgender.at[i, 'description']))

descr_keywords.columns = ['word', 'gender']

In [12]:
# Run the  gender-assigning algorithm
# External loop iterates through the dataset of tweets
for i in range(len(tweets_sample1000_withgender)):
    # Internal loop iterates through the list of keywords
    for j in range(len(descr_keywords)):
        # Where a keyword matches to a tokenised word in the description, the corresponding gender is assigned. Lengthier keywords are favoured.
        if (descr_keywords.at[j, 'word'] in tweets_sample1000_withgender.at[i, 'description_tokenized']) & (tweets_sample1000_withgender.at[i, 'gender_description_algo']==""):
            tweets_sample1000_withgender.at[i, 'gender_description_algo'] = descr_keywords.at[j, 'gender']
        else:
            continue

#### Evaluation of the algorithm's description-based classifications

We analysed the algorithm's description-based classifications in terms of quantity and accuracy against the human's:

In [13]:
def label_gendermatchdescr (row):
    if row["gender_description_algo"] == row["gender_description_human"]:
        return 'OK'
    elif (row["gender_description_algo"] == "") & (row["gender_description_human"]=='U'):
        return 'OK'
    else:
        return 'notOK'

tweets_sample1000_withgender['gendermatchdescr']=tweets_sample1000_withgender.apply (lambda row: label_gendermatchdescr (row), axis=1)

pivot7=pd.pivot_table(tweets_sample1000_withgender, index=["gender_description_algo"],values=["user"], aggfunc=[len], margins=True)
pivot7.columns=['count']
print(pivot7, '\n')

pivot8=pd.pivot_table(tweets_sample1000_withgender, index=["gendermatchdescr"],values=["user"], aggfunc=[len], margins=True)
pivot8.columns=['count']
print(pivot8, '\n')

pivot9=pd.pivot_table(tweets_sample1000_withgender[tweets_sample1000_withgender['gendermatchdescr']!='OK'], index=["gender_description_algo"],values=["user"], aggfunc=[len])
pivot9.columns=['count']
pivot9.index.names = ['algo gender']
pivot9["%age"] = round(pivot9["count"]/27*100, 1)
print(pivot9, '\n')

                         count
gender_description_algo       
                           821
F                          107
M                           72
All                       1000 

                  count
gendermatchdescr       
OK                  973
notOK                27
All                1000 

             count  %age
algo gender             
                 8  29.6
F               10  37.0
M                9  33.3 



The first pivot shows us that the algorithm makes (107F + 72M) 179 gender assignments. 

The second pivot shows us that the algorithm's gender assigning decisions are 97% OK (the same as the human) and 3% not OK (different to the human).

The third pivot shows us that of the 27 'notOK' decisions the algorithm made, 30% were where it assigned no gender when the human did, and 70% were where it assigned an M/F gender that was in conflict with the human's gender assignment based on description.

We look at the 27 'not OK' decisions , by exporting the data, marking it up with reasons for the conflict, and re-importing:

In [14]:
# Export data
writer = pd.ExcelWriter('tweets_sample1000_withgender_description.xlsx')
tweets_sample1000_withgender.to_excel(writer,'Sheet1')
writer.save()

# Import evaluated data, now marked with reasons for the misclassification
descrprobtweets = pd.read_excel("27probtweets.xlsx")

# Pivot the data to aggregate the reasons for misclassification
pivot10=pd.pivot_table(descrprobtweets, index=["reason"],values=["description_tokenized"], aggfunc=[len])
pivot10.columns=['count']
pivot10 = pivot10.reset_index().sort_values(['count'], ascending=False).set_index(['count'])
print(pivot10, '\n')

              reason
count               
19          otheruse
2        his/her/him
2          otherword
2      pluralisation
2          runonword 



The majority of these incorrect matches are due to 'otheruse': gender-significant words being used in the description in such a way that the human, but not the algorithm, recognised they were not a marker of the user's own gender, eg "I love my mom" as opposed to "I'm a mom". Other issues that caused problems were "his/her/him" (a description such as "just an introvert forever trying to break out of her shell" allowed the human to identify the writer was female, but not the algorithm as these gender pronouns were not in its list of keywords), "otherword" (a keyword unknknown to the algorithm eg "damsel", "gentlemen"), "pluralisation" (pluraised keywords in the description eg "sisters", "aunts" were not matched) and run on words (a side effect of the tokenised description is that when words run into another in the description, eg "latinafeminist" "bigmomma" they do not return a match for the gender-significant words.)

### Part 3: Combining Username & Description

The final program combines the username and description programs to classify a maximum of rows with a gender. In the 1.5% of cases where the two subprograms produced a conflicting M/F classification, the description program's classification takes precedence, as it had been found to be more accurate.

In [15]:
# Initial aggregated comparison of the classifications made by the username and description programs
print(pd.pivot_table(tweets_sample1000_withgender, index=["gender_username_algo", "gender_description_algo"],values=["user"], aggfunc=[len]))

                                              len
                                             user
gender_username_algo gender_description_algo     
                                              466
                     F                         56
                     M                         40
F                                             133
                     F                         41
                     M                          5
M                                             222
                     F                         10
                     M                         27


The pivot above shows the 5 cases where the username-based program classified F and the description-based program classified M, plus the 10 cases where the username-based program classified M and the description-based program classified F, for a total of 15 conflicting classifications

In [16]:
# Create a new column for the final gender classification based on the two programs
tweets_sample1000_withgender["gender_algo_final"] = ""

for i in range(len(tweets_sample1000_withgender)):
    # Where the description program has made a classification, put that in the final classification column
    tweets_sample1000_withgender.at[i, "gender_algo_final"] = tweets_sample1000_withgender.at[i, "gender_description_algo"] 
    # If the description program didn't make a classification, use the username program's classification
    if tweets_sample1000_withgender.at[i, "gender_algo_final"]=="":
        tweets_sample1000_withgender.at[i, "gender_algo_final"]=tweets_sample1000_withgender.at[i, "gender_username_algo"]

#### Evaluation of the final classifications made

We analyse the final classifications as follows:

In [17]:
# Pivot to view the number of classifications made
print(pd.pivot_table(tweets_sample1000_withgender, index=["gender_algo_final"],values=["user"], aggfunc=[len]))

# Method to compare the human's final gender classification against the algorithmical final classification
def label_gendermatch_overall (row):
    if row["gender_algo_final"] == row["gender_human_final"]:
        return 'OK'
    elif (row["gender_algo_final"] == "") & (row["gender_human_final"]=='U'):
        return 'OK'
    else:
        return 'notOK'

# Apply the above method to each row to mark the classifications as correct or not
tweets_sample1000_withgender["gendermatchfinal"] = ""    
tweets_sample1000_withgender['gendermatchfinal']=tweets_sample1000_withgender.apply (lambda row: label_gendermatch_overall (row), axis=1)

# Pivot to view the accuracy of classificatoins made
print(pd.pivot_table(tweets_sample1000_withgender, index=["gendermatchfinal"],values=["user"], aggfunc=[len]))

                   len
                  user
gender_algo_final     
                   466
F                  240
M                  294
                  len
                 user
gendermatchfinal     
OK                825
notOK             175


The first pivot table shows us that the algorithm can assign a gender to 53.4% of the dataset (240F + 294M = 534) using a combination of gender and username.

The second table shows us the final gender assignment produced by the algorithm (the combination of username and description) compared to the human's final gender assignment. We have an overall accuracy rate of 82.5%, which is sufficient for our purposes.


The final rule-based program, run over the dataset in Appendix 7, is as follows:

In [None]:
#PREPARATION OF NAME DATA

# Import dataset of names and their genders
namesfinal = pd.read_excel("top2000names2014final.xlsx")

# Sort names dataframe so longest names come first (so that Erica will then match against Erica and not Eric)
namesfinal["namelength"] = ""
for i in range(len(namesfinal)):
    namesfinal.at[i, "namelength"] = len(namesfinal.at[i, "name"])
namesfinal = namesfinal.sort_values(by=['namelength'], ascending=False)
namesfinal = namesfinal.reset_index(drop=True)
print(namesfinal.shape)

# Remove any names 3 characters or less
namesfinal.drop(namesfinal[namesfinal.namelength < 4].index, inplace=True)
print(namesfinal.shape)

# Create a column with name converted to lowercase
namesfinal["name_lower"] = namesfinal["name"].str.lower()

In [None]:
#ASSIGNMENT OF GENDER BASED ON USERNAME

# Create a column with username converted to lowercase
OURDATASET["user_lower"] = OURDATASET["user"].str.lower()

# Create 2 columns for the algorithm to make its gender prediction in
OURDATASET["gender_username_algo"] = ""
OURDATASET["namematched_algo"] = ""

# Run the name matching algorithm
for i in range(len(OURDATASET)):
    for j in range (len(namesfinal)):
        if OURDATASET.at[i, "gender_username_algo"]=="":
            if (namesfinal.at[j, "name_lower"] in OURDATASET.at[i, "user_lower"]):
                OURDATASET.at[i, "gender_username_algo"] = namesfinal.at[j, "gender"]
                OURDATASET.at[i, "namematched_algo"] = namesfinal.at[j, "name"]
            else:
                continue
        else:
            continue
    # Just to get a sense of progress while processing
    if i%300==0:
        (print(i))

In [None]:
# ASSIGNMENT OF GENDER BASED ON DESCRIPTION

descr_keywords = pd.read_excel("descr_keywords.xlsx")

# Add column for new gender prediction based on description
# Create a column with username converted to lowercase, on both datasets
OURDATASET["gender_description_algo"] = ""

# Make description field lowercase
OURDATASET["description"] = OURDATASET["description"].str.lower()

# Tokenize the description field using natural language toolkit
from nltk.tokenize import word_tokenize
from nltk.tokenize import RegexpTokenizer

OURDATASET["description_tokenized"] = ""
tokenizer = RegexpTokenizer(r'\w+')

# Run the tokenizing algorithm
for i in range(len(OURDATASET)):
    if pd.notnull(OURDATASET.at[i, 'description'])==False :
        (OURDATASET.at[i, 'description_tokenized'])=""
    else:
        (OURDATASET.at[i, 'description_tokenized'])=tokenizer.tokenize((OURDATASET.at[i, 'description']))

descr_keywords.columns = ['word', 'gender']

# Run the gender assigning algorithm
for i in range(len(OURDATASET)):
    for j in range(len(descr_keywords)):
        if (descr_keywords.at[j, 'word'] in OURDATASET.at[i, 'description_tokenized']) & (OURDATASET.at[i, 'gender_description_algo']==""):
            OURDATASET.at[i, 'gender_description_algo'] = descr_keywords.at[j, 'gender']
        else:
            continue

In [None]:
#ASSIGNMENT OF FINAL GENDER VALUE

OURDATASET["gender_final"] = ""

for i in range(len(OURDATASET)):
    OURDATASET.at[i, "gender_final"] = OURDATASET.at[i, "gender_description_algo"] 
    if OURDATASET.at[i, "gender_final"]=="":
        OURDATASET.at[i, "gender_final"]=OURDATASET.at[i, "gender_username_algo"]