# Appendix 5: Gender Inference - Machine Learning

Dataset used for training purposes: https://www.kaggle.com/crowdflower/twitter-user-gender-classification
20051 tweets with the description, user, location and text data but also additional fields significantly **gender** and **gender confidence**. These fields had been populated via contributors who were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual).

Code to create machine learning classification algorithms: Dibaka Saha's https://www.kaggle.com/evilport/classify-gender-with-description-and-text 

We train this program on the forementioned kaggle dataset, and then run it against our own dataset, checking its accuracy in both cases.

In [1]:
import nltk
from nltk.corpus import stopwords
import random
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.utils import shuffle
import string
import pandas as pd
import math

In [2]:
def find_features(top_words, text):
    feature = {}
    for word in top_words:
        feature[word] = word in text.lower()
    return feature

In [3]:
df = pd.read_csv('gender-classifier-DFE-791531.csv', encoding = 'latin1')
#df = shuffle(shuffle(shuffle(df)))
df.head(10)

Unnamed: 0,_unit_id,_golden,_unit_state,_trusted_judgments,_last_judgment_at,gender,gender:confidence,profile_yn,profile_yn:confidence,created,...,profileimage,retweet_count,sidebar_color,text,tweet_coord,tweet_count,tweet_created,tweet_id,tweet_location,user_timezone
0,815719226,False,finalized,3,10/26/15 23:24,male,1.0,yes,1.0,12/05/2013 01:48,...,https://pbs.twimg.com/profile_images/414342229...,0,FFFFFF,Robbie E Responds To Critics After Win Against...,,110964,10/26/15 12:40,6.59e+17,main; @Kan1shk3,Chennai
1,815719227,False,finalized,3,10/26/15 23:30,male,1.0,yes,1.0,10/01/2012 13:51,...,https://pbs.twimg.com/profile_images/539604221...,0,C0DEED,ÛÏIt felt like they were my friends and I was...,,7471,10/26/15 12:40,6.59e+17,,Eastern Time (US & Canada)
2,815719228,False,finalized,3,10/26/15 23:33,male,0.6625,yes,1.0,11/28/14 11:30,...,https://pbs.twimg.com/profile_images/657330418...,1,C0DEED,i absolutely adore when louis starts the songs...,,5617,10/26/15 12:40,6.59e+17,clcncl,Belgrade
3,815719229,False,finalized,3,10/26/15 23:10,male,1.0,yes,1.0,06/11/2009 22:39,...,https://pbs.twimg.com/profile_images/259703936...,0,C0DEED,Hi @JordanSpieth - Looking at the url - do you...,,1693,10/26/15 12:40,6.59e+17,"Palo Alto, CA",Pacific Time (US & Canada)
4,815719230,False,finalized,3,10/27/15 1:15,female,1.0,yes,1.0,4/16/14 13:23,...,https://pbs.twimg.com/profile_images/564094871...,0,0,Watching Neighbours on Sky+ catching up with t...,,31462,10/26/15 12:40,6.59e+17,,
5,815719231,False,finalized,3,10/27/15 1:47,female,1.0,yes,1.0,03/11/2010 18:14,...,https://pbs.twimg.com/profile_images/656336865...,0,0,"Ive seen people on the train with lamps, chair...",,20036,10/26/15 12:40,6.59e+17,New York Gritty,Central Time (US & Canada)
6,815719232,False,finalized,3,10/27/15 1:57,brand,1.0,yes,1.0,4/24/08 13:03,...,https://pbs.twimg.com/profile_images/528547133...,0,0,@BpackEngineer Thank you for your patience whi...,,13354,10/26/15 12:40,6.59e+17,Worldwide,Eastern Time (US & Canada)
7,815719233,False,finalized,3,10/26/15 23:48,male,1.0,yes,1.0,12/03/2012 21:54,...,https://pbs.twimg.com/profile_images/508875440...,0,C0DEED,Gala Bingo clubs bought for å£241m: The UK's l...,,112117,10/26/15 12:40,6.59e+17,,
8,815719234,False,finalized,3,10/27/15 1:52,female,1.0,yes,1.0,09/08/2015 04:50,...,https://pbs.twimg.com/profile_images/658670112...,0,0,@_Aphmau_ the pic defines all mcd fangirls/fan...,,482,10/26/15 12:40,6.59e+17,,
9,815719235,False,finalized,3,10/27/15 1:49,female,1.0,yes,1.0,5/13/11 3:32,...,https://pbs.twimg.com/profile_images/513327289...,0,FFFFFF,@Evielady just how lovely is the tree this yea...,,26085,10/26/15 12:40,6.59e+17,"Nottingham, England.",Amsterdam


In [4]:
all_descriptions = df['description']
all_tweets = df['text']
all_genders = df['gender']
all_gender_confidence = df['gender:confidence']
description_tweet_gender = []

In [5]:
# Creation of bag of words for the description
bag_of_words = []
c = 0  # for the index of the row
stop = stopwords.words('english')
for tweet in all_tweets:
    description = all_descriptions[c]
    gender = all_genders[c]
    gender_confidence = all_gender_confidence[c]
    
    # Remove the rows which has an empty tweet and description
    # Remove the rows with unknown or empty gender
    # Remove the rows which have gender:confidence < 80%
    if (str(tweet) == 'nan' and str(description) == 'nan') or str(gender) == 'nan' or str(gender) == 'unknown' or float(gender_confidence) < 0.8:
        c+=1
        continue
    
    if str(tweet) == 'nan':
        tweet = ''
    if str(description) == 'nan':
        description = ''
    
    # Removal of punctuations
    for punct in string.punctuation:
        if punct in tweet:
            tweet = tweet.replace(punct, " ")
        if punct in description:
            description = description.replace(punct, " ")
            
    # Adding the word to the bag except stopwords
    for word in tweet.split():
        if word.isalpha() and word.lower() not in stop:
            bag_of_words.append(word.lower())
    for word in description.split():
        if word.isalpha() and word.lower() not in stop:
            bag_of_words.append(word.lower())
    
    # Using tweet and description for classification
    description_tweet_gender.append((tweet+" "+description , gender))
    c += 1

print(len(bag_of_words))
print(len(description_tweet_gender))

234140
13817


In [6]:
# Get top 4000 words which will act as our features of each sentence
bag_of_words = nltk.FreqDist(bag_of_words)
top_words = []
for word in bag_of_words.most_common(4000):
    top_words.append(word[0])

top_words[:10]

['co', 'https', 'get', 'love', 'weather', 'like', 'http', 'one', 'life', 'new']

In [7]:
# Creating the feature set, training set and the testing set
feature_set = [(find_features(top_words, text), gender) for (text, gender) in description_tweet_gender]
training_set = feature_set[:int(len(feature_set)*4/5)]
testing_set = feature_set[int(len(feature_set)*4/5):]

print("Length of feature set", len(feature_set))
print("Length of training set", len(training_set))
print("Length of testing set", len(testing_set))

Length of feature set 13817
Length of training set 11053
Length of testing set 2764


In [8]:
# Creating a naive bayes classifier
NB_classifier = nltk.NaiveBayesClassifier.train(training_set)
accuracy = nltk.classify.accuracy(NB_classifier, testing_set)*100
print("Naive Bayes Classifier accuracy =", accuracy)
NB_classifier.show_most_informative_features(20)

Naive Bayes Classifier accuracy = 63.06078147612156
Most Informative Features
                 updates = True            brand : female =    168.2 : 1.0
                   dates = True            brand : female =     89.8 : 1.0
                 weather = True            brand : female =     74.8 : 1.0
                 channel = True            brand : female =     67.2 : 1.0
              continuous = True            brand : female =     59.6 : 1.0
                  update = True            brand : female =     59.2 : 1.0
                  latest = True            brand : female =     34.5 : 1.0
               subscribe = True            brand : female =     30.6 : 1.0
               promoting = True            brand : female =     30.4 : 1.0
                  secure = True            brand : female =     30.4 : 1.0
                register = True            brand : female =     29.6 : 1.0
                    date = True            brand : female =     28.4 : 1.0
            photograph

In [9]:
# Creating a logistic regression classifier
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
accuracy = nltk.classify.accuracy(LogisticRegression_classifier, testing_set)*100
print("Logistic Regression classifier accuracy =", accuracy)

Logistic Regression classifier accuracy = 64.616497829233


Note that on the testing data, the Naive Bayes classifier is 63% accurate, and the Logistic Regression classifier is 65% accurate. 

We now to run it on our own sample of 100 tweets from our dataset and see if it can achieve similar accuracy levels:

In [10]:
# Testing with random user-entered data
description = "."
text = ""
features = find_features(top_words, description+" "+text)
print(NB_classifier.classify(features))
print(LogisticRegression_classifier.classify(features))

female
female


In [11]:
# Testing with the same 1000 rows of our twitter data as human. Note blank description fields have been removed.
tweet_descrandtext = pd.read_excel("tweets_sample1000_forML.xlsx")
tweet_descrandtext["genderML_NB"] = ""
tweet_descrandtext["genderML_LR"] = ""

In [12]:
tweet_descrandtext = tweet_descrandtext.reset_index()

In [13]:
import re
# Write regex pattern to remove all punctuation (note: this ML exploration was carried out before pre-processing had been finished on our original dataset)
remove = string.punctuation
remove = remove + "“”‘’"
punct_pattern = r"[{}]".format(remove)

# Loop through dataset to remove punctuation from description and text fields
for i in range(len(tweet_descrandtext)):
    tweet_descrandtext.at[i,"description"] = re.sub(punct_pattern, " ", tweet_descrandtext.at[i,"description"]) # remove punctuation
    tweet_descrandtext.at[i,"text"] = re.sub(punct_pattern, " ", tweet_descrandtext.at[i,"text"]) # remove punctuation


In [14]:
print(tweet_descrandtext.head)

<bound method NDFrame.head of      level_0   index                                        description  \
0        305   30582   Don t get mad  get even     John F  Kennedy  ...   
1        961  177398                                               Gra    
2        351  131345   I find your answer vague and unconvincing    ...   
3        106  286077   If your heart is filled with patriotism  ther...   
4        176  206944   Seek first the Kingdom    and all things shal...   
5        318   23769   Atheist   FreeSpeech   Science   Politics   S...   
6        837  126384   BlueDress  HeyBlueDress Aspiring  friendofthepod   
7        595  188785   Digital  organizer   underground revolutionar...   
8        755  184635   DigitalMarketing  married to  LeighPollack  r...   
9        330  280110   EdDStudent    CriminalJusticeAdjunct  Addicte...   
10       834  109520   LittleMonster  Debater  Democrat and a Minority    
11       171  187080   MAGA ⚜️New Orleans•Destin   Alcohol Drug Ment..

In [15]:
# Loop through dataset to classify each row with the trained Naive Bayes and Logistic Regression classifiers
for i in range(len(tweet_descrandtext)):
    description=  tweet_descrandtext.at[i, "description"]
    text= tweet_descrandtext.at[i, "text"]
    features = find_features(top_words, description+" "+text)
    tweet_descrandtext.at[i, "genderML_NB"]=(NB_classifier.classify(features))
    tweet_descrandtext.at[i, "genderML_LR"]=(LogisticRegression_classifier.classify(features))

In [16]:
# Checking accuracy of naive bayes (NB) classification against human classifications of M or F

countTotal = (len(tweet_descrandtext))
countOK = 0

# Loop through data and compare human classification with NB classifications. Where they match, mark "OK".
for i in range(len(tweet_descrandtext)):
    if ((tweet_descrandtext.at[i, "gender_final"] == "U")):
        countTotal-=1
        continue
    elif ((tweet_descrandtext.at[i, "genderML_NB"] == "female") & (tweet_descrandtext.at[i, "gender_final"] == "F")):
        countOK+=1
    elif ((tweet_descrandtext.at[i, "genderML_NB"] == "male") & (tweet_descrandtext.at[i, "gender_final"] == "M")):
        countOK+=1
        
brandCount=0
for i in range(len(tweet_descrandtext)):
    if ((tweet_descrandtext.at[i, "genderML_NB"] == "brand")):
        brandCount+=1
        
print(countOK)  # The number of rows where the human's classification matched the algorithm's
print(countTotal)  # The total number of rows the human classified with a gender
accuracy = (countOK/countTotal)*100  # The percentage accuracy of the algorithm's classifications (taking the human's classifications to be correct)
print(accuracy)
print(len(tweet_descrandtext))  # Number of classifications the algorithm made in total
print(brandCount)  # Number of classifications the algorithm made as 'brand' (neither male nor female)

271
512
52.9296875
904
124


This shows the following stats:

- The Naive Bayes algorithm made 780 M/F classifications total (904-124: the 124 were classified as 'brand'). The human made only M/F 512 classifications.

- 52.92% of the algorithms M/F classifications were accurate against the human's M/F classifications

In [17]:
# Checking accuracy of logical regression (LR) classification

countTotal2 = (len(tweet_descrandtext))
countOK2 = 0

# Loop through data and compare human classification with LR classifications. Where they match, mark "OK".
for i in range(len(tweet_descrandtext)):
    if ((tweet_descrandtext.at[i, "gender_final"] == "U")):
        countTotal2-=1
        continue
    elif ((tweet_descrandtext.at[i, "genderML_LR"] == "female") & (tweet_descrandtext.at[i, "gender_final"] == "F")):
        countOK2+=1
    elif ((tweet_descrandtext.at[i, "genderML_LR"] == "male") & (tweet_descrandtext.at[i, "gender_final"] == "M")):
        countOK2+=1
        
print(countOK2)  # The number of rows where the human's classification matched the algorithm's
print(countTotal2)  # The total number of rows the human classified with a gender
accuracy2 = (countOK2/countTotal2)*100  # The percentage accuracy of the algorithm's classifications (taking the human's classifications to be correct)
print(accuracy2)

299
512
58.3984375


This shows that 299/512, or 58%, of the LR classifier's M/F classifications were accurate against the human's M/F classifications

In [18]:
# Export the tweet subset with the new gender columns
writer = pd.ExcelWriter('tweets_with_ML_gender.xlsx')
tweet_descrandtext.to_excel(writer,'Sheet1')
writer.save()