# Appendix 4: Gender Inference - Feasibility

We obtained a sample of 1000 random tweets from the raw dataset and export them for human gender classification as follows:

In [None]:
# NOTE: RE-RUNNING THIS SECTION OVERWRITES THE ORIGINAL SAMPLE OBTAINED AND FLAGGED.
import pandas as pd

# Import the user(name) and description columns from the dataset
tweets = pd.read_excel("raw_sotu_tweet_data.xlsx", usecols = "C,K")

# Get a random sample of 1000 tweets
tweets_sample1000 = tweets.sample(1000)
tweets_sample1000 = tweets_sample1000.reset_index()

# Create extra columns
tweets_sample1000["gender_username_human"] = ""
tweets_sample1000["gender_description_human"] = ""

# Export the tweet subset to excel for human to read through the fields and make gender classifications
writer = pd.ExcelWriter('tweets_sample1000.xlsx')
tweets_sample1000.to_excel(writer,'Sheet1')
writer.save()

In Excel, we flagged the exported sample with gender classifications based on the username and description fields. See a snapshot below of the file after classification:

Note: classifications based on username were made where a recognisable male or female name, or else a gender-specific word such as 'momma', 'girl', 'guy', formed part of the "user" field. When in doubt (eg for some names that were more obscure), names were googled to check whether they were predominantly male or female. When no reasonable gender inference could be made from the username, a gender of "U" was assigned.

Classifications based on description were made where the user had indicated their gender in the 'description' field, eg by referring to themselves as "Mom", "Lady", "Guy", "Father", "Girl", "Lesbian", "Red-blooded male" etc. Where no reasonable inference could be made from the description, a gender of "U" was assigned.

In [1]:
import pandas as pd
tweets_sample1000_withgender = pd.read_excel('tweets_sample1000.xlsx')
tweets_sample1000_withgender

Unnamed: 0,index,description,user,gender_username_human,gender_description_human
193,123,I didn't mean to call you an angry mob. Mama a...,LM_Shepard,U,F
270,377,"Economist, opinion columnist, libertarian, Geo...",DorfmanJeffrey,M,U
104,1664,#mediadopedealer A Willy Wonka creation 👀 and ...,Ashlee_Ray,F,U
366,2087,Author of the book Meridian Hill Park. License...,feejaysee,M,U
906,2202,"I bleed Red, White and Blue. God bless Texas. ...",lapadooza,U,U
397,2260,,DanKronstadt,M,U
482,2310,"Wife, Mom, Teacher. Just trying to make sense ...",JLaufe,U,F
332,2622,bon vivant/raconteur/troubadour \nopinionated ...,MichaelSalamone,M,U
20,3002,Just an introvert forever trying to break out ...,MissElsa86,F,F
359,3346,I enjoy being around my Twitter family on here...,tom_lewisville,M,U


We then analyse the classifications made using pandas' groupby function to aggregate the rows based on their classifications:

In [2]:
# Group the dataset in order to aggregate how many rows have been classified on the username field
table1a = tweets_sample1000_withgender.groupby('gender_username_human').agg({'user': 'size'}).rename(columns={'user': 'Username Gender Classifications'})
# Put this aggregated information into a new table
table1b = pd.DataFrame(columns=['Username Gender Classifications'])
table1b.loc['F/M'] = table1a.loc['F'] + table1a.loc['M']
table1b.loc['U'] = table1a.loc['U']
table1b.loc['TOTAL'] = table1b.loc['U'] + table1b.loc['F/M']
print(table1b)

# Group the dataset in order to aggregate how many rows have been classified on the description field
table2a = tweets_sample1000_withgender.groupby('gender_description_human').agg({'user': 'size'}).rename(columns={'user': 'Description Gender Classifications'})
# Put this aggregated information into a new table
table2b = pd.DataFrame(columns=['Description Gender Classifications'])
table2b.loc['F/M'] = table2a.loc['F'] + table2a.loc['M']
table2b.loc['U'] = table2a.loc['U']
table2b.loc['TOTAL'] = table2b.loc['U'] + table2b.loc['F/M']
print(table2b)

      Username Gender Classifications
F/M                               472
U                                 528
TOTAL                            1000
      Description Gender Classifications
F/M                                  168
U                                    832
TOTAL                               1000


Based on the username we can assign a gender to 472 rows out of our 1000 row sample, ie 47%. 

Based on the description we can assign a gender to 168 rows out of our 1000 row sample, ie 17%.

For what percentage of those assignments are there conflicts between the assigned genders based on username and descriptions?

In [3]:
# Pivot to view aggregated information regarding the two classification categories:
print ("len user represents aggregated row count")
print(pd.pivot_table(tweets_sample1000_withgender, index=["gender_username_human", "gender_description_human"],values=["user"], aggfunc=[len]))

len user represents aggregated row count
                                                len
                                               user
gender_username_human gender_description_human     
F                     F                          52
                      M                           1
                      U                         156
M                     F                           2
                      M                          27
                      U                         234
U                     F                          49
                      M                          37
                      U                         442


There is only 1 result with a conflicting F/M gender, and another 2 results with a conflicting M/F gender.

We now use a combination of both gender values (ie where gender_username is unkown, let gender = gender_description, and vice versa, although some rows will have both as unknown) to assign a final gender value for each row. For the cases of conflicts, we have given precedence to the username value rather than the description value:

In [4]:
# Create a new column to hold the final human gender classification
tweets_sample1000_withgender["gender_human_final"] = ""

# Loop through the rows, assigning username-based gender classification into the "final" column
for i in range(len(tweets_sample1000_withgender)):
    tweets_sample1000_withgender.at[i, "gender_human_final"] = tweets_sample1000_withgender.at[i, "gender_username_human"] 
    # Where the username-based gender is unknown, use the description-based classification in the "final" column
    if tweets_sample1000_withgender.at[i, "gender_human_final"]=='U':
        tweets_sample1000_withgender.at[i, "gender_human_final"]=tweets_sample1000_withgender.at[i, "gender_description_human"]
        
# Group the data to view the final classifications made
print (tweets_sample1000_withgender.groupby('gender_human_final').agg({'user': 'size'})
      .rename(columns={'user': 'Count of classifications'}))

                    Count of classifications
gender_human_final                          
F                                        258
M                                        300
U                                        442


From the pivot above, we see that we can assign a gender to (F=258 + M=300) 558/1000 rows, or 56% of our sample dataset.