## Gender guesser do file  

#### Purpose: take list of contacts provided by GIZ and guess their gender based on first names
Outline:   
Part 1: Set up paths, import necessary packages  
Part 2: Load datasets (contact lists provided by GIZ and gendered name lists)  
Part 3: Run gender-guesser package  
Part 4: Fuzzy matching with existing gendered names lists to guess gender of remaining names  
Part 5: Export results

## Part 1
#### Download any necessary packages, import and set up paths

In [47]:
## Install these packages if you don't have them already (remove the #)

#!pip install gender_guesser
#!pip install earthpy
#!pip install fuzzywuzzy
#!pip install python-Levenshtein

In [10]:
import csv
import pandas as pd
import re
import gender_guesser.detector as gender
import os
import earthpy as et
import fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [11]:
github_path = os.path.join(et.io.HOME, "Documents","GitHub","giz-pema-ecommerce","sampling-email-experiment")

try:
    gdrive_path = os.path.join(et.io.HOME, "Google Drive", "Research_GIZ_Tunisia_exportpromotion","1. Intervention I – E-commerce","data","0-sampling-email-experiment")
    os.chdir(gdrive_path)
except:
    gdrive_path = os.path.join(et.io.HOME, "Google Drive","My Drive", "Research_GIZ_Tunisia_exportpromotion","1. Intervention I – E-commerce","data","0-sampling-email-experiment")

## Part 2

#### Load datasets 

In [12]:
## GIZ contact list (ungendered)

contacts = pd.read_csv(os.path.join(gdrive_path,'intermediate','giz_contact_list_ungendered.csv'), dtype='string' )
contacts.head()

Unnamed: 0.1,Unnamed: 0,firmname,name,email,firstname,lastname,origin,governorate,town,sector,fte,export
0,0,3dwave,Ferid kamel,feridkamel@gmail.com,Ferid,Kamel,pema,,,,,
1,1,abaplast,Akram Ben amor,abaplast@topnet.tn,Akram,Ben amor,pema,,,,,
2,2,abin consulting,Elyes Grar,elyesgrar@gmail.com,Elyes,Grar,pema,,,,,
3,3,abp,Ayda Bouassida,aydabouassidaa@gmail.com,Ayda,Bouassida,pema,,,,,
4,4,abshore,Asma Mechri,asma.mechri@abshore.com,Asma,Mechri,pema,,,,,


In [13]:
## Gendered names list

names = pd.read_csv(os.path.join(gdrive_path,'intermediate','gendered_names.csv'), dtype='string' )
names = names[['firstname','gender']]
names.head()

Unnamed: 0,firstname,gender
0,'آمال قاسم حرم,female
1,'أنيس,male
2,'رابح,male
3,سحر,unknown
4,'قرفال,unknown


## Part 3
#### Fuzzy matching with existing gendered names lists to guess gender  
Start with raw matching, then fuzzy matching

In [34]:
df_new = pd.merge(contacts, names, how='left', on='firstname')
df_new['gender'].value_counts()

male           3058
female          704
unknown          36
incomplete        8
not_a_name        4
male,female       1
Name: gender, dtype: Int64

That gets us a good chunk of the way. Now for the remaining ~1200 or so ungendered names. First we try to use fuzzy matching: 

In [35]:
# Pick rows for which gender is still unknown and drop the current gender column

df_unknown = df_new[(df_new['gender'].isnull()) | (df_new['gender']=='unknown') | (df_new['gender']=='incomplete') | (df_new['gender']=='not_a_name') | (df_new['gender']=='male,female')]
df_unknown = df_unknown.drop(columns=['gender']) 
df_unknown.shape

(1239, 13)

In [39]:
# Now the fuzzy matching:
# Casting the first name columns into lists

df1_names = list(df_unknown['firstname'].unique())
df2_names = list(names['firstname'].unique())

In [40]:
#Defining a function to return the match and similarity score of the fuzz.ratio() scorer. The function will take in a term(name), list of terms(list_names), and a minimum similarity score(min_score) to return the match. 
def match_names(name, list_names, min_score=0):
    max_score = -1
    max_name = ''
    for x in list_names:
        score = fuzz.ratio(name, x)
        if (score > min_score) & (score > max_score):
            max_name = x
            max_score = score
    return (max_name, max_score)

In [41]:
#For loop to create a list of tuples with the first value being the name from the second dataframe (name to replace) and the second value from the first dataframe (string replacing the name value). Then, casting the list of tuples as a dictionary. 
firstnames = []
for x in df1_names:
    match = match_names(x, df2_names, 75)
    if match[1] >= 75:
        name = ('(' + str(x), str(match[0]) + ')')
        firstnames.append(name)
name_dict = dict(firstnames)
name_dict

{'(Ferid': 'Feri)',
 '(Elyes': 'Elyess)',
 '(Ayda': 'Ada)',
 '(Bilel': 'Bile)',
 '(Fadhel': 'Fadheela)',
 '(Achraf': 'Ichraf)',
 '(Hatem': 'Haithem)',
 '(Kais': 'Kaies)',
 '(Mr': 'Mor)',
 '(Nedia': 'Fedia)',
 '(Zied': 'Zed)',
 '(Dhaker': 'Dhafer)',
 '(Fethi': 'Fethia)',
 '(Luca': 'Lucas)',
 '(Saber': 'Sabeur)',
 '(Sabrine': 'Sabrin)',
 '(Faycal': 'Faryal)',
 '(Intissar': 'Intisar)',
 '(Oussama': 'Oussema)',
 '(Amine': 'Amineh)',
 '(Imed': 'Imelda)',
 '(Brahmi': 'Brahma)',
 '(Mehdi': 'Mehndi)',
 '(El': 'El)',
 '(Bechir': 'Bechr)',
 '(Cherif': 'Cherifa)',
 '(Kamel': 'Kameel)',
 '(Alaeddine': 'Ala eddine)',
 '(Donia': 'Dona)',
 '(Nouha': 'Nuha)',
 '(Mohsen': 'Mohsin)',
 '(Slim': 'Salim)',
 '(Mokhtar': 'Mukhtar)',
 '(Chakib': 'Shakib)',
 '(Brahim': 'Brahma)',
 '(Chokri': 'Chaouki)',
 '(Lotfi': 'Lutfi)',
 '(Fouzia': 'Faouzia)',
 '(Najiba': 'Najibah)',
 '(Jamel': 'Jameel)',
 '(Lamia': 'Lamiah)',
 '(Nacer': 'Naceur)',
 '(Faouzi': 'Faouzia)',
 '(Mouhamed': 'Mohamed)',
 '(Mechlia': 'Mehalia)',


In [42]:
# Replace the names with the correct equivalent in the long list of names

names['firstname'] = names['firstname'].replace(name_dict)

In [45]:
# And try matching contacts and names again: 

df_solved = pd.merge(df_unknown, names, how='left', on='firstname')
df_solved['gender'].value_counts()

unknown        36
incomplete      8
not_a_name      4
male,female     1
Name: gender, dtype: Int64

That wasn't terribly useful. It couldn't guess a single male or female name... 

## Part 4
#### Run gender-guesser package
To complete the list, we use a package called 'gender guesser'

In [48]:
d = gender.Detector()

df_unknown['gender'] = df_unknown['firstname'].apply(lambda x: d.get_gender(x))
df_unknown['gender'].value_counts()

male             958
female           143
unknown          130
mostly_female      5
andy               3
Name: gender, dtype: int64

##### Now the same but on the surnames

In [49]:
df_unknown.loc[df_unknown['gender']== 'unknown', 'gender'] = df_unknown['lastname'].apply(lambda x: d.get_gender(x))
df_unknown['gender'].value_counts()

male             1046
female            159
unknown            26
mostly_female       5
andy                3
Name: gender, dtype: int64

In [50]:
# One quick fix, many names start with 'Mr' and are male: 

df_unknown.loc[df_unknown['firstname']== 'Mr', 'gender'] = 'male'
df_unknown['gender'].value_counts()

male             1055
female            159
unknown            17
mostly_female       5
andy                3
Name: gender, dtype: int64

In [54]:
# Check out the unknown ones
# df_unknown[df_unknown['gender'] == "andy"]

## Part 5:
#### Merge and export files  
Using Excel to avoid spacing issues 

In [55]:
# First, take only rows of df_new that had already guessed the gender

df_guessed = df_new[((df_new['gender']=='male') | (df_new['gender']=='female'))]
df_guessed.shape

(3762, 13)

In [56]:
# Now merge

frames = [df_guessed, df_unknown]

df_names = pd.concat(frames)
df_names.head()

Unnamed: 0.1,Unnamed: 0,firmname,name,email,firstname,lastname,origin,governorate,town,sector,fte,export,gender
1,1,abaplast,Akram Ben amor,abaplast@topnet.tn,Akram,Ben amor,pema,,,,,,male
4,4,abshore,Asma Mechri,asma.mechri@abshore.com,Asma,Mechri,pema,,,,,,female
6,6,acem plus,BEN SALEM,acemplus@gmail.com,Ben,Salem,pema,,,,,,male
8,8,actia,Yemen Zegneni,yemen.zegneni@actia.engineering.tn,Yemen,zegneni,pema,,,,,,male
9,9,adactim,Maher Ferjani,maher.ferjani@adactim.com,Maher,Ferjani,pema,,,,,,male


In [65]:
# Drop useless first column

df_names = df_names.drop(columns=['Unnamed: 0']) 

(5001, 12)

In [66]:
# Export to Excel

df_names.to_excel(os.path.join(gdrive_path,'final','giz_contact_list.xlsx'))