## Project 3 - Names and Genders

#### Summer 2021
**Authors:** GOAT Team (Esteban Aramayo, Ethan Haley, Claire Meyer, and Tyler Frankenburg)

For Project 3, we'll try to predict genders of unseen names, based on genders of seen names.  

We'll start with nltk's names corpus, complete some data preparations, then test different model types.

In [114]:
import nltk
import random

In [158]:
names = nltk.corpus.names

##### Exploration

First, let's explore the names dataset a bit, and learn more about what's included. We can start by checking if the classes, Male and Female, will be balanced.

In [118]:
print(f"{len(names.words('male.txt'))} male names and {len(names.words('female.txt'))} female names.")

2943 male names and 5001 female names.


The dataset overindexes on female names, at 63%. So, a majority classifier baseline accuracy would be 63%.

In [119]:
# Put all names in one list, with gender labels
names = [(n, 'M') for n in names.words('male.txt')] + \
        [(n, 'F') for n in names.words('female.txt')]

# Shuffle names for ML, setting random seed so we get same results every time
random.seed(620)
random.shuffle(names)
names[2345]

('Ahmet', 'M')

What sorts of names are we dealing with here? Let's sample 20.

In [120]:
print(names[600:620])

[('Rahul', 'M'), ('Aida', 'F'), ('Emelda', 'F'), ('Cherey', 'F'), ('Tessi', 'F'), ('Marchelle', 'F'), ('Bobina', 'F'), ('Dewey', 'M'), ('Gavra', 'F'), ('Angie', 'F'), ('Emelina', 'F'), ('Jessalin', 'F'), ('Genevieve', 'F'), ('Giacinta', 'F'), ('Cloris', 'F'), ('Herman', 'M'), ('Vilhelm', 'M'), ('Kristine', 'F'), ('Elysha', 'F'), ('Marna', 'F')]


Looks a mix of many cultures, but weighted towards European languages.

In [121]:
# shortest name
min(names, key=lambda n: len(n[0]))

('Em', 'F')

In [122]:
# longest name
max(names, key=lambda n: len(n[0]))

('Helen-Elizabeth', 'F')

We can replace those hyphens with spaces, make names lowercase, and add spaces at the start and end of each name too, to signify start and end.

In [123]:
names = [(' ' + n[0].lower().replace('-', ' ') + ' ', n[1]) for n in names]
max(names, key=lambda n: len(n[0]))

(' helen elizabeth ', 'F')

##### Data Preparation for Models

Let's split names into train, dev, and test sets as specified.

In [30]:
tests, devs, trains = names[:500], names[500:1000], names[1000:]

The initial example from the book uses a Naive Bayes classifier, and ends with 76% accuracy.

In [182]:
def gender_features(word):
    return {'suffix1': word[-1:],
            'suffix2': word[-2:]}

In [184]:
train_set = [(gender_features(n), gender) for (n, gender) in trains]
devtest_set = [(gender_features(n), gender) for (n, gender) in devs]
test_set = [(gender_features(n), gender) for (n, gender) in tests]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.76


As an example of feature engineering and extending the book's example, we could break every name down into its letter pairs.  So 27 * 27 possible pairs (with spaces).

In [143]:
def getPairs(name):
    '''
    Given a lowercase name as input, this returns a dictionary 
    showing which pairs of letters are in the name.
    Spaces are allowed, and only binary values are returned, not counts.
    '''
    letters = ' abcdefghijklmnopqrstuvwxyz'
    pairdict = {p1+p2:False for p1 in letters for p2 in letters}
    
    for i in range(len(name)-1):
        pairdict[name[i:i+2]] = True
        
    return pairdict

In [144]:
getPairs(' jo jo ')['o ']

True

In [145]:
# Use that function to make the features
trainfeats = [(getPairs(n[0]), n[1]) for n in trains]
devfeats = [(getPairs(n[0]), n[1]) for n in devs]

##### Train and check accuracy of the models

Then we create the classifiers themselves. We'll be creating both a Decision Tree and a Naive Bayes classifer and comparing the two, starting with **Naive Bayes**. We can implement the classifier on the training set, and check accuracy on our dev set, as well.

In [146]:
nb_model = nltk.NaiveBayesClassifier.train(trainfeats)

In [147]:
nltk.classify.accuracy(nb_model, devfeats)

0.816

Just that one feature gets the model to 81.6% accuracy.  How much did it overfit on the training data? We can check by looking at the accuracy on the training set and comparing to dev.

In [148]:
# check model accuracy on training set
nltk.classify.accuracy(nb_model, trainfeats)

0.8224366359447005

It doesn't look the the Naïve Bayes model overfit during training, so when we check it on the held out test names later, we should expect something similar to the dev results (81.6%).

Next let's try a **Decision Tree Classifier**. 

In [49]:
tree_model = nltk.DecisionTreeClassifier.train(trainfeats)

In [51]:
# check tree model accuracy on dev set
nltk.classify.accuracy(tree_model, devfeats) ## 80.8%

0.8

In [52]:
# how had it done on the training?
nltk.classify.accuracy(tree_model, trainfeats) ## 85.6%

0.8649193548387096

The decision tree model overfit during training, and has slightly worse accuracy than the Bayes model.

##### Learning from the Models

Let's see what the models learned: of all our paired features, what were most informative?

In [58]:
nb_model.show_most_informative_features(20)

Most Informative Features
                      a  = True                F : M      =     34.7 : 1.0
                      f  = True                M : F      =     28.6 : 1.0
                      k  = True                M : F      =     28.1 : 1.0
                      rv = True                M : F      =     27.5 : 1.0
                      fo = True                M : F      =     22.5 : 1.0
                      lt = True                M : F      =     21.9 : 1.0
                      hu = True                M : F      =     21.9 : 1.0
                      iu = True                M : F      =     18.5 : 1.0
                      rw = True                M : F      =     17.4 : 1.0
                      v  = True                M : F      =     16.2 : 1.0
                      p  = True                M : F      =     15.1 : 1.0
                      sp = True                M : F      =     15.1 : 1.0
                      rk = True                M : F      =     14.5 : 1.0

In [62]:
print(tree_model.pseudocode(depth=5))

if d  == False: 
  if r  == False: 
    if s  == False: 
      if o  == False: 
        if n  == False: return 'M'
        if n  == True: return 'F'
      if o  == True: 
        if ko == False: return 'M'
        if ko == True: return 'F'
    if s  == True: 
      if is == False: 
        if ys == False: return 'F'
        if ys == True: return 'F'
      if is == True: 
        if ti == False: return 'M'
        if ti == True: return 'M'
  if r  == True: 
    if ni == False: 
      if no == False: 
        if mb == False: return 'M'
        if mb == True: return 'F'
      if no == True: 
        if ra == False: return 'F'
        if ra == True: return 'M'
    if ni == True: return 'F'
if d  == True: 
  if sa == False: 
    if ag == False: 
      if dr == False: 
        if ng == False: return 'M'
        if ng == True: return 'F'
      if dr == True: return 'F'
    if ag == True: return 'F'
  if sa == True: 
    if ra == False: return 'F'
    if ra == True: return 'M'



Based on most important features and the splits of the decision tree, the last letter is more important than the first letter.  Also notice that all these informative features other than the top one are indicators of male names.  This may be partly due to the fact that female names account for 63% of the names here. An attempt to squeeze a little more accuracy out of the model by balancing the training set, i.e. undersampling female names, led to a small dropoff in performance:

In [159]:
print(f"{len(names.words('male.txt'))} male names and {len(names.words('female.txt'))} female names.")

2943 male names and 5001 female names.


In [167]:
f_names = names.words('female.txt')
random.seed(620)
random.shuffle(f_names)
f_dsample = f_names[:2942]

In [168]:
# Put all names in one list, with gender labels
names_dsample = [(n, 'M') for n in names.words('male.txt')] + \
        [(n, 'F') for n in f_dsample]

# Shuffle names for ML, setting random seed so we get same results every time
random.seed(620)
random.shuffle(names_dsample)
names_dsample[2345]

('Wileen', 'F')

In [170]:
tests_d, devs_d, trains_d = names_dsample[:500], names_dsample[500:1000], names_dsample[1000:]

In [171]:
trainfeats_d = [(getPairs(n[0]), n[1]) for n in trains_d]
devfeats_d = [(getPairs(n[0]), n[1]) for n in devs_d]

In [175]:
nb_model_d = nltk.NaiveBayesClassifier.train(trainfeats_d)

In [176]:
nltk.classify.accuracy(nb_model_d, devfeats)

0.752

In [177]:
nltk.classify.accuracy(nb_model_d, trainfeats)

0.7600806451612904

As we can see, accuracy actually dropped with balanced classes, to a result in line with the textbook's example.

##### Implementing on our Test Sets

Finally, we can look at the accuracy for our test set and see how it compares to the training and development sets. We expect comparable performance from our Naive Bayes model, and poorer performance from the Decision Tree classifier, which demonstrated overfitting.

Even so, all three versions of this model are (though in the case of the downsampled model, very slight) improvements over the textbook's example.

In [178]:
testfeats = [(getPairs(n[0]), n[1]) for n in tests]

In [185]:
print('Textbook example: ', nltk.classify.accuracy(classifier, test_set))
print('Naive Bayes, all data: ', nltk.classify.accuracy(nb_model, testfeats))
print('Decision Tree, all data: ', nltk.classify.accuracy(tree_model, testfeats))
print('Naive Bayes, downsampled: ', nltk.classify.accuracy(nb_model_d, testfeats))

Textbook example:  0.754
Naive Bayes, all data:  0.822
Decision Tree, all data:  0.796
Naive Bayes, downsampled:  0.762
