## Project 3

#### Summer 2021
**Authors:** GOAT Team (Esteban Aramayo, Ethan Haley, Claire Meyer, and Tyler Frankenburg)

### For Project 3, we'll try to predict genders of unseen names, based on genders of seen names.  

**We'll start with nltk's names corpus**

In [1]:
import nltk
import random

In [36]:
names = nltk.corpus.names

In [37]:
# are genders balanced?
print(f"{len(names.words('male.txt'))} male names and {len(names.words('female.txt'))} female names.")

2943 male names and 5001 female names.


So a majority classifier baseline accuracy is 63%

In [38]:
# Put all names in one list, with gender labels
names = [(n, 'M') for n in names.words('male.txt')] + \
        [(n, 'F') for n in names.words('female.txt')]

# Shuffle names for ML, setting random seed so we get same results every time
random.seed(620)
random.shuffle(names)
names[2345]

('Ahmet', 'M')

In [39]:
# shortest name
min(names, key=lambda n: len(n[0]))

('Em', 'F')

In [40]:
# longest name
max(names, key=lambda n: len(n[0]))

('Helen-Elizabeth', 'F')

We can replace those hyphens with spaces, make names lowercase, and add spaces at the start and end of each name too, to signify start and end.

In [42]:
names = [(' ' + n[0].lower().replace('-', ' ') + ' ', n[1]) for n in names]

max(names, key=lambda n: len(n[0]))

(' helen elizabeth ', 'F')

**Split names into train, dev, and test sets as specified.**

In [43]:
tests, devs, trains = names[:500], names[500:1000], names[1000:]

#### As an example of feature engineering, we could break every name down into its letter pairs.  So 27 * 27 possible pairs (with spaces).

In [30]:
def getPairs(name):
    '''
    Given a lowercase name as input, this returns a dictionary 
    showing which pairs of letters are in the name.
    Spaces are allowed, and only binary values are returned, not counts.
    '''
    letters = ' abcdefghijklmnopqrstuvwxyz'
    pairdict = {p1+p2:False for p1 in letters for p2 in letters}
    
    for i in range(len(name)-1):
        pairdict[name[i:i+2]] = True
        
    return pairdict

In [45]:
getPairs(' jo jo ')['o ']

True

In [47]:
# use that function to make the features
trainfeats = [(getPairs(n[0]), n[1]) for n in trains]
devfeats = [(getPairs(n[0]), n[1]) for n in devs]

In [48]:
# train Naive Bayes model
model = nltk.NaiveBayesClassifier.train(trainfeats)

In [49]:
# check model accuracy on dev set
nltk.classify.accuracy(model, devfeats)

0.816

In [50]:
# see what the model learned
model.show_most_informative_features(20)

Most Informative Features
                      a  = True                F : M      =     34.7 : 1.0
                      f  = True                M : F      =     28.6 : 1.0
                      k  = True                M : F      =     28.1 : 1.0
                      rv = True                M : F      =     27.5 : 1.0
                      fo = True                M : F      =     22.5 : 1.0
                      lt = True                M : F      =     21.9 : 1.0
                      hu = True                M : F      =     21.9 : 1.0
                      iu = True                M : F      =     18.5 : 1.0
                      rw = True                M : F      =     17.4 : 1.0
                      v  = True                M : F      =     16.2 : 1.0
                      p  = True                M : F      =     15.1 : 1.0
                      sp = True                M : F      =     15.1 : 1.0
                      rk = True                M : F      =     14.5 : 1.0

The last letter is more important than the first letter.  Also notice that all these informative features other than the top one are indicators of male names.  This may be partly due to the fact that female names account for 63% of the names here.  We might squeeze a little more accuracy out of the model by balancing the training set, i.e. undersampling female names.