## Project 3

### Alice Ding, Shoshana Farber, Christian Uriostegui

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

For this project, we'll be using Supervised Classification in order to build a name gender classifier. According to the book, "a classifier is called *supervised* if it is built on training corpora containing the correct label for each input." Using this framework, we will do our best to take a list of names, split it into training and testing data, and then use the training data to build a model and then classify the testing data accordingly.

### Importing Packages and the Data

In [1]:
import nltk
import numpy as np
from nltk.classify import apply_features

# grab all of the names
from nltk.corpus import names
names = ([(name, 'male') for name in names.words ('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

# use the random package to shuffle the names
import random
np.random.seed(135)
random.shuffle(names)

### Feature Identification

#### Last Letter

Female and male names have distinctive characteristics: names ending in a, e, and i are likely to be female while names ending in s, r, o, and k are likely to be male. Taking this code from chapter 6, we can build a dictionary containing that relevant information after given a name.

In [2]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Shrek')

{'last_letter': 'k'}

Now, we'll use this function on our names dataset and then split our data into a training and testing set. We'll then use naive Bayes to classify the datasets accordingly.

In [3]:
# using apply_features to avoid using a large amount of memory
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[:500])
classifier = nltk.NaiveBayesClassifier.train(train_set)

What if we try using this classifier on a few names that aren't in either dataset?

In [4]:
print('Tyrion:', classifier.classify(gender_features('Tyrion')))
print('Cersei:', classifier.classify(gender_features('Cersei')))

Tyrion: male
Cersei: female


For these Game of Thrones characters, this seemed to work well.

How is the accuracy of our model?

In [5]:
nltk.classify.accuracy(classifier, test_set)

0.76

76%, not too bad. What does the classifer say about what features it found most effective for classifying each name's gender?

In [6]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.8 : 1.0
             last_letter = 'k'              male : female =     32.0 : 1.0
             last_letter = 'f'              male : female =     15.8 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =     11.1 : 1.0


Interestingly, this is telling us that:

- names that end with a are 35.8 times more likely to be female
- names that end with k are 32 times more likely to be male
- names that end with f are 15.8 times more likely to be male
- names that end with p are 11.8 times more likely to be male
- names that end with v are 11.1 times more likely to be male

What other features can we add to this model to make it a little better?

#### Adding First Letter and Count of All Letters

Let's make a new function that finds the last letter, first letter, and a count of all letters used in the name, then use this to train a new model and determine its accuracy.

In [7]:
def gender_features_full(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

# using apply_features to avoid using a large amount of memory
train_set = apply_features(gender_features_full, names[500:])
test_set = apply_features(gender_features_full, names[:500])
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.796


At ~80%, this is higher than our previous 76%!

The next thing we can do is start refining the feature set for error analysis; this requires selecting a development set now.

In [8]:
train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]

train_set = [(gender_features(n), g) for (n,g) in train_names]
devtest_set = [(gender_features(n), g) for (n,g) in devtest_names]
test_set = [(gender_features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.763


With a 76.3% accuracy, let's see what the errors were.

In [9]:
errors = []
# for each name and its given tag
for (name, tag) in devtest_names:
    # get the classifier's guess
    guess = classifier.classify(gender_features(name))
    # if the guess does not match the correct classification
    if guess != tag:
        # add it to the list of errors
        errors.append( (tag, guess, name))

for (tag, guess, name) in sorted(errors):
    print('actual=%-8s guess=%-8s name=%-30s' % (tag, guess, name))

actual=female   guess=male     name=Adel                          
actual=female   guess=male     name=Alison                        
actual=female   guess=male     name=Amabel                        
actual=female   guess=male     name=Amargo                        
actual=female   guess=male     name=Anett                         
actual=female   guess=male     name=Angil                         
actual=female   guess=male     name=Ardis                         
actual=female   guess=male     name=Babs                          
actual=female   guess=male     name=Beatriz                       
actual=female   guess=male     name=Beau                          
actual=female   guess=male     name=Beitris                       
actual=female   guess=male     name=Berget                        
actual=female   guess=male     name=Bidget                        
actual=female   guess=male     name=Brandais                      
actual=female   guess=male     name=Brit                      