## Project 3

### Alice Ding, Shoshana Farber, Christian Uriostegui

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

For this project, we'll be using Supervised Classification in order to build a name gender classifier. According to the book, "a classifier is called *supervised* if it is built on training corpora containing the correct label for each input." Using this framework, we will do our best to take a list of names, split it into training and testing data, and then use the training data to build a model and then classify the testing data accordingly.

We'll be taking a lot of inspiration from the book in order to this analysis.

### Importing Packages and the Data

In [1]:
import nltk
import numpy as np
from nltk.classify import apply_features
import pandas as pd

# grab all of the names
from nltk.corpus import names
names = ([(name, 'male') for name in names.words ('male.txt')] +
        [(name, 'female') for name in names.words('female.txt')])

# use the random package to shuffle the names
import random
rng = random.Random()
rng.seed(1358)
rng.shuffle(names)

# create the subsets of names
train_names = names[1000:]
devtest_names = names[500:1000]
test_names = names[:500]

### Feature Identification

#### Last Letter

Female and male names have distinctive characteristics: names ending in a, e, and i are likely to be female while names ending in s, r, o, and k are likely to be male. Taking this code from chapter 6 of Natural Language Processing with Python, we can build a dictionary containing that relevant information after given a name.

In [2]:
def gender_features(word):
    return {'last_letter': word[-1]}
gender_features('Shrek')

{'last_letter': 'k'}

Now, we'll use this function on our names dataset and then split our data into a training and testing set. We'll then use Naive Bayes to classify the datasets accordingly.

In [3]:
# using apply_features to avoid using a large amount of memory
train_set = apply_features(gender_features, train_names)
test_set = apply_features(gender_features, test_names)
devtest_set = apply_features(gender_features, devtest_names)
classifier1 = nltk.NaiveBayesClassifier.train(train_set)

What if we try using this classifier on a few names that aren't in either dataset?

In [4]:
print('Tyrion:', classifier1.classify(gender_features('Tyrion')))
print('Cersei:', classifier1.classify(gender_features('Cersei')))

Tyrion: male
Cersei: female


For these Game of Thrones characters, this seemed to work well.

How does the accuracy of our model look?

In [5]:
nltk.classify.accuracy(classifier1, test_set)

0.752

75.2%, not too bad. What does the classifer say about what features it found most effective for classifying each name's gender?

In [6]:
classifier1.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     43.1 : 1.0
             last_letter = 'k'              male : female =     29.9 : 1.0
             last_letter = 'f'              male : female =     15.3 : 1.0
             last_letter = 'd'              male : female =     10.6 : 1.0
             last_letter = 'm'              male : female =      9.0 : 1.0


Interestingly, this is telling us that:

- Names that end with a are 43.1 times more likely to be female
- Names that end with k are 29.9 times more likely to be male
- Names that end with f are 15.3 times more likely to be male
- Names that end with d are 10.6 times more likely to be male
- Names that end with m are 9.0 times more likely to be male

What other features can we add to this model to make it a little better?

#### Adding First Letter and Count of All Letters

Let's make a new function that finds the last letter, first letter, and a count of all letters used in the name, then use this to train a new model and determine its accuracy.

In [7]:
def gender_features2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

# using apply_features to avoid using a large amount of memory
train_set = apply_features(gender_features2, train_names)
test_set = apply_features(gender_features2, test_names)
devtest_set = apply_features(gender_features2, devtest_names)
classifier2 = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier2, test_set))

0.756


At 75.6%, this is higher than our previous 75.2%!

The next thing we can do is start refining the feature set for error analysis; let's look at the accuracy of our devtest set.

In [8]:
print(nltk.classify.accuracy(classifier2, devtest_set))

0.75


With a 75% accuracy, let's see what some of the errors were.

In [9]:
errors = []
# for each name and its given tag
for (name, tag) in devtest_names:
    # get the classifier's guess
    guess = classifier2.classify(gender_features(name))
    # if the guess does not match the correct classification
    if guess != tag:
        # add it to the list of errors
        errors.append( (tag, guess, name))

errors = pd.DataFrame(sorted(errors), columns = ['actual', 'guess', 'name'])

errors.head(20)

Unnamed: 0,actual,guess,name
0,male,female,Abdel
1,male,female,Abelard
2,male,female,Addie
3,male,female,Allah
4,male,female,Amadeus
5,male,female,Ambrose
6,male,female,Anatole
7,male,female,Andie
8,male,female,Antin
9,male,female,Armando


From the list above, it looks like combinations of characters for the suffixes can be more indicative of gender than just the one letter. For example, names that end with `yn` are likely to be female while names that end with just `n` are typically male. To adjust for this, let's add a feature that takes the last two letters of a person's name as well. Perhaps it'd even be helpful to add in bigrams (pairs of consecutive letters).

#### Adding Combinations of Letters

In [10]:
def gender_features3(name):
    features = {}
    i = 0
    features["firstletter"] = name[0].lower()
    features["suffix1"] = name[-1].lower()
    features["suffix2"] = name[-2].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    for w in nltk.bigrams(name):
        features[('b'+str(i))] = (w[0] + w[1]).lower()
        i = i+1
    return features

# using apply_features to avoid using a large amount of memory
train_set = apply_features(gender_features3, train_names)
devtest_set = apply_features(gender_features3, devtest_names)
test_set = apply_features(gender_features3, train_names)
classifier3 = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier3, test_set))

0.8355414746543779


We're now at 83.6% accuracy -- this is the highest we've seen! 

What are some of the most informative features?

In [11]:
classifier3.show_most_informative_features(20)

Most Informative Features
                 suffix1 = 'a'            female : male   =     43.1 : 1.0
                 suffix1 = 'k'              male : female =     29.9 : 1.0
                      b5 = 'ta'           female : male   =     20.9 : 1.0
                      b5 = 'na'           female : male   =     16.6 : 1.0
                 suffix1 = 'f'              male : female =     15.3 : 1.0
                      b6 = 'rd'             male : female =     15.2 : 1.0
                      b4 = 'us'             male : female =     13.9 : 1.0
                      b3 = 'to'             male : female =     13.5 : 1.0
                      b2 = 'sa'           female : male   =     13.5 : 1.0
                      b5 = 'rd'             male : female =     12.9 : 1.0
                      b4 = 'na'           female : male   =     12.6 : 1.0
                      b5 = 'ra'           female : male   =     12.1 : 1.0
                      b2 = 'rk'             male : female =     11.6 : 1.0

Suffices and bigrams seem to be the most informative here -- very interesting!

Let's go over the overall performance of the three models we created here.

#### Model Comparisons

In [12]:
classifier1_accuracy = nltk.classify.accuracy(classifier1, test_set)
classifier2_accuracy = nltk.classify.accuracy(classifier2, test_set)
classifier3_accuracy = nltk.classify.accuracy(classifier3, test_set)

print("Accuracy with Last Letter Classifier check:                                                {}".format(classifier1_accuracy))
print("Accuracy with First and Last Letter with Letter Counts Classifer check:                    {}".format(classifier2_accuracy))
print("Accuracy with First and Last Letter with Letter Counts Classifer and Bigrams check:        {}".format(classifier3_accuracy))

Accuracy with Last Letter Classifier check:                                                0.6304723502304147
Accuracy with First and Last Letter with Letter Counts Classifer check:                    0.7111175115207373
Accuracy with First and Last Letter with Letter Counts Classifer and Bigrams check:        0.8355414746543779


Solely based on these training and testing datasets, it seems like the last model does much better at predicting gender based on names.