# Project 3

Group Member: Bryan Persaud, Matthew Baker, Zhi Ying Chen

For this project, please work with the entire class as one collaborative group! Your project should be
submitted (as a Jupyter Notebook via GitHub) by end of the due date. The group should present their
code and findings in our meetup.

In [1]:
import nltk
from collections import Counter
from nltk import download
from nltk.corpus import names
from nltk.util import ngrams
from nltk.tokenize.sonority_sequencing import SyllableTokenizer
from nltk import NaiveBayesClassifier
from nltk import DecisionTreeClassifier
from nltk import classify
from nltk.classify import apply_features
import matplotlib.pyplot as plt
import seaborn as sns
import random
import pandas as pd
import numpy as np

# jupyter setup
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%matplotlib inline

Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can. 

In [2]:
download('names')
males = [(name.lower().strip(), 'male') for name in names.words('male.txt')]
females = [(name.lower().strip(), 'female') for name in names.words('female.txt')]
names = males + females
random.shuffle(names)

[nltk_data] Downloading package names to
[nltk_data]     C:\Users\Bryan\AppData\Roaming\nltk_data...
[nltk_data]   Package names is already up-to-date!


True

In [74]:
# List of Top 10 names
names[0:10]

[('maryjo', 'female'),
 ('nanice', 'female'),
 ('daphene', 'female'),
 ('ceciley', 'female'),
 ('hope', 'female'),
 ('mack', 'male'),
 ('pat', 'male'),
 ('sunshine', 'female'),
 ('charin', 'female'),
 ('maggi', 'female')]

Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the devtest set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

In [38]:
# Length of names
len(names)

7944

In [39]:
train_set = names[1000:]
test_set = names[:500]
devtest_names = names[500:1000]

In [40]:
# Length of train dataset
len(train_set)

6944

In [41]:
# length of test dataset
len(test_set)

500

In [42]:
# length of devtest set
len(devtest_names)

500

We take a look at the data by looking at the number of names in the dataset and then split the data into a train/test set.

## Naive Baiyer Classifier

In Naive Bayes classifiers, every feature gets a say in determining which label should be assigned to a given input value. To choose a label for an input value, the Naive Bayes classifier begins by calculating the prior probability of each label, which is determined by checking the frequency of each label in the training set. The contribution from each feature is then combined with this prior probability, to arrive at a likelihood estimate for each label.

### Feature 1

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll only include only last character of word.

In [75]:
# Book's classifier - baseline
def gender_features(word):
    w = word.lower()
    return {'last_letter': w[-1]} 

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a separate dev-test set for error analysis, rather than just using the test set.

In [76]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_featuresets = [(gender_features(n), g) for (n, g) in train_set]
test_featuresets = [(gender_features(n), g) for (n, g) in test_set]
devtest_featuresets = [(gender_features(n), g) for (n, g) in devtest_names]

We apply Naive Baiyer Classifier to calculate the accuracy for the initial model which takes into consideration the last letter of each name.

In [162]:
nb1 = nltk.NaiveBayesClassifier.train(train_featuresets)

print ('Accuracy: %4.2f' %nltk.classify.accuracy(nb1, devtest_featuresets))

Accuracy: 0.76


In [175]:
# Top 5 of most effective for distinguishing the names’ genders:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     41.3 : 1.0
             last_letter = 'a'            female : male   =     37.2 : 1.0
             last_letter = 'f'              male : female =     15.4 : 1.0
             last_letter = 'd'              male : female =     10.7 : 1.0
             last_letter = 'm'              male : female =     10.3 : 1.0


This listing shows that the names in the training set that end in "a" are female 41.3 times more than they are male, but names that end in "k" are male 37.2 times more than they are female. These ratios are known as likelihood ratios, and can be useful for comparing different feature-outcome relationships.

Using the dev-test set, we generate a list of the errors that the classifier makes when predicting name genders:

In [164]:
def error_analysis(FEATURES):        
    errors = [] 
    for (name, tag) in devtest_names:
        guess = classifier.classify(FEATURES(name))
        if guess != tag:
            errors.append((tag, guess, name))
    print("Number of Errors: ", len(errors))    
    for (tag, guess, name) in sorted(errors):
        print('correct = {:<8} guess = {:<8s} name = {:<30}'.format(tag, guess, name))

In [165]:
error_analysis(gender_features)

Number of Errors:  122
correct = female   guess = male     name = aileen                        
correct = female   guess = male     name = alisun                        
correct = female   guess = male     name = alys                          
correct = female   guess = male     name = arabel                        
correct = female   guess = male     name = bird                          
correct = female   guess = male     name = bridget                       
correct = female   guess = male     name = brynn                         
correct = female   guess = male     name = carol                         
correct = female   guess = male     name = caroleen                      
correct = female   guess = male     name = caroljean                     
correct = female   guess = male     name = charis                        
correct = female   guess = male     name = christin                      
correct = female   guess = male     name = clair                         
correct = femal

Number of Errors from Feature 1: 122

### Feature 2

For Feature 2, we adjust our feature extractor to include features for two-letter suffixes and whether last letter is a vowel (aeiou).

In [109]:
def gender_features2(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2:],
            'last_is_vowel' : (word[-1] in 'aeiou')}

In [110]:
train_featuresets2 = [(gender_features2(n), g) for (n, g) in train_set]
test_featuresets2 = [(gender_features2(n), g) for (n, g) in test_set]
devtest_featuresets2 = [(gender_features2(n), g) for (n, g) in devtest_names]

In [166]:
nb2 = nltk.NaiveBayesClassifier.train(train_featuresets2)

print ('Accuracy: %4.2f' %nltk.classify.accuracy(nb2, devtest_featuresets2))

Accuracy: 0.75


In [167]:
# Top 5 of most effective for distinguishing the names’ genders:
classifier2.show_most_informative_features(5)

Most Informative Features
                 suffix2 = 'na'           female : male   =    157.3 : 1.0
                 suffix2 = 'la'           female : male   =     69.8 : 1.0
                 suffix2 = 'ta'           female : male   =     43.3 : 1.0
                 suffix1 = 'k'              male : female =     41.3 : 1.0
                 suffix2 = 'ld'             male : female =     39.7 : 1.0


This listing shows that the names in the training set that suffix in "na" are female 157.3 times more than they are male, but names that suffix in "la" are male 69.8 times more than they are female. 

In [168]:
error_analysis(gender_features2)

Number of Errors:  181
correct = male     guess = female   name = adolphe                       
correct = male     guess = female   name = adrian                        
correct = male     guess = female   name = agustin                       
correct = male     guess = female   name = ahmad                         
correct = male     guess = female   name = alaa                          
correct = male     guess = female   name = alan                          
correct = male     guess = female   name = aldrich                       
correct = male     guess = female   name = alic                          
correct = male     guess = female   name = alister                       
correct = male     guess = female   name = allah                         
correct = male     guess = female   name = andre                         
correct = male     guess = female   name = archie                        
correct = male     guess = female   name = armstrong                     
correct = male 

The accuracy result of feature 2 has similar accuracy as feature 1, and the number of Errors has been increased to 181.

### Feature 3

For Feature 3, we use first letter, last letter and two suffixes as its feature

In [120]:
def gender_features3(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in "aeiou":
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    features["suffix2"] = name[-2:].lower()
    features["suffix3"] = name[-3:].lower()
    return features

In [121]:
train_featuresets3 = [(gender_features3(n), g) for (n, g) in train_set]
test_featuresets3 = [(gender_features3(n), g) for (n, g) in test_set]
devtest_featuresets3 = [(gender_features3(n), g) for (n, g) in devtest_names]

In [169]:
nb3 = nltk.NaiveBayesClassifier.train(train_featuresets3)

print ('Accuracy: %4.2f' %nltk.classify.accuracy(nb3, devtest_featuresets3))

Accuracy: 0.81


In [170]:
# Top 5 of most effective for distinguishing the names’ genders:
classifier3.show_most_informative_features(5)

Most Informative Features
                 suffix2 = 'na'           female : male   =    157.3 : 1.0
                 suffix2 = 'la'           female : male   =     69.8 : 1.0
                 suffix2 = 'ta'           female : male   =     43.3 : 1.0
             last_letter = 'k'              male : female =     41.3 : 1.0
                 suffix2 = 'ld'             male : female =     39.7 : 1.0


In [171]:
error_analysis(gender_features3)

Number of Errors:  122
correct = female   guess = male     name = aileen                        
correct = female   guess = male     name = alisun                        
correct = female   guess = male     name = alys                          
correct = female   guess = male     name = arabel                        
correct = female   guess = male     name = bird                          
correct = female   guess = male     name = bridget                       
correct = female   guess = male     name = brynn                         
correct = female   guess = male     name = carol                         
correct = female   guess = male     name = caroleen                      
correct = female   guess = male     name = caroljean                     
correct = female   guess = male     name = charis                        
correct = female   guess = male     name = christin                      
correct = female   guess = male     name = clair                         
correct = femal

The result is much better and has the best accuracy among the 3 feature extractors that we have built. And the number of Errors is 122 which is similar as Feature 1.

## Decision Tree Classifier

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The topmost node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value. It partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps you in decision making. It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret.

### Decision Tree Feature 1

Decision Tree Feature 1 , we do the same way as Navie Baiyer which only include only last character of word.

In [172]:
dt1 = nltk.classify.DecisionTreeClassifier.train(train_featureset)

print ('Accuracy: %4.2f' %nltk.classify.accuracy(dt1, devtest_featureset))

Accuracy: 0.76


In [208]:
errors2 = []

for (name,tag) in devtest_names:
    guess = dt1.classify(gender_features(name))
    if guess != tag:
        errors2.append((tag,guess,name))

In [210]:
for (tag,guess,name) in sorted(errors2):
    print('correct={:<8} guess={:<8s} name={:30}'.format(tag,guess,name))

correct=female   guess=male     name=aileen                        
correct=female   guess=male     name=alisun                        
correct=female   guess=male     name=alys                          
correct=female   guess=male     name=arabel                        
correct=female   guess=male     name=bird                          
correct=female   guess=male     name=bridget                       
correct=female   guess=male     name=brynn                         
correct=female   guess=male     name=carol                         
correct=female   guess=male     name=caroleen                      
correct=female   guess=male     name=caroljean                     
correct=female   guess=male     name=charis                        
correct=female   guess=male     name=christin                      
correct=female   guess=male     name=clair                         
correct=female   guess=male     name=claribel                      
correct=female   guess=male     name=darb       

In [211]:
print("Number of Errors: ", len(errors2)) 

Number of Errors:  122


### Decision Tree Feature 2

Decision Tree Feature 2, we add last two characters as suffix and whether last letter is a vowel (aeiou).

In [197]:
dt2 = nltk.classify.DecisionTreeClassifier.train(train_featuresets2)

print ('Accuracy: %4.2f' %nltk.classify.accuracy(dt2, devtest_featuresets2))

Accuracy: 0.79


In [215]:
errors2 = []

for (name,tag) in devtest_names:
    guess = dt2.classify(gender_features(name))
    if guess != tag:
        errors2.append((tag,guess,name))

In [216]:
for (tag,guess,name) in sorted(errors2):
    print('correct={:<8} guess={:<8s} name={:30}'.format(tag,guess,name))

correct=male     guess=female   name=adolphe                       
correct=male     guess=female   name=adrian                        
correct=male     guess=female   name=agustin                       
correct=male     guess=female   name=ahmad                         
correct=male     guess=female   name=alaa                          
correct=male     guess=female   name=alan                          
correct=male     guess=female   name=aldrich                       
correct=male     guess=female   name=alic                          
correct=male     guess=female   name=alister                       
correct=male     guess=female   name=allah                         
correct=male     guess=female   name=andre                         
correct=male     guess=female   name=archie                        
correct=male     guess=female   name=armstrong                     
correct=male     guess=female   name=artur                         
correct=male     guess=female   name=aubert     

In [217]:
print("Number of Errors: ", len(errors2))

Number of Errors:  181


The accuracy result of feature 2 is higher than feature 1, and the number of Errors has been increased to 181.

### Decision Tree Feature 3

Decision Tree Feature 3, we use first letter, last letter and two suffixes as its feature

In [199]:
dt3 = nltk.classify.DecisionTreeClassifier.train(train_featuresets3)

print ('Accuracy: %4.2f' %nltk.classify.accuracy(dt3, devtest_featuresets3))

Accuracy: 0.77


In [220]:
errors2 = []

for (name,tag) in devtest_names:
    guess = dt3.classify(gender_features(name))
    if guess != tag:
        errors2.append((tag,guess,name))

In [221]:
for (tag,guess,name) in sorted(errors2):
    print('correct={:<8} guess={:<8s} name={:30}'.format(tag,guess,name))

correct=male     guess=female   name=adolphe                       
correct=male     guess=female   name=adrian                        
correct=male     guess=female   name=agustin                       
correct=male     guess=female   name=ahmad                         
correct=male     guess=female   name=alaa                          
correct=male     guess=female   name=alan                          
correct=male     guess=female   name=aldrich                       
correct=male     guess=female   name=alic                          
correct=male     guess=female   name=alister                       
correct=male     guess=female   name=allah                         
correct=male     guess=female   name=andre                         
correct=male     guess=female   name=archie                        
correct=male     guess=female   name=armstrong                     
correct=male     guess=female   name=artur                         
correct=male     guess=female   name=aubert     

In [219]:
print("Number of Errors: ", len(errors2))

Number of Errors:  181


The accuracy result of feature 3 has similar accuracy as feature 1, and has the same number of errors as feature 2.

# Summary

Combine the result of accuracy and show the difference for Navie Baiyer

In [194]:
accuracy_dev_test_df = pd.DataFrame()
accuracy_dev_test_df = accuracy_dev_test_df.append({
    "Features": nltk.classify.accuracy(nb1, devtest_featuresets),
    "Features2": nltk.classify.accuracy(nb2, devtest_featuresets2),
    "Features3": nltk.classify.accuracy(nb3, devtest_featuresets3)
    }, ignore_index=True)
accuracy_dev_test_df

Unnamed: 0,Features,Features2,Features3
0,0.756,0.75,0.808


Testing Accuracy

Navie Baiyer: Let's compare the features with features2 and feature3 using the test set. 

In [195]:
accuracy_test_df = pd.DataFrame()
accuracy_test_df = accuracy_test_df.append({
     "Features": nltk.classify.accuracy(nb1, test_featuresets),
     "Features2":  nltk.classify.accuracy(nb2, test_featuresets2),
     "Features3": nltk.classify.accuracy(nb3, test_featuresets3)
    }, ignore_index=True)
accuracy_test_df

Unnamed: 0,Features,Features2,Features3
0,0.776,0.776,0.806


Combine the result of accuracy and show the difference for Decision Tree

In [201]:
accuracy_dev_test_df_dt = pd.DataFrame()
accuracy_dev_test_df_dt = accuracy_dev_test_df_dt.append({
    "Features": nltk.classify.accuracy(dt1, devtest_featuresets),
    "Features2": nltk.classify.accuracy(dt2, devtest_featuresets2),
    "Features3": nltk.classify.accuracy(dt3, devtest_featuresets3)
    }, ignore_index=True)
accuracy_dev_test_df_dt

Unnamed: 0,Features,Features2,Features3
0,0.756,0.792,0.774


Decision Tree: Let's compare the features with features2 and feature3 using the test set. 

In [202]:
accuracy_test_df_dt = pd.DataFrame()
accuracy_test_df_dt = accuracy_test_df_dt.append({
     "Features": nltk.classify.accuracy(dt1, test_featuresets),
     "Features2":  nltk.classify.accuracy(dt2, test_featuresets2),
     "Features3": nltk.classify.accuracy(dt3, test_featuresets3)
    }, ignore_index=True)
accuracy_test_df_dt

Unnamed: 0,Features,Features2,Features3
0,0.776,0.76,0.736


# Conclusion

We have created a gender classification models using Naive Bayes and Decision Tree with using different feature sets. Each feature sets has their own accuracy, number of error and testing accuracy. From the result, we can tell that Feature 3 of Navie Baiyer has the best result with highest accuracy and testing accuracy. Also, from the summary, we also can tell that Feature 2 of Decision Tree has the highest accuracy but does not get the highet testing accuracy from that three features.

# Video Presentation

The video presentation can be found [here](https://youtu.be/Ac6Xg4Tlrow).