Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Project is due 10/28.

Source: Natural Language Processing with Python, exercise 6.10.2.

In [98]:
from nltk.corpus import names 
import pandas as pd
import nltk
import random


below is a function created based on the text book with name length added as suggested

In [111]:
def gender_features(word):
    first_letter = word[0]
    vowel_first_letter = first_letter in 'aeiou'
    return {'last_letter':word[-1],'name_length':len(word)}


In [112]:
name_list = ([(name, 'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])
random.shuffle(name_list)


Below is a dataframe used to explore different features of the names to see what other characteristics could improve the classifier

In [113]:
data = pd.DataFrame.from_dict(dict(name_list), orient='index', columns=['gender']).reset_index()
data.columns = ['name', 'gender']

data['length'] = data['name'].str.len()
data['first_letter'] = data['name'].str[0]
data['last_letter'] = data['name'].str[-1]

data['vowel_last_letter'] = data['last_letter'].isin([*'aeiou'])
data['vowel_first_letter'] = data['first_letter'].isin([*'aeiou'])
data['vowels'] = data['name'].str.replace(r'[^aeiou]', '')
data['consonants'] = data['name'].str.replace(r'[aeiou]', '')
data['vowel_count'] = data['vowels'].str.len()
data['consonant_count'] = data['consonants'].str.len()
data.head()
data.groupby('gender').mean()



Unnamed: 0_level_0,length,vowel_last_letter,vowel_first_letter,vowel_count,consonant_count
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
female,6.148994,0.724632,0.0,2.498029,3.650965
male,5.942029,0.234783,0.0,2.05,3.892029


next the function created is tested

In [114]:
featuresets = [(gender_features(n),g) for (n,g) in name_list]
test_names = name_list[500:]
dev_test_names = name_list[501:1002]
train_names = name_list[1003:]
test_set, dev_test_set,train_set = featuresets[500:],featuresets[501:1002],featuresets[1003:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier,dev_test_set))
print(classifier.show_most_informative_features(5))

0.7345309381237525
Most Informative Features
             last_letter = 'k'              male : female =     42.2 : 1.0
             last_letter = 'a'            female : male   =     34.7 : 1.0
             last_letter = 'p'              male : female =     17.4 : 1.0
             last_letter = 'f'              male : female =     16.5 : 1.0
             last_letter = 'd'              male : female =     10.2 : 1.0
None


the errors are printed to review for patterns

In [115]:
errors = []
for (name, tag) in dev_test_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))
print(errors[:20])


[('female', 'male', 'Sharleen'), ('male', 'female', 'Wesley'), ('male', 'female', 'Kendall'), ('female', 'male', 'Michal'), ('female', 'male', 'Wynn'), ('male', 'female', 'Jessie'), ('female', 'male', 'Marillin'), ('female', 'male', 'Jannel'), ('male', 'female', 'Parnell'), ('female', 'male', 'Avril'), ('female', 'male', 'Rosamund'), ('male', 'female', 'Alley'), ('male', 'female', 'Chauncey'), ('male', 'female', 'Bucky'), ('male', 'female', 'Murray'), ('female', 'male', 'Alleen'), ('female', 'male', 'Austin'), ('male', 'female', 'Willey'), ('female', 'male', 'Janis'), ('male', 'female', 'Piggy')]


based on the dataframe characteristics were added to the function and retested

In [104]:
def gender_features(word):
    first_letter = word[0]
    vowel_first_letter = first_letter in 'aeiou'
    return {'last_letter':word[-1],'name_length':len(word),'first_letter':word[0],'vowel_first_letter': vowel_first_letter}




In [105]:
name_list = ([(name, 'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])
random.shuffle(name_list)

featuresets = [(gender_features(n),g) for (n,g) in name_list]
test_names = name_list[500:]
dev_test_names = name_list[501:1002]
train_names = name_list[1003:]
test_set, dev_test_set,train_set = featuresets[500:],featuresets[501:1002],featuresets[1003:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier,dev_test_set))
print(classifier.show_most_informative_features(5))

0.7784431137724551
Most Informative Features
             last_letter = 'a'            female : male   =     34.4 : 1.0
             last_letter = 'k'              male : female =     32.1 : 1.0
             last_letter = 'f'              male : female =     23.4 : 1.0
             last_letter = 'v'              male : female =     15.4 : 1.0
             last_letter = 'p'              male : female =     10.6 : 1.0
None


In [106]:
errors = []
for (name, tag) in dev_test_names:
    guess = classifier.classify(gender_features(name))
    if guess != tag:
        errors.append((tag,guess,name))
print(errors[:20])


[('female', 'male', 'Clio'), ('female', 'male', 'Chloris'), ('female', 'male', 'Dian'), ('male', 'female', 'Brodie'), ('male', 'female', 'Ricky'), ('female', 'male', 'Eden'), ('male', 'female', 'Griffith'), ('female', 'male', 'Talyah'), ('female', 'male', 'Jonis'), ('female', 'male', 'Mel'), ('female', 'male', 'Rosamund'), ('male', 'female', 'Lay'), ('male', 'female', 'Rudie'), ('male', 'female', 'Logan'), ('male', 'female', 'Vinny'), ('male', 'female', 'Carson'), ('male', 'female', 'Geoffrey'), ('male', 'female', 'Aubrey'), ('male', 'female', 'Jeramie'), ('male', 'female', 'Agustin')]


based on the errors other characteristics are reviewed and added to see if they improve the classifier

In [107]:
data['second_last_letter'] = data['name'].str[-2]
data['vowel_second_last_letter'] = data['second_last_letter'].isin([*'aeiou'])
data.groupby('gender').mean()


Unnamed: 0_level_0,length,vowel_last_letter,vowel_first_letter,vowel_count,consonant_count,vowel_second_last_letter
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,6.113777,0.712058,0.0,2.475705,3.638072,0.320736
male,5.995733,0.224593,0.0,2.061676,3.934057,0.549263


In [108]:
def gender_features(word):
    first_letter = word[0]
    vowel_first_letter = first_letter in 'aeiou'
    second_last = word[-2]
    vowel_second_last = second_last in 'aeiou'
    return {'last_letter':word[-1],'name_length':len(word),'first_letter':word[0],'vowel_first_letter': vowel_first_letter,'vowel_second_last':vowel_second_last}




In [109]:
name_list = ([(name, 'male') for name in names.words('male.txt')] + [(name,'female') for name in names.words('female.txt')])
random.shuffle(name_list)

featuresets = [(gender_features(n),g) for (n,g) in name_list]
test_names = name_list[500:]
dev_test_names = name_list[501:1002]
train_names = name_list[1003:]
test_set, dev_test_set,train_set = featuresets[500:],featuresets[501:1002],featuresets[1003:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier,dev_test_set))
print(classifier.show_most_informative_features(5))

0.7884231536926147
Most Informative Features
             last_letter = 'a'            female : male   =     36.9 : 1.0
             last_letter = 'k'              male : female =     31.2 : 1.0
             last_letter = 'f'              male : female =     16.6 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'd'              male : female =      9.4 : 1.0
None


In [110]:
print(nltk.classify.accuracy(classifier,dev_test_set))
print(nltk.classify.accuracy(classifier,test_set))


0.7884231536926147
0.7798226759806556


The final classifier has improved based on additions.
the dev_test_Set and test_set results are similar which would be expected 