<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#About-this-project" data-toc-modified-id="About-this-project-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>About this project</a></span></li><li><span><a href="#Data-load-and-prepare" data-toc-modified-id="Data-load-and-prepare-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data load and prepare</a></span></li><li><span><a href="#Data-Metrics" data-toc-modified-id="Data-Metrics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Metrics</a></span></li><li><span><a href="#Features" data-toc-modified-id="Features-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Features</a></span></li><li><span><a href="#Naive-Bayes-Model" data-toc-modified-id="Naive-Bayes-Model-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Naive Bayes Model</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

### About this project

Using Decision Trees, Naive Bayes, and/or Maximum Entropy build the best name gender classifier you can, using the NLTK Names corpus data.  
<br>
Split the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

### Data load and prepare

In [88]:
from nltk.corpus import names
import pandas as pd 
from collections import defaultdict

from nltk import NaiveBayesClassifier
from nltk import classify

from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB


import warnings
warnings.filterwarnings('ignore')

pd.set_option('Display.max_columns', None)
pd.set_option('Display.max_rows', None)

In [89]:
%cd C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\KJW_Project3_DS620

C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\KJW_Project3_DS620


In [90]:
#Get the male and female names, then create a dataframe with two columns, name and gender, that has all the
#male and female names
male_names = names.words('male.txt')
male_gender = 'male ' * len(male_names)
male_gender = male_gender.split()

female_names = names.words('female.txt')
female_gender = 'female ' * len(female_names)
female_gender = female_gender.split()

data = pd.DataFrame()
data['name'] = male_names + female_names
data['gender'] = male_gender + female_gender

In [91]:
#Shuffle the dataframe
data = data.sample(frac=1).reset_index(drop=True)

data.head()

Unnamed: 0,name,gender
0,Anthea,female
1,Koral,female
2,Marji,female
3,Steward,male
4,Peter,male


### Data Metrics

In [92]:
total_count = len(male_names) + len(female_names)
print(f'There are {len(male_names)} male names out of a total of {total_count} names')
print(f'There are {len(female_names)} female names')
print()

There are 2943 male names out of a total of 7944 names
There are 5001 female names



In [93]:
#Check for duplicate names, i.e. names that are in both the male and female datasets
duplicates_check = data[['name']]
duplicates = duplicates_check[duplicates_check.duplicated()]
print(f'There are {len(duplicates)} duplicate names.')
print()

print('Shown below is a count of 730 names with 365 unique names, so these are all names in both the male and female datasets')
print('Since this project is to predict binomial gender classification, these names will be removed')

#Create a list of the duplicate names and validate that they are male and female entries
duplicates_list = list(duplicates.name)

duplicates_df = data[data.name.isin(duplicates_list)]
duplicates_df = duplicates_df.sort_values(['name','gender'])
duplicates_df.describe()

There are 365 duplicate names.

Shown below is a count of 730 names with 365 unique names, so these are all names in both the male and female datasets
Since this project is to predict binomial gender classification, these names will be removed


Unnamed: 0,name,gender
count,730,730
unique,365,2
top,Wallis,male
freq,2,365


In [94]:
#Remove the duplicate names
data = data[~data.name.isin(duplicates_list)]

print(f'Removing duplicate names drops {total_count - len(data)} names; total names went from {total_count} to {len(data)}')

Removing duplicate names drops 730 names; total names went from 7944 to 7214


### Features

The names corpus consists of two text files, one of all male names and one of all female names.  Not much to go on with regard to building a classifier, but here are features that we came up with...

1. last letter of name
2. first letter of name
3. length of name
4. has repeating letters in a name

In [95]:
#Add last_letter, first_letter, and name_length columns to the dataframe
data['last_letter'] = [name[-1] for name in data.name]
data['first_letter'] = [name[0] for name in data.name]
data['name_length'] = [len(name) for name in data.name]

In [96]:
#Count the max number of repeating letters in a name and add to dataframe
letters = 'abcdefghijklmnopqrstuvwxyz'

repeat_letters_list = []

#loop thru the dataframe
for name in data.name:
    rl_dict = defaultdict(int)
    
    #loop thru each character of each name
    lname = str(name).lower()
    for l in lname:
        if l not in rl_dict:
            rl_dict[l] = 1
        else:
            rl_dict[l] += 1
    
    repeat_letters_list.append(max(rl_dict.values()) - 1)
                 
data['repeat_letters_count'] = repeat_letters_list
data.head()

Unnamed: 0,name,gender,last_letter,first_letter,name_length,repeat_letters_count
0,Anthea,female,a,A,6,1
1,Koral,female,l,K,5,0
2,Marji,female,i,M,5,0
3,Steward,male,d,S,7,0
4,Peter,male,r,P,5,1


In [97]:
#Format the dataset for modeling
model_data = data[['last_letter', 'first_letter', 'name_length', 'repeat_letters_count', 'gender']]
model_data.head()

Unnamed: 0,last_letter,first_letter,name_length,repeat_letters_count,gender
0,a,A,6,1,female
1,l,K,5,0,female
2,i,M,5,0,female
3,d,S,7,0,male
4,r,P,5,1,male


### Naive Bayes Model

In [98]:
#Model using all features
features_list = []

for m in model_data.values:
    feature_dict = {'last_letter': m[0], 'first_letter': m[1], 'name_length': m[2], 'repeat_letters_count': m[3]}
    feature_set = (feature_dict, m[4])
    features_list.append(feature_set)
    
train_data = features_list[:6214]
devtest_data = features_list[6214:6714]
test_data = features_list[6214:6714]
    
nb_classifier = NaiveBayesClassifier.train(train_data)

# Test the accuracy of the classifier on the test data
print(classify.accuracy(nb_classifier, test_data))

0.794


In [99]:
#Model using last letter
features_list = []

for m in model_data.values:
    feature_dict = {'last_letter': m[0]}
    feature_set = (feature_dict, m[4])
    features_list.append(feature_set)
    
train_data = features_list[:6214]
devtest_data = features_list[6214:6714]
test_data = features_list[6214:6714]
    
nb_classifier = NaiveBayesClassifier.train(train_data)

# Test the accuracy of the classifier on the test data
print(classify.accuracy(nb_classifier, test_data))

0.76


### Conclusion

In [100]:
### Example to eventually delete
def gender_features(word):
    return {"last_letter": word[-1]}  # feature set

labeled_names = ([(name, "male") for name in names.words("male.txt")] +
                 [(name, "female") for name in names.words("female.txt")])

print(len(labeled_names)) # 7944 names

# Shuffle the names in the list
import random
random.shuffle(labeled_names)

feature_sets = [(gender_features(n), gender)
                    for (n, gender) in labeled_names]

# Divide the feature sets into training and test sets
train_set, test_set = feature_sets[500:], feature_sets[:500]

classifier = NaiveBayesClassifier.train(train_set)

# Test out the classifier with few samples outside of training set
print(classifier.classify(gender_features("neo")))  # returns male
print(classifier.classify(gender_features("trinity")))  # returns female

# Test the accuracy of the classifier on the test data
print(classify.accuracy(classifier, test_set))  # returns 0.78 for now

# examine classifier to determine which feature is most effective for distinguishing the name's gender
print(classifier.show_most_informative_features(5))

7944
male
female
0.77
[({'last_letter': 'h'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': ' '}, 'female'), ({'last_letter': 'm'}, 'female'), ({'last_letter': 'l'}, 'male'), ({'last_letter': 'x'}, 'male'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'n'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'd'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'n'}, 'female'), ({'last_letter': 'e'}, 'male'), ({'last_letter': 'd'}, 'female'), ({'last_letter': 'i'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'o'}, 'male'), ({'last_letter': 's'}, 'male'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'a'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'last_letter': 'o'}, 'female'), ({'last_letter': 'e'}, 'female'), ({'la

<I>How does the performance on the test set compare to the performance on the dev-test set? </I>
<br><br>
    <I>Is this what you'd expect? </I>