<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#About-this-project" data-toc-modified-id="About-this-project-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>About this project</a></span></li><li><span><a href="#Data-load-and-prepare" data-toc-modified-id="Data-load-and-prepare-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data load and prepare</a></span></li><li><span><a href="#Data-Metrics" data-toc-modified-id="Data-Metrics-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Data Metrics</a></span></li><li><span><a href="#Features" data-toc-modified-id="Features-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Features</a></span></li><li><span><a href="#Naive-Bayes-Model-train-and-devtest" data-toc-modified-id="Naive-Bayes-Model-train-and-devtest-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Naive Bayes Model train and devtest</a></span></li><li><span><a href="#Naive-Bayes-model-test-data" data-toc-modified-id="Naive-Bayes-model-test-data-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Naive Bayes model test data</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusion</a></span></li><li><span><a href="#Footnotes" data-toc-modified-id="Footnotes-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Footnotes</a></span></li></ul></div>

In [1]:
from IPython.display import Video

Video("KJW_Project3_DS620_video.mkv")

### About this project

Using Decision Trees, Naive Bayes, and/or Maximum Entropy build the best name gender classifier you can, using the NLTK Names corpus data.  We decided to use Naive Bayes.
<br><br>
Split the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set.

### Data load and prepare

In [21]:
from nltk.corpus import names
import pandas as pd 
from collections import defaultdict

from nltk import NaiveBayesClassifier
from nltk import classify

from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB


import warnings
warnings.filterwarnings('ignore')

pd.set_option('Display.max_columns', None)
pd.set_option('Display.max_rows', None)

In [22]:
%cd C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\KJW_Project3_DS620

C:\Users\user\Documents\00_Applications_DataScience\CUNY\DATA620\KJW_CUNY_DATA_620\KJW_Project3_DS620


In [23]:
#Get the male and female names, then create a dataframe with two columns, name and gender, that has all the
#male and female names
male_names = names.words('male.txt')
male_gender = 'male ' * len(male_names)
male_gender = male_gender.split()

female_names = names.words('female.txt')
female_gender = 'female ' * len(female_names)
female_gender = female_gender.split()

data = pd.DataFrame()
data['name'] = male_names + female_names
data['gender'] = male_gender + female_gender

In [24]:
#Shuffle the dataframe
data = data.sample(frac=1).reset_index(drop=True)

data.head()

Unnamed: 0,name,gender
0,Megan,female
1,Cam,female
2,Bailie,male
3,Hanna,female
4,Antonio,male


### Data Metrics

In [25]:
total_count = len(male_names) + len(female_names)
print(f'There are {len(male_names)} male names out of a total of {total_count} names')
print(f'There are {len(female_names)} female names')
print()

There are 2943 male names out of a total of 7944 names
There are 5001 female names



In [26]:
#Check for duplicate names, i.e. names that are in both the male and female datasets
duplicates_check = data[['name']]
duplicates = duplicates_check[duplicates_check.duplicated()]
print(f'There are {len(duplicates)} duplicate names.')
print()

print('Shown below is a count of 730 names with 365 unique names, so these are all names in both the male and female datasets')
print('Since this project is to predict binomial gender classification, these names will be removed')

#Create a list of the duplicate names and validate that they are male and female entries
duplicates_list = list(duplicates.name)

duplicates_df = data[data.name.isin(duplicates_list)]
duplicates_df = duplicates_df.sort_values(['name','gender'])
duplicates_df.describe()

There are 365 duplicate names.

Shown below is a count of 730 names with 365 unique names, so these are all names in both the male and female datasets
Since this project is to predict binomial gender classification, these names will be removed


Unnamed: 0,name,gender
count,730,730
unique,365,2
top,Addie,female
freq,2,365


In [27]:
#Remove the duplicate names
data = data[~data.name.isin(duplicates_list)]

print(f'Removing duplicate names drops {total_count - len(data)} names; total names went from {total_count} to {len(data)}')

Removing duplicate names drops 730 names; total names went from 7944 to 7214


### Features

The names corpus consists of two text files, one of all male names and one of all female names.  Not much to go on with regard to building a classifier, but here are features that we came up with...

1. last letter of name
2. first letter of name
3. length of name
4. has repeating letters in a name

In [28]:
#Add last_letter, first_letter, and name_length columns to the dataframe
data['last_letter'] = [name[-1] for name in data.name]
data['first_letter'] = [name[0] for name in data.name]
data['name_length'] = [len(name) for name in data.name]

In [29]:
#Count the max number of repeating letters in a name and add to dataframe
letters = 'abcdefghijklmnopqrstuvwxyz'

repeat_letters_list = []

#loop thru the dataframe
for name in data.name:
    rl_dict = defaultdict(int)
    
    #loop thru each character of each name
    lname = str(name).lower()
    for l in lname:
        if l not in rl_dict:
            rl_dict[l] = 1
        else:
            rl_dict[l] += 1
    
    repeat_letters_list.append(max(rl_dict.values()) - 1)
                 
data['repeat_letters_count'] = repeat_letters_list
data.head()

Unnamed: 0,name,gender,last_letter,first_letter,name_length,repeat_letters_count
0,Megan,female,n,M,5,0
2,Bailie,male,e,B,6,1
3,Hanna,female,a,H,5,1
4,Antonio,male,o,A,7,1
5,Valida,female,a,V,6,1


In [30]:
#Format the dataset for modeling
model_data = data[['last_letter', 'first_letter', 'name_length', 'repeat_letters_count', 'gender']]
model_data.head()

Unnamed: 0,last_letter,first_letter,name_length,repeat_letters_count,gender
0,n,M,5,0,female
2,e,B,6,1,male
3,a,H,5,1,female
4,o,A,7,1,male
5,a,V,6,1,female


### Naive Bayes Model train and devtest

In [31]:
#Data for model using only the last letter
def create_last_letter_features_list(indata):
    features_list = []

    for m in indata.values:
        feature_dict = {'last_letter': m[0]}
        feature_set = (feature_dict, m[4])
        features_list.append(feature_set)
    
    return features_list

In [32]:
#Data for model using only the first letter
def create_first_letter_features_list(indata):
    features_list = []

    for m in indata.values:
        feature_dict = {'first_letter': m[1]}
        feature_set = (feature_dict, m[4])
        features_list.append(feature_set)
    
    return features_list

In [33]:
#Data for model using only the name length
def create_length_features_list(indata):
    features_list = []

    for m in indata.values:
        feature_dict = {'name_length': m[2]}
        feature_set = (feature_dict, m[4])
        features_list.append(feature_set)
    
    return features_list

In [34]:
#Data for model using only the repeat letters count
def create_repeat_letters_features_list(indata):
    features_list = []

    for m in indata.values:
        feature_dict = {'repeat_letters_count': m[3]}
        feature_set = (feature_dict, m[4])
        features_list.append(feature_set)
    
    return features_list

In [35]:
#Data for model using all features
def create_all_features_list(indata):
    features_list = []

    for m in indata.values:
        feature_dict = {'last_letter': m[0], 'first_letter': m[1], 'name_length': m[2], 'repeat_letters_count': m[3]}
        feature_set = (feature_dict, m[4])
        features_list.append(feature_set)
    
    return features_list

In [36]:
#Data for model using first and last letter features
def create_first_last_features_list(indata):
    features_list = []

    for m in indata.values:
        feature_dict = {'last_letter': m[0], 'first_letter': m[1]}
        feature_set = (feature_dict, m[4])
        features_list.append(feature_set)
    
    return features_list

In [37]:
#Run the model and return the accuracy score
def run_model(in_features_list):
    train_data = in_features_list[:6214]
    devtest_data = in_features_list[6214:6714]

    nb_classifier = NaiveBayesClassifier.train(train_data)

    model_accuracy = classify.accuracy(nb_classifier, devtest_data)
    
    return [model_accuracy, nb_classifier]

There are six feature combinations that are tried below:
1. Last letter only
2. First letter only
3. Name length only
4. Repeating letters only
5. First, last letters
6. All features

Using the six feature combinations, bootstrap with the naive bayes model to identify the overall best feature combination 

In [38]:
last_letter_results = []
first_letter_results = []
name_length_results = []
repeat_letters_results = []
first_last_letters_results = []
all_features_results = []

#Bootstrap 100 times
bootstrap_data = model_data.iloc[:7214,:]

for r in range(1,101):
    #Shuffle
    indata = bootstrap_data.sample(frac=1).reset_index(drop=True)
    
    #1 Last letter
    features_list = create_last_letter_features_list(indata)
    accuracy = run_model(features_list)
    last_letter_results.append(accuracy[0])

    #2 First letter
    features_list = create_first_letter_features_list(indata)
    accuracy = run_model(features_list)
    first_letter_results.append(accuracy[0])
    
    #3 Name length
    features_list = create_length_features_list(indata)
    accuracy = run_model(features_list)
    name_length_results.append(accuracy[0])

    #4 Repeating letters count
    features_list = create_repeat_letters_features_list(indata)
    accuracy = run_model(features_list)
    repeat_letters_results.append(accuracy[0])

    #5 first and last letters features
    features_list = create_first_last_features_list(indata)
    accuracy = run_model(features_list)
    first_last_letters_results.append(accuracy[0])    
    
    #6 All features
    features_list = create_all_features_list(indata)
    accuracy = run_model(features_list)
    all_features_results.append(accuracy[0])

print('RESULTS')
print('-------')

last_letter_avg = sum(last_letter_results)/len(last_letter_results)
print(f'Last letter accuracy is {round(last_letter_avg * 100)}%')

first_letter_avg = sum(first_letter_results)/len(first_letter_results)
print(f'First letter accuracy is {round(first_letter_avg * 100)}%')

name_length_avg = sum(name_length_results)/len(name_length_results)
print(f'Name length accuracy is {round(name_length_avg * 100)}%')

repeat_letters_avg = sum(repeat_letters_results)/len(repeat_letters_results)
print(f'Repeat letters count accuracy is {round(repeat_letters_avg * 100)}%')

first_last_features_avg = sum(first_last_letters_results)/len(first_last_letters_results)
print(f'First, last features average accuracy is {round(first_last_features_avg * 100)}%')

all_features_avg = sum(all_features_results)/len(all_features_results)
print(f'All features average accuracy is {round(all_features_avg * 100)}%')

RESULTS
-------
Last letter accuracy is 79%
First letter accuracy is 66%
Name length accuracy is 65%
Repeat letters count accuracy is 64%
First, last features average accuracy is 80%
All features average accuracy is 81%


### Naive Bayes model test data

The devtest dataset results identified that using all the features is the best dataset for the model.  Now we'll run the test dataset using all features.

In [39]:
test_data = model_data.iloc[6714:,:]

features_list = create_all_features_list(test_data)

nb_classifier = accuracy[1]

model_accuracy = classify.accuracy(nb_classifier, features_list)

print(f'All features accuracy using test data is {round(model_accuracy * 100)}%')

All features accuracy using test data is 82%


### Conclusion

<I>How does the performance on the test set compare to the performance on the dev-test set? </I> <br> The devtest result for the model as an average of the bootstrapping (100 runs) was 81% and the test data average was 82%.  
<br>
<I>Is this what you'd expect?</I> <br> Yes, it's reasonable to expect a variance of several percentage points between a devtest and test dataset. 

### Footnotes

This github titled, ["Classifier to determine the gender of a name using NLTK library"](https://gist.github.com/vinovator/6e5bf1e1bc61687a1e809780c30d6bf6#file-nltk_name_classifier-py-L66) was helpful towards understanding NLTK's Naive Bayes Classifier.