This notebook takes the gender names dataset from Kaggle (sourced from Social Security records) and uses it as training data for a bigram Naive Bayes classifier that probabilistically classifies lists of names into a likely split of the genders within by determining patterns within gendered names. 

Here, we import all of the needed libraries. 

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
import requests
from bs4 import BeautifulSoup
import matplotlib as mlp
mlp.use("TKAgg")
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
mlp.rcParams.update({'font.family': "Open Sans", 'font.size' : 16})

because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.



Now we import the data file (the Social Security database) and do some basic data cleaning -- so we can display the first five entries and get a sense of the data.

In [2]:
#import social security names database from Kaggle in int form
names = pd.read_csv("../input/NationalNames.csv", dtype = {'Count': np.int32})
names = names.fillna(0)
names.head()

Unnamed: 0,Id,Name,Year,Gender,Count
0,1,Mary,1880,F,7065
1,2,Anna,1880,F,2604
2,3,Emma,1880,F,2003
3,4,Elizabeth,1880,F,1939
4,5,Minnie,1880,F,1746


Now let's groupby so we can get away from counting sums across years, and get the aggregated count of female and male occurrences for each name.

In [3]:
namechart = names.groupby(['Name', 'Gender'], as_index = False)['Count'].sum()
namechart.head(5)

Unnamed: 0,Name,Gender,Count
0,Aaban,M,72
1,Aabha,F,21
2,Aabid,M,5
3,Aabriella,F,10
4,Aadam,M,196


Now let's add a column to categorize different names into a male or female bucket based on whether or not the frequency of males for a name outnumbers the frequency of females. 

In [4]:
namechartdiff = namechart.reset_index().pivot('Name', 'Gender', 'Count')
namechartdiff = namechartdiff.fillna(0)
namechartdiff["Mpercent"] = ((namechartdiff["M"] - namechartdiff["F"])/(namechartdiff["M"] + namechartdiff["F"]))
namechartdiff['gender'] = np.where(namechartdiff['Mpercent'] > 0.001, 'male', 'female')
namechartdiff.head()

Gender,F,M,Mpercent,gender
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aaban,0.0,72.0,1.0,male
Aabha,21.0,0.0,-1.0,female
Aabid,0.0,5.0,1.0,male
Aabriella,10.0,0.0,-1.0,female
Aadam,0.0,196.0,1.0,male


Let's now break down 'the strings of names into bigram blocks of characters with CountVectorizer.

In [5]:
char_vectorizer = CountVectorizer(analyzer='char', ngram_range=(2, 2))
X = char_vectorizer.fit_transform(namechartdiff.index)
X = X.tocsc()
y = (namechartdiff.gender == 'male').values.astype(np.int)
print(X)

  (0, 0)	1
  (1, 0)	1
  (2, 0)	1
  (3, 0)	1
  (4, 0)	1
  (5, 0)	1
  (6, 0)	1
  (7, 0)	1
  (8, 0)	1
  (9, 0)	1
  (10, 0)	1
  (11, 0)	1
  (12, 0)	1
  (13, 0)	1
  (14, 0)	1
  (15, 0)	1
  (16, 0)	1
  (17, 0)	1
  (18, 0)	1
  (19, 0)	1
  (20, 0)	1
  (21, 0)	1
  (22, 0)	1
  (23, 0)	1
  (24, 0)	1
  :	:
  (64876, 616)	1
  (67519, 616)	1
  (67520, 616)	1
  (67521, 616)	1
  (72287, 616)	1
  (73357, 616)	1
  (73358, 616)	1
  (76118, 616)	1
  (81252, 616)	1
  (81253, 616)	1
  (81254, 616)	1
  (81255, 616)	1
  (81256, 616)	1
  (83577, 616)	1
  (88001, 616)	1
  (88002, 616)	1
  (88270, 616)	1
  (91145, 616)	1
  (91333, 616)	1
  (91334, 616)	1
  (91335, 616)	1
  (91336, 616)	1
  (91604, 616)	1
  (92000, 616)	1
  (93888, 616)	1


Let's split our training and test data now. 

In [6]:
itrain, itest = train_test_split(range(namechartdiff.shape[0]), train_size=0.7)
mask=np.ones(namechartdiff.shape[0], dtype='int')
mask[itrain]=1
mask[itest]=0
mask = (mask==1)

Now we train the model.

In [7]:
Xtrainthis=X[mask]
Ytrainthis=y[mask]
Xtestthis=X[~mask]
Ytestthis=y[~mask]
clf = MultinomialNB(alpha = 1)
clf.fit(Xtrainthis, Ytrainthis)
training_accuracy = clf.score(Xtrainthis,Ytrainthis)
test_accuracy = clf.score(Xtestthis,Ytestthis)
        
print(training_accuracy)
print(test_accuracy)

0.74180639664
0.735080058224


Now let's define a function that will allow us to easily look up and predict individual names. 

In [8]:
def lookup(x):
    str(x)
    new = char_vectorizer.transform([x])
    y_pred = clf.predict(new)
    if (y_pred == 1):
        print("This is most likely a male name!")
    else:
        print("This is most likely a female name!")
    

I've looked up my own name and determined that it's most likely male!

In [9]:
lookup("Roger")

This is most likely a male name!
