Final Project

PROJECT TOPIC Predict gender of baby name base on state

Authors Vincent Q. Do, Dang H. Tran, and Ryan N. Treftz


# I. Introduction:

This project is intended to make a prediction of a baby’s gender and its state.
Predicting the baby’s gender will help the local government to have a better idea
about the future workforce. This should give us some insight on how certain states
like to name their children.


# II. Data Source:

The data source we are using is ‘US Baby Names’ dataset. Only names with at
least 5 babies born in the same year/state are included in this dataset for privacy.
Which means very unique names will be excluded.

# III  Outcome and Expectations:

Create a model that will accurately predict the gender of a baby based on its
name and the state it’s borned in.
We expect the model to have somewhere around 80-90% accuracy because
there are other variables that we are disregarding such as current time period, pop
culture, and ethnicity.

Using the name, we can predict what state the babies will be born in (use state population) => Professor suggestion

# Task 1

Read in the data

In [3]:
#Proposal Solution
import pandas as pd
import math

df = pd.read_csv("StateNames.csv", usecols = ['Name', 'Year', 'Gender','State','Count'])

In [42]:
import pprint
pp = pprint.PrettyPrinter(indent="2")

# Task 2

Decide on data structure to use

# Task 3

Split the data into `train, test`.
- Train names are from prior to 2000
- Test names are from 2000 to 2014

In [340]:
#Split the data into a training, devtest and test sets. (52% train, 24% devtest and 24% test)
train_names = df[(df['Year'] < 1980)]
train_names_data = pd.DataFrame(train_names.groupby(["State", "Name"])["Count"].sum()).reset_index()

devtest_names = df[(df['Year'] > 1979) & (df['Year'] < 2000)]
test_names = df[(df['Year'] >= 2000)]

In [23]:
#Data source stored in a data frame
baby_data = df[(df["State"] == "CA") | (df["State"] == "NY") | (df["State"] == "TX")]
state_df = pd.DataFrame(baby_data.groupby(["State", "Name"])["Count"].sum()).reset_index()

In [13]:
size = len(set(state_data.Name))
size

25241

In [87]:
def baby_features(name, count):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["size"] = len(name)
    features["count"] = count
    return features

In [88]:
#Data source stored in a dictionary
state_dict = state_data.groupby('State')[['Name','Count']].apply(lambda x: x.set_index('Name').to_dict()).to_dict()

featuresets = []
for state in state_dict:
    for name in state_dict[state]["Count"]:
        count = state_dict[state]["Count"][name]
        featuresets.append((baby_features(name, count), state))

In [92]:
import random
random.shuffle(featuresets)
#(training: 50%, devtest: 30% and test:20)
train_set = featuresets[:24500]
devtest_set = featuresets[24500:39200]
test_set = featuresets[39200:]

# Task 4

Create the classifier:
- how many times a give name and gender appears
- calculate population of states
- prior prob

In [101]:
def population(data):
    #Get the total number of times each name appears for each state.
    state = pd.DataFrame(data.groupby(['State'])['Count'].sum()).reset_index()
    #retain the total number of records.
    country = state.Count.sum()
    return (state, country)

#Store the state dataset and country count for the training set.
state_population, country_population = population(state_df)

In [106]:
def prior_probabilities(state, total):
    return int(state_data[state_data.State == state].Count) / total

In [109]:
prior_probabilities("NY", country_population)

0.3186979768152494

In [446]:
#def conditional_probability(data, state, name, num_of_names, vocab_size):
    #P(w|c) = count(w,c) + 1 / count(c) + |V|
#    if data[data.State == state].any().State and data[data.Name == name].any().Name:
#        count = int(data[(data.State == state) & (data.Name == name)].Count)
#        result = (count + 1) / (num_of_names + vocab_size)
#    else:
#        result = 1 / (num_of_names + vocab_size)
#    return result

In [451]:
def conditional_probability(state, name, count, unique_names):
    #P(w|c) = count(w,c) + 1 / count(c) + |V|
    #if name in state_dict[state]["Count"]:
    #    result = (state_dict[state]["Count"][name] + 1) / (count + unique_names)
    #else:
    #    result = 1 / (count + unique_names)
    #return result

In [268]:
#params:
#state = state to calculate the conditional prob
#name = name to query the featureset counts
#count = the total number of names(count) in the provided state
#unique_names = number of unique names in the country
def conditional_probability(state, name, count, unique_names):
    #P(w|c) = count(w,c) + 1 / count(c) + |V|
    result = 0
    state_set = [feature[0] for feature in train_set if feature[1] == state]
    for features in state_set:
        if features["first_letter"] == name[0].lower() and features["last_letter"] == name[-1] and features["size"] == len(name):
            print("WHY???")
            result += features["count"]
        return (result + 1) / (count + unique_names)

In [269]:
test_count = int(state_population[state_population.State == "CA"].Count)
conditional_probability("CA", "Ryan", test_count, country_population)

9.59533240044953e-09

In [208]:
print(train_set[0])

({'first_letter': 'f', 'last_letter': 'n', 'size': 7, 'count': 5}, 'TX')


In [127]:
int(state_population[state_population.State == "CA"].Count)

29252805

In [452]:
conditional_probability("PA", "Ryan", country_population, size)

0.00016982499290764958

In [289]:
states = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DC", "DE", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

In [508]:
def nb_classifier(name):
    state_prob = {}
    
    for state in states:
        state_prior = prior_probabilities(state, country_population) 
        state_cp = conditional_probability(state, name, country_population, size)
        state_prob[state] = math.log10(state_prior) + math.log10(state_cp)

    guess = [key for key, value in state_prob.items() if value == max(state_prob.values())]
    percentage = state_prob[guess[0]] / int(sum([value for key, value in state_prob.items()])) * 100
    return guess, percentage

In [509]:
test_guess, test_percentage = nb_classifier("Ryan")
print(test_guess, test_percentage)
#type(sum([value for key, value in state_test.items()]))

['CA'] 1.3687313665758


In [461]:
#nb_classifier("Ryan")
test_prior = prior_probabilities("CA", country_population)
test_cp = conditional_probability("CA", "Ryan", country_population, size)
print(test_prior)
print(test_cp)

0.09787365990433337
0.00037552991901338713


In [447]:
def test_classifier(data):
    total = correct = 0
    for index, row in data.iterrows():
        print(row["Name"])
        guess = nb_classifier(row["Name"], size)
        print(guess, row["State"])
        if guess == row["State"]:
            correct += 1
        total += 1
    return (correct/total) * 100

In [386]:
test_classifier(devtest_names[0:5])

Jessica
['NY'] AK
Jennifer
['CA'] AK
Sarah
['NY'] AK
Amanda
['TX'] AK
Melissa
['NY'] AK


0.0

In [224]:
devtest_names[:5].Name
#test = nb_classifier(set(devtest_names.Name), state_data)
#train_names[train_names.State == "PA"].any().State and train_names[train_names.Name == "Ryan"].any().Name

0     Jessica
1    Jennifer
2       Sarah
3      Amanda
4     Melissa
Name: Name, dtype: object

In [375]:
#size = len(set(train_names.Name))
conditional_probabilities(train_names, "PA", "Aaron", country_population, size)

3.8860460805709485e-05

In [168]:
#Example of how to include gender in the returned label. Given the name predict the state and gender.
#This would require that the male and female conditional probabilites are calculated for each name in the classifier.
#example_data = pd.DataFrame(train_names.groupby(['State', 'Name', 'Gender'])['Count'].sum()).reset_index()
#state_count = int(name_data[(name_data.State == "AK") & (name_data.Name == "Aaron") & (name_data.Gender == "M")].Count)

In [370]:
#baseline
baseline = {}
for state in train_names.State:
    if state not in baseline:
        baseline[state] = 0
    baseline[state] += 1
    
max_value = [(key,value) for key, value in baseline.items() if value == max(baseline.values())]
print(max_value[0][1] / len(train_names.Name))

0.052522673656265555
