# Name Data


## Gathering

If there is a "Gold Standard" for a First Name and Gender dataset it must be the [National data](https://www.ssa.gov/oact/babynames/limits.html) provided by the U.S. Social Security Administration.

> All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
All data are from a 100% sample of our records on Social Security card applications as of March 2019.

Let's get the dataset

In [1]:
import requests
import io
import re
from zipfile import ZipFile


url = "https://www.ssa.gov/oact/babynames/names.zip"
r = requests.get(url, allow_redirects=True)

raw = io.BytesIO(r.content)
data = {}

with ZipFile(raw) as myzip:
    file_names = myzip.namelist()
    file_names = filter(lambda x: 'yob' in x, file_names)
    for f in file_names:
        year_of_birth = int(re.search(r"[0-9]{4}", f).group())
        with myzip.open(f) as myfile:
            data[year_of_birth] = myfile.read().decode().splitlines()


### Preprocessing
Since we are interested in aggregating these counts by name across several years let's do so after choosing a range of years.




In [2]:
name_counts = {}
start_year = 1970

for year, names in data.items():
    if year < start_year:
        continue
    for line in names:
        name, gender, count = line.split(",")
        count = int(count)
        current_counts = name_counts.get(name, {'M': 0, 'F': 0})
        current_counts[gender] += count
        name_counts[name] = current_counts
        

In [3]:
import pandas as pd

df = pd.DataFrame(list(name_counts.items()), columns=['Name', 'Counts'])
df = df.join(df['Counts'].apply(pd.Series))
del df['Counts']

In [4]:
df.sample(5)

Unnamed: 0,Name,M,F
51915,Rayqwan,11,0
20838,Neshawn,64,10
59181,Zamier,120,0
31805,Starkesha,0,35
72986,Jahiyah,0,18


The SSA mentions a point that is of interest to us:
> Name data are not edited. For example, the sex associated with a name may be incorrect. Entries such as "Unknown" and "Baby" are not removed from the lists.

We will manually remove these


In [5]:
# Manually removing data that are not 'real' names
import numpy as np
import operator

# Helper functions for chaining multiple filters

def make_filter(df, col, value, op):
    return op(df[col], value)

def combine_filters(df, boolean, *filters):
    return df[(boolean(*filters))]

def remove_names(names, df=df, col='Name', op=operator.ne):
    current_count = len(df)
    filters = []
    for name in names:
        f1 = make_filter(df, col, name, op)
        filters.append(f1)
        
    # np.logical_or.reduce == "matching any of these conditions"
    # np.logical_and.reduce == "matching all of these conditions"
    
    df1 = combine_filters(df, np.logical_and.reduce, filters)
    print("Removed {} Names".format(current_count - len(df1)))
    return df1

names_to_remove = ['Person', 'Baby', 'Unknown', 'First', 'Firstname', 'First Name', 'Boy', 'Girl',
                   'Man', 'Woman']

In [6]:
df = remove_names(names_to_remove)

Removed 5 Names


### Compiling

With our collected names and their respective male and female counts we may now proceed to coupling our data with methods that will search for these names and return their counts.   

I find that [flashtext](https://github.com/vi3k6i5/flashtext) is well suited for this task. It implements a [Trie Dictionary](https://en.wikipedia.org/wiki/TrieDictionary) which will help overcome the performance bottleneck from searching for this many names.

To install flashtext from a notebook
```
pip! install flashtext
```
   
Let's create and load our data into a `flashtext.keyword.KeywordProcessor` class


In [7]:
from flashtext.keyword import KeywordProcessor
kp = KeywordProcessor()
for i, row in df.iterrows():
    name, n_male, n_female = row['Name'], row['M'], row['F']
    kp.add_keyword(name, (n_male, n_female))


Note that we are storing the counts in a tuple so their order matters. Whenever we mention M/F counts, they will always follow the convention of `(n_male, n_female)`

Now let's try our new `KeywordProcessor`

In [8]:
test_names = ["John", "john", "smith, john", "john smith", "John Smith III"]
for name in test_names:
    result = max(kp.extract_keywords(name), key=lambda x: sum(x))
    print("{} , n_male : {}, n_female: {}".format(name, result[0], result[1]))
    

John , n_male : 1218600, n_female: 5448
john , n_male : 1218600, n_female: 5448
smith, john , n_male : 1218600, n_female: 5448
john smith , n_male : 1218600, n_female: 5448
John Smith III , n_male : 1218600, n_female: 5448


As we can see, we are still able to detect "John" in the above names and have the counts for "John" returned

Now we can persist our `KeywordProcessor` with `pickle` so that we may use in future notebooks 

In [9]:
import pickle
with open(r'data/keyword_processor.pkl', 'wb') as fp:
    pickle.dump(kp, fp)