# Name Data


## Gathering

If there is a "Gold Standard" for a First Name and Gender dataset it must be the [National data](https://www.ssa.gov/oact/babynames/limits.html) provided by the U.S. Social Security Administration.

> All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
All data are from a 100% sample of our records on Social Security card applications as of March 2019.

Let's examine the dataset

In [1]:
import requests
import io
import re
from zipfile import ZipFile


url = "https://www.ssa.gov/oact/babynames/names.zip"
r = requests.get(url, allow_redirects=True)

raw = io.BytesIO(r.content)
data = {}

with ZipFile(raw) as myzip:
    file_names = myzip.namelist()
    file_names = filter(lambda x: 'yob' in x, file_names)
    for f in file_names:
        year_of_birth = int(re.search(r"[0-9]{4}", f).group())
        with myzip.open(f) as myfile:
            data[year_of_birth] = myfile.read().decode().splitlines()


Since we are interested in aggregating these counts by name across several years let's do so after choosing a range of years.




In [2]:
name_counts = {}
start_year = 1970

for year, names in data.items():
    if year < start_year:
        continue
    for line in names:
        name, gender, count = line.split(",")
        count = int(count)
        current_counts = name_counts.get(name, {'M': 0, 'F': 0})
        current_counts[gender] += count
        name_counts[name] = current_counts
        

In [3]:
import pandas as pd

df = pd.DataFrame(list(name_counts.items()), columns=['Name', 'Counts'])
df = df.join(df['Counts'].apply(pd.Series))
del df['Counts']

In [4]:
df.sample(5)

Unnamed: 0,Name,M,F
77634,Leonydus,5,0
87001,Kaelahni,0,6
43032,Marielos,0,10
60933,Juanfernando,5,0
51795,Jevontay,5,0


The SSA mentions a point that is of interest to us:
> Name data are not edited. For example, the sex associated with a name may be incorrect. Entries such as "Unknown" and "Baby" are not removed from the lists.

We will manually remove these


In [5]:
# Manually removing data that are not 'real' names
import numpy as np
import operator

# Helper functions for chaining multiple filters

def make_filter(df, col, value, op):
    return op(df[col], value)

def combine_filters(df, boolean, *filters):
    return df[(boolean(*filters))]

def remove_names(names, df=df, col='Name', op=operator.ne):
    current_count = len(df)
    filters = []
    for name in names:
        f1 = make_filter(df, col, name, op)
        filters.append(f1)
        
    # np.logical_or.reduce == "matching any of these conditions"
    # np.logical_and.reduce == "matching all of these conditions"
    
    df1 = combine_filters(df, np.logical_and.reduce, filters)
    print("Removed {} Names".format(current_count - len(df1)))
    return df1

names_to_remove = ['Person', 'Baby', 'Unknown', 'First', 'Firstname', 'First Name', 'Boy', 'Girl',
                   'Man', 'Woman']

In [6]:
df = remove_names(names_to_remove)

Removed 5 Names
