# Name Data


## Gathering

If there is a "Gold Standard" for a First Name and Gender dataset it must be the [National data](https://www.ssa.gov/oact/babynames/limits.html) provided by the U.S. Social Security Administration.

> All names are from Social Security card applications for births that occurred in the United States after 1879. Note that many people born before 1937 never applied for a Social Security card, so their names are not included in our data. For others who did apply, our records may not show the place of birth, and again their names are not included in our data.
All data are from a 100% sample of our records on Social Security card applications as of March 2019.

Let's examine the dataset

In [6]:
import requests
import io
import re
from zipfile import ZipFile


url = "https://www.ssa.gov/oact/babynames/names.zip"
r = requests.get(url, allow_redirects=True)

raw = io.BytesIO(r.content)
data = {}

with ZipFile(raw) as myzip:
    file_names = myzip.namelist()
    file_names = filter(lambda x: 'yob' in x, file_names)
    for f in file_names:
        year_of_birth = int(re.search(r"[0-9]{4}", f).group())
        with myzip.open(f) as myfile:
            data[year_of_birth] = myfile.read().decode().splitlines()


Since we are interested in aggregating these counts by name across several years let's do so after choosing a range of years


In [7]:
name_counts = {}
start_year = 1970

for year, names in data.items():
    if year < start_year:
        continue
    for line in names:
        name, gender, count = line.split(",")
        count = int(count)
        current_counts = name_counts.get(name, {'M': 0, 'F': 0})
        current_counts[gender] += count
        name_counts[name] = current_counts
        

In [12]:
# make a pandas DataFrame
import pandas as pd

df = pd.DataFrame(list(name_counts.items()), columns=['Name', 'Counts'])
df = df.join(df['Counts'].apply(pd.Series))
del df['Counts']

In [14]:
df.sample(100)


Unnamed: 0,Name,M,F
43824,Rondie,5,0
80752,Adithri,0,23
82917,Kingman,6,0
71163,Zamyiah,0,90
62735,Darreion,12,0
36164,Corddaryl,5,0
61825,Cadynce,0,272
76697,Armonnie,0,5
39312,Kevona,0,113
83813,Liyan,19,5
