In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

Now let's play around a bit with the large baby names dataset with pandas basics knowledge. We'll start by loading that dataset from the social security administration's website.

To keep the data small enough to avoid crashing datahub, we're going to look at only New York State rather than looking at the national dataset.

In [2]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ny_name = 'NY.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ny_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.sample(5)

Unnamed: 0,State,Sex,Year,Name,Count
43692,NY,F,1959,Deena,10
159751,NY,F,2015,Heather,31
17712,NY,F,1934,Noel,7
136467,NY,F,2006,Jadyn,34
30454,NY,F,1949,Kay,77


## Goal 1: Find the most popular baby name in New York in 2018

In [12]:
babynames[babynames["Year"]==2018].sort_values("Count",ascending = False).head(5)

Unnamed: 0,State,Sex,Year,Name,Count
301439,NY,M,2018,Liam,1515
301440,NY,M,2018,Noah,1273
166710,NY,F,2018,Emma,1099
166711,NY,F,2018,Olivia,1092
301441,NY,M,2018,Jacob,1020


## Goal 2: Baby names that start with T. 

In [38]:
starts_with_T =babynames['Name'].str.startswith('T')
pd.DataFrame(babynames[starts_with_T]['Name'].unique()).sample(10)

Unnamed: 0,0
638,Teofilo
291,Tyiesha
698,Taran
137,Tarsha
295,Tasheba
613,Tyreek
10,Thomas
514,Tobias
408,Tasnia
428,Tehya


## Goal 3: Sort names by their length.

Approach 1a: Create a new series of only the lengths. Then add that series to the dataframe as a column. Then sort by that column. Then drop that column.

In [43]:
#create a new series of only the lengths(map len function to Name series)
babyname_lengths = babynames["Name"].map(len)

#add that series to the dataframe as a column
babynames["name_length"] = babyname_lengths

#sort by that column
babynames_by_length = babynames.sort_values(by = "name_length")

#drop that column
babynames_by_length = babynames_by_length.drop("name_length", axis=1)
babynames_by_length.head(5)

Unnamed: 0,State,Sex,Year,Name,Count
182248,NY,M,1920,Wm,9
215753,NY,M,1964,Ty,13
196804,NY,M,1942,Ed,6
255658,NY,M,1994,Al,6
258394,NY,M,1996,Ty,24


Approach 1b: Same as 1a, but use str.len() to generate the lengths of the strings.

In [44]:
#create a new series of only the lengths
babyname_lengths = babynames["Name"].str.len()

#add that series to the dataframe as a column
babynames["name_length"] = babyname_lengths

#sort by that column
babynames_by_length = babynames.sort_values(by = "name_length")

#drop that column
babynames_by_length = babynames_by_length.drop("name_length", axis=1)
babynames_by_length.head(5)

Unnamed: 0,State,Sex,Year,Name,Count
182248,NY,M,1920,Wm,9
215753,NY,M,1964,Ty,13
196804,NY,M,1942,Ed,6
255658,NY,M,1994,Al,6
258394,NY,M,1996,Ty,24


Approach 2: Generate an index that is in the order we want. Pass that index to loc.

In [45]:
babynames.loc[babynames["Name"].str.len().sort_values().index].head(5)

Unnamed: 0,State,Sex,Year,Name,Count,name_length
182248,NY,M,1920,Wm,9,2
215753,NY,M,1964,Ty,13,2
196804,NY,M,1942,Ed,6,2
255658,NY,M,1994,Al,6,2
258394,NY,M,1996,Ty,24,2


How does this work exactly? Let's break it into pieces.

In [46]:
lengths_sorted_by_length = babynames["Name"].str.len().sort_values()
lengths_sorted_by_length.head(5)

182248    2
215753    2
196804    2
255658    2
258394    2
Name: Name, dtype: int64

In [47]:
index_sorted_by_length = lengths_sorted_by_length.index
index_sorted_by_length

Int64Index([182248, 215753, 196804, 255658, 258394, 264975,  17395, 298476,
            233432, 286723,
            ...
             53107,  78092, 272534,  58087, 279907,  48458,  70613,  66973,
             54835, 250865],
           dtype='int64', length=309532)

In [48]:
# now pass the index to loc. This is yet another way 
# that loc can be used that we did not discuss in lecture!
babynames.loc[index_sorted_by_length].head(5)

Unnamed: 0,State,Sex,Year,Name,Count,name_length
182248,NY,M,1920,Wm,9,2
215753,NY,M,1964,Ty,13,2
196804,NY,M,1942,Ed,6,2
255658,NY,M,1994,Al,6,2
258394,NY,M,1996,Ty,24,2
