##  US Baby Names 1880–2010

The United States Social Security Administration (SSA) has made available data on
the frequency of baby names from 1880 through the present. Hadley Wickham, an
author of several popular R packages, has this dataset in illustrating data manipula‐
tion in R.

In [1]:
!head -n 10 datasets/babynames/yob1880.txt

Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
Ida,F,1472
Alice,F,1414
Bertha,F,1320
Sarah,F,1288


There are many things you might want to do with the dataset:
- Visualize the proportion of babies given a particular name (your own, or another name) over time
- Determine the relative rank of a name
- Determine the most popular names in each year or the names whose popularity has advanced or declined the most
- Analyze trends in names: vowels, consonants, length, overall diversity, changes in
spelling, first and last letters
- Analyze external sources of trends: biblical names, celebrities, demographics

As of this writing, the US Social Security Administration makes available data files,
one per year, containing the total number of births for each sex/name combination.
You can download the [raw](https://www.ssa.gov/oact/babynames/limits.html) archive of these files. 

If this page has been moved by the time you’re reading this, it can most likely be
located again with an internet search. After downloading the **“National data”** file
**names.zip** and unzipping it, you will have a directory containing a series of files like `yob1880.txt`. 

In [4]:
# code

These files only contain names with **at least five occurrences in each year**, so for
simplicity’s sake we can use the sum of the births column by sex as the total number
of births in that year

In [3]:
# code

Since the dataset is **split into files by year**, one of the first things to do is to assemble all of the data into a single DataFrame and further add a year field. You can do this using `pandas.concat`.

There are a couple things to **note** here. First, remember that `concat` combines the
DataFrame objects by **row** by default. Second, you have to pass `ignore_index=True`
because we’re **not** interested in preserving the original row numbers returned from
`pandas.read_csv`.

In [None]:
# code

With this data in hand, we can already start **aggregating the data at the year and sex** level using `groupby` or `pivot_table` 

In [None]:
# code

Next, let’s **insert a column** `prop` with the **fraction of babies given each name relative to the total number of births**. A `prop` value of 0.02 would indicate that 2 out of every 100 babies were given a particular name. 

Thus, we group the data by year and sex, then add the new column to each group

In [None]:
# code

**NOTE:** When performing a group operation like this, it’s often valuable to do a **sanity check**, like verifying that the prop column sums to 1 within all the groups:

In [None]:
# code

Now that this is done, I’m going to ***extract a subset of the data*** to facilitate further analysis: **the top 1,000 names for each sex/year combination**.

In [None]:
# code

We can drop the group index since we don’t need it for our analysis

### Analyzing Naming Trends
With the full dataset and the top one thousand dataset in hand, we can start analyzing
various naming trends of interest. First, we can **split the top one thousand names into the boy and girl portions**


In [None]:
# code

Simple time series, like the **number of Johns or Marys for each year**, can be plotted
but require some manipulation to be more useful. 

Let’s form a **pivot table** of the **total number of births by year and name**

In [None]:
# code

### Measuring the increase in naming diversity
One explanation for the **decrease in plots** is that fewer parents are choosing common
names for their children. This hypothesis can be explored and confirmed in the data.
One measure is the **proportion of births represented by the top 1,000 most popular names**, which I aggregate and plot by year and sex

In [None]:
# code

You can see that, indeed, there **appears to be increasing name diversity** (decreasing
total proportion in the top one thousand). 

Another interesting metric is the **number of distinct names, taken in order of popularity from highest to lowest, in the top 50% of births**. This number is trickier to compute. 

Let’s consider just the boy names from 2010

In [None]:
# code

After **sorting prop** in descending order, we want to know **how many of the most popular names it takes to reach 50%**. You could write a for loop to do this, but a
vectorized NumPy way is more computationally efficient. Taking the cumulative sum,
`cumsum`, of prop and then calling the method `searchsorted` returns the position in
the cumulative sum at which 0.5 would need to be inserted to keep it in sorted order

Since arrays are zero-indexed, adding 1 to this result gives you a result of **117**. By contrast, in 1900 this number was much smaller

In [None]:
# code

You can now **apply this operation to each year/sex combination**, groupby those fields, and apply a function returning the count for each group:

In [None]:
# code

This resulting DataFrame `diversity` now has two time series, one for each sex,
indexed by year. This can be inspected and plotted as before

In [None]:
# code

As you can see, **girl names have always been more diverse than boy names**, and they
have only become more so over time. 

### The “last letter” revolution
In 2007, baby name researcher Laura Wattenberg pointed out that **the distribution of boy names by final letter has changed significantly over the last 100 years**. To see this, we first aggregate all of the births in the full dataset by year, sex, and final letter

In [None]:
# code

Then we select three representative years spanning the history and print the first few
rows

In [None]:
# code

Next, normalize the table by total births to compute a new table containing the
proportion of total births for each sex ending in each letter

In [None]:
# code

With the letter proportions now in hand, we can make bar plots for each sex, broken
down by year

In [None]:
# code

As you can see, **boy names ending in n have experienced significant growth since the 1960s**. 

Going back to the full table created before, I again **normalize by year and sex and select a subset of letters for the boy names, finally transposing to make each column a time series**

In [None]:
# code

### Boy names that became girl names (and vice versa)
Another fun trend is looking at names that were more popular with one gender
earlier in the sample but have become preferred as a name for the other gender
over time. One example is the name Lesley or Leslie. Going back to the top1000
DataFrame, I compute a list of names occurring in the dataset starting with “Lesl”

In [None]:
# code

From there, we can filter down to just those names and sum births grouped by name
to see the relative frequencies

In [None]:
# code

Next, let’s aggregate by sex and year, and normalize within year

In [None]:
# code

Lastly, it’s now possible to make a plot of the breakdown by sex over time

In [None]:
# code