In [None]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt

# Part I: Analyzing US baby name trends

The SSA has made available data on the frequency of baby names from 1880 through 2021 (at the time of this writing).
The raw data can be obtained from [the SSA webpage](https://www.ssa.gov/oact/babynames/limits.html) (there is one file per year).

**Part 0:** Download the [National Data](https://www.ssa.gov/oact/babynames/names.zip) file *names.zip* and unzip it.

**Part 1** Assemble all of the data into a single DataFrame and add a *year* field. 
You can do this using [pandas.concat](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)

In [None]:
years = list(range(1880,2021))

In [None]:
df_list = []
for year in years:
    # load the dataset into a dataframe
    df = pd.read_csv('../../../Data/names/yob'+str(year)+'.txt',header=None,names=['name','sex','births'])
    # add year column
    df['year'] = year
    # put dataframe in df_list
    df_list.append(df)
names = pd.concat(df_list)
names.head()

**Part 2:** Plot the total births by sex and year

In [None]:
names.groupby(['sex', 'year']).births.sum().unstack(level=0).plot()

**Part 3:** Plot the number of babies given a particular name (your own, or another name) by year.

In [None]:
names[names['name'] == 'Jacob'].groupby('year').births.sum().plot()
plt.title('Births per year for the name Jacob')
plt.ylabel('Births')
plt.xlabel('Year')

**Part 4:** Insert a column 'prop' with the relative frequency of each name in each of the years.

In [None]:
# **Part 4:** Insert a column 'prop' with the relative frequency of each name in each of the years.

names['prop'] = names['births']/names.groupby(['year'])['births'].transform('sum')
names.head()

**Part 5**: Create a DataFrame 'top1000_names' that contains the top 1000 names for each sex/year combination.
You will use this top 1000 dataset in the following investigations into the data.

In [None]:
top1000_names = names.groupby(['year','sex','name', 'prop']).births.sum().reset_index()
# keep only the top 1000 names per year by births
def get_top1000(group):
    return group.sort_values(by=['births'], ascending=False).head(1000)

top1000_names = top1000_names.groupby('year').apply(get_top1000).reset_index(drop=True)

top1000_names.head()

**Part 6**: Plot the number of Johns, Harrys, Marys, and Marilyns by year.

In [None]:
# dataframe with only the names we want to plot
names_to_plot = top1000_names[top1000_names['name'].isin(['John','Harry','Mary','Marilyn'])]

# plot the number of births per year for each name
names_to_plot.groupby(['year','name']).births.sum().unstack(level=1).plot()
plt.title('Births per year for the names John, Harry, Mary, and Marilyn')
plt.ylabel('Births')
plt.xlabel('Year')

Looking at your plots, you might conclude that these names have grown out of favor with the American population. 
But the story is more complicated than that, as you will explore in the next part.

## Measuring the increase in naming diversity

One explanation for the decrease in plots is that fewer parents are choosing common names for their children.
One measure of naming diversity is the proportion of births represented by the top 1000 most popular names.

**Part 7**: Plot the proportion of the top 1000 names by year and sex

## 10 most popular 2017 names through the ages

**Part 8**: Find the 10 most popular female names in 2017

**Part 9**: Plot the proportions of the 10 most popular female names in 2017 by year

## Similarity between decades

Here, you will explore the similarity between the set of names given in one particular year and the set of names given 10 years previosly.

The **Jaccard similarity** between sets A and B is the number of
elements in both A and B relative to the number of elements in either A or
B. 
If we let |A| denote the number of elements in the set A, then the Jaccard
similarity is

$$
J(A,B)=\frac{|A \cap B|}{|A\cup B|}
$$

**Part 10**: Find the Jaccard similarity between the following two sets

In [None]:
set1 = {'John','Daniel','Drogo'}
set2 = {'Robert', 'John'}

**Part 11**: Compute the Jaccard similarity between the set of male names given in 2017 and the set of male names given in 2007

**Part 12**: Plot the Jaccard similarity between the set of male names given in one particular year and the set of male names given 10 years previosly by year

##  The last letter revolution

It has been argued (see [here](https://www.babynamewizard.com/archives/2007/7/where-all-boys-end-up-nowadays), for example) that the distribution of boy names by final letter has changed significantly over the last 100 years.

**Part 13:** Extract the last letter from the "name" column

**Part 14**: Plot the proportion of male names by the last letter for the years 1910, 1960, and 2010

**Part 15**: Plot the proportions of male names ending in "e", "n", "d", "s" and "y" by year.