Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
211 lines (170 sloc) 9.36 KB
---
title: "Why does everyone I work with have the same name? A guide on how to not pick a unique baby name in America."
author: ~
date: '2019-10-05'
slug: baby-names
thumbnailImagePosition: left
thumbnailImage: img/20190909/top10boynames.png
categories: ['analysis']
tags: ['r', 'plotly', 'ggplot2', 'dataviz', 'data-journalism']
codefolding_show: hide
---
I've taken a particular interest in names since I'm thinking of a name for my to-be-born son. I did a little digging through the [Social Security Administration names database]( https://www.ssa.gov/oact/babynames/limits.html), which lists all names given to baby boys and girls in America[^1]. I began this exercise to just get a quality list of ideas, but my curiosity got the better of me.
[^1]:It even breaks it down by state. It'll return any name that is reported more than 5 times. So if the name "Juniper" is only given to four babies in the US in a given year, Juniper won't appear in the database.
# Name Trends Since 1950
What was the most popular boy name since 1950? Michael - it dominated. Michael was the top boy name for 44 years between 1954 and 1998 (except for 1960). Let's look at the top 10 names for boys across time:
_Note: These interactive plots aren't optimized for mobile devices. Please check them out on your desktop, or keep reading to see non-interactive plots. For those on desktop, you can hover your mouse and click to see different trends._
```{r, echo=F, warning=F, message=F}
knitr::opts_chunk$set(echo = F, warning = F, message = F)
```
```{r, include=T, echo=F, warning=F, eval=FALSE}
# hello
df_all = data.frame()
prefix = '/Users/bryanwhiting/Downloads/names/yob'
# for (y in 2018:2000){
for (y in 1880:2018){
path = paste0(prefix, y, '.txt')
df_new = read.csv(path, header = F)
df_new$year = y
df_all = rbind(df_all, df_new)
}
colnames(df_all) <- c('name', 'sex', 'number', 'year')
# Save
save(df_all, file="~/github/blog/static/img/20190909/names-all.Rda")
```
```{r}
load("~/github/blog/static/img/20190909/names-all.Rda")
install.load::install_load(c('dplyr', 'tidyverse', 'plotly'))
df2 <- df_all %>%
group_by(year, sex) %>%
arrange(year, sex, desc(number)) %>%
mutate(pct = number/sum(number) * 100,
cum_pct = cumsum(pct),
tenk = pct/100 * 10000,
rank = rank(desc(number), ties.method='first')) %>%
ungroup()
```
```{r}
plot_top10 <- function(df2, s){
if (s == 'M'){
title = 'Top 10 Boy Names Since 1950'
default = c('Christopher', 'Michael')
} else {
title = 'Top 10 Girl Names Since 1950'
default = c('Jennifer', 'Jessica')
}
tmp <- df2 %>%
filter(rank <= 10,
year >= 1950,
sex == s) %>%
mutate(rank = rank) %>%
highlight_key(~name)#, group='Search Names (Select multiple holding "shift" + click)')
# Working ggplotly example: https://plotcon17.cpsievert.me/workshop/day2/#18
p <- ggplot(tmp, aes(year, rank, group=name)) +
geom_line() +
geom_point() +
labs(x = "Year", y="Name Rank", title=title) +
scale_y_reverse(breaks = 1:10) +
theme_minimal()
gg <- ggplotly(p, tooltip = c("name", "rank", 'year'))
highlight(gg, 'plotly_click', defaultValues = default)
}
plot_top10(df2, s='M')
```
For curiosity's sake, let's look at female names over time:
```{r}
plot_top10(df2, s='F')
```
Female names don't appear to be as dominant - holding on to the top 1 place seems like a rarer feat.
This gets me thinking - how many names have attained that #1 status since 1950? And how does it differ across male and female names? This table below shows how many unique names have been ranked as number 1 between 1950 and 2018. For male, it's 7 different names. For female, it's been 10. Female names tend to change more frequently.
```{r, eval=F, include=F}
df2 %>%
filter(year >= 1950,
sex == 'M') %>%
select(name) %>%
unique() %>%
count()
#Male:41 names in top 10, 15 in top 3, 7 top 1, 38,670 in total
#Femal: 60 names in top 10, 29 in top 3, 10 top 1, 63,331 in total
```
| Rank | Male | Female |
|----|----|----|
|# Unique names in Top 1 | 7 | 10|
|# Unique names in Top 3 | 15 | 29 |
|# Unique names in Top 10 | 41 | 60 |
| Total Unique (All Names) | 38,670 | 63,331|
The fact that females have 63% more names in the US is kind of shocking. Why aren't male names as creative?
Perhaps names are like wardrobes in America. Looking at my closet, I see a lot of blue shirts. Why? Because blue is a great color. And I think others agree. For example, it's become a joke in the workplace - I'll show up to work with the same shirt as other male colleagues. Even that one red shirt I own - just last week someone teased me because their boss wears that shirt too. He called me a copy cat or something, like my shirt was un-original. Turns out my closet isn't as diverse as I think it is. Yet, I haven't run into that problem yet with my female colleagues. They seem to be more creative.
# The Recent Top Names Back to 1880
Let's dive into the boy names who achieved top 1 status since 1950:
```{r}
top1 <- df2 %>%
filter(sex == 'M', rank == 1, year > 1950) %>%
select(name) %>%
unique() %>%
pull()
df2 %>%
filter(name %in% top1, sex == 'M') %>%
ggplot(aes(year, rank, color=name)) +
geom_line() +
geom_point() +
labs(x = "Year", y="Name Rank",
title='The Battle for Most Popular Male Name',
subtitle = 'Showing 7 names who achieved 1st since 1950, but showing data back to 1880') +
scale_y_reverse(limits = c(100, 0), breaks = c(1, seq(10,100, 10))) +
ggthemes::scale_color_pander() +
theme_minimal()
```
For a while there, boy names that became popular were at one point in the past popular. However, there seems to be a new trend among millennial parents with a name like Liam coming out of nowhere...
Female names follow a completely different track. Once a name becomes popular, it falls off a cliff years later. Girl names come and go. This plot kind of looks like long hair...a fashion choice common among women?
```{r}
top1 <- df2 %>%
filter(sex == 'F', rank == 1, year > 1950) %>%
select(name) %>%
unique() %>%
pull()
df2 %>%
filter(name %in% top1, sex == 'F') %>%
ggplot(aes(year, rank, color=name)) +
geom_line() +
geom_point() +
labs(x = "Year", y="Name Rank",
title='The Battle for Most Popular Female Name',
subtitle = 'Showing 10 names who achieved 1st since 1950, but showing data back to 1880') +
scale_y_reverse(limits = c(100, 0), breaks = c(1, seq(10,100, 10))) +
ggthemes::scale_color_pander() +
theme_minimal()
```
Wow - Emma made a huge comeback. To go through such a decline, and then see such a resurgence to the coveted #1 spot - the other names must be jealous. And it's pretty crazy that Mary held on to 1st place from 1880 through pretty much the 1960's. Part of me wonders how accurate the SSA data is going that far back[^3].
[^3]: To check this, I looked at the CDC's reported population count for 2017 and matched it with the number of names I could find in the SSA's database. Turns out the SSA database covers about 92% of all names in America. So it's not 100% of the names.
# Uniqueness in the Name
So, I think it's obvious from above, but just one more check: which sex is more creative in their naming? Let's dive into the number of unique names given each year. Female names achieve peak creativity in 2007 with 20,568 unique names. Male creativity spikes a year later in 2008 with 14,615 unique names.
```{r}
df_uniq <- df2 %>%
count(year, sex, name)
df_uniq %>%
count(year, sex) %>%
ggplot(aes(year, n, color=sex)) +
geom_line() +
geom_point() +
# geom_smooth() +
labs(x = "Year", y="Number of Unique Names Used in a Year",
title='Creativity is Actually on the Decline',
subtitle = 'Showing number of unique names each year by sex') +
geom_vline(xintercept = 2007)+
annotate("text", x = 2013, y = 3000, label= "2008")+
ggthemes::scale_color_pander() +
theme_minimal()
```
I was surprised that names have gotten less creative since 2007/2008. I thought that millennial would have gotten increasingly unique with the internet and access to highly-useful information like this blog post. Perhaps, instead, we're getting stuck in an analysis paralysis, and people mostly just give up and pick whatever name is in the top 1000?
What else could explain this rise in both names? Perhaps the data got more accurate as it became easier to report names. Maybe names took off in the 1970's because that's when computers became popular and it became easier to record things.
# Conclusions
* Female names have far greater variety than male names.
* Female names tend to fluctuate more in the upper-ranks.
* Both male and female top names are holding a first-place rank for shorter duration than they have in the past.
* Name creativity really started accelerating in the 1970's, but peaked in 2008.
* There's a strong correlation between creativity in male names as in female names.
* A name popular today will quickly decline in popularity later.
# How did I do this analysis?
* Step 1: The Social Security Administration reports all baby names each year in the United States, given the name occurs at least 5 times. I downloaded these easily from [here]( https://www.ssa.gov/oact/babynames/limits.html). Then I combined them together into one dataset.
* Step 2: Combine the data across years and plot using the code from [here](https://github.com/bryanwhiting/blog/tree/master/content/post/2019-10-05-baby-names.Rmd).
You can’t perform that action at this time.