/
2019-05-13-popular-letters-in-baby-names-animated.Rmd
498 lines (428 loc) · 16.2 KB
/
2019-05-13-popular-letters-in-baby-names-animated.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
---
title: Popular Letters in Baby Names, Animated
author: Garrick Aden-Buie
date: '2019-05-17'
slug: popular-letters-in-babynames
categories:
- Blog
tags:
- R
- gganimate
- ggplot2
- Visualization
- Baby Names
description: "An animation of letter popularity in baby names"
twitterImage: "/blog/2019/2019-05-13-popular-letters-in-baby-names-animated_files/figure-html/babynames-any-letter-1.gif"
rmd_source: https://github.com/gadenbuie/garrickadenbuie-com/blob/master/content/blog/2019/2019-05-13-popular-letters-in-baby-names-animated.Rmd
keywords: rstats
editor_options:
chunk_output_type: console
---
<!-- Links -->
[tidyverse]: https://tidyverse.org/
[gganimate]: https://gganimate.com/
[babynames]: https://github.com/hadley/babynames
[tidyexplain]: https://github.com/gadenbuie/tidyexplain
[hadley]: https://hadley.nz/
[grrrck]: https://twitter.com/grrrck
[showtext]: https://cran.r-project.org/web/packages/showtext/
[sysfonts]: https://cran.r-project.org/web/packages/sysfonts/
[google-fonts]: https://fonts.google.com/
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = FALSE, warning = FALSE, message = FALSE, cache = TRUE,
fig.width = 9, fig.height = 10
)
options(htmltools.dir.version = TRUE)
# If using lightbox for plots, set `fig.show = FALSE`
# Usage: lightbox_img(knitr::fig_chunk("chunk-name", "png"))
lightbox_img <- function(url, alt = "", caption = "", preview = TRUE) {
if (preview) {
glue::glue(
'<a href="{url}" data-featherlight="image">
<div class="figure">
<img src="{url}" alt="{alt}">
<p class="caption">{caption}</p>
</div>
</a>
'
)
} else {
if (alt == "") alt <- "static image of the plot"
glue::glue('<a href="{url}" data-featherlight="image">{alt}</a>')
}
}
```
`r lightbox_img(knitr::fig_chunk("babynames-any-letter", "gif"))`
Earlier this week, a tweet from Kieran Healy caught my attention with a neat [animation of the popularity of the final letters](https://twitter.com/kjhealy/status/1127951257392623616) of baby names.
```{r}
blogdown::shortcode("twitter", "1127951257392623616")
```
I'm also a big fan of [[gganimate]]{.pkg} - check out my first project with [gganimate]{.pkg}, a collection of [join animations called _tidyexplain_][tidyexplain].
And the [[babynames]]{.pkg} package by [Hadley Wickham][hadley] makes it pleasantly easy to work with the baby names data reported by the U.S. Social Security Administration.
Kieran's animations inspired several questions that I hope to answer (or at least visualize) in this post:
1. What about _any_ letter within a baby's name? I understand why first and last letters would be interesting, but how has the popularity of any letter used in a baby's name changed over time?
1. Can we visualize both _male_ and _female_ names in the same animation without overloading the animation?
1. While I love [gganimate]{.pkg} and animated plots, are static plots more effective at displaying the same information?
Rather than answer these questions definitively or scientifically, I've stuck with the fun parts and made a few visualizations.
I'll let you decide how effective they are.
(And feel free to let me know on Twitter at [@grrrck][grrrck].)
## Getting Started
To get started, I loaded the following packages, all installed from CRAN with `install.packages()`.
```{r echo=TRUE}
library(tidyverse)
library(gganimate)
library(babynames)
```
Next, I set up a base [ggplot2]{.pkg} theme that I'll use throughout.
Note that you can also set the `ggplot2` theme globally with `theme_set()`, but I'm not doing that here for complicated reasons related to my use of [knitr caching](https://yihui.name/knitr/demo/cache/) for faster rendering between post drafts.
I also used the [[showtext]]{.pkg} and [[sysfonts]]{.pkg} packages.
These two sister packages are my go-to packages for reliably being able to use [Google Fonts][google-fonts] with [ggplot2]{.pkg} on any system.
```{r theme, echo=TRUE}
showtext::showtext_auto()
sysfonts::font_add_google("PT Sans")
sysfonts::font_add_google("PT Sans Narrow")
base_theme <-
theme_minimal(base_size = 16, base_family = "PT Sans") +
theme(
legend.position = c(0.5, 0.9),
legend.text = element_text(margin = margin(r = 10)),
legend.background = element_rect(fill = "white", color = "white"),
legend.direction = "horizontal",
legend.justification = "center",
panel.grid.major.x = element_blank(),
panel.grid.minor.x = element_blank(),
axis.text = element_text(family = "PT Sans Narrow"),
axis.text.x = element_text(vjust = 0.9, face = "bold"),
axis.title.y = element_text(margin = margin(r = 20)),
plot.subtitle = element_text(
size = rel(1.5), hjust = 0.5, margin = margin(t = 10)
),
plot.caption = element_text(color = "grey40", lineheight = 1.1)
)
sex_colors <- c("Male" = "#00589A", "Female" = "#EB1455")
caption_text <- paste(
"Source: babynames, U.S. Social Security Administration",
"@grrrck", "garrickadenbuie.com",
sep = " | "
)
```
## How Popular are Letters Used Anywhere in Baby Names
### Letters in Baby Names
To answer the first question, I needed to first establish what letters are used in a name, regardless of the letter's position in the name, and not counting duplicates.
Rather than repeatedly calculate this information for every year, sex, and name pair in `babynames`, I used `distinct()` from [dplyr]{.pkg} to obtain the list of unique names.
```{r babynames-and-letters, echo = TRUE}
babynames_and_letters <-
babynames %>%
distinct(name) %>%
mutate(letter = strsplit(tolower(name), character())) %>%
# strsplit() returns a list for each name,
# so 'letter' is a list-column that can be converted to a
# normal column with unnest()
unnest(letter) %>%
# count each letter once
distinct()
babynames_and_letters %>%
filter(name == "Garrick")
```
Then I joined `babynames` with the `babynames_and_letters` table to get the sum of the proportions of the population (by year and sex) that have each letter in their name.
```{r babynames-containing, echo = TRUE}
babynames_containing <-
left_join(babynames, babynames_and_letters, by = "name") %>%
group_by(letter, year, sex) %>%
summarize(prop = sum(prop)) %>%
ungroup() %>%
filter(year >= 1900) %>%
mutate(sex = recode(sex, M = "Male", F = "Female"))
babynames_containing %>%
filter(letter == "g") %>%
head()
```
### Plot and Animate
At this point, my typical [gganimate]{.pkg} workflow is to first create a static [ggplot2]{.pkg} plot as a sanity check for the animation.
If the static plot works with `facet_wrap(~ state_column)`, then using `transition_state(state_column)` is likely to work well (although you may need to filter the data to preview only a few states).
In this case, I used `year` as the state column.
I generally try to structure my [ggplot2]{.pkg} code in a consistent way (future blog post?) so I prefer having the [gganimate]{.pkg} specific parts at the end.
I also directly called `gganimate::animate()` so that I could control specific parameters of the output, namely the number of frames and the size of the image.
The default number of frames is 100 but I'm visualizing 117 years and [gganimate]{.pkg} will complain (i.e. throw an error) if there aren't enough frames to cover the number of states.
```{r babynames-animated-popular-letter-names, echo=TRUE, fig.show="hide"}
ga_pop_letter <-
ggplot(babynames_containing) +
aes(letter, prop, fill = sex) +
geom_col(position = "identity", alpha = 0.6) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 10),
expand = c(0, 0)
) +
scale_fill_manual(values = sex_colors) +
labs(
x = NULL,
y = "Percent of Population",
fill = NULL,
title = "How many babies have the letter ____ in their name?",
subtitle = "{closest_state}",
caption = caption_text
) +
base_theme +
ease_aes("linear") +
transition_states(
year, transition_length = 1, state_length = 0, wrap = FALSE
)
ga_pop_letter_animated <- animate(
ga_pop_letter,
nframes = 117*2+10,
end_pause = 10,
width = 1024,
height = 512
)
```
```{r babynames-any-letter, include=FALSE}
ga_pop_letter_animated
```
The resulting animation, below, shows the proportion of babies given a name containing (at least one of) each letter of the alphabet since 1900.
`r lightbox_img(knitr::fig_chunk("babynames-any-letter", "gif"))`
```{r eval=FALSE}
g2 <- ggplot(babynames_containing) +
aes(year, prop, color = sex) +
geom_text(
data = babynames_containing %>%
distinct(year, letter) %>%
mutate(prop = 0, year2 = 1967),
aes(label = letter, x = year2),
size = rel(15),
vjust = -0.4,
color = "grey75"
) +
geom_line(aes(group = sex)) +
facet_wrap(~ letter, ncol = 6) +
labs(x = "1917 - 2017") +
scale_x_continuous(
breaks = seq(1900, 2017, by = 50),
expand = c(0, 0)
) +
scale_y_continuous(
labels = scales::percent_format(10),
expand = c(0, 0, 0.05, 0)
) +
scale_color_manual(values = c("Male" = "#00589A", "Female" = "#EB1455")) +
theme(
legend.position = "bottom",
axis.text.x = element_blank(),
strip.text = element_blank(),
panel.grid.minor.y = element_blank(),
panel.border = element_rect(fill = NA, color = "grey90")
)
```
## First and Last Letters of Baby Names
To extract the first and last letter of each name, I wrote two small functions, `first_letter()` and `last_letter()` that use `substring()` to pull out the first and last letter of a string.
These are reasonably fast when applied to all of the names using `map_chr()` from [purrr]{.pkg} within `mutate()` to add new columns `first_letter` and `last_letter`.
```{r babynames_first_last, echo=TRUE}
first_letter <- function(x) substring(x, 1, 1)
last_letter <- function(x) substring(x, nchar(x), nchar(x))
babynames_first_last <-
babynames %>%
mutate(
first_letter = map_chr(name, first_letter),
last_letter = map_chr(name, last_letter),
) %>%
mutate_at(vars(contains("letter")), tolower)
set.seed(42)
babynames_first_last %>% sample_n(6)
```
### Animated
First, let's look at the animated version of my remix of Kieran's plots.
You can still see the change in the use of **N** as a final letter in the names given to baby boys, as he described, but you also see the changes in female names as well.
`r lightbox_img(knitr::fig_chunk("babynames-last-animation", "gif"))`
Here's the code I used to produce the plot above.
```{r echo=TRUE}
gb_last <-
babynames_first_last %>%
filter(year >= 1900) %>%
mutate(sex = recode(sex, M = "Male", F = "Female")) %>%
group_by(year, sex, last_letter) %>%
summarize(prop = sum(prop)) %>%
ggplot() +
aes(last_letter, prop, fill = sex) +
geom_col(position = "identity", alpha = 0.6) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 5),
expand = c(0, 0)
) +
scale_fill_manual(values = sex_colors) +
labs(
x = NULL,
y = "Percent of Population",
fill = NULL,
title = "How many baby names end with the letter ____?",
subtitle = "{closest_state}",
caption = caption_text
) +
base_theme +
theme(legend.position = c(0.8, 0.9)) +
ease_aes("linear") +
transition_states(
year,
transition_length = 1,
state_length = 0, wrap = FALSE
)
gb_last_animated <- animate(
gb_last,
nframes = 117*2+10,
end_pause = 10,
width = 1024, height = 512
)
```
```{r babynames-last-animation, include=FALSE}
gb_last_animated
```
Then I also created a similar visualization for the starting letters of baby's names.
`r lightbox_img(knitr::fig_chunk("babynames-first-animation", "gif"))`
And here's the code to create the above animation.
```{r echo=TRUE}
gb_first <-
babynames_first_last %>%
filter(year >= 1900) %>%
mutate(sex = recode(sex, M = "Male", F = "Female")) %>%
group_by(year, sex, first_letter) %>%
summarize(prop = sum(prop)) %>%
ggplot() +
aes(first_letter, prop, fill = sex) +
geom_col(position = "identity", alpha = 0.6) +
scale_y_continuous(
labels = scales::percent_format(accuracy = 5),
expand = c(0, 0)
) +
scale_fill_manual(values = sex_colors) +
labs(
x = NULL,
y = "Percent of Population",
fill = NULL,
title = "How many baby names start with the letter ____?",
subtitle = "{closest_state}",
caption = caption_text
) +
base_theme +
theme(legend.position = c(0.8, 0.9)) +
ease_aes("linear") +
transition_states(
year,
transition_length = 1,
state_length = 0,
wrap = FALSE
)
gb_first_animated <- animate(
gb_first,
nframes = 117*2+10,
end_pause = 10,
width = 1024, height = 512
)
```
```{r babynames-first-animation, include=FALSE}
gb_first_animated
```
### Static
I'll admit, I think the animated plots are cool,
but they also make it hard to reason about overall trends.
You have to watch the animation loop over and over, trying to watch one letter or one sex specifically.
I feel like I'm seeing the movement but missing the picture.
I thought it would be interesting to compare the animated plots with line charts showing the same information, so I swapped [gganimate]{.pkg}'s `transition_state(year)` for [ggplot2]{.pkg}'s `facet_wrap(~ letter)`.
I somewhat like these plots more than their animated versions.
I get the sense that I'm seeing a fuller picture (or more easy-to-compare picture) of the overall trends in starting and ending letters of baby names.
`r lightbox_img(knitr::fig_chunk("babynames-last-static", "png"))`
`r lightbox_img(knitr::fig_chunk("babynames-first-static", "png"))`
Here's the code to produce the static image of the trends in the last letter of baby names.
```{r babynames-last-static, echo=TRUE, fig.width=10, fig.height=5, dependson="theme", fig.show="hide"}
babynames_first_last %>%
filter(year >= 1900) %>%
mutate(sex = recode(sex, M = "Male", F = "Female")) %>%
group_by(year, sex, last_letter) %>%
summarize(prop = sum(prop)) %>%
ungroup() %>%
mutate(last_letter = toupper(last_letter)) %>%
ggplot() +
aes(year, prop) +
geom_text(
data = tibble(
last_letter = LETTERS,
x = 1900+117/2,
prop = 0
),
aes(label = last_letter, x = x),
size = rel(15),
vjust = -0.12,
color = "grey75",
family = "PT Sans"
) +
geom_line(aes(color = sex)) +
facet_wrap(~ last_letter) +
scale_y_continuous(
labels = scales::percent_format(5),
breaks = c(0, 0.3)
) +
scale_x_continuous(breaks = c(1900, 2000)) +
scale_color_manual(values = sex_colors) +
labs(
x = NULL,
y = "Proportion of Population",
title = "How many babies have names ending with the letter ____?",
caption = caption_text,
color = NULL
) +
base_theme +
theme(
strip.text = element_blank(),
axis.text.x = element_text(face = "plain"),
legend.position = c(0.7, 0.025),
panel.grid.minor.y = element_blank()
)
```
And here is the code to produce the plot of trends in the first letter of baby names.
```{r babynames-first-static, echo=TRUE, fig.width=10, fig.height=5, dependson="theme", fig.show="hide"}
babynames_first_last %>%
filter(year >= 1900) %>%
mutate(sex = recode(sex, M = "Male", F = "Female")) %>%
group_by(year, sex, first_letter) %>%
summarize(prop = sum(prop)) %>%
ungroup() %>%
mutate(first_letter = toupper(first_letter)) %>%
ggplot() +
aes(year, prop) +
geom_text(
data = tibble(
first_letter = LETTERS,
x = 1900+117/2,
prop = 0
),
aes(label = first_letter, x = x),
size = rel(15),
vjust = -0.2,
color = "grey75",
family = "PT Sans"
) +
geom_line(aes(color = sex)) +
facet_wrap(~ first_letter) +
scale_y_continuous(
labels = scales::percent_format(5),
breaks = c(0, 0.2)
) +
scale_x_continuous(breaks = c(1900, 2000)) +
scale_color_manual(values = sex_colors) +
labs(
x = NULL,
y = "Proportion of Population",
title = "How many babies have names starting with the letter ____?",
caption = caption_text,
color = NULL
) +
base_theme +
theme(
strip.text = element_blank(),
axis.text.x = element_text(face = "plain"),
legend.position = c(0.7, 0.025),
panel.grid.minor.y = element_blank()
)
```
***
If you made it this far, thanks for reading!
I'd love to hear your opinion on these plots, or see your own versions -- animated or not!
Just drop me a line on Twitter at [@grrrck][grrrck].