/
guess_the_age.Rmd
161 lines (129 loc) · 6.36 KB
/
guess_the_age.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: "How to Tell Someone's Age When All You Know Is Her Name"
author: "Emily Kothe"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
keep_md: yes
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This vignette is based on this FiveThirtyEight article [How to Tell Someone’s Age When All You Know Is Her Name](https://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/#fn-1). That article demonstrated how you could use actuarial life tables and a database of baby names to estimate the age of living Americans with a given name.
In this vignette we use the nzbabynames data set and the NZ complete cohort actuarial life tables to calculate the median age of individuals with a given first name.^[Like FiveThirtyEight we assume that people of the same sex die at the same rate regardless of their name - in practice will vary by a number of factors related to both name and life expectancy such as ethnicity].
## An "old" name
Let's start with one name that is stereotypically 'old' - Ethel
```{r message=FALSE, warning = FALSE}
library(ggplot2)
library(matrixStats)
library(dplyr)
library(nzbabynames)
library(stringr)
library(lubridate)
name_of_interest <- "Ethel"
df <- life_tables %>%
filter(percentile == "median") %>%
filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>%
mutate(prop_still_alive = lx/100000,
Year = yearofbirth,
Sex = stringr::str_to_title(sex)) %>%
right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>%
mutate(prop_still_alive = case_when(
Year < 1918 ~ 0,
TRUE ~ prop_still_alive),
Name = case_when(
is.na(Name) ~ name_of_interest,
TRUE ~ Name),
Count = case_when(
is.na(Count) ~ 0,
TRUE ~ as.numeric(Count)),
alive_count = Count * prop_still_alive)
med_age <- df %>%
summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE))
max <- max(df$Count)
peak <- df %>%
filter(Count == max)
ggplot(data = df, aes(x = Year)) +
geom_col(aes(y = Count), fill="grey")+
geom_col(aes(y = alive_count), fill="cornflowerblue") +
scale_x_continuous(limits = c(1900, 2016)) +
coord_cartesian(expand = FALSE) +
theme_classic()
```
`r name_of_interest` peaked in `r peak$Year` when `r max` `r name_of_interest`s were born. Based on the actuarial data, `r round(sum(df$alive_count, na.rm = TRUE)/sum(df$Count, na.rm = TRUE)*100,0)`% of `r name_of_interest`s born between 1900 and 2017 would still be alive in `r lubridate::year(Sys.Date())`. The median age of a living `r name_of_interest` would be `r round(med_age$wted_median, 0)` years.
## And a young one
Let's contrast that with a 'younger' name - Brittany.
```{r message=FALSE, warning = FALSE}
name_of_interest <- "Brittany"
df <- life_tables %>%
filter(percentile == "median") %>%
filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>%
mutate(prop_still_alive = lx/100000,
Year = yearofbirth,
Sex = stringr::str_to_title(sex)) %>%
right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>%
mutate(prop_still_alive = case_when(
Year < 1918 ~ 0,
TRUE ~ prop_still_alive),
Name = case_when(
is.na(Name) ~ name_of_interest,
TRUE ~ Name),
Count = case_when(
is.na(Count) ~ 0,
TRUE ~ as.numeric(Count)),
alive_count = Count * prop_still_alive)
med_age <- df %>%
summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE))
max <- max(df$Count)
peak <- df %>%
filter(Count == max)
ggplot(data = df, aes(x = Year)) +
geom_col(aes(y = Count), fill="grey")+
geom_col(aes(y = alive_count), fill="cornflowerblue") +
scale_x_continuous(limits = c(1900, 2016)) +
coord_cartesian(expand = FALSE) +
theme_classic()
```
`r name_of_interest` peaked in `r peak$Year` when `r max` `r name_of_interest`s were born. Based on the actuarial data, `r round(sum(df$alive_count, na.rm = TRUE)/sum(df$Count, na.rm = TRUE)*100,0)`% of `r name_of_interest`s born between 1900 and 2017 would still be alive in `r lubridate::year(Sys.Date())`. The median age of a living `r name_of_interest` would be `r round(med_age$wted_median, 0)` years.
## Finally a less informative name
Ethel and Brittany are both names with pretty clear trajectories. But of course some names rise and fall over time. Let's pick one of these - Joseph
```{r message=FALSE, warning = FALSE}
name_of_interest <- "Joseph"
df <- life_tables %>%
filter(percentile == "median") %>%
filter(age == (lubridate::year(Sys.Date()) - yearofbirth)) %>%
mutate(prop_still_alive = lx/100000,
Year = yearofbirth,
Sex = stringr::str_to_title(sex)) %>%
right_join(filter(nzbabynames, Name == name_of_interest), by= c("Year", "Sex")) %>%
mutate(prop_still_alive = case_when(
Year < 1918 ~ 0,
TRUE ~ prop_still_alive),
Name = case_when(
is.na(Name) ~ name_of_interest,
TRUE ~ Name),
Count = case_when(
is.na(Count) ~ 0,
TRUE ~ as.numeric(Count)),
alive_count = Count * prop_still_alive)
med_age <- df %>%
summarise(wted_median = weightedMedian(age, w = alive_count, na.rm = TRUE))
max <- max(df$Count)
peak <- df %>%
filter(Count == max)
ggplot(data = df, aes(x = Year)) +
geom_col(aes(y = Count), fill="grey")+
geom_col(aes(y = alive_count), fill="cornflowerblue") +
scale_x_continuous(limits = c(1900, 2016)) +
coord_cartesian(expand = FALSE) +
theme_classic()
```
`r name_of_interest` peaked in `r peak$Year` when `r max` `r name_of_interest`s were born. Based on the actuarial data, `r round(sum(df$alive_count, na.rm=TRUE)/sum(df$Count, na.rm=TRUE)*100,0)`% of `r name_of_interest`s born between 1900 and 2017 would still be alive in `r lubridate::year(Sys.Date())`. The median age of a living `r name_of_interest` would be `r round(med_age$wted_median, 0)` years.
However, note that while there is a clear peak in `r peak$Year` this name has had much more consistent consistent staying power than either Ethel or Brittany, meaning that you might be less confident about guessing the age of any given `r name_of_interest` based on name alone. Put more formally, the interquartile range (and thus uncertainty) around the estimated age for `r name_of_interest` is much greater than around Ethel or Brittany.