-
Notifications
You must be signed in to change notification settings - Fork 8
/
25-many-models.Rmd
624 lines (492 loc) · 22.1 KB
/
25-many-models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
```{r setup25, include=FALSE, message=FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(ggplot2)
library(ggbeeswarm)
library(tidyverse)
library(gapminder)
library(modelr)
library(broom)
library(nycflights13)
library(babynames)
library(nasaweather)
library(lubridate)
```
# Ch. 25: Many models
```{block2, type='rmdimportant'}
**Key questions:**
* 25.2.5 #1, 2
```
```{block2, type='rmdtip'}
**Functions and notes:**
```
* `nest` creates a list-column with default key value `data`. Each row value becomes a dataframe with all non-grouping columns and all rows corresponding with a particular group
```{r, eval = FALSE}
iris %>%
group_by(Species) %>%
nest()
```
* `unnest` unnest any list-column in your dataframe.
Notes on `unnest` behavior:
* if the atomic components of the elements of the list column are length > 1, the non-nested row columns will be duplicated when the list-column is unnested
```{r, eval = FALSE}
# atomic components of elements of list-col == 3 --> (will see duplicates of `x`)
tibble(x = 1:100) %>%
mutate(test1 = list(tibble(a = c(1, 2, 3)))) %>%
unnest(test1)
# atomic components of elements of list-col == 1 --> (will not see duplicates of `x`)
tibble(x = 1:100) %>%
mutate(test1 = list(tibble(a = 1, b = 2))) %>%
unnest(test1)
```
* if there are multiple list-cols, specify the column to unnest or default behavior will be to unnest all
* when unnesting a single column but multiple list-cols exist, the default behavior is to drop the other list columns. To override this use `.drop = FALSE`. ^[Note that if using `.drop = FALSE` in the latter case that you are creating replicated rows for list-col values]
```{r, eval = FALSE}
tibble(x = 1:100) %>%
mutate(test1 = list(c(1, 2)),
test2 = list(c(3, 4))) %>%
unnest(test1, .drop = FALSE) # change to default, i.e. `.drop = TRUE` to drop `test2` column
```
* when unnesting multiple columns, all must be the same length or you will get an error, e.g. below fails:
```{r, error = TRUE}
tibble(x = 1:100) %>%
mutate(test1 = list(c(1)),
test2 = list(c(2,3))) %>%
unnest()
# # to successfully unnest this could have added another unnest, e.g.:
# tibble(x = 1:100) %>%
# mutate(test1 = list(c(1)),
# test2 = list(c(2,3))) %>%
# unnest(test1) %>%
# unnest(test2)
```
* Method for nesting individual vectors: `group_by() %>% summarise()`, e.g.:
```{r}
iris %>%
group_by(Species) %>%
summarise_all(list)
```
* the above has the advantage of producing atomic vectors rather than dataframes as the types inside of the lists
* `broom::glance` takes a model as input and outputs a one row tibble with columns for each of several model evalation statistics (note that these metrics are geared towards evaluating the training)
* `broom::tidy` creates a tibble with columns `term`, `estimate`, `std.error`, `statistic` (t-statistic) and `p.value`. A new row is created for each `term` type, e.g. intercept, x1, x2, etc.
* `ggtitle()`, alternative to `labs(title = "type title here")`
* see [25.4.5] number 3 for a useful way of wrapping certain functions in `list` functions to take advantage of the list-col format
## 25.2: gapminder
The set-up example Hadley goes through is important, below is a slightly altered copy of his example.
__Nested Data__
```{r}
by_country <- gapminder::gapminder %>%
group_by(country, continent) %>%
nest()
```
__List-columns__
```{r}
country_model <- function(df) {
lm(lifeExp ~ year, data = df)
}
```
Want to apply this function over every data frame, the dataframes are in a list, so do this by:
```{r}
by_country2 <- by_country %>%
mutate(model = purrr::map(data, country_model))
```
Advantage with keeping things in the dataframe is that when you filter, or move things around, everything stays in sync, as do new summary values you might add.
```{r}
by_country2 %>%
arrange(continent, country)
by_country2 %>%
mutate(summaries = purrr::map(model, summary)) %>%
mutate(r_squared = purrr::map2_dbl(model, data, rsquare))
```
__unnesting__, another dataframe with the residuals included and then unnest
```{r}
by_country3 <- by_country2 %>%
mutate(resids = purrr::map2(data, model, add_residuals))
```
```{r}
resids <- by_country3 %>%
unnest(resids)
resids
```
### 25.2.5
1. A linear trend seems to be slightly too simple for the overall trend. Can you do better with a quadratic polynomial? How can you interpret the coefficients of the quadratic? (Hint you might want to transform `year` so that it has mean zero.)
*Create functions*
```{r}
# funciton to center value
center_value <- function(df){
df %>%
mutate(year_cent = year - mean(year))
}
# this function allows me to input any text to "var" to customize the inputs
# to the model, default are a linear and quadratic term for year (centered)
lm_quad_2 <- function(df, var = "year_cent + I(year_cent^2)"){
lm(as.formula(paste("lifeExp ~ ", var)), data = df)
}
```
*Create dataframe with evaluation metrics*
```{r}
by_country3_quad <- by_country3 %>%
mutate(
# create centered data
data_cent = purrr::map(data, center_value),
# create quadratic models
mod_quad = purrr::map(data_cent, lm_quad_2),
# get model evaluation stats from original model
glance_mod = purrr::map(model, broom::glance),
# get model evaluation stats from quadratic model
glance_quad = purrr::map(mod_quad, broom::glance))
```
*Create plots*
```{r}
by_country3_quad %>%
unnest(glance_mod, glance_quad, .sep = "_", .drop = TRUE) %>%
gather(glance_mod_r.squared, glance_quad_r.squared,
key = "order", value = "r.squared") %>%
ggplot(aes(x = continent, y = r.squared, colour = continent)) +
geom_boxplot() +
facet_wrap(~order)
```
* The quadratic trend seems to do better --> indicated by the distribution of the R^2 values being closer to one. The level of improvement seems especially pronounced for African countries.
Let's check this closer by looking at percentage point improvement in R^2 in chart below
```{r}
by_country3_quad %>%
mutate(quad_coefs = map(mod_quad, broom::tidy)) %>%
unnest(glance_mod, .sep = "_") %>%
unnest(glance_quad) %>%
mutate(bad_fit = glance_mod_r.squared < 0.25,
R.squ_ppt_increase = r.squared - glance_mod_r.squared) %>%
ggplot(aes(x = continent, y = R.squ_ppt_increase))+
# geom_quasirandom(aes(alpha = bad_fit), colour = "black")+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
geom_quasirandom(aes(colour = continent))+
labs(title = "Percentage point (PPT) improvement in R squared value",
subtitle = "(When adding a quadratic term to the linear regression model)")
```
*View predictions from linear model with quadratic term*
(of countries where linear trend did not capture relationship)
```{r}
bad_fit <- by_country3 %>%
mutate(glance = purrr::map(model, broom::glance)) %>%
unnest(glance, .drop = TRUE) %>%
filter(r.squared < 0.25)
#solve with join with bad_fit
by_country3_quad %>%
semi_join(bad_fit, by = "country") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp, colour = country))+
geom_line(aes(y = pred, colour = country))+
facet_wrap(~country)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
* while the quadratic model does a better job fitting the model than a linear term does, I wouldn't say it does a good job of fitting the model
* it looks like the trends are generally consistent rates of improvement and then there is a sudden drop-off associated with some event, hence an intervention variable may be a more appropriate method for modeling this pattern
*Quadratic model parameters*
```{r}
by_country3_quad %>%
mutate(quad_coefs = map(mod_quad, broom::tidy)) %>%
unnest(glance_mod, .sep = "_") %>%
unnest(glance_quad) %>%
unnest(quad_coefs) %>%
mutate(bad_fit = glance_mod_r.squared < 0.25) %>%
ggplot(aes(x = continent, y = estimate, alpha = bad_fit))+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
geom_quasirandom(aes(colour = continent))+
facet_wrap(~term, scales = "free")+
labs(caption = "Note that 'bad fit' represents a bad fit on the initial model \nthat did not contain a quadratic term)")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
* The quadratic term (in a linear function, trained with the x-value centered at the mean, as in this dataset) has a few important notes related to interpretation
* If the coefficient is positive the output will be convex, if it is negative it will be concave (i.e. smile vs. frown shape)
* The value on the coefficient represents 1/2 the rate at which the relationship between `lifeExp` and `year` is changing for every one unit change from the mean / expected value of `lifeExp` in the dataset.
* Hence if the coefficient is near 0, that means the relationship between `lifeExp` and `year` does not change (or at least does not change at a constant rate) when moving in either direction from `lifeExp`s mean value.
To better understand this, let's look look at a specific example. Excluding Rwanda, Botswana was the `country` that the linear model without the quadratic term performed the worst on. We'll use this as our example for interpreting the coefficients.
*Plots of predicted and actual values for Botswanian life expectancy by year*
```{r}
by_country3_quad %>%
filter(country == "Botswana") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp))+
geom_line(aes(y = pred, colour = "prediction"))+
labs(title = "Data and quadratic trend of predictions for Botswana")
```
*(note that the centered value for year in the 'centered' dataset is 1979.5)*
In the model for Botswana, coefficents are:
Intercept: ~ 59.81
year (centered): ~ 0.0607
year (centered)^2: ~ -0.0175
Hence for every one year we move away from the central year (1979.5), the rate of change between year and price decreases by *~0.035*.
Below I show this graphically by plotting the lines tangent to the models output.
```{r}
botswana_coefs <- by_country3_quad %>%
filter(country == "Botswana") %>%
with(map(mod_quad, coef)) %>%
flatten_dbl()
```
Helper functions to find tangent points
```{r}
find_slope <- function(x){
2*botswana_coefs[[3]]*x + botswana_coefs[[2]]
}
find_y1 <- function(x){
botswana_coefs[[3]]*(x^2) + botswana_coefs[[2]]*x + botswana_coefs[[1]]
}
find_intercept <- function(x, y, m){
y - x*m
}
tangent_lines <- tibble(x1 = seq(-20, 20, 10)) %>%
mutate(slope = find_slope(x1),
y1 = find_y1(x1),
intercept = find_intercept(x1, y1, slope),
slope_change = x1*2*botswana_coefs[[3]]) %>%
select(slope, intercept, everything())
```
```{r}
by_country3_quad %>%
filter(country == "Botswana") %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(data_preds) %>%
ggplot(aes(x = year_cent))+
geom_line(aes(x = year_cent, y = pred), colour = "red")+
geom_abline(aes(intercept = intercept, slope = slope),
data = tangent_lines)+
coord_fixed()
```
Below is the relevant output in a table.
`x1`: represents the change in x value from 1979.5
`slope`: slope of the tangent line at particular `x1` value
`slope_diff_central`: the amount the slope is different from the slope of the tangent line at the central year
```{r}
select(tangent_lines, x1, slope, slope_diff_central = slope_change)
```
* notice that for every 10 year increase in `x1` we see the slope of the tangent line has decreased by 0.35. If we'd looked at just one year we would have seen the change was 0.035, this correspondig with 2 multiplied by the coefficient on the quadratic term of our model.
1. Explore other methods for visualising the distribution of $R^2$ per continent. You might want to try the ggbeeswarm package, which provides similar methods for avoiding overlaps as jitter, but uses deterministic methods.
*visualisations of linear model*
```{r}
by_country3_quad %>%
unnest(glance_mod) %>%
ggplot(aes(x = continent, y = r.squared, colour = continent))+
geom_boxplot(alpha = 0.1, colour = "dark grey")+
ggbeeswarm::geom_quasirandom()
```
* I like `geom_quasirandom()` the best as an overlay on boxplot, it keeps things centered and doesn't have the gravitational pull affect that makes `geom_beeswarm()` become a little misaligned, it also works well here over `geom_jitter()` as the points stay better around their true value
1. To create the last plot (showing the data for the countries with the worst model fits), we needed two steps: we created a data frame with one row per country and then semi-joined it to the original dataset. It's possible to avoid this join if we use `unnest()` instead of `unnest(.drop = TRUE)`. How?
```{r}
#first filter by r.squared and then unnest
by_country3_quad %>%
mutate(data_preds = purrr::map2(data_cent, mod_quad, add_predictions)) %>%
unnest(glance_mod) %>%
mutate(bad_fit = r.squared < 0.25) %>%
filter(bad_fit) %>%
unnest(data_preds) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp, colour = country))+
geom_line(aes(y = pred, colour = country))+
facet_wrap(~country)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## 25.4: Creating list-columns
### 25.4.5
1. List all the functions that you can think of that take an atomic vector and return a list.
* `stringr::str_extract_all` + other `stringr` functions
(however the below can also take types that are not atomic and are probably not really what is being looked for)
* `list`
* `tibble`
* `map` / `lapply`
1. Brainstorm useful summary functions that, like `quantile()`, return multiple values.
* `summary`
* `range`
* ...
1. What's missing in the following data frame? How does `quantile()` return that missing piece? Why isn't that helpful here?
```{r}
mtcars %>%
group_by(cyl) %>%
summarise(q = list(quantile(mpg))) %>%
unnest()
```
* need to capture probabilities of quantiles to make useful...
```{r}
probs <- c(0.01, 0.25, 0.5, 0.75, 0.99)
mtcars %>%
group_by(cyl) %>%
summarise(p = list(probs), q = list(quantile(mpg, probs))) %>%
unnest()
```
* see [list(quantile()) examples] for related method that captures names of quantiles (rather than requiring th user to manually input a vector of probabilities)
1. What does this code do? Why might it be useful?
```{r, eval = FALSE}
mtcars %>%
select(1:3) %>%
group_by(cyl) %>%
summarise_all(funs(list))
```
* It turns each row into an atomic vector grouped by the particular `cyl` value. It is different from `nest` in that each column creates a new list-column representing an atomic vector. If `nest` had been used, this would have created a single dataframe that all the values woudl have been in. Could be useful for running purr through particular columns...
* e.g. let's say we want to find the number of unique items in each column for each grouping, we could do that like so
```{r}
mtcars %>%
group_by(cyl) %>%
select(1:5) %>%
summarise_all(funs(list)) %>%
mutate_all(funs(unique = map_int(., ~length(unique(.x)))))
# we could also simply overwrite the values (rather than make new columns)
mtcars %>%
group_by(cyl) %>%
select(1:5) %>%
summarise_all(funs(list)) %>%
mutate_all(funs(map_int(., ~length(unique(.x)))))
```
## 25.5: Simplifying list-columns
### 25.5.3
1. Why might the `lengths()` function be useful for creating atomic
vector columns from list-columns?
* perhaps you want to measure the number of elements (or unique elements) in an individual element of a list column
```{r}
mpg %>%
group_by(cyl) %>%
summarise(displ_list = list(displ)) %>%
mutate(num_unique = map_int(displ_list, ~unique(.x) %>% length()))
```
1. List the most common types of vector found in a data frame. What makes lists different?
* the atomic types: char, int, double, factor, date are all more common, they are atomic, whereas lists are not atomic vectors and can contain any type of data within them (e.g. a list of atomic vectors, list of lists, etc.).
## Appendix
### Models in lists
This is the more traditional way you might store models in a list
```{r}
models_countries <- purrr::map(by_country$data, country_model)
names(models_countries) <- by_country$country
models_countries[1:3]
```
### List-columns for sampling
say you want to sample all the flights on 50 days out of the year. List-cols can be used to generate a sample like this:
```{r}
flights %>%
mutate(create_date = make_date(year, month, day)) %>%
select(create_date, 5:8) %>%
group_by(create_date) %>%
nest() %>%
sample_n(50) %>%
unnest()
```
Alternatively you could use a `semi_join()`, e.g.
```{r}
flights_samp <- flights %>%
mutate(create_date = make_date(year, month, day)) %>%
distinct(create_date) %>%
sample_n(50)
flights %>%
mutate(create_date = make_date(year, month, day)) %>%
select(create_date, 5:8) %>%
semi_join(flights_samp, by = "create_date")
```
* In some situations I find the `nest`, `unnest` method more elegant though the `semi_join` method seems to run goes faster on large dataframes
* There are also other more specialized functions in the tidyverse to help with various sampling strategies
### 25.2.5.1
#### Include cubic term
Let's look at this example if we had allowed year to be a 3rd order polynomial. We're really stretching our degrees of freedom (in relation to our number of observations) in this case -- these might be less likely to generalize to other data well.
```{r}
by_country3 %>%
semi_join(bad_fit, by = "country") %>%
mutate(
# create centered data
data_cent = purrr::map(data, center_value),
# create cubic (3rd order) data
mod_cubic = purrr::map(data_cent, lm_quad_2, var = "year_cent + I(year_cent^2) + I(year_cent^3)"),
# get predictions for 3rd order model
data_cubic = purrr::map2(data_cent, mod_cubic, add_predictions)) %>%
unnest(data_cubic) %>%
ggplot(aes(x = year, group = country))+
geom_point(aes(y = lifeExp, colour = country))+
geom_line(aes(y = pred, colour = country))+
facet_wrap(~country)+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
* interpretibility of coefficients beyond quadratic term becomes less strait forward to explain
### Multiple graphs in chunk
There are a variety of ways to have multiple graphs outputted and aligned side by side:
* build graphs separately and use `gridExtra::grid.arrange()`
* Ensure metrics have been gathered into a single column and then use `facet_wrap()`/`facet_grid()` (`ggforce` is a helpful extension package to ggplot2 that gives more functionality to these faceting functions)
* manipulate chunk options, e.g. figures below have the following options set in the R code chunk: `out.width = "33%", fig.asp = 1, fig.width = 3, fig.show='hold',, fig.align='default'`
```{r, out.width = "33%", fig.asp = 1, fig.width = 3, fig.show='hold',, fig.align='default'}
nz <- filter(gapminder, country == "New Zealand")
nz %>%
ggplot(aes(year, lifeExp)) +
geom_line() +
ggtitle("Full data = ")
nz_mod <- lm(lifeExp ~ year, data = nz)
nz %>%
add_predictions(nz_mod) %>%
ggplot(aes(year, pred)) +
geom_line() +
ggtitle("Linear trend + ")
nz %>%
add_residuals(nz_mod) %>%
ggplot(aes(year, resid)) +
geom_hline(yintercept = 0, colour = "white", size = 3) +
geom_line() +
ggtitle("Remaining pattern")
```
### list(quantile()) examples
Some of these examples may not represent best practices.
```{r}
prob_vals <- c(0, .25, .5, .75, 1)
iris %>%
group_by(Species) %>%
summarise(Petal.Length_q = list(quantile(Petal.Length))) %>%
mutate(probs = list(prob_vals)) %>%
unnest()
```
*Example for using quantile across range of columns*
*Also notice dynamic method for extracting names*
```{r}
iris %>%
group_by(Species) %>%
summarise_all(funs(list(quantile(., probs = prob_vals)))) %>%
mutate(probs = map(Petal.Length, names)) %>%
unnest()
```
### Extracting names
Maybe not best practice:
```{r}
quantile(1:100) %>%
as.data.frame() %>%
rownames_to_column()
```
Better would be to use `enframe()` here:
```{r}
quantile(1:100) %>%
tibble::enframe()
```
### `invoke_map` example (book)
I liked Hadley's example with invoke_map and wanted to save it:
```{r}
sim <- tribble(
~f, ~params,
"runif", list(min = -1, max = -1),
"rnorm", list(sd = 5),
"rpois", list(lambda = 10)
)
sim %>%
mutate(sims = invoke_map(f, params, n = 10))
```
### named list example (book)
I liked Hadley's example where you have a list of named vectors that you need to iterate over both the values as well as the names and the use of enframe to facilitate this.
Below is the copied example and notes:
```{r}
x <- list(
a = 1:5,
b = 3:4,
c = 5:6
)
df <- enframe(x)
df
```
The advantage of this structure is that it generalises in a straightforward way - names are useful if you have character vector of metadata, but don't help if you have other types of data, or multiple vectors.
Now if you want to iterate over names and values in parallel, you can use `map2()`:
```{r}
df %>%
mutate(
smry = map2_chr(name, value, ~ stringr::str_c(.x, ": ", .y[1]))
)
```