/
2019-07-22-automate-model-fitting-with-loops.Rmd
270 lines (186 loc) · 11.8 KB
/
2019-07-22-automate-model-fitting-with-loops.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
---
title: 'Many similar models - Part 2: Automate model fitting with purrr::map() loops'
author: Ariel Muldoon
date: '2019-07-22'
slug: automate-model-fitting-with-loops
categories:
- r
- statistics
tags:
- analysis
- purrr
- ggplot2
keywords:
- purrr
- rstats
draft: FALSE
description: "The task of fitting many similar models can be automated by looping through variables. I show an example of fitting the same model for multiple different response variables and then making residual plots for all models prior to extracting model results."
---
```{r setup, include = FALSE, message = FALSE, purl = FALSE}
knitr::opts_chunk$set(comment = "#")
devtools::source_gist("2500a85297b742c6f2fb3a14549f5851",
filename = 'render_toc.R')
# Make dataset, with 3 response variables
# and 1 2-level categorical explanatory
set.seed(16)
dat = data.frame(group = rep(letters[1:2], each = 15),
resp = rnorm(30, mean = 10, sd = 1),
slp = rnorm(30, mean = 45, sd = 10) )
dat$grad = with(dat, exp(.25*(group == "b") + rnorm(30, mean = 0, sd = 1 ) ) ) # Log scale
dat = dplyr::mutate_if(dat, is.numeric, round, 2)
```
When we have many similar models to fit, automating at least some portions of the task can be a real time saver. In [my last post](https://aosmith.rbind.io/2019/06/24/function-for-model-fitting/) I demonstrated how to make a function for model fitting. Once you have made such a function it's possible to loop through variable names and fit a model for each one.
In this post I am specifically focusing on having many response variables with the same explanatory variables, using `purrr::map()` and friends for the looping.
## Table of Contents
```{r toc, echo = FALSE, purl = FALSE}
input = knitr::current_input()
render_toc(input)
```
# R packages
I'll be using **purrr** for looping and will make residual plots with **ggplot2** and **patchwork**. I'll use **broom** to extract tidy results from models.
```{r, message = FALSE, warning = FALSE}
library(purrr) # v. 0.3.2
library(ggplot2) # v. 3.2.0
library(patchwork) # v. 0.0.1, github only
library(broom) # v. 0.5.2
```
# The dataset
I made a dataset with three response variables, `resp`, `slp`, and `grad`, along with a 2-level explanatory variable `group`.
```{r}
dat = structure(list(group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
resp = c(10.48, 9.87, 11.1, 8.56, 11.15, 9.53, 8.99, 10.06,
11.02, 10.57, 11.85, 10.11, 9.25, 11.66, 10.72, 8.34, 10.58,
10.47, 9.46, 11.13, 8.35, 9.69, 9.82, 11.47, 9.13, 11.53,
11.05, 11.03, 10.84, 10.22), slp = c(38.27, 46.33, 44.29,
35.57, 34.78, 47.81, 50.45, 46.31, 47.82, 42.07, 31.75, 65.65,
47.42, 41.51, 38.69, 47.84, 46.22, 50.66, 50.69, 44.09, 47.3,
52.53, 53.63, 53.38, 27.34, 51.83, 56.63, 32.99, 77.5, 38.24
), grad = c(0.3, 0.66, 0.57, 0.23, 0.31, 0.48, 0.5, 0.49,
2.41, 0.6, 0.27, 0.89, 2.43, 1.02, 2.17, 1.38, 0.17, 0.47,
1.1, 3.28, 6.14, 3.8, 4.35, 0.85, 1.13, 1.11, 2.93, 1.13,
4.52, 0.13)), class = "data.frame", row.names = c(NA, -30L) )
head(dat)
```
# A function for model fitting
The analysis in the example I'm using today amounts to a two-sample t-test. I will fit this as a linear model with `lm()`.
Since the response variable needs to vary among models but the dataset and explanatory variable do not, my function will have a single argument for setting the response variable. Building the model formula in my function `ttest_fun()` relies on `paste()` and `as.formula()`.
```{r}
ttest_fun = function(response) {
form = paste(response, "~ group")
lm(as.formula(form), data = dat)
}
```
This function takes the response variable as a string and returns a model object.
```{r}
ttest_fun(response = "resp")
```
# Looping through the response variables
I'll make a vector of the response variable names as strings so I can loop through them and fit a model for each one. I pull my response variable names out of the dataset with `names()`. This step may take more work for you if you have many response variables that aren't neatly listed all in a row like mine are. `r emo::ji("stuck_out_tongue_winking_eye")`
```{r}
vars = names(dat)[2:4]
vars
```
I want to keep track of which variable goes with which model. This can be accomplished by naming the vector I'm going to loop through. I name the vector of strings with itself using `purrr::set_names()`.
```{r}
vars = set_names(vars)
vars
```
Now I'm ready to loop through the variables and fit a model for each one with `purrr::map()`. Since my function takes a single argument, the response variable, I can list the function by name within `map()` without using a formula (`~`) or an anonymous function.
```{r}
models = vars %>%
map(ttest_fun)
```
The output is a list containing three models, one for each response variable. Notice that the output list is a *named* list, where the names of each list element is the response variable used in that model. This is the reason I took the time to name the response variable vector.
```{r}
models
```
Note I could have done the `set_names()` step within the pipe chain rather than as a separate step.
```{r, eval = FALSE}
vars %>%
set_names() %>%
map(ttest_fun)
```
# Create residual plots for each model
I'm working with a simple model fitting function, where the output only contains the fitted model. To extract other output I can loop through the list of models in a separate step. An alternative is to create all the output within the modeling function and then pull whatever you want out of the list of results.
In this case, my next step is to loop through the models and make residual plots. I want to look at a residuals vs fitted values plot as well as a plot to look at residual normality (like a boxplot, a histogram, or a quantile-quantile normal plot). In more complicated models I might also make plots of residuals vs explanatory variables.
I'll make a function to build the two residuals plots. My function takes a model and the model name as arguments. I extract residuals and fitted values via `broom::augment()` and make the two plots with **ggplot2** functions. I combine the plots via **patchwork**. I add a title to the combined plot with the name of the response variable from each model to help me keep track of things.
```{r}
resid_plots = function(model, modelname) {
output = augment(model)
res.v.fit = ggplot(output, aes(x = .fitted, y = .resid) ) +
geom_point() +
theme_bw(base_size = 16)
res.box = ggplot(output, aes(x = "", y = .resid) ) +
geom_boxplot() +
theme_bw(base_size = 16) +
labs(x = NULL)
res.v.fit + res.box +
plot_annotation(title = paste("Residuals plots for", modelname) )
}
```
The output of this function is a combined plot of the residuals. Here is an example for one model (printed at 8" wide by 4" tall).
```{r, fig.height = 4, fig.width = 8}
resid_plots(model = models[[1]], modelname = names(models)[1])
```
I can use `purrr::imap()` to loop through all models and the model names simultaneously to make the plots with the title for each variable.
```{r}
residplots = imap(models, resid_plots)
```
## Examining the plots
In a situation where I have many response variables, I like to save my plots out into a PDF so I can easily page through them outside of R. You can see some approaches for saving plots [in a previous post](https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/#saving-the-plots).
Since I have only a few plots I can print them in R. The last plot, shown below, looks potentially problematic. I see the variance increasing with the mean and right skew in the residuals.
```{r, fig.height = 4, fig.width = 8}
residplots[[3]]
```
# Re-fitting a model
If you find a problematic model fit you'll need to spend some time working with that variable to find a more appropriate model.
Once you have a model you're happy with, you can manually add the new model to the list (if needed). In my example, let's say the `grad` model needed a log transformation.
```{r}
gradmod = ttest_fun("log(grad)")
```
If I'm happy with the fit of the new model I add it to the list with the other models to automate extracting results.
```{r}
models$log_grad = gradmod
```
I remove the original model by setting it to `NULL`. I don't want any results from that model and if I leave it in I know I'll ultimately get confused about which model is the final model. `r emo::ji("confused")`
```{r}
models$grad = NULL
```
Now the output list has three models, with the new `log_grad` model and the old `grad` model removed.
```{r}
models
```
I could have removed models from the list via subsetting by name. Here's an example, showing what the list looks like if I remove the `slp` model.
```{r}
models[!names(models) %in% "slp"]
```
# Getting model results
Once you are happy with model fit of all models it's time to extract any output of interest. For a t-test we would commonly want the estimated difference between the two groups, which is in the `summary()` output. I'll pull this information from the model as a data.frame with `broom::tidy()`. This returns the estimated coefficients, statistical tests, and (optionally) confidence intervals for coefficients
I switch to `map_dfr()` for looping to get the output combined into a single data.frame. I use the `.id` argument to add the response variable name to the output dataset.
Since some of the response variables are log transformed, it would make sense to back-transform coefficients in this step. I don't show this here, but would likely approach this using an `if()` statement based on log transformed variables containing `"log"` in their names.
```{r}
res_anova = map_dfr(models, tidy, conf.int = TRUE, .id = "variable")
res_anova
```
The primary interest in this output would be in the `groupb` row for each variable. Since the output is a data frame (thanks `broom::tidy()`!) you can use standard data manipulation tools to pull out only rows and columns of interest.
Other output, like, e.g., estimated marginal means for more complicated models, can be extracted and saved in a similar way.
# Alternative approach to fitting many models
When I am working with many response variables with widely varying ranges, it feels most natural to me to keep the different variables in different columns and loop through them as I have shown above. However, a reasonable alternative is to *reshape* your dataset so all the values of all variables are in a single column. A second, categorical column will contain the variable names so we know which variable each row is associated with. Such reshaping is an example of making a *wide* dataset into a *long* dataset.
Once your data are in a long format, you can use a list-columns approach for the analysis. You can see an example of this in [Chapter 25: Many models](https://r4ds.had.co.nz/many-models.html#introduction-17) of Grolemund and Wickham's [R for Data Science book](https://r4ds.had.co.nz/).
# Just the code, please
```{r getlabels, echo = FALSE, purl = FALSE}
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "toc", "getlabels", "allcode", "makescript")]
```
```{r makescript, include = FALSE, purl = FALSE}
options(knitr.duplicate.label = "allow") # Needed to purl like this
input = knitr::current_input() # filename of input document
scriptpath = paste(tools::file_path_sans_ext(input), "R", sep = ".")
output = here::here("static", "script", scriptpath)
knitr::purl(input, output, documentation = 0, quiet = TRUE)
```
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](`r paste0("/script/", scriptpath)`).
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE}
```