-
Notifications
You must be signed in to change notification settings - Fork 4
Expand file tree
/
Copy path2019-09-27-more-exploratory-plots.Rmd
More file actions
373 lines (290 loc) · 15.8 KB
/
2019-09-27-more-exploratory-plots.Rmd
File metadata and controls
373 lines (290 loc) · 15.8 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
---
title: 'More exploratory plots with ggplot2 and purrr: Adding conditional elements'
author: Ariel Muldoon
date: '2019-09-27'
slug: more-exploratory-plots
categories:
- r
tags:
- ggplot2
- purrr
- loops
description: Following up on a previous post, I show how I add conditional elements to automated ggplot2 plots through the use of if() statements within my plotting function.
draft: FALSE
---
```{r setup, include = FALSE, message = FALSE, purl = FALSE}
knitr::opts_chunk$set(comment = "#")
devtools::source_gist("2500a85297b742c6f2fb3a14549f5851",
filename = 'render_toc.R')
# Make datasets
set.seed(16)
dat = data.frame(cov_plant = round( rlnorm(n = 60, meanlog = 0.75, sdlog = 1.15), digits = 1),
cov_oth = round( runif(n = 60, min = 0.3, max = 100), digits = 1),
gap = round( rlnorm(n = 60, meanlog = 2, sdlog = 1.5), digits = 1),
year = rep( paste("Year", 1:3), each = 20),
trt = rep(letters[1:2], each = 10, times = 3) )
# Add in 0 values to cov_plant
dat$cov_plant = with(dat, ifelse(cov_plant <= 0.3, 0, cov_plant) )
# Metadata
resp_dat = data.frame(variable = c("cov_plant", "cov_oth", "gap"),
description = c("Plant cover", "Other cover", "Gap size"),
units = c("%", "%", "m"),
transformation = c("log", "identity", "log"),
constant = c(0.3, 0, 0) )
```
This summer I was asked to collaborate on an analysis project with many response variables. As usual, I planned on automating my initial graphical data exploration through the use of functions and `purrr::map()` as [I've written about previously](https://aosmith.rbind.io/2018/08/20/automating-exploratory-plots/).
However, this particular project was a follow-up to a previous analysis. In the original analysis, different variables were analyzed on different scales. I wanted to put the new plots on whatever scale they were analyzed in that analysis. If I was going to automate the plotting, which I definitely wanted to do with so many variables `r emo::ji("smile")`, I needed to add conditional elements.
This post demonstrates how I used `if()` statements within my plotting function to use different plotting elements depending on which variable I was plotting.
## Table of Contents
```{r toc, echo = FALSE, purl = FALSE}
input = knitr::current_input()
render_toc(input)
```
# R packages
I will use **ggplot2** for plotting and **purrr** for looping through variables.
```{r}
library(ggplot2) # v. 3.2.1
library(purrr) # v. 0.3.2
```
# The data
My simplified example dataset, `dat`, contains three response variables, `cov_plant`, `cov_oth`, and `gap`. I created two categorical explanatory variables, `year` with 3 levels and `trt` with two levels.
```{r}
dat = structure(list(cov_plant = c(3.7, 1.8, 7.5, 0.4, 7.9, 1.2, 0.7,
2.3, 6.9, 4.1, 17.7, 2.4, 0.9, 14.3, 4.9, 0, 4.1, 3.6, 1.1, 7.7,
0, 1.5, 1.7, 11.5, 0.8, 12.3, 7.1, 6.9, 5.6, 2.7, 1, 2.5, 2,
0.7, 0.7, 2.9, 4, 2.5, 2.9, 1.5, 0.5, 22.8, 2.8, 1.4, 1, 2.9,
2.4, 4.1, 4.1, 1.9, 2.8, 5, 5.7, 5.6, 0, 4.6, 8.1, 0.5, 88.9,
1), cov_oth = c(11.5, 63.2, 34, 65.5, 28.8, 8.6, 7.1, 65.5, 12.1,
3, 23.6, 3.8, 24.9, 55.9, 24.2, 78.2, 81.1, 10.7, 30.7, 23.5,
10.1, 4.6, 45.7, 37.6, 81.3, 39.1, 50.8, 75.8, 78.2, 23.9, 53,
51.1, 2.5, 40.2, 15.9, 91.3, 44, 72.9, 82.7, 42.4, 94.1, 23,
86.2, 50.1, 88.9, 80.5, 34.2, 68.7, 45, 13.9, 44.2, 85, 79.6,
1, 45.3, 69.5, 89.6, 22.2, 1.3, 88), gap = c(2.8, 11.8, 0.3,
17.2, 18.3, 1.4, 19.6, 19.4, 2.6, 66.3, 97.1, 17, 381.5, 15.7,
8.3, 2.4, 3.8, 3.8, 246.6, 43.2, 16.7, 6.6, 3.1, 2.4, 3.2, 4.3,
0.3, 2.1, 41.7, 68.9, 5.1, 5.7, 0.4, 35.5, 1.1, 10.8, 5, 11.8,
75.5, 5.4, 12.6, 5.2, 11.4, 6.8, 5.3, 1.1, 3.2, 2.9, 5.2, 0.2,
1.5, 0.6, 7.4, 18.6, 11.7, 1.6, 13.7, 7.1, 19.9, 16.8), year = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Year 1",
"Year 2", "Year 3"), class = "factor"), trt = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor")), row.names = c(NA, -60L), class = "data.frame")
```
Here are the first six rows of this dataset.
```{r}
head(dat)
```
I spent a fair amount of time sleuthing out which variables were used and which transformations were done in the original analysis. It turned out many (but not all) of the variables in that analysis were log transformed. Variables that contained 0 values were shifted by a fixed constant prior to the log transformation. A different constant was used for different variables.
I made a dataset of variable metadata to help me keep all this information organized. This dataset contains a row for each variable along with a description of what that variable was, the units the variable was measured in, the transformation used for analysis, and the constant used to shift the variable. I'll call this dataset `resp_dat`.
This metadata dataset was key in adding conditional elements to my plotting function.
```{r}
resp_dat = structure(list(variable = structure(c(2L, 1L, 3L), .Label = c("cov_oth",
"cov_plant", "gap"), class = "factor"), description = structure(3:1, .Label = c("Gap size",
"Other cover", "Plant cover"), class = "factor"), units = structure(c(1L,
1L, 2L), .Label = c("%", "m"), class = "factor"), transformation = structure(c(2L,
1L, 2L), .Label = c("identity", "log"), class = "factor"), constant = c(0.3,
0, 0)), class = "data.frame", row.names = c(NA, -3L))
resp_dat
```
# Initial plotting function
I set up my initial plotting function to make a scatterplot of the raw data per `year` for each `trt`. Different `trt` were indicated by shapes and colors, and I added group means as larger symbols connected by lines.
You can see that my plotting code ended up fairly complicated. I'm skipping the (many!) steps it took to get to this point. While I don't show the process here, you can rest assured that I did a **lot** of testing to work out the plot structure prior to making the plotting function below. `r emo::ji("wink")`
In addition to the plotting code, my `plot_fun()` function includes a line where I subset the `resp_dat` dataset to only the row of metadata for the response variable used in the plot. I use this information to add the constant to `y` and make a plot title with a description of the variable plus the units.
```{r}
plot_fun = function(data = dat, respdata = resp_dat, response) {
respvar = subset(respdata, variable == response)
ggplot(data = data, aes(x = year,
y = .data[[response]] + respvar$constant,
shape = trt,
color = trt,
group = trt) ) +
geom_point(position = position_dodge(width = 0.5),
alpha = 0.25,
size = 2, show.legend = FALSE) +
stat_summary(fun.y = mean, geom = "point",
position = position_dodge(width = 0.5),
size = 4, show.legend = FALSE) +
stat_summary(fun.y = mean, geom = "line",
position = position_dodge(width = 0.5),
size = 1, key_glyph = "rect") +
theme_bw(base_size = 14) +
theme(legend.position = "bottom",
legend.direction = "horizontal",
legend.box.spacing = unit(0, "cm"),
legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
panel.grid.minor.y = element_blank() ) +
scale_color_grey(name = "",
label = c("A Treatment", "B Treatment"),
start = 0, end = 0.5) +
labs(x = "Year since treatment",
title = paste0(respvar$descrip, " (", respvar$units, ")"),
y = NULL)
}
```
Here is what the plot looks like for `cov_plant`.
```{r}
plot_fun(response = "cov_plant")
```
# Adding a conditional axis scale
I wanted to put variables that were log transformed in the original analysis on the log scale. Since not all variables were transformed, I wanted to use `scale_y_log10()` for log transformed variables and the standard scale for everything else.
To achieve this, I will assign my base plot a name within the function so I can add on to it conditionally. I name it `g1`.
I use the `transformation` column in the variable metadata to check if a log transformation was done or not via `grepl()`. If it was done, I add `scale_y_log10()` to the existing plot. Otherwise I return the plot on the original scale.
This is what that code looks like that I'll add to the end of my function.
```{r, eval = FALSE}
if( grepl("log", respvar$transformation) ) {
g1 + scale_y_log10()
} else {
g1
}
```
I used the metadata I created to use in the `if()` statement, but you could do something similar if you had variables with the transformation as part of the variable name.
Here is what the plotting function looks like now.
```{r}
plot_fun2 = function(data = dat, respdata = resp_dat, response) {
respvar = subset(respdata, variable == response)
g1 = ggplot(data = data, aes(x = year,
y = .data[[response]] + respvar$constant,
shape = trt,
color = trt,
group = trt) ) +
geom_point(position = position_dodge(width = 0.5),
alpha = 0.25,
size = 2, show.legend = FALSE) +
stat_summary(fun.y = mean, geom = "point",
position = position_dodge(width = 0.5),
size = 4, show.legend = FALSE) +
stat_summary(fun.y = mean, geom = "line",
position = position_dodge(width = 0.5),
size = 1, key_glyph = "rect") +
theme_bw(base_size = 14) +
theme(legend.position = "bottom",
legend.direction = "horizontal",
legend.box.spacing = unit(0, "cm"),
legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
panel.grid.minor.y = element_blank() ) +
scale_color_grey(name = "",
label = c("A Treatment", "B Treatment"),
start = 0, end = 0.5) +
labs(x = "Year since treatment",
title = paste0(respvar$descrip, " (", respvar$units, ")"),
y = NULL)
if( grepl("log", respvar$transformation) ) {
g1 + scale_y_log10()
} else {
g1
}
}
```
Now when I use the function on a log transformed variable like `cov_plant`, the y axis is on the log scale.
```{r}
plot_fun2(response = "cov_plant")
```
But the y axis for `cov_oth`, which was analyzed on the original scale, is not.
```{r}
plot_fun2(response = "cov_oth")
```
# Adding a conditional caption
After I changed the y axis scale for some variables, I decided I should make sure that the scale of the axis is clear to the reader. In addition, I wanted to highlight the fact that some variables were shifted prior to transformation. I decided to create a conditional caption with this information, which can then be then added to the plot in `labs()`.
Since I ended up with three conditions, log transformation, log transformation with added constant, or no transformation, I ended up using `if()`-`else if()`-`else` to do this.
```{r, eval = FALSE}
caption_text = {
if(respvar$constant != 0 ) {
paste0("Y axis on log scale ",
"(added constant ",
respvar$constant, ")")
} else if(!grepl("log", respvar$transformation) ) {
"Y axis on original scale"
} else {
"Y axis on log scale"
}
}
```
The function is getting pretty long and complicated now.
```{r}
plot_fun3 = function(data = dat, respdata = resp_dat, response) {
respvar = subset(respdata, variable == response)
caption_text = {
if(respvar$constant != 0 ) {
paste0("Y axis on log scale ",
"(added constant ",
respvar$constant, ")")
} else if(!grepl("log", respvar$transformation) ) {
"Y axis on original scale"
} else {
"Y axis on log scale"
}
}
g1 = ggplot(data = data, aes(x = year,
y = .data[[response]] + respvar$constant,
shape = trt,
color = trt,
group = trt) ) +
geom_point(position = position_dodge(width = 0.5),
alpha = 0.25,
size = 2, show.legend = FALSE) +
stat_summary(fun.y = mean, geom = "point",
position = position_dodge(width = 0.5),
size = 4, show.legend = FALSE) +
stat_summary(fun.y = mean, geom = "line",
position = position_dodge(width = 0.5),
size = 1, key_glyph = "rect") +
theme_bw(base_size = 14) +
theme(legend.position = "bottom",
legend.direction = "horizontal",
legend.box.spacing = unit(0, "cm"),
legend.text = element_text(margin = margin(l = -.2, unit = "cm") ),
panel.grid.minor.y = element_blank() ) +
scale_color_grey(name = "",
label = c("A Treatment", "B Treatment"),
start = 0, end = 0.5) +
labs(x = "Year since treatment",
title = paste0(respvar$descrip, " (", respvar$units, ")"),
y = NULL,
caption = caption_text)
if( grepl("log", respvar$transformation) ) {
g1 +
scale_y_log10()
} else {
g1
}
}
```
But it does what I want. The plots now have captions with information added at the bottom in addition to the conditional y axis scale.
```{r}
plot_fun3(response = "cov_plant")
```
# Looping through the variables
Once I have worked out the details of the function I can loop through all the variables and make plots with `purrr::map()`. I've set this up to loop through the vector of variable names, stored in `vars` as strings.
```{r}
vars = names(dat)[1:3]
vars
```
```{r}
all_plots = map(vars, ~plot_fun3(response = .x) )
```
Here are the plots, with captions showing that two plots are on the log scale, one is on the original scale, and one has an added constant.
I'm showing the plots all together here, but I actually saved them in a PDF with one plot per page so collaborators could easily page through them.
```{r, fig.height = 10}
cowplot::plot_grid(plotlist = all_plots)
```
# Just the code, please
```{r getlabels, echo = FALSE, purl = FALSE}
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "toc", "getlabels", "allcode", "makescript")]
```
```{r makescript, include = FALSE, purl = FALSE}
options(knitr.duplicate.label = "allow") # Needed to purl like this
input = knitr::current_input() # filename of input document
scriptpath = paste(tools::file_path_sans_ext(input), "R", sep = ".")
output = here::here("static", "script", scriptpath)
knitr::purl(input, output, documentation = 0, quiet = TRUE)
```
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](`r paste0("/script/", scriptpath)`).
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE}
```