Skip to content
Permalink
master
Switch branches/tags
Go to file
 
 
Cannot retrieve contributors at this time
---
title: Automating exploratory plots with ggplot2 and purrr
author: Ariel Muldoon
date: '2018-08-20'
slug: automating-exploratory-plots
categories:
- r
tags:
- ggplot2
- purrr
- loops
draft: FALSE
description: "In this post I show an example of how to automate the process of making many exploratory plots in ggplot2 with multiple continuous response and explanatory variables. To loop through both x and y variables involves nested looping. In the latter section of the post I go over options for saving the resulting plots, either together in a single document, separately, or by creating combined plots prior to saving."
---
```{r setup, echo = FALSE, message = FALSE, purl = FALSE}
knitr::opts_chunk$set(comment = "#")
devtools::source_gist("2500a85297b742c6f2fb3a14549f5851",
filename = 'render_toc.R')
```
*This post was last updated on `r format(Sys.Date(), "%Y-%m-%d")`.*
When you have a lot of variables and need to make a lot exploratory plots it's usually worthwhile to automate the process in R instead of manually copying and pasting code for every plot. However, the coding approach needed to automate plots can look pretty daunting to a beginner R user. It can look so daunting, in fact, that it can appear easier to manually make the plots (like in Excel) rather than using R at all.
Unfortunately making plots manually can backfire. The efficiency of using a software program you already know is quickly out-weighed by being unable to easily reproduce the plots when needed. I know I invariably have to re-make even exploratory plots, and it'd be a bummer if I had to remake them all manually rather than re-running some code.
So while I often assure students working under time constraints that it is perfectly OK to use software they already know rather than spending the time to learn how to do something in R, making many plots is a special case. To get them started I will provide students who need to automate plotting in R some example code (with explanation).
This post is based on an example that involves plotting bivariate relationships between many continuous variables.
## Table of Contents
```{r toc, echo = FALSE, purl = FALSE}
render_toc("2018-08-16-automating-exploratory-plots.Rmd")
```
# Load R packages
I'll be plotting with **ggplot2** and looping with **purrr**. I'll also be using package **cowplot** later to combine individual plots into one, but will use the package functions via `cowplot::` instead of loading the package.
```{r}
library(ggplot2) # v. 3.3.3
library(purrr) # v. 0.3.4
```
# The set-up
Today I'm going to make an example dataset with 3 response (`y`) variables and 4 explanatory (`x`) variables for plotting. (The real dataset had 9 response and 9 explanatory variables.)
```{r}
set.seed(16)
dat = data.frame(elev = round( runif(20, 100, 500), 1),
resp = round( runif(20, 0, 10), 1),
grad = round( runif(20, 0, 1), 2),
slp = round( runif(20, 0, 35),1),
lat = runif(20, 44.5, 45),
long = runif(20, 122.5, 123.1),
nt = rpois(20, lambda = 25) )
head(dat)
```
The goal is to make scatterplots for every response variable vs every explanatory variable. I've deemed the first three variables in the dataset to be the response variables (`elev`, `resp`, `grad`).
The plan is to loop through the variables and make the desired plots. I'm going to use vectors of the variable names for this, one vector for the response variables and one for the explanatory variables.
If all of your response or explanatory variables share some unique pattern in the variable names there are some clever ways to pull out the names with some of the select helper functions in `dplyr::select()`. Alas, my variable names are all unique. My options are to either write the vectors out manually or pull the names out by index. I'll do the latter since the different types of variables are grouped together.
```{r}
response = names(dat)[1:3]
expl = names(dat)[4:7]
```
When I know I'm going to be looping through character vectors I like to use *named* vectors. This helps me keep track of things in the output.
The `set_names()` function in **purrr** is super handy for naming character vectors, since it can use the values of the vector as names (i.e., the vector will be named by itself). (I don't recommend trying this with lists of data.frames like I have in the past, though, since it turns out that naming a data.frame with a data.frame isn't so useful. `r emo::ji("laughing")`)
```{r}
response = set_names(response)
response
```
```{r}
expl = set_names(expl)
expl
```
# Create a plotting function
Since I'm going to make a bunch of plots that will all have the same basic form, I will make a plotting function. I am going to make a function where only the `x` and `y` variables can vary (so are arguments to the function).
Since I'm making a function to plot variables from a single dataset I'm going to hard-code the dataset into the function. If you have multiple datasets or you are making a function for use across projects you'll probably want to add the dataset as a function argument.
My functions inputs are based on the variable names, so I need to pass strings into the **ggplot2** functions. Strings cannot be used directly in `aes()`, but can be used with the [`.data` pronoun](https://rlang.r-lib.org/reference/tidyeval-data.html).
I'm making pretty basic graphs since these are exploratory plots, not publication-ready plots. I will make a scatterplot and add locally weighted regression (loess) lines via `geom_smooth()`. I use such lines with great caution, as it can be easy to get too attached any pattern the loess line shows.
```{r}
scatter_fun = function(x, y) {
ggplot(dat, aes(x = .data[[x]], y = .data[[y]]) ) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "grey74") +
theme_bw()
}
```
If using older versions of **ggplot2** (and/or **rlang**), use the now deprecated `aes_string()` for working with strings.
```{r, eval = FALSE}
scatter_fun = function(x, y) {
ggplot(dat, aes_string(x = x, y = y ) ) +
geom_point() +
geom_smooth(method = "loess", se = FALSE, color = "grey74") +
theme_bw()
}
```
Here's an example of the function output, passing in `x` and `y` as strings.
```{r}
scatter_fun(x = "lat", y = "elev")
```
# Looping through one vector of variables
One way to make all the plots I want is to loop through each explanatory variable for a fixed response variable. With this approach I would need a separate loop for each response variable.
I will use `map()` from package **purrr** for the looping.
I pass each explanatory variable to the first argument in `scatter_fun()` and I fix the second argument to `"elev"`. I use the formula coding in `map()` and so refer to the element of the explanatory vector via `.x` within `scatter_fun()`.
```{r}
elev_plots = map(expl, ~scatter_fun(.x, "elev") )
```
The output is a list of 4 plots (since there are 4 explanatory variables). You'll notice that each element of the list has the variable name associated with it. This is why I used `set_names()` earlier, since this is convenient for printing the plots and, you'll see later, is convenient when saving the plots in files with understandable names.
```{r, message = FALSE}
elev_plots
```
# Looping through both vectors
For only a few response variables we could easily copy and paste the code above, changing the hard-coded response variable each time. This process can get burdensome if there are a lot of response variables, though.
## Using a nested loop
Another option is to loop through both vectors of variables and make all the plots at once. Because we want a plot for each combination of variables, this is a job for a *nested* loop. This means one `map()` loop will be nested inside another. I will refer to the first `map()` loop as the *outer* loop and the second one as the *inner* loop.
I'm going to run through the response variables in the outer loop and the explanatory variables in the inner loop. That way I can graph all of the explanatory variables for each response variable before moving on to the next response variable. This puts the output, a nested list, in a logical order.
A nested loop involves more complicated code. It took some effort for me to wrap my head around how to refer to the list element from the outer loop within the inner loop when using the `map()` formula coding with the tilde (`~`). I found the answers/comments to [this question](https://stackoverflow.com/questions/48847613/purrr-map-equivalent-of-nested-for-loop) on Stack Overflow to be helpful.
Since my scatterplot function is so simple I ended up using formula coding for the outer loop and the function as-is in the inner loop. The inner list elements are fed to the first argument of `scatter_fun()` by default, which works out great since the first argument is the `x` variable and the inner loop loops through the explanatory variables. The `.x` then refers to the outer list elements (the response variable names), and is passed to the `y` argument of the function in the inner loop.
```{r}
all_plots = map(response,
~map(expl, scatter_fun, y = .x) )
```
To be honest, I think it is fairly hard to follow what the above code is doing. And I've found it difficult to use such an approach if adding additional nested loops. At the end of the day, using anonymous functions involves more code but the results may be easier to understand. Well, as easy to follow as nested loops ever are. `r emo::ji("stuck_out_tongue_winking_eye")` Here's the same plots using anonymous functions in each `map()` loop.
```{r, eval = FALSE}
all_plots2 = map(response, function(resp) {
map(expl, function(expl) {
scatter_fun(x = expl, y = resp)
})
})
```
The output is a list of lists. Each sublist contains all the plots for a single response variable. Because I set the names for both vectors of variable names, the inner and outer lists both have names. These names can be used to pull out individual plots.
For example, if I want to see all the plots for the `grad` response variable I can print that sublist by name. (I'm going to display only two of four `grad` plots here to save space.)
```{r}
all_plots$grad[1:2]
```
If I want to print a single plot, I can first extract one of the sublists using an outer list name and then extract the individual plot via an inner list name.
```{r}
all_plots$grad$long
```
I find the names convenient, but you can also extract plots via position. Here's the same graph, the third element of the third list.
```{r}
all_plots[[3]][[3]]
```
## Combining variables for a single loop
Since nested loops can be complicated, another option is to create all combinations of the two input vectors and then loop through those in a single loop. This can be done, for example, using `tidyr::expand_grid()`. The main thing I like about `tidyr::expand_grid()` over `expand.grid()` from base R is the order of the output. Both put the results into a dataset (data.frame or tibble).
Here's how that would look, putting the response variables first to match the output order above. I'm using **tidyr** version 1.1.2 here.
```{r}
resp_expl = tidyr::expand_grid(response, expl)
resp_expl
```
I can now use `pmap()` from **purrr** to loop through the rows of this tibble. I have to be careful with my argument order, though, since I have the y variable first in the tibble but it is the second argument in my function.
```{r}
allplots2 = pmap(resp_expl, ~scatter_fun(x = .y, y = .x) )
```
I'm not going to print them here, but one minor downside of this approach is that only the first variable name is carried along to the output list. If I wanted to save these plots separately and use both the response and explanatory variable in the file name I'd need to create the file names based on `resp_expl`. I won't show this in the next section on saving plots, so here is a quick example of creating basic file names based on both variables.
```{r}
allplots2_names = pmap(resp_expl, ~paste0(.x, "_", .y, ".png"))
allplots2_names[1:2]
```
# Saving the plots
Once all the graphs are made we can look at them in R by printing the list or parts of the list as above. But if you want to peruse them at your leisure later or send them to a collaborator you'll want to save them outside of R.
This next section is dedicated to exploring some of the ways you can do this.
## Saving all plots to one PDF
If you want to save every plot as a separate page in a PDF, you can do so with the `pdf()` function. The code below shows an example of how this works. First, a graphics device to save the plots into is created and given a name via `pdf()`. Then all the plots are put into that device. Finally, the device is turned off with `dev.off()`. The last step is important, as you can't open the file until the device is turned off.
This is a pretty coarse way to save everything, but it allows you to easily page through all the plots. I've used this method when I had many exploratory plots for a single response variable that I wanted to share with collaborators.
In this example code I save the file, which I name `all_scatterplots.pdf`, into the working directory.
```{r, eval = FALSE}
pdf("all_scatterplots.pdf")
all_plots
dev.off()
```
## Saving groups of plots together
Another option is to save each group of plots in a separate document. This might make sense in a case like this where there are a set of plots for each response variable and we might want a separate file for each set.
To save each sublist separately we'll need to loop through `all_plots` and save the plots for each response variable into a separate file. The list names can be used in the file names to keep the output organized.
The functions in **purrr** that start with `i` are special functions that loop through a list and the names of that list simultaneously. This is useful here where we want to use the list names to identify the output files while we save them.
The `walk()` function is part of the `map` family, to be used when you want a function for its side effect instead of for a return value. Saving plots is a classic example of when we want `walk()` instead of `map()`.
Combining the `i` and the `walk` gives us the `iwalk()` function. In the formula interface, `.x` refers to the list elements and `.y` refers to the names of the list. You can see I create the plot file names using the list name combined with "scatterplots.pdf", using `_` as the separator.
The code below makes three files, one for each response variable, with four plots each. The files are named "elev_scatterplots.pdf", "resp_scatterplots.pdf", and "grad_scatterplots.pdf".
```{r, eval = FALSE}
iwalk(all_plots, ~{
pdf(paste0(.y, "_scatterplots.pdf") )
print(.x)
dev.off()
})
```
## Saving all plots separately
All plots can be saved separately instead of combined in a single document. This might be necessary if you want to insert the plots into some larger document later.
We'll want to use the names of both the outer and inner lists to appropriately identify each plot we save. I decided to do this by looping through the `all_plots` list and the names of the list via `imap()` to make the file names in a separate step. This time I'm going to save these as PNG files so use `.png` at the end of the file name.
The result is a list of lists, so I flatten this into a single list via `flatten()`. If I were to use `flatten()` earlier in the process I'd lose the names of the outer list. This process of combining names prior to flattening should be simplified once [the proposed `flatten_names()` function](https://github.com/tidyverse/purrr/issues/525) is added to **purrr**.
```{r}
plotnames = imap(all_plots, ~paste0(.y, "_", names(.x), ".png")) %>%
flatten()
plotnames
```
Once the file names are created I can loop through all the file names and plots simultaneously with `walk2()` and save things via `ggsave()`. The height and width of each output file can be set as needed in `ggsave()`.
You can see I flattened the nested list of plots into a single list to use in `walk2()`.
```{r, eval = FALSE}
walk2(plotnames, flatten(all_plots), ~ggsave(filename = .x, plot = .y,
height = 7, width = 7))
```
## Combining plots
Another way to get a set of plots together is to combine them into one plot. How useful this is will depend on how many plots you have per set. This option is a lot like faceting, except we didn't reshape our dataset to allow the use faceting.
I like the package **cowplot** function `plot_grid()` for quickly combining multiple plots into one. A list of plots can be passed via the `plotlist` argument.
Here's what that looks like for the first response variable, `elev`.
```{r, message = FALSE}
cowplot::plot_grid(plotlist = all_plots[[1]])
```
We can use a loop to combine the plots for each response variable sublist. The result could then be saved using any of the approaches shown above. If you have many subplots per combined plot you likely will want to save the plots at a larger size so the individual plots can be clearly seen.
```{r, message = FALSE}
response_plots = map(all_plots, ~cowplot::plot_grid(plotlist = .x))
response_plots
```
# Just the code, please
```{r getlabels, echo = FALSE, purl = FALSE}
labs = knitr::all_labels()
labs = labs[!labs %in% c("setup", "toc", "getlabels", "allcode", "makescript")]
```
```{r makescript, include = FALSE, purl = FALSE}
options(knitr.duplicate.label = "allow") # Needed to purl like this
input = knitr::current_input() # filename of input document
output = here::here("static", "script", paste(tools::file_path_sans_ext(input), "R", sep = ".") )
knitr::purl(input, output, documentation = 0, quiet = TRUE)
```
Here's the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](/script/2018-08-16-automating-exploratory-plots.R).
```{r allcode, ref.label = labs, eval = FALSE, purl = FALSE}
```