# <div style="text-align: right"> Chapter __17__</div>

# __Iteration with purrr__

In [1]:
# config
repr_html.tbl_df <- function(obj, ..., rows = 6) repr:::repr_html.data.frame(obj, ..., rows = rows)
options(dplyr.summarise.inform = FALSE)

One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extracting them
out into independent pieces that can be easily reused and updated.
Another tool for reducing duplication is iteration, which helps you
when you need to do the same thing to multiple inputs: repeating
the same operation on different columns, or on different datasets. In
this chapter you’ll learn about two important iteration paradigms:
imperative programming and functional programming. On the
imperative side you have tools like for loops and while loops, which
are a great place to start because they make iteration very explicit, so
it’s obvious what’s happening. However, for loops are quite verbose,
and require quite a bit of bookkeeping code that is duplicated for
every for loop. Functional programming (FP) offers tools to extract
out this duplicated code, so each common for loop pattern gets its
own function. Once you master the vocabulary of FP, you can solve
many common iteration problems with less code, more ease, and
fewer errors.

In [2]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.1
[32m✔[39m [34mtidyr  [39m 1.1.1     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



## __For loops__

Imagine we have this simple tibble:

In [4]:
df <- tibble(
    a = rnorm(10),
    b = rnorm(10),
    c = rnorm(10),
    d = rnorm(10)
)

We want to compute the median of each column. You could do it
with copy-and-paste:

In [9]:
median(df$a)

In [10]:
median(df$b)

In [11]:
median(df$c)

In [12]:
median(df$d)

But that breaks our rule of thumb: never copy and paste more than
twice. Instead, we could use a for loop:

In [13]:
output <- vector('double', ncol(df))
for (i in seq_along(df)) {
    output[[i]] <- median(df[[i]])
}
output

Excercise

Write for-loops to:

    Compute the mean of every column in mtcars.
    Determine the type of each column in nycflights13::flights.
    Compute the number of unique values in each column of iris.
    Generate 10 random normals for each of μ = -10, 0, 10, and 100.

In [18]:
# compute the mean of every column in mtcars
output <- vector('double', ncol(mtcars))
names(output) <- names(mtcars)
for (i in names(mtcars)) {
    output[i] <- mean(mtcars[[i]])
}

output

In [19]:
# determine the type of each column in
# nycflights13::flights
output <- vector('list', ncol(nycflights13::flights))
names(output) <- names(nycflights13::flights)
for (i in names(nycflights13::flights)) {
    output[[i]] <- class(nycflights13::flights[[i]])
}
output

In [22]:
# compute the number of unique values in
# each column in the iris dataset
data('iris')
iris_uniq <- vector('double', ncol(iris))
names(iris_uniq) <- names(iris)
for (i in names(iris)) {
    iris_uniq[i] <- n_distinct(iris[[1]])
}

iris_uniq

In [23]:
# generate 10 random normals for each mu
n <- 10
mu <- c(-10, 0, 10, 100)
normals <- vector('list', length(mu))
for (i in seq_along(normals)) {
    normals[[i]] <- rnorm(n, mean = mu[i])
}
normals

Excercise

Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

In [29]:
out <- ''
for (x in letters) {
    out <- stringr::str_c(out, x)
}
out

In [30]:
str_c(letters, collapse = '')

In [32]:
x <- sample(100)
sd <- 0
for (i in seq_along(x)) {
sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd / (length(x) - 1))
sd

In [33]:
sd(x)

In [34]:
x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
for (i in 2:length(x)) {
out[i] <- out[i - 1] + x[i]
}
out

In [35]:
cumsum(x)

In [37]:
all.equal(cumsum(x), out)

It’s common to see for loops that don’t preallocate the output
and instead increase the length of a vector at each step:

In [38]:
output <- vector('integer', 0)
for (i in seq_along(x)) {
    output <- c(output, lengths(x[[i]]))
}
output

## __For Loop Variations__

Once you have the basic for loop under your belt, there are some
variations that you should be aware of. These variations are impor‐
tant regardless of how you do iteration, so don’t forget about them
once you’ve mastered the FP techniques you’ll learn about in the
next section.
There are four variations on the basic theme of the for loop:

* Modifying an existing object, instead of creating a new object.
* Looping over names or values, instead of indices.
* Handling outputs of unknown length.
* Handling sequences of unknown length.

### Modifying an Existing Object

Sometimes you want to use a for loop to modify an existing object.
For example, remember our challenge from Chapter 15. We wanted
to rescale every column in a data frame:

In [41]:
df <- tibble(
    a = rnorm(10),
    b = rnorm(10),
    c = rnorm(10),
    d = rnorm(10)
)

In [42]:
rescale_01 <- function(x) {
    rng <- range(x, na.rm = TRUE)
    (x - rng[1]) / (rng[2] - rng[1])
}

In [43]:
df$a <- rescale_01(df$a)
df$b <- rescale_01(df$b)
df$c <- rescale_01(df$c)
df$d <- rescale_01(df$d)

To solve this with a for loop we again think about the three components:

_Output_: We already have the output—it’s the same as the input!

_Sequence_: We can think about a data frame as a list of columns, so we can
iterate over each column with seq_along(df) .

_Body_: Apply `rescale01()`.

In [46]:
for (i in seq_along(df)) {
    df[[i]] <- rescale_01(df[[i]])
}

Typically you’ll be modifying a list or data frame with this sort of
loop, so remember to use `[[` , not `[` . You might have spotted that I
used `[[` in all my for loops: I think it’s better to use `[[` even for
atomic vectors because it makes it clear that I want to work with a
single element.

### Looping Patterns

There are three basic ways to loop over a vector. So far I’ve shown
you the most general: looping over the numeric indices with for (i
in `seq_along(xs))` , and extracting the value with `x[[i]]` . There
are two other forms:

* Loop over the elements: for `(x in xs)` . This is most useful if
you only care about side effects, like plotting or saving a file,
because it’s difficult to save the output efficiently.

* Loop over the names: `for (nm in names(xs))` . This gives you a
name, which you can use to access the value with `x[[nm]]` . This
is useful if you want to use the name in a plot title or a filename.

If you’re creating named output, make sure to name the results
vector like so:

In [47]:
results <- vector('list', length(x))
names(results) <- names(x)

Iteration over the numeric indices is the most general form, because
given the position you can extract both the name and the value:

In [48]:
for (i in seq_along(x)) {
    name <- names(x)[[i]]
    value <- x[[i]]
}

### Unknown Output Length

Sometimes you might not know how long the output will be. For
example, imagine you want to simulate some random vectors of
random lengths. You might be tempted to solve this problem by
progressively growing the vector:

In [49]:
means <- c(0, 1, 2)

output <- double()
for (i in seq_along(means)) {
    n <- sample(100, 1)
    output <- c(output, rnorm(n, means[[i]]))
}

str(output)

 num [1:234] -1.734 2.083 0.654 -0.538 1.081 ...


But this is not very efficient because in each iteration, R has to copy
all the data from the previous iterations. In technical terms you get
“quadratic” (O(n 2 )) behavior, which means that a loop with three
times as many elements would take nine (3 2 ) times as long to run.
A better solution is to save the results in a list, and then combine
into a single vector after the loop is done:

In [50]:
out <- vector('list', length(means))
for (i in seq_along(means)) {
    n <- sample(100, 1)
    out[[i]] <- rnorm(n, means[[i]])
}
str(out)

List of 3
 $ : num [1:92] -0.47 0.439 0.138 0.755 -1.365 ...
 $ : num [1:47] 1.169 1.713 0.754 -1.016 2.13 ...
 $ : num [1:74] 4.226 1.462 1.92 0.801 1.213 ...


In [51]:
str(unlist(out))

 num [1:213] -0.47 0.439 0.138 0.755 -1.365 ...


Here I’ve used `unlist()` to flatten a list of vectors into a single vec‐
tor. A stricter option is to use `purrr::flatten_dbl()` —it will throw
an error if the input isn’t a list of doubles.

This pattern occurs in other places too:

* You might be generating a long string. Instead of `paste()` ing
together each iteration with the previous, save the output in a
character vector and then combine that vector into a single
string with `paste(output, collapse = "")` .
* You might be generating a big data frame. Instead of sequentially `rbind()` ing in each iteration, save the output in a list, then
use `dplyr::bind_rows(output)` to combine the output into a
single data frame.
Watch out for this pattern. Whenever you see it, switch to a more
complex result object, and then combine in one step at the end.

### Unknown Sequence Length

Sometimes you don’t even know how long the input sequence
should be. This is common when doing simulations. For example,
you might want to loop until you get three heads in a row. You can’t
do that sort of iteration with the `for` loop. Instead, you can use a
`while` loop. A `while` loop is simpler than a for loop because it only
has two components, a condition and a body:

```r
while (condition) {
    # body...
}
```

A while loop is also more general than a for loop, because you can
rewrite any for loop as a while loop, but you can’t rewrite every
while loop as a for loop:

In [52]:
for (i in seq_along(x)) {
    # body
}

In [53]:
# equivalent to
i <- 1
while (i <= length(x)) {
    # body
}

Here’s how we could use a while loop to find how many tries it takes
to get three heads in a row:

In [57]:
flip <- function() sample(c('T', 'H'), 1)

flips <- 0
n_heads <- 0

while (n_heads < 3) {
    if (flip() == 'H') {
        n_heads <- n_heads + 1
    } else {
        n_heads <- 0
    }
    flips <- flips + 1
}

(flips)

I mention while loops only briefly, because I hardly ever use them.
They’re most often used for simulation, which is outside the scope
of this book. However, it is good to know they exist so that you’re
prepared for problems where the number of iterations is not known
in advance.

Excercise

What happens if you use for (nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique?

Let’s try it out and see what happens. When there are no names for the vector, it does not run the code in the loop. In other words, it runs zero iterations of the loop.

In [59]:
x <- c(11, 12, 13)
print(names(x))

NULL


In [60]:
for (nm in names(x)) {
    print(nm)
    print(x[[nm]])
}

Note that the length of NULL is zero:

In [61]:
length(NULL)

If there only some names, then we get an error for trying to access an element without a name.

In [62]:
x <- c(a = 11, 12, c = 13)
names(x)

In [63]:
for (nm in names(x)) {
  print(nm)
  print(x[[nm]])
}

[1] "a"
[1] 11
[1] ""


ERROR: Error in x[[nm]]: subscript out of bounds


Finally, if the vector contains duplicate names, then `x[[nm]]` returns the first element with that name.

In [64]:
x <- c(a = 11, a = 12, c = 13)
names(x)

In [65]:
for (nm in names(x)) {
  print(nm)
  print(x[[nm]])
}

[1] "a"
[1] 11
[1] "a"
[1] 11
[1] "c"
[1] 13


Excercise

Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, `show_mean(iris)` would print:

```r
show_mean(iris)
# > Sepal.Length: 5.84
# > Sepal.Width:  3.06
# > Petal.Length: 3.76
# > Petal.Width:  1.20
```

In [74]:
show_mean <- function(df, digits = 2) {
  # Get max length of all variable names in the dataset
  maxstr <- max(str_length(names(df)))
  for (nm in names(df)) {
    if (is.numeric(df[[nm]])) {
      cat(
        str_c(str_pad(str_c(nm, ":"), maxstr + 1L, side = "right"),
          format(mean(df[[nm]]), digits = digits, nsmall = digits),
          sep = " "
        ),
        "\n"
      )
    }
  }
}
show_mean(iris)

Sepal.Length: 5.84 
Sepal.Width:  3.06 
Petal.Length: 3.76 
Petal.Width:  1.20 


## For Loops Versus Functionals

For loops are not as important in R as they are in other languages
because R is a functional programming language. This means that
it’s possible to wrap up for loops in a function, and call that function
instead of using the for loop directly.

To see why this is important, consider (again) this simple data
frame:

In [76]:
df <- tibble(
    a = rnorm(10),
    b = rnorm(10),
    c = rnorm(10),
    d = rnorm(10)
)

Imagine you want to compute the mean of every column. You could
do that with a for loop:

In [77]:
output <- vector('double', length(df))
for (i in seq_along(df)) {
    output[[i]] <- mean(df[[i]])
}
output

You realize that you’re going to want to compute the means of every
column pretty frequently, so you extract it out into a function:

In [78]:
col_mean <- function(df) {
    output <- vector('double', length(df))
    for (i in seq_along(df)) {
        output[i] <- mean(df[[i]])
    }
    output
}

But then you think it’d also be helpful to be able to compute the
median, and the standard deviation, so you copy and paste your
`col_mean()` function and replace the `mean()` with `median()` and
`sd()` :

In [79]:
col_median <- function(df) {
    output <- vector('double', length(df))
    for (i in seq_along(df)) {
        output[i] <- median(df[[i]])
    }
    output
}

In [80]:
col_sd <- function(df) {
    output <- vector('double', length(df))
    for (i in seq_along(df)) {
        output[i] <- sd(df[[i]])
    }
    output
}

Uh oh! You’ve copied and pasted this code twice, so it’s time to think
about how to generalize it. Notice that most of this code is for-loop
boilerplate and it’s hard to see the one thing ( `mean()` , `median()` ,
`sd()` ) that is different between the functions.
What would you do if you saw a set of functions like this?

In [81]:
f1 <- function(x) abs(x - mean(x)) ^ 1
f2 <- function(x) abs(x - mean(x)) ^ 2
f3 <- function(x) abs(x - mean(x)) ^ 3

Hopefully, you’d notice that there’s a lot of duplication, and extract it
out into an additional argument:

In [82]:
f <- function(x, i) abs(x - mean(x)) ^ i

You’ve reduced the chance of bugs (because you now have 1/3 less
code), and made it easy to generalize to new situations.
We can do exactly the same thing with `col_mean()` , `col_median()` ,
and `col_sd()` by adding an argument that supplies the function to
apply to each column

In [83]:
col_summary <- function(df, fun) {
    out <- vector('double', length(df))
    for (i in seq_along(df)) {
        out[i] <- fun(df[[i]])
    }
    out
}

In [84]:
col_summary(df, median)

In [85]:
col_summary(df, mean)

The idea of passing a function to another function is an extremely
powerful idea, and it’s one of the behaviors that makes R a func‐
tional programming language. It might take you a while to wrap
your head around the idea, but it’s worth the investment. In the rest
of the chapter, you’ll learn about and use the purrr package, which
provides functions that eliminate the need for many common for
loops. The apply family of functions in base R ( `apply()` , `lapply()` ,
`tapply()` , etc.) solve a similar problem, but purrr is more consistent
and thus is easier to learn.
The goal of using purrr functions instead of for loops is to allow you
to break common list manipulation challenges into independent
pieces:

* How can you solve the problem for a single element of the list?
Once you’ve solved that problem, purrr takes care of generaliz‐
ing your solution to every element in the list.
* If you’re solving a complex problem, how can you break it down
into bite-sized pieces that allow you to advance one small step
toward a solution? With purrr, you get lots of small pieces that
you can compose together with the pipe.
This structure makes it easier to solve new problems. It also makes it
easier to understand your solutions to old problems when you re-
read your old code.

Excercise



Read the documentation for apply(). In the 2nd case, what two for-loops does it generalize.


For an object with two-dimensions, such as a matrix or data frame, `apply()` replaces looping over the rows or columns of a matrix or data-frame. The `apply()` function is used like `apply(X, MARGIN, FUN, ...)`, where X is a matrix or array, `FUN` is a function to apply, and ... are additional arguments passed to `FUN`.

When `MARGIN = 1`, then the function is applied to each row. For example, the following example calculates the row means of a matrix.

In [88]:
X <- matrix(rnorm(15), nrow = 5)
X

0,1,2
-1.1902592,1.3028807,0.5865743
-1.0687685,0.12366869,-1.03425
-0.6136274,-0.78680261,0.5268122
-2.4034874,0.04724041,0.7469081
0.2579339,-1.38771856,0.5472409


In [89]:
apply(X, 1, mean)

That is equivalent to this for-loop.

In [90]:
X_row_means <- vector('numeric', length = nrow(X))
for (i in seq_len(nrow(X))) {
    X_row_means[[i]] <- mean(X[i, ])
}
X_row_means

When `MARGIN = 2`, `apply()` is equivalent to a for-loop looping over columns.

In [93]:
apply(X, 2, mean)

In [95]:
X_col_means <- vector('numeric', length = ncol(X))
for (i in seq_len(ncol(X))) {
    X_col_means[[i]] <- mean(X[, i])
}
X_col_means

Excercise

Adapt `col_summary()` so that it only applies to numeric col‐
umns. You might want to start with an `is_numeric()` function
that returns a logical vector that has a `TRUE` corresponding to
each numeric column.

In [107]:
# as of Sep 2020 , is_numeric from purrr is deprecated
col_summary_2 <- function(df, fun) {
    # create an empty vector which will store
    # whether each column is numeric
    numeric_cols <- vector('logical', length(df))
    # test whether each column is numeric
    for (i in seq_along(df)) {
        numeric_cols[[i]] <- is.numeric(df[[i]])
    }
    # find the indexes of the numeric columns
    idxs <- which(numeric_cols)
    # find the number of numeric columns
    n <- sum(numeric_cols)
    # create a vector to hold the results
    out <- vector('double', n)
    # apply the function only to numeric vectors
    for (i in seq_along(idxs)) {
        out[[i]] <- fun(df[[idxs[[i]]]])
    }
    # name the vector
    names(out) <- names(df)[idxs]
    out
}

In [108]:
df <- tibble(
  X1 = c(1, 2, 3),
  X2 = c("A", "B", "C"),
  X3 = c(0, -1, 5),
  X4 = c(TRUE, FALSE, TRUE)
)
col_summary_2(df, mean)

## The Map Functions

The pattern of looping over a vector, doing something to each element, and saving the results is so common that the purrr package
provides a family of functions to do it for you. There is one function
for each type of output:

* `map()` makes a list.
* `map_lgl()` makes a logical vector.
* `map_int()` makes an integer vector.
* `map_dbl()` makes a double vector.
* `map_chr()` makes a character vector.

Each function takes a vector as input, applies a function to each
piece, and then returns a new vector that’s the same length (and has
the same names) as the input. The type of the vector is determined
by the suffix to the map function.

Once you master these functions, you’ll find it takes much less time
to solve iteration problems. But you should never feel bad about
using a for loop instead of a map function. The map functions are a
step up a tower of abstraction, and it can take a long time to get your
head around how they work. The important thing is that you solve
the problem that you’re working on, not write the most concise and
elegant code (although that’s definitely something you want to strive
toward!).

Some people will tell you to avoid for loops because they are slow.
They’re wrong! (Well at least they’re rather out of date, as for loops
haven’t been slow for many years). The chief benefit of using functions like `map()` is not speed, but clarity: they make your code easier
to write and to read.

We can use these functions to perform the same computations as the
last for loop. Those summary functions returned doubles, so we
need to use `map_dbl()` :

In [135]:
df <- tibble(
  a = rnorm(10), 
  b = rnorm(10), 
  c = rnorm(10), 
  d = rnorm(10)
)

In [136]:
map_dbl(df, mean)

In [137]:
map_dbl(df, median)

In [140]:
df %>% map_dbl(mean)

In [141]:
df %>% map_dbl(median)

There are a few differences between `map_*()` and `col_summary()` :
* All purrr functions are implemented in C. This makes them a
little faster at the expense of readability.
* The second argument, `.f` , the function to apply, can be a formula, a character vector, or an integer vector. You’ll learn about
those handy shortcuts in the next section

* `map_*()` uses ... (“Dot-Dot-Dot `(...)`") to pass along
additional arguments to `.f` each time it’s called:

In [143]:
map_dbl(df, mean, trim = 0.4)

- The map functions also preserve names:

In [144]:
z <- list(x = 1:3, y = 4:5)
map_int(z, length)

## Shortcuts

There are a few shortcuts that you can use with .f in order to save a
little typing. Imagine you want to fit a linear model to each group in
a dataset. The following toy example splits up the mtcars dataset
into three pieces (one for each value of cylinder) and fits the same
linear model to each piece:

In [145]:
models <- mtcars %>%
    split(.$cyl) %>%
    map(function(df) lm(mpg ~ wt, data = df))

The syntax for creating an anonymous function in R is quite verbose
so purrr provides a convenient shortcut—a one-sided formula:

In [146]:
models <- mtcars %>%
    split(.$cyl) %>%
    map(~lm(mpg ~ wt, data = .))

Here I’ve used `.` as a pronoun: it refers to the current list element (in
the same way that i referred to the current index in the for loop).

When you’re looking at many models, you might want to extract a
summary statistic like the R 2 . To do that we need to first run `summary()`
and then extract the component called `r.squared` . We could
do that using the shorthand for anonymous functions:

In [147]:
models %>%
    map(summary) %>%
    map_dbl(~.$r.squared)

But extracting named components is a common operation, so purrr
provides an even shorter shortcut: you can use a string.

In [148]:
models %>%
    map(summary) %>%
    map_dbl('r.squared')

You can also use an integer to select elements by position:

In [149]:
x <- list(list(1, 2, 3), list(4, 5, 6), list(7, 8, 9))
x %>% map_dbl(2)

## Base R

If you’re familiar with the apply family of functions in base R, you
might have noticed some similarities with the purrr functions:
* `lapply()` is basically identical to `map()` , except that `map()` is
consistent with all the other functions in purrr, and you can use
the shortcuts for `.f` .

* Base `sapply()` is a wrapper around `lapply()` that automatically
simplifies the output. This is useful for interactive work but is
problematic in a function because you never know what sort of
output you’ll get:

In [150]:
x1 <- list(
    c(0.27, 0.37, 0.91, 0.20),
    c(0.90, 0.94, 0.66, 0.63),
    c(0.21, 0.18, 0.69, 0.38)
)

x2 <- list(
    c(0.50, 0.72, 0.99, 0.38, 0.78),
    c(0.93, 0.21, 0.65, 0.13, 0.27),
    c(0.39, 0.01, 0.38, 0.87, 0.34)
)

In [151]:
threshold <- function(x, cutoff = 0.8) x[x > cutoff]
x1 %>% sapply(threshold) %>% str()

List of 3
 $ : num 0.91
 $ : num [1:2] 0.9 0.94
 $ : num(0) 


In [152]:
x2 %>% sapply(threshold) %>% str()

 num [1:3] 0.99 0.93 0.87


* `vapply()` is a safe alternative to `sapply()` because you supply
an additional argument that defines the type. The only problem
with `vapply()` is that it’s a lot of typing: `vapply(df,
is.numeric, logical(1))` is equivalent to `map_lgl(df,
is.numeric)` . One advantage of `vapply()` over purrr’s map
functions is that it can also produce matrices—the map funcions only ever produce vectors.

Excercise

Write code that uses one of the map functions to:

    Compute the mean of every column in mtcars.
    Determine the type of each column in nycflights13::flights.
    Compute the number of unique values in each column of iris.
    Generate 10 random normals for each of μ=−10, 0, 10, and 100.

In [157]:
map_dbl(mtcars, mean)
typeof(map_dbl(mtcars, mean))

In [158]:
(map(mtcars, mean))
typeof(map(mtcars, mean))

In [159]:
map_chr(nycflights13::flights, typeof)

In [160]:
map_int(iris, n_distinct)

In [162]:
map(c(-10, 0, 10, 100), ~rnorm(n = 10, mean = .))

Excercise

How can you create a single vector that for each column in a data frame indicates whether or not it’s a factor?

In [163]:
map_lgl(diamonds, is.factor)

Excercise

What happens when you use the map functions on vectors that aren’t lists? What does map(1:5, runif) do? Why?

In [164]:
map(c(TRUE, FALSE, TRUE), ~ !.)

In [165]:
map(c("Hello", "World"), str_to_upper)

In [166]:
map(1:5, ~ rnorm(.))

In [167]:
map(c(-0.5, 0, 1), ~ rnorm(1, mean = .))

It is important to be aware that while the input of `map()` can be any vector, the output is always a `list`.

In [168]:
map(1:5, runif)

Excercise



What does map(-2:2, rnorm, n = 5) do? Why?

What does map_dbl(-2:2, rnorm, n = 5) do? Why?


In [169]:
map(-2:2, rnorm, n = 5)

This expression takes samples of size five from five normal distributions, with means of (-2, -1, 0, 1, and 2), but the same standard deviation (1). It returns a list with each element a numeric vectors of length 5.

However, if instead, we use `map_dbl()`, the expression raises an error.

In [170]:
map_dbl(-2:2, rnorm, n = 5)

ERROR: Error: Result 1 must be a single double, not a double vector of length 5




This is because the `map_dbl()` function requires the function it applies to each element to return a numeric vector of length one. If the function returns either a non-numeric vector or a numeric vector with a length greater than one, `map_dbl()` will raise an error. The reason for this strictness is that `map_dbl()` guarantees that it will return a numeric vector of the same length as its input vector.

This concept applies to the other `map_*()` functions. The function `map_chr()` requires that the function always return a character vector of length one; map_int() requires that the function always return an integer vector of length one; `map_lgl()` requires that the function always return an logical vector of length one. Use the `map()` function if the function will return values of varying types or lengths.

To return a numeric vector, use `flatten_dbl()` to coerce the list returned by `map()` to a numeric vector.

In [171]:
map(-2:2, rnorm, n = 5) %>%
  flatten_dbl()

## Dealing with Failure

When you use the map functions to repeat many operations, the
chances are much higher that one of those operations will fail.
When this happens, you’ll get an error message, and no output. This
is annoying: why does one failure prevent you from accessing all the
other successes? How do you ensure that one bad apple doesn’t ruin
the whole barrel?

In this section you’ll learn how to deal with this situation with a new
function: `safely()` . `safely()` is an adverb: it takes a function (a
verb) and returns a modified version. In this case, the modified
function will never throw an error. Instead, it always returns a list
with two elements:

`result`
The original result. If there was an error, this will be NULL .

`error` An error object. If the operation was successful, this will be
NULL

In [172]:
safe_log <- safely(log)
str(safe_log(10))

List of 2
 $ result: num 2.3
 $ error : NULL


In [173]:
str(safe_log('a'))

List of 2
 $ result: NULL
 $ error :List of 2
  ..$ message: chr "non-numeric argument to mathematical function"
  ..$ call   : language .Primitive("log")(x, base)
  ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"


When the function succeeds the result element contains the result
and the error element is NULL . When the function fails, the result
element is NULL and the error element contains an error object.

`safely()` is designed to work with map :

In [174]:
x <- list(1, 10, 'a')
y <- x %>% map(safely(log))

In [175]:
str(y)

List of 3
 $ :List of 2
  ..$ result: num 0
  ..$ error : NULL
 $ :List of 2
  ..$ result: num 2.3
  ..$ error : NULL
 $ :List of 2
  ..$ result: NULL
  ..$ error :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language .Primitive("log")(x, base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"


This would be easier to work with if we had two lists: one of all the
errors and one of all the output. That’s easy to get with
`purrr::transpose()` :

In [176]:
y <- y %>% transpose()

In [177]:
str(y)

List of 2
 $ result:List of 3
  ..$ : num 0
  ..$ : num 2.3
  ..$ : NULL
 $ error :List of 3
  ..$ : NULL
  ..$ : NULL
  ..$ :List of 2
  .. ..$ message: chr "non-numeric argument to mathematical function"
  .. ..$ call   : language .Primitive("log")(x, base)
  .. ..- attr(*, "class")= chr [1:3] "simpleError" "error" "condition"


It’s up to you how to deal with the errors, but typically you’ll either
look at the values of `x` where `y` is an error, or work with the values of
`y` that are OK:

In [178]:
is_ok <- y$error %>%
    map_lgl(is_null)
x[!is_ok]

In [179]:
y$result[is_ok] %>%
    flatten_dbl()

purrr provides two other useful adverbs:

Like `safely()` , `possibly()` always succeeds. It’s simpler than
`safely()` , because you give it a default value to return when
there is an error:

In [180]:
x <- list(1, 10, 'a')
x %>% map_dbl(possibly(log, NA_real_))

`quietly()` performs a similar role to `safely()` , but instead of
capturing errors, it captures printed output, messages, and
warnings:

In [181]:
x <- list(1, -1)
x %>% map(quietly(log)) %>% str()

List of 2
 $ :List of 4
  ..$ result  : num 0
  ..$ output  : chr ""
  ..$ messages: chr(0) 
 $ :List of 4
  ..$ result  : num NaN
  ..$ output  : chr ""
  ..$ messages: chr(0) 


## Mapping over Multiple Arguments

So far we’ve mapped along a single input. But often you have multiple related inputs that you need to iterate along in parallel. That’s the
job of the `map2()` and `pmap()` functions. For example, imagine you
want to simulate some random normals with different means. You
know how to do that with `map()` :

In [182]:
mu <- list(5, 10, -3)

In [183]:
mu %>%
    map(rnorm, n = 5) %>%
    str()

List of 3
 $ : num [1:5] 6.72 5.54 5.02 5.83 4.54
 $ : num [1:5] 9.5 9.2 10.07 10.94 9.72
 $ : num [1:5] -3.45 -4.31 -3.44 -3.55 -2.04


What if you also want to vary the standard deviation? One way to do
that would be to iterate over the indices and index into vectors of
means and sds:

In [184]:
sigma <- list(1, 5, 10)
seq_along(mu) %>%
    map(~rnorm(5, mu[[.]], sigma[[.]])) %>%
    str()

List of 3
 $ : num [1:5] 4.85 4.13 4.58 4.02 5.99
 $ : num [1:5] 17.7 9.77 11.29 2.49 17.61
 $ : num [1:5] -6.29 -9.48 23.75 -1.61 -1.97


But that obfuscates the intent of the code. Instead we could use
`map2()` , which iterates over two vectors in parallel:

In [185]:
map2(mu, sigma, rnorm, n = 5) %>%
    str()

List of 3
 $ : num [1:5] 4.93 3.63 2.29 4.88 4.99
 $ : num [1:5] 13.16 4.8 14.65 16.55 5.37
 $ : num [1:5] -7.18 -7.58 -4.26 -5.26 20.16


Like `map()` , `map2()` is just a wrapper around a for loop:
```r
map2 <- function(x, y, f, ...) {
    out <- vector('list', length(x))
    for (i in seq_along(x)) {
        out[[i]] <- f(x[[i]], y[[i]], ...)
    }
    out
}
```

You could also imagine `map3()` , `map4()` , `map5()` , `map6()` , etc., but
that would get tedious quickly. Instead, purrr provides `pmap()` ,
which takes a list of arguments. You might use that if you wanted to
vary the mean, standard deviation, and number of samples:

In [186]:
n <- list(1, 3, 5)
args1 <- list(n, mu, sigma)
args1 %>%
    pmap(rnorm) %>%
    str()

List of 3
 $ : num 4.4
 $ : num [1:3] 5.79 17.71 7.52
 $ : num [1:5] 0.444 -7.26 2.842 11.07 -3.872


If you don’t name the elements of list, `pmap()` will use positional
matching when calling the function. That’s a little fragile, and makes
the code harder to read, so it’s better to name the arguments:

In [187]:
args2 <- list(mean = mu, sd = sigma, n = n)
args2 %>%
    pmap(rnorm) %>%
    str()

List of 3
 $ : num 5.84
 $ : num [1:3] 12.7 13.9 23.1
 $ : num [1:5] -7.16 -6.96 -6.86 -16.69 4.45


Since the arguments are all the same length, it makes sense to store
them in a data frame:

In [188]:
params <- tribble(
    ~mean, ~sd, ~n,
    5, 1, 1,
    10, 5, 3,
    -3, 10, 5
)

params %>%
    pmap(rnorm)

As soon as your code gets complicated, I think a data frame is a
good approach because it ensures that each column has a name and
is the same length as all the other columns.

## Invoking Different Functions

There’s one more step up in complexity—as well as varying the
arguments to the function you might also vary the function itself:

In [189]:
f <- c('runif', 'rnorm', 'rpois')
param <- list(
    list(min = -1, max = 1),
    list(sd = 5),
    list(lambda = 10)
)

To handle this case, you can use `invoke_map()` :

In [190]:
invoke_map(f, param, n = 5) %>%
    str()

List of 3
 $ : num [1:5] -0.454 -0.683 0.718 -0.671 0.568
 $ : num [1:5] -3.418 2.05 1.153 2.553 0.701
 $ : int [1:5] 15 7 9 12 9


The first argument is a list of functions or a character vector of function names. The second argument is a list of lists giving the arguments that vary for each function. The subsequent arguments are
passed on to every function.

And again, you can use `tribble()` to make creating these matching
pairs a little easier:

In [191]:
sim <- tribble(
    ~f,     ~params,
    'runif', list(min = -1, max = 1),
    'rnorm', list(sd = 5),
    'rpois', list(lambda = 10)
)

sim %>%
    mutate(sim = invoke_map(f, params, n = 10))

f,params,sim
<chr>,<list>,<list>
runif,"-1, 1","0.6418576, 0.4270354, 0.1707565, 0.9633596, -0.6784058, -0.1566551, -0.2001263, -0.8362209, -0.1078683, -0.4953311"
rnorm,5,"-1.65927346, -4.13028336, -2.22735888, -0.84661681, -4.52464737, 4.51057010, 0.06553342, -6.59964444, 1.22340575, 1.99553023"
rpois,10,"15, 9, 10, 11, 17, 7, 6, 9, 11, 9"


## Walk

Walk is an alternative to map that you use when you want to call a
function for its side effects, rather than for its return value. You typically do this because you want to render output to the screen or save
files to disk—the important thing is the action, not the return value.
Here’s a very simple example:

In [192]:
x <- list(1, 'a', 4)
x %>%
    walk(print)

[1] 1
[1] "a"
[1] 4


`walk()` is generally not that useful compared to `walk2()` or `pwalk()` .
For example, if you had a list of plots and a vector of filenames, you
could use `pwalk()` to save each file to the corresponding location on
disk:

In [193]:
library(ggplot2)
plots <- mtcars %>%
    split(.$cyl) %>%
    map(~ggplot(., aes(mpg, wt)) + geom_point())

paths <- str_c(names(plots), '.pdf')

pwalk(list(paths, plots), ggsave, path = tempdir())

Saving 6.67 x 6.67 in image

Saving 6.67 x 6.67 in image

Saving 6.67 x 6.67 in image



## Other Patterns of For Loops

purrr provides a number of other functions that abstract over other
types of for loops. You’ll use them less frequently than the map functions, but they’re useful to know about. The goal here is to briefly
illustrate each function, so hopefully it will come to mind if you see
a similar problem in the future. Then you can go look up the documentation for more details.

### Predicate Functions

A number of functions work with predicate functions that return
either a single TRUE or FALSE .
`keep()` and `discard()` keep elements of the input where the predicate is TRUE or FALSE , respectively:

In [194]:
iris %>%
    keep(is.factor) %>%
    str()

'data.frame':	150 obs. of  1 variable:
 $ Species: Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


In [195]:
iris %>%
    discard(is.factor) %>%
    str()

'data.frame':	150 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...


`some()` and `every()` determine if the predicate is true for any or for
all of the elements:

In [196]:
x <- list(1:5, letters, list(10))

In [197]:
x %>%
    some(is_character)

In [198]:
x %>%
    every(is_vector)

`detect()` finds the first element where the predicate is true;
`detect_index()` returns its position:


In [210]:
x <- sample(10)
x

In [211]:
x %>%
    detect(~ . > 5)

In [212]:
x %>%
    detect_index(~ . > 5)

`head_while()` and `tail_while()` take elements from the start or
end of a vector while a predicate is true:

In [213]:
x %>%
    head_while(~ . > 5)

In [214]:
x %>%
    tail_while(~ . > 5)

## Reduce and Accumulate

Sometimes you have a complex list that you want to reduce to a sim‐
ple list by repeatedly applying a function that reduces a pair to a sin‐
gleton. This is useful if you want to apply a two-table __dplyr__ verb to
multiple tables. For example, you might have a list of data frames,
and you want to reduce to a single data frame by joining the ele‐
ments together:

In [215]:
dfs <- list(
    age = tibble(name = 'John', age = 30),
    sex = tibble(name = c('John', 'Mary'), sex = c('M', 'F')),
    trt = tibble(name = 'Mary', treatment = 'A')
)

In [216]:
dfs %>%
    reduce(full_join)

Joining, by = "name"

Joining, by = "name"



name,age,sex,treatment
<chr>,<dbl>,<chr>,<chr>
John,30.0,M,
Mary,,F,A


Or maybe you have a list of vectors, and want to find the intersection:

In [217]:
vs <- list(
    c(1, 3, 5, 6, 10),
    c(1, 2, 3, 7, 8, 10),
    c(1, 2, 3, 4, 8, 9, 10)
)

vs %>%
    reduce(intersect)

The reduce function takes a “binary” function (i.e., a function with
two primary inputs), and applies it repeatedly to a list until there is
only a single element left.

`accumulate` is similar but it keeps all the interim results. You could
use it to implement a cumulative sum:

In [218]:
x <- sample(10)
x

In [219]:
x %>%
    accumulate(`+`)