# Loop functions - `lapply()` and `sapply()`

* Writing `for`, `while` loops is useful when programming but not particularly easy when working interactively on the command line;
* There are some functions which implement looping to make life easier.


1. `lapply()`: Loop over a list and evaluate a function on each element;
2. `sapply()`: Same as `lapply()` but try to simplify the result;
3. `apply()`: Apply a function over the margins of an array;
4. `tapply()`: Apply a function over subsets of a vector;
5. `mapply()`: Multivariate version of `lapply()`.


* An auxiliary function `split()` is also useful, particularly in conjunction with `lapply()`.

## `lapply()`

* It takes three arguments: (1) **a list x**, (2) **a function** (or the name of a function) FUN, (3) **other arguments** via its argument;
* If **x** is not a list, it will be coerced to a list using `as.list()`;
* The actual looping is done internally in C code.


In [1]:
lapply

* It always returns a list, regardless of the class of the input.

In [2]:
x <- list(a=1:5, b=rnorm(10))
x
lapply(x, mean)

In [3]:
x <- list(a=1:4, b=rnorm(10), c=rnorm(20,1), d=rnorm(100,5))
x
lapply(x, mean)

In [5]:
x <- 1:4
x
lapply(x, runif)

* `lapply()` and friends make heavy use of *anonymous functions*

In [7]:
x <- list(a=matrix(1:4,2,2), b=matrix(1:6,3,2))
x

0,1
1,3
2,4

0,1
1,4
2,5
3,6


* Example of an anonymous function for extracting the first column of each matrix:

In [8]:
lapply(x, function(elt) elt[,1])

## `sapply()`

* It will try to simplify the result of `lapply()`if possible:


1. If the result is a list where every element is length 1, then a vector is returned;
2. If the result is a list where every element is a vector of the same length(>1), a matrix is returned;
3. If it can't figure things out, a list is returned.

In [9]:
x <- list(a=1:4, b=rnorm(10), c=rnorm(20,1), d=rnorm(100,5))
lapply(x, mean)

In [12]:
sapply(x, mean) # returns a vector

In [11]:
mean(x)

“argument is not numeric or logical: returning NA”

---
## apply()

* It is used to evaluate a function (often an anonymous one) over the margins of an array;
* The main reasion to use it is less typing.


1. It is most often used to apply a function to the rows or columns of a matrix;
2. It can be used with general arrays (taking the average of an array of matrices);
3. It is not really faster than writing a loop, but it works in one line.



In [13]:
str(apply)

function (X, MARGIN, FUN, ...)  


* **x** is an array;
* **MARGIN** is an integer vector indicating which margins should be *retained*;
* **FUN** is a function to be applied;
* **...** is for other arguments to be passed to **FUN**.

In [25]:
x <- matrix(rnorm(200), 20, 10)
print(apply(x, 2, mean))  # 2 == columns
print('---')
print(apply(x, 1, sum))  # 1 == rows

 [1] -0.45860586 -0.04821500 -0.10902990 -0.06394621  0.36047431 -0.23218423
 [7]  0.08775499  0.19499142 -0.38224370  0.33730712
[1] "---"
 [1]  0.53440815  8.15617231  2.09784355  3.68881127  2.93681205 -4.38584805
 [7] -0.65427916 -1.27904362  1.61203779  0.40545089 -0.62957700 -3.48513927
[13] -0.47367387 -0.50262289  0.08750964 -2.73329140 -0.94222103 -5.93470970
[19]  1.64671962 -6.41930029


## col/row sums and means

* For sums and means of matrix dimensions, we have some shortcuts:


1. `rowSums()` = `apply(x, 1, sum)`
2. `rowMeans()` = `apply(x, 1, mean)`
3. `colSums()` = `apply(x, 2, sum)`
4. `colMeans()` = `apply(x, 2, mean)`


* The shortcut functions are **much** faster, but you won't notice unless you're using a large matrix.

## Other ways to apply

* Quantiles of the rows of a matrix.

In [27]:
x <- matrix(rnorm(200), 20, 10)
print(apply(x, 1, quantile, probs=c(0.25, 0.75)))

          [,1]       [,2]       [,3]       [,4]       [,5]       [,6]      [,7]
25% -1.1229583 -0.7201730 -0.4798213 -0.6320166 -1.0089346 -0.7584882 0.3959865
75%  0.6286197  0.9856762  0.1340930  1.3958415  0.3423727  0.4916413 1.0270753
          [,8]       [,9]      [,10]      [,11]      [,12]      [,13]
25% -1.0320469 -1.4248100 -0.6639273 -0.6500948 -0.3627748 -0.2240410
75% -0.2537827 -0.2202204  0.6593582  0.3034206  1.0147624  0.6909585
         [,14]      [,15]      [,16]      [,17]      [,18]      [,19]
25% -0.7003952 -0.7707818 -0.5530322 0.02971026 -0.3929981 -0.2323060
75%  0.7379277  0.6205212  0.8758019 1.09297908  0.5296029  0.2858213
         [,20]
25% -0.9432065
75%  0.1439631


* Average matrix in an array:

In [37]:
a <- array(rnorm(2*2*10), c(2,2,10))  # 40 aleatory numbers organized in ten arrays with 2x2
print(apply(a, c(1,2), mean))
print('---')
print(rowMeans(a, dims=2))

           [,1]       [,2]
[1,] -0.1269778 -0.1335441
[2,] -0.5702521  0.6789206
[1] "---"
           [,1]       [,2]
[1,] -0.1269778 -0.1335441
[2,] -0.5702521  0.6789206


---
## `mapply()`

* Is a multivariate apply of sorts which applies a function in parallel over a set of arguments;

In [38]:
str(mapply)

function (FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE)  


* **FUN** is a function to apply;
* **...** contains arguments to apply over;
* **MoreArgs** is a list of other arguments to **FUN**;
* **SIMPLIFY** indicates whether the result should be simplified

> The following is tedious to type:

In [41]:
print(list(rep(1,4), rep(2,3), rep(3,2), rep(4,1)))

[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4



> Instead we can do:

In [42]:
print(mapply(rep, 1:4, 4:1))

[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4



## - Vectorizing a function


In [47]:
noise <- function(n, mean, sd) {
    rnorm(n, mean, sd)
}

print(noise(5,1,2))
print('---')
print(noise(1:5, 1:5, 2)) # here it doesn't work properly, been necessary use `mapply()`

[1]  1.3787703  4.7100678  1.9037110 -0.0045433  2.5863238
[1] "---"
[1] 2.134702 2.884352 1.474108 1.756641 4.358317


In [46]:
print(mapply(noise, 1:5, 1:5, 2))

[[1]]
[1] -0.7074835

[[2]]
[1] 0.8753265 1.4578100

[[3]]
[1] 1.6181635 0.1841707 2.7530993

[[4]]
[1] 4.713487 2.964363 4.008731 4.495350

[[5]]
[1] 3.752158 6.074472 7.887333 9.122761 5.093533



* Which is the same as:

In [48]:
print(list(noise(1,1,2), noise(2,2,2), noise(3,3,2), noise(4,4,2), noise(5,5,2)))

[[1]]
[1] 2.358179

[[2]]
[1] 3.707294 1.571609

[[3]]
[1] 3.786568 1.017364 3.778746

[[4]]
[1] 3.168267 2.861526 5.477760 3.930258

[[5]]
[1] 3.763899 2.681291 5.970280 8.151980 5.228499



---
## `tapply()`

* It is used to apply a function over subsets of a vector;

In [49]:
str(tapply)

function (X, INDEX, FUN = NULL, ..., default = NA, simplify = TRUE)  


* **x** is a vector;
* **INDEX** is a factor or a list of factors (or else they are coerced to factors);
* **FUN** is a function to be applied;
* **...** contains other arguments to be passed to **FUN**;
* **simplify** should we simplify the result?

In [58]:
x <- c(rnorm(10), runif(10), rnorm(10,1))
f <- gl(3, 10)
print(x)
print(f)

print(tapply(x, f, mean))

 [1]  0.73251796  0.09171256 -0.39233184  0.15589176 -0.20850068  1.57831916
 [7] -0.56046739  1.47781611 -1.10053625 -0.22404224  0.71290907  0.80418300
[13]  0.68092608  0.50336359  0.50087476  0.20034499  0.84248251  0.40531634
[19]  0.09277781  0.35403662  0.50260783  2.20195048  1.94859538  0.79318807
[25] -0.40947244  0.45125235  1.77408893  0.61831982  1.32273434  0.20748295
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
        1         2         3 
0.1550379 0.5097215 0.9410748 


* Takes group means without simplification

In [60]:
print(tapply(x, f, mean, simplify=FALSE))

$`1`
[1] 0.1550379

$`2`
[1] 0.5097215

$`3`
[1] 0.9410748



---
## `split()`

* It takes a vector or other objects and splits it into groups determined by a factor or list of factors.

In [61]:
str(split)

function (x, f, drop = FALSE, ...)  


* **x** is a vector (or list) or data frame;
* **f** is a factor (or coerced to one) or a list of factors;
* **drop** indicates whether empty factors levels should be dropped.

In [64]:
x <- c(rnorm(10), runif(10), rnorm(10,1))
f <- gl(3, 10)
print(split(x, f))

$`1`
 [1]  1.72840066  0.26888565  0.10244064 -1.25632116  0.46814248  0.06448892
 [7] -1.13188332 -3.12751256 -0.77302421  0.48145892

$`2`
 [1] 0.4907943 0.8835422 0.2051925 0.3416405 0.5532782 0.1433677 0.5975480
 [8] 0.1990694 0.7812303 0.2547684

$`3`
 [1]  0.4398967  1.3803923  0.5416893 -0.7331889  0.4167218  1.5948983
 [7] -1.7745058 -0.4972197  2.7039631  2.0676326



* A common idiom is `split()` followed by an `lapply()`

In [65]:
lapply(split(x,f), mean)

## - Splitting a data frame

In [66]:
library(datasets)
head(airquality)

Ozone,Solar.R,Wind,Temp,Month,Day
41.0,190.0,7.4,67,5,1
36.0,118.0,8.0,72,5,2
12.0,149.0,12.6,74,5,3
18.0,313.0,11.5,62,5,4
,,14.3,56,5,5
28.0,,14.9,66,5,6


In [68]:
s <- split(airquality, airquality$Month)
print(lapply(s, function(x) colMeans(x[,c("Ozone", "Solar.R", "Wind")])))

$`5`
   Ozone  Solar.R     Wind 
      NA       NA 11.62258 

$`6`
    Ozone   Solar.R      Wind 
       NA 190.16667  10.26667 

$`7`
     Ozone    Solar.R       Wind 
        NA 216.483871   8.941935 

$`8`
   Ozone  Solar.R     Wind 
      NA       NA 8.793548 

$`9`
   Ozone  Solar.R     Wind 
      NA 167.4333  10.1800 



In [69]:
print(sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")])))

               5         6          7        8        9
Ozone         NA        NA         NA       NA       NA
Solar.R       NA 190.16667 216.483871       NA 167.4333
Wind    11.62258  10.26667   8.941935 8.793548  10.1800


In [70]:
print(sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm=TRUE)))

                5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000


## Splitting on more than one level

In [72]:
x <- rnorm(10)
f1 <- gl(2,5)
f2 <- gl(5,2)
print(f1)
print(f2)
print(interaction(f1,f2))

 [1] 1 1 1 1 1 2 2 2 2 2
Levels: 1 2
 [1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
 [1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5
Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5


* Interactions can create empty levels

In [73]:
str(split(x, list(f1, f2)))

List of 10
 $ 1.1: num [1:2] 0.00884 -0.12451
 $ 2.1: num(0) 
 $ 1.2: num [1:2] 0.168 1.677
 $ 2.2: num(0) 
 $ 1.3: num 0.362
 $ 2.3: num -0.781
 $ 1.4: num(0) 
 $ 2.4: num [1:2] -2.36 1.62
 $ 1.5: num(0) 
 $ 2.5: num [1:2] -2.611 -0.662


* Empty levels can be dropped

In [74]:
str(split(x, list(f1, f2), drop=TRUE))

List of 6
 $ 1.1: num [1:2] 0.00884 -0.12451
 $ 1.2: num [1:2] 0.168 1.677
 $ 1.3: num 0.362
 $ 2.3: num -0.781
 $ 2.4: num [1:2] -2.36 1.62
 $ 2.5: num [1:2] -2.611 -0.662


---
# Debugging tools - diagnosing the problem

* Indications that something's not right:


1. `message`: A generic notification/diagnostic message produced by the `message` function, execution of the function continues;
2. `warning`: An indication that something is wrong but not necessarily fatal (generated by the `warning` function, execution of the function continues;
3. `error`: An indication that a fatal problem has occurred (produced by the `stop` function), execution stops;
4. `condition`: A generic concept for indicating  that something unexpected can occur, programmers can create their own conditions.

## - Something's wrong!

* Warning

In [75]:
log(-1)

“NaNs produced”

In [77]:
printmessage <- function(x) {
    if(x>0)
        print("x is greater than zero")
    else
        print("c is less than or equal to zero")
    invisible(x)
}

printmessage(1)
printmessage(NA)

[1] "x is greater than zero"


ERROR: Error in if (x > 0) print("x is greater than zero") else print("c is less than or equal to zero"): missing value where TRUE/FALSE needed


In [79]:
printmessage2 <- function(x) {
    if(is.na(x))
        print("x is a missing value!")
    else if(x>0)
        print("x is greater than zero")
    else
        print("x is less than or equal to zero")
    invisible(x)
}

x <- log(-1)
printmessage2(x)

“NaNs produced”

[1] "x is a missing value!"


* How do you know that something is wrong with your function?


1. What was your input? How did you call the function?
2. What were you expecting? Output, messages, other results?
3. What did you get?
4. How does what you get differ from what you were expecting?
5. Were your expectations correct in the first place?
6. Can you reproduce the problem (exactly?)

---
# Debugging tools - basic tools

* `traceback`: prints out the function call stack after an error occurs, does nothing if there's no erro;
* `debug`: flags a function for **debug** mode which allows you to step though execution of a function one line at a time;
* `browser`: suspends the execution of a function wherever it is called puts the function in debug mode;
* `trace`: allows you to insert debugging code into a function in specific places;
* `recover`: allows you to modify the error behavior so that you can browse the function call stack.


> These are interactive tools specificially designed to allow you to pick through a function;

> There's also the more blunt technique of inserting print/cat statements in the function.

---
## traceback


In [83]:
mean(o)

traceback()

ERROR: Error in mean(o): object 'o' not found


In [86]:
lm(y-x)
traceback()

ERROR: Error in stats::model.frame(formula = y - x, drop.unused.levels = TRUE): object 'y' not found


## debug

In [87]:
debug(lm)
lm(y-x)

debugging in: lm(y - x)
debug: {
    ret.x <- x
    ret.y <- y
    cl <- match.call()
    mf <- match.call(expand.dots = FALSE)
    m <- match(c("formula", "data", "subset", "weights", "na.action", 
        "offset"), names(mf), 0L)
    mf <- mf[c(1L, m)]
    mf$drop.unused.levels <- TRUE
    mf[[1L]] <- quote(stats::model.frame)
    mf <- eval(mf, parent.frame())
    if (method == "model.frame") 
        return(mf)
    else if (method != "qr") 
            method), domain = NA)
    mt <- attr(mf, "terms")
    y <- model.response(mf, "numeric")
    w <- as.vector(model.weights(mf))
    if (!is.null(w) && !is.numeric(w)) 
        stop("'weights' must be a numeric vector")
    offset <- as.vector(model.offset(mf))
    if (!is.null(offset)) {
        if (length(offset) != NROW(y)) 
            stop(gettextf("number of offsets is %d, should equal %d (number of observations)", 
                length(offset), NROW(y)), domain = NA)
    }
    if (is.empty.model(mt)) {
        x <- NULL
     

ERROR: Error in stats::model.frame(formula = y - x, drop.unused.levels = TRUE): object 'y' not found


## recover


In [89]:
options(error=recover)
read.csv("nosuchfile")

“cannot open file 'nosuchfile': No such file or directory”

ERROR: Error in file(file, "rt"): cannot open the connection


## Debugging - summary

* There are three main indications of a problem/condition: `message`, `warning` and `error` (only `error` is fatal);
* When analyzing a function witth a problem, make sure you can reproduce the problem, clearly state your expectations and how the output differs from your expectation;
* Interactive debugging tools `traceback`, `debug`, `browser`, `trace` and `recover` can be used to find problematic code in functions;
* Debugging tools are not a substitute for thinking!