# **Lab 9: Functions, vectors and lists**

Derek Hansen

See https://r4ds.had.co.nz/functions.html for a good reference.

In [1]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



# Why functions?

Consider the following synthetic data frame. Each column is randomly generated from a different probability distribution

In [10]:
set.seed(1111)
df <- tibble(x = rnorm(1000),y = rgamma(1000, shape = 1), z = rpois(1000, lambda = 5))
print(df)

[38;5;246m# A tibble: 1,000 x 3[39m
         x     y     z
     [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m -[31m0[39m[31m.[39m[31m0[39m[31m86[4m6[24m[39m 0.171     5
[38;5;250m 2[39m  1.32   1.12      2
[38;5;250m 3[39m  0.640  0.836     1
[38;5;250m 4[39m  1.17   0.374     1
[38;5;250m 5[39m  0.116  0.499     3
[38;5;250m 6[39m -[31m2[39m[31m.[39m[31m93[39m   1.94      4
[38;5;250m 7[39m  0.678  0.346    11
[38;5;250m 8[39m  1.12   0.392     6
[38;5;250m 9[39m  1.38   0.542     6
[38;5;250m10[39m  1.28   2.72      6
[38;5;246m# … with 990 more rows[39m


Suppose we want to normalize each column so that its maximum value is 1 and its minimum value is zero. We could do this manually...

In [15]:
df_norm1 <- mutate(df,x = (x - min(x)) /  (max(x) - min(x)), 
                   y = (y - min(y)) /  (max(y) - min(y)), 
                   z = (z - min(z)) /  (max(z) - min(z)))
print(df_norm1)

[38;5;246m# A tibble: 1,000 x 3[39m
         x      y      z
     [3m[38;5;246m<dbl>[39m[23m  [3m[38;5;246m<dbl>[39m[23m  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m 0.483   0.021[4m2[24m 0.385 
[38;5;250m 2[39m 0.721   0.139  0.154 
[38;5;250m 3[39m 0.606   0.103  0.076[4m9[24m
[38;5;250m 4[39m 0.696   0.046[4m3[24m 0.076[4m9[24m
[38;5;250m 5[39m 0.517   0.061[4m7[24m 0.231 
[38;5;250m 6[39m 0.002[4m7[24m[4m7[24m 0.241  0.308 
[38;5;250m 7[39m 0.612   0.042[4m8[24m 0.846 
[38;5;250m 8[39m 0.687   0.048[4m5[24m 0.462 
[38;5;250m 9[39m 0.732   0.067[4m0[24m 0.462 
[38;5;250m10[39m 0.715   0.336  0.462 
[38;5;246m# … with 990 more rows[39m


What are the problems with this approach?
-  Doesn't scale well if I have hundreds of columns.
-  I copy-pasted the same code for each column, which could lead to errors.
-  If I decided later to normalize in a different way, I need to make changes in 3 places

## Anatomy of a function

From R for Data Science: 

>Writing a function has three big advantages over using copy-and-paste:
>1.  You can give a function an evocative name that makes your code easier to understand.
>1.  As requirements change, you only need to update code in one place, instead of many.
>1.  You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).


You've been using functions this whole time, but have not needed to write your own. 
You can think of a function as having three main ingredients:
1.  Input: The variables passed to the function
2.  Body: The code block that runs
3.  Output: What the function returns

For example:

In [None]:
calc_power3 <- function(x, p) {
    y <- x^3
    return(y)
}

Here, ```x``` and ```p``` are the input, 
```y <- x^3``` is the body, and ```return(y)``` specifies ```y``` as the output.

In [19]:
print(calc_power3(2))

[1] 8


We can re-write our earlier expression as a function:

In [None]:
rescale01 <- function(x) {
    (x - min(x)) / (max(x) - min(x))
}

In [25]:
rescale01(1:10)

This leads to simpler code in our dplyr example:

In [28]:
df_norm2 <- mutate(df,x = rescale01(x), 
                   y = rescale01(y), 
                   z = rescale01(z))
print(df_norm2)

[38;5;246m# A tibble: 1,000 x 3[39m
         x      y      z
     [3m[38;5;246m<dbl>[39m[23m  [3m[38;5;246m<dbl>[39m[23m  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m 0.483   0.021[4m2[24m 0.385 
[38;5;250m 2[39m 0.721   0.139  0.154 
[38;5;250m 3[39m 0.606   0.103  0.076[4m9[24m
[38;5;250m 4[39m 0.696   0.046[4m3[24m 0.076[4m9[24m
[38;5;250m 5[39m 0.517   0.061[4m7[24m 0.231 
[38;5;250m 6[39m 0.002[4m7[24m[4m7[24m 0.241  0.308 
[38;5;250m 7[39m 0.612   0.042[4m8[24m 0.846 
[38;5;250m 8[39m 0.687   0.048[4m5[24m 0.462 
[38;5;250m 9[39m 0.732   0.067[4m0[24m 0.462 
[38;5;250m10[39m 0.715   0.336  0.462 
[38;5;246m# … with 990 more rows[39m


However, we can still do better.  ```across``` from ```dplyr``` allows us to apply the same function to multiple columns at once.

In [39]:
df_norm3 <- mutate(df, across(everything(), .fns = rescale01, .names = "{.col}_rescaled"))
print(df_norm3)

[38;5;246m# A tibble: 1,000 x 6[39m
         x     y     z x_rescaled y_rescaled z_rescaled
     [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m      [3m[38;5;246m<dbl>[39m[23m      [3m[38;5;246m<dbl>[39m[23m      [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m -[31m0[39m[31m.[39m[31m0[39m[31m86[4m6[24m[39m 0.171     5    0.483       0.021[4m2[24m     0.385 
[38;5;250m 2[39m  1.32   1.12      2    0.721       0.139      0.154 
[38;5;250m 3[39m  0.640  0.836     1    0.606       0.103      0.076[4m9[24m
[38;5;250m 4[39m  1.17   0.374     1    0.696       0.046[4m3[24m     0.076[4m9[24m
[38;5;250m 5[39m  0.116  0.499     3    0.517       0.061[4m7[24m     0.231 
[38;5;250m 6[39m -[31m2[39m[31m.[39m[31m93[39m   1.94      4    0.002[4m7[24m[4m7[24m     0.241      0.308 
[38;5;250m 7[39m  0.678  0.346    11    0.612       0.042[4m8[24m     0.846 
[38;5;250m 8[39m  1.12   0.392     6    0.6

This is powerful when you want to apply multiple functions to multiple columns at once. Suppose we want to "standardize" our variables by subtracting the mean and dividing by standard deviation.

In [40]:
standardize <- function(x) {
    y <- (x - mean(x))/sd(x)
    return(y)
}

In [45]:
df_norm4 <- mutate(df, 
                   across(everything(), 
                          .fns = list(rescaled=rescale01, standardized=standardize), 
                          .names = "{.col}_{.fn}"))
print(df_norm4)
print(select(df_norm4, -x,-y,-z))

[38;5;246m# A tibble: 1,000 x 9[39m
         x     y     z x_rescaled x_standardized y_rescaled y_standardized
     [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m      [3m[38;5;246m<dbl>[39m[23m          [3m[38;5;246m<dbl>[39m[23m      [3m[38;5;246m<dbl>[39m[23m          [3m[38;5;246m<dbl>[39m[23m
[38;5;250m 1[39m -[31m0[39m[31m.[39m[31m0[39m[31m86[4m6[24m[39m 0.171     5    0.483          -[31m0[39m[31m.[39m[31m0[39m[31m60[4m8[24m[39m     0.021[4m2[24m        -[31m0[39m[31m.[39m[31m867[39m 
[38;5;250m 2[39m  1.32   1.12      2    0.721           1.36       0.139          0.099[4m1[24m
[38;5;250m 3[39m  0.640  0.836     1    0.606           0.671      0.103         -[31m0[39m[31m.[39m[31m191[39m 
[38;5;250m 4[39m  1.17   0.374     1    0.696           1.21       0.046[4m3[24m        -[31m0[39m[31m.[39m[31m661[39m 
[38;5;250m 5[39m  0.116  0.499     3    0.517           

## Conditions

### Common Pitfalls

The condition part of the if statement must evaluate to either a single TRUE or FALSE. If it does not, you will get a warning:

In [46]:
if (c(T, F)) { 
    1 
}

“the condition has length > 1 and only the first element will be used”


Similarly, a condition of NA will generate an error:

In [5]:
if (NA) {
    1 
}

ERROR: Error in if (NA) {: missing value where TRUE/FALSE needed


### Logical operators and "short-circuting"

Often you will need to combine multiple logical conditions in an if statement. To do this we have the `&&` and `||` operators, which take the logical and and or, respectively, of several logical conditions:

In [6]:
TRUE && FALSE && TRUE

In [7]:
FALSE || TRUE || FALSE

There is a subtle but important difference betwen the single and double versions of these operators. The single `&` performs entrywise AND over logical vectors:

In [8]:
c(T, T, F) & c(F, T, F)

In contrast, the double ampersand `&&` returns `F` as soon as it encounters a value of `F`:

In [9]:
c(T, T, T) && c(F, T, F)

It only returns `T` if it gets to the end of a vector without finding any `F` values:

In [10]:
c(T, T, T) && c(T, T, T)

This is known as "short-circuiting": R can stop evaluating as soon as it hits one false value, since this will cause the & to return false:

In [11]:
f = function() { print("f called"); F }
g = function() { print("g called"); T }
f() && g()

g() && f()

[1] "f called"


[1] "g called"
[1] "f called"


The or operator works similarly:

In [12]:
g() || f()

f() || g()

[1] "g called"


[1] "f called"
[1] "g called"


### Testing for equality

Be careful when testing for equality in conditionals. The == operator will return a vector of logicals. If you want to make sure that any/all entries of a vector are TRUE, use the any() or all() functions:

In [13]:
v1 = c(1, 2, 3)
v2 = c(1, 1, 2)
if (v1 == v2) { print("Wrong!") }
if (all(v1 == v2)) { print("All!") }
if (any(v1 == v2)) { print("Any!") }

“the condition has length > 1 and only the first element will be used”

[1] "Wrong!"
[1] "Any!"


Also be wary of testing floating point numbers for equality:

In [14]:
2 == sqrt(2) ^ 2

In [15]:
sqrt(2) ^ 2

If you need to do this, use the `near()` function instead:

In [16]:
near(2, sqrt(2) ^ 2)

### Multiple conditions

Sometimes you will want to check multiple conditions using an if statement. For example, let's define the function:
$$
sign(x)=\begin{cases}
-1, x<0\\
0, x=0\\
1, x>1
\end{cases}$$

The general form is

```
if (this) {
   do that
} else if (that) {
   do something else
} else {
   
}
```

## Function arguments

Functions can take multiple arguments. Generally they fall into one of two categories:

*   Data to be processed by the function, and
*   Options, which affect how the data gets processed.


### Rules for function arguments

Generally:

*   The data parameters should come first; and
*   The options should come second, and have sensible defaults.

Default parameter values are specified by the option=default notation:

In [None]:
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}


When you call a function, you can omit the values of the default arguments. If overriding the default, you should specify the parameter you are overriding and then input the overridden value with an = in between:



```
mean_ci(c(1, 2, 3, 4), conf=.99) #yes
mean_ci(c(1, 2, 3, 4), .99)  # no

```

## Validation

When writing functions it's a good idea to validate the input -- that is, make sure it matches your assumptions about what is being passed to the function. Consider the following function which returns the weighted average of a vector:

In [17]:
w_mean = function(x, w) {
    (x * w) / sum(w)
}

This function relies implicitly on the fact that the weight vector `w` is the same length as the input vector `x`. If it's not, you'll get a warning and unexpected behavior.

In [18]:
w_mean(c(1,2,3), w=c(1, 2))

“longer object length is not a multiple of shorter object length”


It's best to make the assumption of equal length explicit by checking it:

In [19]:
w_mean = function(x, w) {
    stopifnot(length(w) == length(x))
    (x * w) / sum(w)
}

In [20]:
w_mean(c(1,2,3), w=c(1, 2))

ERROR: Error in w_mean(c(1, 2, 3), w = c(1, 2)): length(w) == length(x) is not TRUE


## ...

Some functions are designed to take a variable number of inputs. We saw this for example with the str_c function:

In [21]:
stringr::str_c("a", "b")
stringr::str_c("a", "b", "c", "d")

To construct a function that takes a variable number of arguments we use the `...` notation:

```
f = function(...) {
    <do something with variable arguments>
}

```
One thing you can do with the ... is pass it to another function:

In [22]:
commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])