# <div style="text-align: right"> Chapter __15__</div>

# __Functions__

In [53]:
library(tidyverse)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mdplyr  [39m 1.0.1
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0
[32m✔[39m [34mpurrr  [39m 0.3.4     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()



In [2]:
# config
repr_html.tbl_df <- function(obj, ..., rows = 6) repr:::repr_html.data.frame(obj, ..., rows = rows)
options(dplyr.summarise.inform = FALSE)

If you have a family of functions that do similar things, make sure
they have consistent names and arguments. Use a common prefix to
indicate that they are connected. That’s better than a common suffix
because autocomplete allows you to type the prefix and see all the
members of the family:

```r
# good
input_select()
input_checkbox()
input_text()

# not so good
select_input()
checkbox_input()
tet_input()
```

## Conditional Execution

In [3]:
# simple function that uses an if statement
has_name <- function(x) {
    nms <- names(x)
    if (is.null(nms)) {
        rep(FALSE, length(x))
    } else {
        !is.na(nms) & nms != ''
    }
}

This function takes advantage of the standard return rule: a function
returns the last value that it computed. Here that is either one of the
two branches of the if statement.

### Conditions

The condition must evaluate to either `TRUE` or `FALSE` . If it’s a vector,
you’ll get a warning message; if it’s an NA , you’ll get an error. Watch
out for these messages in your own code:

In [4]:
if (c(TRUE, FALSE)) {}

“the condition has length > 1 and only the first element will be used”


NULL

In [5]:
if (NA) {}

ERROR: Error in if (NA) {: missing value where TRUE/FALSE needed


You can use `||` (or) and `&&` (and) to combine multiple logical
expressions. These operators are “short-circuiting”: as soon as `||`
sees the first `TRUE` it returns `TRUE` without computing anything else.
As soon as `&&` sees the first `FALSE` it returns `FALSE` . You should never
use `|` or `&` in an `if` statement: these are vectorized operations that
apply to multiple values (that’s why you use them in `filter()` ). If
you do have a logical vector, you can use `any()` or `all()` to collapse
it to a single value.

Be careful when testing for equality. `==` is vectorized, which means
that it’s easy to get more than one output. Either check the length is
already 1, collapse with `all()` or `any()` , or use the nonvectorized
`identical()` . `identical()` is very strict: it always returns either a
single `TRUE` or a single `FALSE` , and doesn’t coerce types. This means
that you need to be careful when comparing integers and doubles:

In [6]:
identical(0L, 0)

You also need to be wary of floating-point numbers:

In [7]:
x <- sqrt(2) ^ 2
x

In [8]:
x == 2

In [9]:
x - 2

Instead use `dplyr::near()` for comparisons:

In [10]:
dplyr::near(x, 2)

__Excercise__

Write a greeting function that says “good morning”, “good afternoon”, or “good evening”, depending on the time of day. (Hint: use a time argument that defaults to lubridate::now(). That will make it easier to test your function.)

In [13]:
greeting <- function(time = lubridate::now()) {
    hr <- lubridate::hour(time)
    if (hr < 12) {
        print('good morning')
    } else if (hr < 17) {
        print('good afternoon')
    } else {
        print('good evening')
    }
}

greeting()

[1] "good evening"


__Excercise__

Implement a fizzbuzz function. It takes a single number as
input. If the number is divisible by three, it returns “fizz”. If it’s
divisible by five it returns “buzz”. If it’s divisible by three and
five, it returns “fizzbuzz”. Otherwise, it returns the number.
Make sure you first write working code before you create the
function.

In [15]:
fizzbuzz <- function(x) {
    # these two lines check that x is a vlaid input
    stopifnot(length(x) == 1)
    stopifnot(is.numeric(x))
    if (!(x %% 3) && !(x %% 5)) {
        return('fizzbuzz')
    } else if (!(x %% 3)) {
        return('fizz')
    } else if (!(x %% 5)) {
        'buzz'
    } else {
        # ensure that the function returns a char vector
        return(as.character(x))
    }
}

In [17]:
(fizzbuzz(6))
(fizzbuzz(10))
(fizzbuzz(15))
(fizzbuzz(2))

Instead of only accepting one number as an input, we could a FizzBuzz function that works on a vector. The `case_when()` function vectorizes multiple if-else conditions, so is perfect for this task. In fact, fizz-buzz is used in the examples in the documentation of `case_when()`.

In [20]:
vect_fizzbuzz <- function(x) {
    dplyr::case_when(!(x %% 3) & !(x %% 5) ~ 'fizzbuzz',
              !(x %% 3) ~ 'fizz',
              !(x %% 5) ~ 'buzz',
             TRUE ~ as.character(x)
             )
}

In [21]:
vect_fizzbuzz(c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 14, 15))

## Function Arguments

In [22]:
# compute confidence interval around
# mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
    se <- sd(x) / sqrt(length(x))
    alpha <- 1 - conf
    mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

In [23]:
x <- runif(100)
mean_ci(x)

In [24]:
mean_ci(x, conf = 0.99)

The default value should almost always be the most common value.
The few exceptions to this rule have to do with safety. For example,
it makes sense for na.rm to default to FALSE because missing values
are important. Even though na.rm = TRUE is what you usually put
in your code, it’s a bad idea to silently ignore missing values by
default.

When you call a function, you typically omit the names of the data
arguments, because they are used so commonly. If you override the
default value of a detail argument, you should use the full name:

In [25]:
# Good
mean(1:10, na.rm = TRUE)

In [26]:
# bad
mean(x = 1:10, , FALSE)

In [28]:
# bad
mean(, TRUE, x = c(1:10, NA))

## Choosing Names

The names of the arguments are also important. R doesn’t care, but
the readers of your code (including future-you!) will. Generally you
should prefer longer, more descriptive names, but there are a handful of very common, very short names. It’s worth memorizing these:
* x , y , z : vectors.
* w : a vector of weights.
* df : a data frame.
* i , j : numeric indices (typically rows and columns).
* n : length, or number of rows.
* p : number of columns.
Otherwise, consider matching names of arguments in existing R
functions. For example, use na.rm to determine if missing values
should be removed.

## Checking Values
As you start to write more functions, you’ll eventually get to the
point where you don’t remember exactly how your function works.
At this point it’s easy to call your function with invalid inputs. To
avoid this problem, it’s often useful to make constraints explicit. For
example, imagine you’ve written some functions for computing
weighted summary statistics:

In [29]:
wt_mean <- function(x, w) {
    sum(x * w) / sum(x)
}

In [30]:
wt_var <- function(x, w) {
    mu <- wt_mean(x, w)
    sum(w * (x - mu) ^2 ) / sum(w)
}

In [31]:
wt_sd <- function(x, w) {
    sqrt(wt_var(x, w))
}

What happens if `x` and `w` are not the same length?

In this case, because of R’s vector recycling rules, we don’t get an
error.
It’s good practice to check important preconditions, and throw an
error (with `stop()` ) if they are not true:

In [32]:
wt_mean <- function(x, w) {
    if (length(x) != length(w)) {
        stop('x and w must be the same length', call. = FALSE)
    }
    sum(w * x) / sum(x)
}

In [33]:
wt_mean(1:9, 9:18)

ERROR: Error: x and w must be the same length


In [37]:
wt_mean(1:5, 5:9)

Be careful not to take this too far. There’s a trade-off between how
much time you spend making your function robust, versus how
long you spend writing it. For example, if you also added a na.rm
argument, I probably wouldn’t check it carefully:

In [38]:
wt_mean <- function(x, w, na.rm = FALSE) {
    if (!is.logical(na.rm)) {
        stop('na.rm must be logical')
    }
    if (length(na.rm) != 1) {
        stop('na.rm must be length 1')
    }
    if (length(x) != length(w)) {
        stop('x and w must be the same length', call. = FALSE)
    }
    
    if (na.rm) {
        miss <- is-na(x) | is.na(w)
        x <- x[!miss]
        w <- w[!miss]
    }
    sum(w * x) / sum(x)
}

This is a lot of extra work for little additional gain. A useful compro‐
mise is the built-in `stopifnot()` ; it checks that each argument is
TRUE , and produces a generic error message if not:

In [39]:
wt_mean <- function(x, w, na.rm = FALSE) {
    stopifnot(is.logical(na.rm), length(na.rm) == 1)
    stopifnot(length(x) == length(w))
    
    if (na.rm) {
        miss <- is.na(x) | is.na(w)
        x <- x[!miss]
        w <- w[!miss]
    }
    sum(w * x) / sum(x)
}

wt_mean(1:6, 6:1, na.rm = 'foo')

ERROR: Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE


Note that when using `stopifnot()` you assert what should be true
rather than checking for what might be wrong

## Dot-Dot-Dot (...)

Many functions in R take an arbitrary number of inputs:

In [40]:
sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 0)

In [41]:
stringr::str_c('a', 'b', 'c', 'd', 'e', 'f', 'g')

How do these functions work? They rely on a special argument: `...`
(pronounced dot-dot-dot). This special argument captures any
number of arguments that aren’t otherwise matched.

It’s useful because you can then send those `...` on to another function. This is a useful catch-all if your function primarily wraps
another function. For example, I commonly create these helper
functions that wrap around `str_c()` :

In [42]:
commas <- function(...) stringr::str_c(..., collapse = ', ')
commas(letters[1:10])

In [43]:
rule <- function(..., pad = '-') {
    title <- paste0(...)
    width <- getOption('width') - nchar(title) - 5
    cat(title, ' ', stringr::str_dup(pad, width), '\n', sep = '')
}
rule('Important output')

Important output -----------------------------------------------------------


Here `...` lets me forward on any arguments that I don’t want to deal
with to `str_c()` . It’s a very convenient technique. But it does come
at a price: any misspelled arguments will not raise an error. This
makes it easy for typos to go unnoticed:

In [44]:
x <- c(1, 2)
sum(x, na.mr = TRUE)

If you just want to capture the values of the `...` , use `list(...)` .

## Lazy Evaluation
Arguments in R are lazily evaluated: they’re not computed until
they’re needed. That means if they’re never used, they’re never
called. This is an important property of R as a programming lan‐
guage, but is generally not important when you’re writing your own
functions for data analysis.

## Writing Pipeable Functions

If you want to write your own pipeable functions, thinking about
the return value is important. There are two main types of pipeable
functions: transformation and side-effect.

In __transformation__ functions, there’s a clear “primary” object that is
passed in as the first argument, and a modified version is returned
by the function. For example, the key objects for dplyr and tidyr are
data frames. If you can identify what the object type is for your
domain, you’ll find that your functions just work with the pipe.

__Side-effect__ functions are primarily called to perform an action, like
drawing a plot or saving a file, not transforming an object. These
functions should “invisibly” return the first argument, so they’re not
printed by default, but can still be used in a pipeline. For example,
this simple function prints out the number of missing values in a
data frame:

In [46]:
show_missings <- function(df) {
    n <- sum(is.na(df))
    cat('Missing values: ', n, '\n', sep = '')
    
    invisible(df)
}

If we call it interactively, the `invisible()` means that the input df
doesn’t get printed out:

In [47]:
show_missings(mtcars)

Missing values: 0


But it’s still there, it’s just not printed by default:

In [48]:
(x <- show_missings(mtcars))
(class(x))
(dim(x))

Missing values: 0


Unnamed: 0_level_0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


And we can still use it in a pipe:

In [54]:
mtcars %>%
    show_missings() %>%
    mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>%
    show_missings()

Missing values: 0
Missing values: 18
