# Robust functions

## What do these calls do?

```
df[, vars]

subset(df, x == y)

data.frame(x = "a")
```

* Interactive analysis: Helpful
    - Iterate as quickly as possible and check the result as you go
    - The functions can guess what you want and it's no big deal if they gues wrong.
* Programming: Strict
    - The functions for programming should be robust since you're not working with them interactively
    
## Three main problems
    
* Type-unstable functions
* Non-standard evaluation
* Hidden arguments - Most notorious

## Throwing errors

```
x <- 1:10

stopifnot(is.character(x)) # Throws an error message if the condition isn't met.

# Some template
if (condition) {
    stop("Error", call. = FALSE)
}

# Some example
if (!is.character(x)) {
    stop("`x` should be a character vector", call. = FALSE)
}
```

### An error is better than a surprise

In [1]:
# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  # Add stopifnot() to check length of x and y
  stopifnot(length(x) == length(y))
  
  sum(is.na(x) & is.na(y))
}

# Call both_na() on x and y
both_na(x, y)

ERROR: Error: length(x) == length(y) is not TRUE


### An informative error is even better

In [2]:
# Define troublesome x and y
x <- c(NA, NA, NA)
y <- c( 1, NA, NA, NA)

both_na <- function(x, y) {
  # Replace condition with logical
  if (!(length(x) == length(y))) {
    # Replace "Error" with better message
    stop("x and y must have the same length", call. = FALSE)
  }  
  
  sum(is.na(x) & is.na(y))
}

# Call both_na() 
both_na(x, y)

ERROR: Error: x and y must have the same length


## Unstable types

### Surprises due to unstable types

* Type-inconsistent: the type of the return object depends on the input
* Surprises occur when you've used a type-inconsistent function inside your own function

### What will df[1, ] return?

```
# Sometimes you get a data frame
df <- data.frame(z = 1:3, y = 2:4)
str(df[1, ]) # Will return a data frame

# And sometimes you get a vector
df <- data.frame(z = 1:3)
str(df[1, ]) # Will return a vector
```

### [ is a common source of surprises

```
last_row <- function(df) {
                df[nrow(df), ]
            }
df <- data.frame(x = 1:3)

# Not a row, just a vector
str(last_row(df))
```

### Two common solutions for [

1. Use drop = FALSE: df[x, , drop = FALSE]
```
last_row <- function(df) {
                df[nrow(df), , drop = FALSE]
            }
```
2. Subset the data frame like a list: df[x]

### What to do?

* Write your own functions to be type-stable
* Learn the common type-inconsistent functions in R:
    - [
    - sapply
* Avoid using type-inconsistent functions inside your own functions
* Build a vocabulary of type-consistent functions

### sapply is another common culprit

In [5]:
df <- data.frame(
  a = 1L,
  b = 1.5,
  y = Sys.time(),
  z = ordered(1)
)

A <- sapply(df[1:4], class) 
B <- sapply(df[3:4], class)

A
B

y,z
POSIXct,ordered
POSIXt,factor


### Using purrr solves the problem

In [6]:
library(purrr)

# sapply calls
A <- sapply(df[1:4], class) 
B <- sapply(df[3:4], class)
C <- sapply(df[1:2], class) 

# Demonstrate type inconsistency
str(A)
str(B)
str(C)

# Use map() to define X, Y and Z
X <- map(df[1:4], class)
Y <- map(df[3:4], class)
Z <- map(df[1:2], class)

# Use str() to check type consistency
str(X)
str(Y)
str(Z)

"package 'purrr' was built under R version 3.4.3"

List of 4
 $ a: chr "integer"
 $ b: chr "numeric"
 $ y: chr [1:2] "POSIXct" "POSIXt"
 $ z: chr [1:2] "ordered" "factor"
 chr [1:2, 1:2] "POSIXct" "POSIXt" "ordered" "factor"
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr [1:2] "y" "z"
 Named chr [1:2] "integer" "numeric"
 - attr(*, "names")= chr [1:2] "a" "b"
List of 4
 $ a: chr "integer"
 $ b: chr "numeric"
 $ y: chr [1:2] "POSIXct" "POSIXt"
 $ z: chr [1:2] "ordered" "factor"
List of 2
 $ y: chr [1:2] "POSIXct" "POSIXt"
 $ z: chr [1:2] "ordered" "factor"
List of 2
 $ a: chr "integer"
 $ b: chr "numeric"


### A type consistent solution

In [7]:
col_classes <- function(df) {
  # Assign list output to class_list
  # map(df, class)
  class_list <- map(df, class)
  
  # Use map_chr() to extract first element in class_list
  map_chr(class_list, 1)
}

# Check that our new function is type consistent
df %>% col_classes() %>% str()
df[3:4] %>% col_classes() %>% str()
df[1:2] %>% col_classes() %>% str()

 Named chr [1:4] "integer" "numeric" "POSIXct" "ordered"
 - attr(*, "names")= chr [1:4] "a" "b" "y" "z"
 Named chr [1:2] "POSIXct" "ordered"
 - attr(*, "names")= chr [1:2] "y" "z"
 Named chr [1:2] "integer" "numeric"
 - attr(*, "names")= chr [1:2] "a" "b"


### Or fail early if something goes wrong

In [9]:
col_classes <- function(df) {
  class_list <- map(df, class)
  
  # Add a check that no element of class_list has length > 1
  if (any(map_dbl(class_list, length) > 1)) {
    stop("Some columns have more than one class", call. = FALSE)
  }
  
  # Use flatten_chr() to return a character vector
  flatten_chr(class_list)
}

In [10]:
# Check that our new function is type consistent
df %>% col_classes() %>% str()

ERROR: Error: Some columns have more than one class


In [11]:
df[3:4] %>% col_classes() %>% str()

ERROR: Error: Some columns have more than one class


In [12]:
df[1:2] %>% col_classes() %>% str()

 chr [1:2] "integer" "numeric"


## Non-standard evaluation

### What is non-standard evaluation

```
subset(mtcars, disp > 400) # (data frame, logical expression) evaluated inside mtcars

disp > 400 # Will throw an error

dist       # Will throw an error
```

* Non-standard evaluation functions don't use the usual lookup rules
* Great for data analysis, because they save typing
* When it's used inside your own function, it can cause problems

### Other NSE(Non-standard evaluation) functions

```
library(ggplot2)
ggplot(mpg, aes(displ, cty)) + geom_point()

library(dplyr)
filter(mtcars, disp > 400)

disp_threshold <- 400
filter(mtcars, disp > disp_threshold) # disp_threshold value in the global environment
```

### What to do?

* Using non-standard evaluation functions inside your own functions can cause surprises
* Avoid using non-standard evaluation functions inside your functions
* Or, learn the surprising cases and protect against them

### Programming with NSE functions

In [39]:
big_x <- function(df, threshold) {
  dplyr::filter(df, x > threshold)
}

library(ggplot2)

sample_idx <- sample.int(n = nrow(diamonds), size = 20, replace = FALSE)
diamonds_sub <- diamonds[sample_idx, ]

big_x(diamonds_sub, 7)

carat,cut,color,clarity,depth,table,price,x,y,z
1.78,Premium,H,SI2,59.1,60,11262,7.93,7.87,4.67
2.01,Very Good,D,SI2,61.8,59,17079,8.07,8.15,5.01
2.08,Ideal,H,SI2,61.5,57,15065,8.26,8.2,5.06


In [43]:
# Remove the x column from diamonds
diamonds_sub$x <- NULL

# Create variable x with value 1
x <- 1

# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, 7)

# Create a threshold column with value 100
diamonds_sub$threshold <- 100

# Use big_x() to find rows in diamonds_sub where x > 7
big_x(diamonds_sub, threshold = 7)

carat,cut,color,clarity,depth,table,price,y,z,threshold


carat,cut,color,clarity,depth,table,price,y,z,threshold


In [46]:
### What to do?
big_x <- function(df, threshold) {
  # Write a check for x not being in df
  if (!"x" %in% names(df)) {
    stop("df must contain variable called x", call. = FALSE)
  }
  
  # Write a check for threshold being in df
  if ("threshold" %in% names(df)) {
    stop("df must not contain variable called threshold", call. = FALSE)
  }
  
  dplyr::filter(df, x > threshold)
}

## Hidden arguments

### Pure functions

1. Their output only depends on their inputs
2. They don't affect the outside world except through their return value

* Hidden arguments are function inputs that may be different for differnet users or sessions
* Common example: argument defaults that depend on global options

### View global options

* Settings that affect your entire R session
> options() # Will show them


### Getting and setting options

```
# Getting options
getOption("digits") # The number of digits to show

# Setting options
options(digits = 5)

```

### Relying on options in your code

* You could set them inside a function, but you shouldn't, because this would violate the second principle of pure function.
* The return value of a function should ** never ** depend on a global option
* Side effects may be controlled by global options

### Hidden dependence

In [51]:
# Read in the swimming_pools.csv to pools
pools <- read.csv("swimming_pools.csv", )

# Examine the structure of pools
str(pools)

# Change the global stringsAsFactor option to FALSE
options(stringsAsFactor = FALSE)

# Read in the swimming_pools.csv to pools2
pools2 <- read.csv("swimming_pools.csv")

# Examine the structure of pools2
str(pools2)

'data.frame':	20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...
'data.frame':	20 obs. of  4 variables:
 $ Name     : Factor w/ 20 levels "Acacia Ridge Leisure Centre",..: 1 2 3 4 5 6 19 7 8 9 ...
 $ Address  : Factor w/ 20 levels "1 Fairlead Crescent, Manly",..: 5 20 18 10 9 11 6 15 12 17 ...
 $ Latitude : num  -27.6 -27.6 -27.6 -27.5 -27.4 ...
 $ Longitude: num  153 153 153 153 153 ...


### Legitimate use of options

In [52]:
# Fit a regression model
fit <- lm(mpg ~ wt, data = mtcars)

# Look at the summary of the model
summary(fit)

# Set the global digits option to 2
options(digits = 2)

# Take another look at the summary
summary(fit)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,	Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10



Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-4.543 -2.365 -0.125  1.410  6.873 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.285      1.878   19.86  < 2e-16 ***
wt            -5.344      0.559   -9.56  1.3e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3 on 30 degrees of freedom
Multiple R-squared:  0.753,	Adjusted R-squared:  0.745 
F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10


## Warp-up

### Writing functions

* If you have copy-and-pasted two times, it's time to write a function
* Solve a simple problem, before writing the function
* A good function is both correct and understandable

### Functional Programming

* Abstract away the pattern, so you can focus on the data and actions
* Solve iteration problems more easily
* Have more understandable code

### Remove duplication and improve readability

```
df$a <- (df$a - min(df$a, na.rm = TRUE)) /
        (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) /
        (max(df$b, na.rm = TRUE) - min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) /
        (max(df$c, na.rm = TRUE) - min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) /
        (max(df$d, na.rm = TRUE) - min(df$d, na.rm = TRUE))

library(purrr)
df[] <- map(df, rescale01)
```

### Unusudal inputs and outputs

* Deal with failure using safely()
* Iterate over two or more arguments
* Iterate functions for their side effects

### Write functions that don't surprise

* Use stop() and stopifnot() to fail early
* Avoid using type-inconsistent functions in your own functions
* Avoid using non-standard evaluation functions in your own functions
* Never rely on global options for computational details

### Wrapping up

* Solve the problem that you're working on
* Never feel bad about using a for loop!
* Get a function that works right, for the easiest 80% of the problem
* In time, you'll learn how to get to 99% with minimal extra effort
* Concise and elegant code is something to strive towards!