# When and how you should write a function

## Why should you write a function?

### What does this code do?

```
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) -
        min(df$a, na.rm = TRUE))
df$b <- (df$b - min(df$b, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) -
        min(df$b, na.rm = TRUE))
df$c <- (df$c - min(df$c, na.rm = TRUE)) / (max(df$c, na.rm = TRUE) -
        min(df$c, na.rm = TRUE))
df$d <- (df$d - min(df$d, na.rm = TRUE)) / (max(df$d, na.rm = TRUE) -
        min(df$d, na.rm = TRUE))
```

* It seems that we are repeating the same oepration.
* What the code does is to normalize the values in each column.
* You can also notice that there's an erro in the second line.

### How was this code written?

* Complete one line of code
* Copy-and-paste the rest

### When should you write a function?

* If you have copied-and-pasted twice, it's time to write a function.

### Writing a function makes the intent clearer

```
df$a <- rescale01(df$a)
df$b <- rescale01(df$b)
df$c <- rescale01(df$c)
df$d <- rescale01(df$d)
```

* Reduces mistakes from copying and pasting
* Makes updating code easier

### Functional programming further reduces duplication

```
library(purrr)
df[] <- map(df, rescale01)
```

## Start with a snippet of code

In [1]:
# Define example vector x
x <- 1:10

# Rewrite this snippet to refer to x
# (df$a - min(df$a, na.rm = TRUE)) /
#   (max(df$a, na.rm = TRUE) - min(df$a, na.rm = TRUE))
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))

## Rewrite for clarity

In [5]:
# Define example vector x
x <- 1:10

# Define rng
rng <- range(x, na.rm = TRUE)

# Rewrite this snippet to refer to the elements of rng
# (x - min(x, na.rm = TRUE)) /
#   (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
(x - rng[1]) / (rng[2] - rng[1])

## Finally turn it into a function!

In [6]:
# Define example vector x
x <- 1:10 

# Use the function template to create the rescale01 function
# my_fun <- function(arg1, arg2) {
#   # body
#   
# }
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE)
  (x - rng[1]) / (rng[2] - rng[1])
}

# Test your function, call rescale01 using the vector x as the argument
rescale01(x)

## How should you write a function?

### Start with a simple problem

```
df <- data.frame(a = rnorm(10),
                 b = rnorm(10),
                 c = rnorm(10),
                 d = rnorm(10))

# Rescale the 'a' column in df to a 0-1 range
df <- data.frame(a = 1:11,
                 b = rnorm(11),
                 c = rnorm(11),
                 d = rnorm(11))
```

### Get a working snippet of code

```
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) -
         min(df$a, na.rm = TRUE))$
```
### Rewrite to use temporary variables

```
df$a <- (df$a - min(df$a, na.rm = TRUE)) / (max(df$a, na.rm = TRUE) -
         min(df$a, na.rm = TRUE))

x <- df$a

(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
```

### Rewrite for clarity

```
x <- df$a

rng <- range(x, na.rm = TRUE)

(x - rng[1]) / (rng[2] - rng[1])$
```

### Finally, turn it into a function

```
x <- df$a

rescale01 <- function(x) {
    rng <- range(x, na.rm = TRUE)
    (x - rng[1]) - (rng[2] - rng[1])
}

rescale01(x) $
```

### How should you write a function?

* Start with a simple problem
* Get a working snippet of code
* Rewrite to use temporary variables
* Rewrite for clarity
* Finally, turn into a function

## Start with a simple problem

In [8]:
# Define example vectors x and y
x <- c( 1, 2, NA, 3, NA)
y <- c(NA, 3, NA, 3,  4)

# Count how many elements are missing in both x and y
sum(is.na(x) & is.na(y))

## Rewrite snippet as function

In [9]:
# Define example vectors x and y
x <- c( 1, 2, NA, 3, NA)
y <- c(NA, 3, NA, 3,  4)

# Turn this snippet into a function: both_na()
#  sum(is.na(x) & is.na(y))
both_na <- function(x, y) {
  sum(is.na(x) & is.na(y))
}
both_na(x, y)

## Put our function to use

In [10]:
# Define x, y1 and y2
x  <- c(NA, NA, NA)
y1 <- c(1, NA, NA)
y2 <- c(1, NA, NA, NA)

# Call both_na on x, y1
both_na(x, y1)

# Call both_na on x, y2
both_na(x, y2)

"longer object length is not a multiple of shorter object length"

## How can you write a good function?

### What makes a good function?

* Correct: Your function solves the problem correctly.
* Understandable: Understandable for other people reading your code.
* Functions are for humans and computers
* Correct + Understandable = Obviously correct

### What does this code do?

> baz <- foo(bar, qux)

Who knows? What about this one?

> df2 <- arrange(df, qux)

Good names make code understandable with minimal context

### Naming principles

** Same whether objects, functions, or arguments **

* Pick a consistent style for long names

```
# Good
col_mins()
row_maxes()

# Bad
newData <- c(old.data, todays_log)
```


* Do not override existing variables or functions: espescially predefined ones!


```
T <- FALSE
c <- 10
mean <- function(x) sum(x)
```

### Function names

* Should generally be verbs


```
# Good
impute_missing()

# Bad
imputed()
```


* Should be descriptive: The function name should describe what it does.


```
# Good
collapse_years()

# Bad
f()
my_awesome_function()
```


### Argument names

* Should generally be nouns
* Use the very common short names when appropriate:
    - x, y, z: vectors
    - df: a data frame
    - i, j: numeric indices(typically rows and columns)
    - n: length, or number of rows
    

### Argument order

> mean(x, trim = 0, na.rm = FALSE, ...)

* Data Arguments: supply data to compute on
* Detail Arguments: supply arguments that control the detail of the computation

> t.test(x, y = NULL,
alternative = c("two.sided", "less", "greater"),
mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)


* Data arguments come first
* Detail arguments should have sensible default


###  What makes a good function?

* Use good names for functions and arguments
* Use an intuitive argument order and reasonable default values
* Make it clear what the function returns
* Use good style inside the body of the function

## Argument names

In [11]:
# Rewrite mean_ci to take arguments named level and x
# mean_ci <- function(c, nums) {
#   se <- sd(nums) / sqrt(length(nums))
#   alpha <- 1 - c
#   mean(nums) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
# }

mean_ci <- function(level, x) {
 se <- sd(x) / sqrt(length(x))
 alpha <- 1 - level
 mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

## Argument order

In [12]:
# Alter the arguments to mean_ci
# mean_ci <- function(level, x) {
#   se <- sd(x) / sqrt(length(x))
#   alpha <- 1 - level
#   mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
# }

# Change the arugment order and set the default of level to 0.95
mean_ci <- function(x, level = 0.95) {
 se <- sd(x) / sqrt(length(x))
 alpha <- 1 - level
 mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

## Return statements

```
mean_ci <- function(x, level = 0.95) {
  if (length(x) == 0) {
    warning("`x` was empty", call. = FALSE)
    interval <- c(-Inf, Inf)
  } else { 
    se <- sd(x) / sqrt(length(x))
    alpha <- 1 - level
    interval <- mean(x) + 
      se * qnorm(c(alpha / 2, 1 - alpha / 2))
  }
  interval
}
```


Edit the mean_ci function using an if statement to check for the case when x is empty and if so, to produce the same warning as the code above then immediately return() c(-Inf, Inf).

In [13]:
# Alter the mean_ci function
# mean_ci <- function(x, level = 0.95) {
#   se <- sd(x) / sqrt(length(x))
#   alpha <- 1 - level
#   mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
# }

mean_ci <- function(x, level = 0.95) {
  if (length(x) == 0) {
    return(c(-Inf, Inf))
  } else {
    se <- sd(x) / sqrt(length(x))
    alpha <- 1 - level
    interval <- mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
  }
  return (interval)
}

## What does this function do?

```
f <- function(x, y) {
  x[is.na(x)] <- y
  cat(sum(is.na(x)), y, "\n")
  x
}
```


* Define a numeric vector x with the values: 1, 2, NA, 4 and 5.
* Call f() with the arguments x = x, and y = 3.
* Call f() with the arguments x = x, and y = 10.

In [14]:
# Define the function f
f <- function(x, y) {
  x[is.na(x)] <- y
  cat(sum(is.na(x)), y, "\n")
  x
}

# Define a numeric vector x with the values 1, 2, NA, 4 and 5
x <- c(1, 2, NA, 4, 5)

# Call f() with the arguments x = x and y = 3
f(x, 3)

# Call f() with the arguments x = x and y = 10
f(x, 10)

0 3 


0 10 


## Let's make it clear from its name

In [15]:
# Define a data frame df
df <- data.frame(z = c(-0.39, -0.01, 3.53, NA, NA,
                      -0.31, 0.12, -0.52, 0.55, 0.70))

# Rename the function f() to replace_missings()
# f <- function(x, y) {
#   # Change the name of the y argument to replacement
#   x[is.na(x)] <- y
#   cat(sum(is.na(x)), y, "\n")
#   x
# }

replace_missings <- function(x, replacement) {
 # Change the name of the y argument to replacement
 x[is.na(x)] <- replacement
 cat(sum(is.na(x)), replacement, "\n")
 x
}

# Rewrite the call on df$z to match our new names
df$z <- replace_missings(df$z, 0)

0 0 


## Make the body more understandable

In [16]:
# replace_missings <- function(x, replacement) {
#   # Define is_miss
#   
#   
#   # Rewrite rest of function to refer to is_miss
#   x[is.na(x)] <- replacement
#   cat(sum(is.na(x)), replacement, "\n")
#   x
# }

replace_missings <- function(x, replacement) {
  # Define is_miss
  is_miss <- is.na(x)
  
  # Rewrite rest of function to refer to is_miss
  x[is_miss] <- replacement
  cat(sum(is_miss), replacement, "\n")
  x
}

## Much better! But a few more tweaks...

In [18]:
# replace_missings <- function(x, replacement) {
#   is_miss <- is.na(x)
#   x[is_miss] <- replacement
#   
#   # Rewrite to use message()
#   cat(sum(is_miss), replacement, "\n")
#   x
# }

replace_missings <- function(x, replacement) {
  is_miss <- is.na(x)
  x[is_miss] <- replacement
  
  # Rewrite to use message()
  message(sum(is_miss), "missings replaced by the value", replacement, "\n")
  x
}


# Check your new function by running on df$z
