In [2]:
library(tidyverse)

# [Programming with dplyr](https://dplyr.tidyverse.org/articles/programming.html)

# Introduction

Most dplyr verbs use tidy evaluation in some way. Tidy evaluation is a special type of non-standard evaluation used throughout the tidyverse. There are two basic forms found in dplyr:

* `arrange()`, `count()`, `filter()`, `group_by()`, `mutate()`, and `summarise()` use __data masking__ so that you can use data variables as if they were variables in the environment (i.e. you write `my_variable` not `df$myvariable`).

* `across()`, `relocate()`, `rename()`, `select()`, and `pull()` use __tidy selection__ so you can easily choose variables based on their position, name, or type (e.g. starts_with("x") or is.numeric).

To determine whether a function argument uses data masking or tidy selection, look at the documentation: in the arguments list, you’ll see `<data-masking>` or `<tidy-select>`.

# Data masking

Data masking makes data manipulation faster because it requires less typing. In most (but not all1) base R functions you need to refer to variables with $, leading to code that repeats the name of the data frame many times:

In [11]:
iris[iris$Species == 'setosa', ,]

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


In [12]:
iris %>% filter(Species == 'setosa')

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


### Data- and env-variables

The key idea behind data masking is that it blurs the line between the two different meanings of the word “variable”:

* `env-variables` are “programming” variables that live in an environment. They are usually created with `<-`.

* `data-variables` are “statistical” variables that live in a data frame. They usually come from data files (e.g. `.csv`, `.xls`), or are created manipulating existing variables.

In [13]:
df <- data.frame(x = runif(3), y = runif(3))
df$x

It creates a env-variable, `df`, that contains two data-variables, `x` and `y`. Then it extracts the data-variable `x` out of the env-variable df using `$`.

### Indirection

* When you have the data-variable in a function argument (i.e. an env-variable that holds a promise2), you need to embrace the argument by surrounding it in doubled braces, like `filter(df, {{ var }})`.

In [15]:
my_count <- function(data, var) data %>% count({{var}})

mpg %>% count(manufacturer)

manufacturer,n
audi,18
chevrolet,19
dodge,37
ford,25
honda,9
hyundai,14
jeep,8
land rover,4
lincoln,3
mercury,4


When you have an env-variable that is a character vector, you need to index into the `.data` pronoun with `[[`, like `summarise(df, mean = mean(.data[[var]])`).

In [18]:
for(var in colnames(mtcars))
    mtcars %>% count(.data[[var]]) %>% print()

    mpg n
1  10.4 2
2  13.3 1
3  14.3 1
4  14.7 1
5  15.0 1
6  15.2 2
7  15.5 1
8  15.8 1
9  16.4 1
10 17.3 1
11 17.8 1
12 18.1 1
13 18.7 1
14 19.2 2
15 19.7 1
16 21.0 2
17 21.4 2
18 21.5 1
19 22.8 2
20 24.4 1
21 26.0 1
22 27.3 1
23 30.4 2
24 32.4 1
25 33.9 1
  cyl  n
1   4 11
2   6  7
3   8 14
    disp n
1   71.1 1
2   75.7 1
3   78.7 1
4   79.0 1
5   95.1 1
6  108.0 1
7  120.1 1
8  120.3 1
9  121.0 1
10 140.8 1
11 145.0 1
12 146.7 1
13 160.0 2
14 167.6 2
15 225.0 1
16 258.0 1
17 275.8 3
18 301.0 1
19 304.0 1
20 318.0 1
21 350.0 1
22 351.0 1
23 360.0 2
24 400.0 1
25 440.0 1
26 460.0 1
27 472.0 1
    hp n
1   52 1
2   62 1
3   65 1
4   66 2
5   91 1
6   93 1
7   95 1
8   97 1
9  105 1
10 109 1
11 110 3
12 113 1
13 123 2
14 150 2
15 175 3
16 180 3
17 205 1
18 215 1
19 230 1
20 245 2
21 264 1
22 335 1
   drat n
1  2.76 2
2  2.93 1
3  3.00 1
4  3.07 3
5  3.08 2
6  3.15 2
7  3.21 1
8  3.23 1
9  3.54 1
10 3.62 1
11 3.69 1
12 3.70 1
13 3.73 1
14 3.77 1
15 3.85 1
16 3.90 2
17 3.92 3
18 4.08 2

Note that `.data` is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with `.data$x` or indirectly with `.data[[var]]`. Don’t expect other functions to work with it.

# Tidy select

Underneath all functions that use tidy selection is the **`tidyselect`** package. 

### Indirection

* When you have the data-variable in an env-variable that is a function argument, you use the same technique as data masking: you embrace the argument by surrounding it in doubled braces.

In [20]:
my_select <- function(data, cols) data %>% select({{cols}})

iris %>% my_select(!Species) %>% colnames()

* When you have an env-variable that is a character vector, you need to use `all_of()` or `any_of()` depending on whether you want the function to error if a variable is not found.

In [23]:
my_select_str_any <- function(data, cols) data %>% select(any_of(cols))

iris %>% my_select_str_any(c('Species', 'fucking ignore this')) %>% colnames()

# How to

Read the article. Most of thing we know. Below will be things that worth noticing

If you want to use the names of variables in the output, you can use **`glue`** syntax in conjunction with `:=`:

In [26]:
my_summary <- function(data, mean_var, sd_var)
    data %>% summarize(
        'mean_{{mean_var}}' := mean({{mean_var}}), 
        'sd_{{sd_var}}' := sd({{sd_var}}))

iris %>% my_summary(Sepal.Length, Petal.Length)

mean_Sepal.Length,sd_Petal.Length
5.843333,1.765298


If you want to take an arbitrary number of user supplied expressions, use `...`

In [38]:
my_count <- function(data, ...) 
    data %>% count(..., sort = T)

In [39]:
mpg %>% my_count(manufacturer)

manufacturer,n
dodge,37
toyota,34
volkswagen,27
ford,25
chevrolet,19
audi,18
hyundai,14
subaru,14
nissan,13
honda,9


In [42]:
mpg %>% my_count(manufacturer, model)

manufacturer,model,n
dodge,caravan 2wd,11
dodge,ram 1500 pickup 4wd,10
dodge,dakota pickup 4wd,9
ford,mustang,9
honda,civic,9
volkswagen,jetta,9
audi,a4 quattro,8
jeep,grand cherokee 4wd,8
subaru,impreza awd,8
audi,a4,7


If you want the user to provide a set of data-variables that are then transformed, use `across()`:

In [45]:
my_summarise <- function(data, summary_vars) {
  data %>%
    summarise(across({{ summary_vars }}, ~ mean(., na.rm = TRUE)))
}

starwars %>% 
  group_by(species) %>% 
  my_summarise(c(mass, height))

`summarise()` ungrouping output (override with `.groups` argument)


species,mass,height
Aleena,15.0,79.0
Besalisk,102.0,198.0
Cerean,82.0,198.0
Chagrian,,196.0
Clawdite,55.0,168.0
Droid,69.75,131.2
Dug,40.0,112.0
Ewok,20.0,88.0
Geonosian,80.0,183.0
Gungan,74.0,208.6667
