[Advanced R, Metaprogramming](https://adv-r.hadley.nz/meta.html)

Author for this note: Yvette Li    
Record date: March 28, 2018

# Motivation

Rlang library is a powerful tool when you want to write functions with R. Since tidyverse uses non-standard evaluation, if a variable, whose value is a column name, is passed, the variable name will be interpreted literally and the value is not extracted with the usual practice. Also, scoping issue may also associate with evaluating a variable. Base R has quite a few ways for writing programs in R (not just analyze data interactively), and a more specific explanations can be found [here](http://adv-r.had.co.nz/Computing-on-the-language.html). 

I found *Advanced R* helpful, but if one wants to read just one section or chapter, the terminology used can be rather confusing since the book defines quite a few terms to explain the behaviours of R, R structure and functions in rlang. I highly recommend to read the whole section before using it. This note is a more condensed version of the section but it does **not** summarize all the information from the book. It serves the purpose of searching some definitions with examples when one encounters unfamiliar terms in the book, or wants a quicker (but not thorough) introduction to the library.

The newest version of rlang (0.2.0) was out on February 28, 2018, some functions are soft deprecated in this version of rlang. There are not many posts about the updates of this libary at the time I am writing this note. Being aware of the danger of using strings to manipulate expressions, I used the newest rlang library with stringr to write many expressions. After reading the book and some old blog posts once again, I decided to re-write some codes here to show safer practices and more robust programs. If you only wish to see some common practices I used, please go to section **"Common Practice in Survey Data"**.


---

# Metaprogramming

In R, functions that use metaprogramming are commonly said to use non-standard evaluation (NSE)   

Two primary uses of metaprogramming are:
- trade precision for concision in functions like ```subset()``` and ```dplyr::filter()``` that make interactive data exploration faster at the cost of introducing some ambiguity
- build <strong>domain specific languages </strong> (DSLs) that tailor R’s semantics to specific problem domains like visualisation or data manipulation.

---

# 1. Introduction

## 1.1 Trading Precision for Concision

Metaprogramming allows you to use names of variables in a dataframe as if they were objects in the environment. This makes interactive exploration more fluid at the cost of introducing some minor ambiguity.

In [81]:
## eg
data("diamonds", package="ggplot2")
subset(diamonds, x==0 & y ==0 & z==0)
diamonds[diamonds$x == 0 & diamonds$y == 0 & diamonds$z == 0, ]

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
11964,1.0,Very Good,H,VS2,63.3,53,5139,0,0,0
15952,1.14,Fair,G,VS1,57.5,67,6381,0,0,0
24521,1.56,Ideal,G,VS2,62.2,54,12800,0,0,0
26244,1.2,Premium,D,VVS1,62.1,59,15686,0,0,0
27430,2.25,Premium,H,SI2,62.8,59,18034,0,0,0
49557,0.71,Good,F,SI2,64.1,60,2130,0,0,0
49558,0.71,Good,F,SI2,64.1,60,2130,0,0,0


Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
11964,1.0,Very Good,H,VS2,63.3,53,5139,0,0,0
15952,1.14,Fair,G,VS1,57.5,67,6381,0,0,0
24521,1.56,Ideal,G,VS2,62.2,54,12800,0,0,0
26244,1.2,Premium,D,VVS1,62.1,59,15686,0,0,0
27430,2.25,Premium,H,SI2,62.8,59,18034,0,0,0
49557,0.71,Good,F,SI2,64.1,60,2130,0,0,0
49558,0.71,Good,F,SI2,64.1,60,2130,0,0,0


In above example, x, y and z are the columns in the diamonds data frame. Subset is also considerably shorter than the equivalent code using \[ and \$.

## 1.2 Domain Specific Languages

DSLs are useful because they make it possible to translate R code into another language.

In [6]:
library(dplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), filename = ":memory:")
mtcars_db <- copy_to(con, mtcars)

mtcars_db %>%
  filter(cyl > 2) %>%
  select(mpg:hp) %>%
  head(10) %>%
  show_query()
#> <SQL>
#> SELECT `mpg`, `cyl`, `disp`, `hp`
#> FROM `mtcars`
#> WHERE (`cyl` > 2.0)
#> LIMIT 10

DBI::dbDisconnect(con)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

<SQL>
SELECT `mpg`, `cyl`, `disp`, `hp`
FROM `mtcars`
WHERE (`cyl` > 2.0)
LIMIT 10


ggplot2 and dplyr are known as <strong>embedded DSLs</strong>, because they take advantage of R’s parsing and execution framework, but tailor R’s semantics for specific tasks.

# 2. Expressions

## 2.1 Abstract Syntax Trees

AST is one of the following
- Constants and symbols form the leaves of the tree.
- Calls form the branches of the tree.
- Pairlists are a largely historical data structure that are now only used for function arguments.

Quoted expressions are called Abstract Syntax Trees (AST) because the structure of code is hierarchical and can be naturally represented as a tree.

In [11]:
## example
lobstr::ast(f(x, "y", 1))

o-f 
+-x 
+-"y" 
\-1 

### 2.1.1 Infix vs prefix calls
- Infix functions come in between their arguments. infix form: <- and \*.
- Prefix functions: name of the function comes first

In [16]:
## Two equivalent lines

# y <- x * 10
# `<-` (y, `*`(x, 10)
lobstr::ast(y <- x *10)

o-`<-` 
+-y 
\-o-`*` 
  +-x 
  \-10 

### 2.1.2 Associativity

In R, most operators are left-associative. Operations on the left are evaluated first

In [17]:
lobstr::ast(1+2+3)

o-`+` 
+-o-`+` 
| +-1 
| \-2 
\-3 

## 2.2 Data Structures

Note 1: In base R, "expression" is a special type that is basically equivalent to a list of what we call expressions. We call these "expression objects".    
Note 2: Base R does not have an equivalent term for "expression". The closest is "language object".

<strong> Constants </strong><br>
Constants occured in the leaves of the AST. They are the simplest data structure found in the AST because they are atomic vectors of length 1. Constants are “self-quoting” in the sense that the expression used to represent a constant is the constant itself:

In [8]:
library(rlang)

identical(expr("x"), "x")    
identical(expr(TRUE), TRUE)     
identical(expr(1), 1)        
identical(expr(2), 2)      

<strong>Symbols</strong> <br>
- Symbols represent variable names. They are basically a single string stored in a special way.
- Symbols are scalars (put multiple symbols in a list)
- Difference between string and symbols
    - evaluate a string -> string
    - evaluate a symbol -> value associate with the symbol in the current environment

In [4]:
## Exception Case
missing_arg()
invisible(typeof(missing_arg()))
#> [1] "symbol"
invisible(as_string(missing_arg()))
#> [1] ""
# And see if you have a missing symbol with rlang::is_missing():

invisible(is_missing(missing_arg()))
#> [1] TRUE
# This symbol has a peculiar propery: if you bind it to a variable, then access that variable, you will get an error:

m1 <- missing_arg()
# m1
#> Error in eval(expr, envir, enclos): argument "m1" is missing, with no default
# But you won’t get an error if it’s stored inside another data structure!

m2 <- list(missing_arg())
m2[[1]]

ERROR: Error in missing_arg(): could not find function "missing_arg"


<strong> Calls </strong> <br>
Calls define the tree in AST. A call behaves similarly to a list:

- It has a length().
- You can extract elements with \[\[, \[, and \$.
- Calls can contain other calls

In [10]:
x <- expr(read.table("important.csv", row = FALSE))
invisible(lobstr::ast(!!x))
# o-read.table 
# +-"important.csv" 
# \-row = FALSE 

## Get the number of arguments
invisible(length(x) - 1)
# 2

## The names of a call are empty, except for named arguments:
invisible(names(x))
# '' '' 'row'

## Extract the leaves of the call by position and by name using [[ and $ in the usual way:
invisible(x[[1]])
# read.table
invisible(x[[2]])
# 'important.csv'
invisible(x[[3]])
# FALSE

## 2.3 Parsing and Deparsing

### 2.3.1 Parsing

In [16]:
x1 <- " y <- x+10"
lobstr::ast(!!x1)

" y <- x+10" 

In [17]:
x2 <- rlang::parse_expr(x1)
invisible(x2)
#> y <- x + 10
lobstr::ast(!!x2)

o-`<-` 
+-y 
\-o-`+` 
  +-x 
  \-10 

### 2.3.2 Deparsing
- Deparsing is used when you need a string from expression.    
- Parsing and Deparsing can be both symmetrical or asymmetrical.
- Deparsing is often used to provide default names for data structures (like data frames), and default labels for messages or other output.

In [18]:
z <- expr(y <- x + 10)
expr_text(z)

In [26]:
z <- expr(f(x, y, z))
expr_name(z)
expr_label(z)

Base R has a function, deparse(), which returns a character vector with one element for each line.
- Rememebr that the length of the output might be greater than 1

<strong> Long Expressions </strong> <br>
deparse() produces vectors when the input is long.

In [33]:
expr <- expr(g(a + b + c + d + e + f + g + h + i + j + k + l + m +
  n + o + p + q + r + s + t + u + v + w + x + y + z))

invisible(expr_text(expr))
# 'g(a + b + c + d + e + f + g + h + i + j + k + l + m + n + o + \n    p + q + r + s + t + u + v + w + x + y + z)'
invisible(expr_name(expr))
# 'g(...)'
invisible(expr_label(expr))
# '`g(...)`'

# 3. Quasiquotation

Two sides of quasiquotation
- <strong>Quotation </strong> allows you to capture the AST associated with an argument. As a function author, this gives you a lot of power to influence how expressions are evaluated.

- <strong>Unquotation </strong> allows you to selectively evaluate parts of a quoted expression. This is a powerful tool that makes it easy to build up a complex AST from simpler fragments.

- An <strong> evaluated </strong> argument obeys R's usual evaluation rules.
- A <strong> quoted </strong> argument is captured by the function and cannot be executed outside of the function.

## 3.1 Quotation

### Quotation with rlang

Two components with quotation
1. capturing an expression directly (functions without en-)
2. capturing from an lazily-evaluated function argument (functions with en-)

|         	| expr()    	| sym()    	|
|---------	|-----------	|----------	|
| direct  	| expr()    	| sym()    	|
| direct  	| exprs()   	| syms()   	|
| fun arg 	| enexpr()  	| ensym()  	|
| fun arg 	| enexprs() 	| ensyms() 	|

###  Capturing an expression directly     
In the following function f1, x is passed to expr(x). Expr() takes in the variable and capture the argument exactly.
Expr() is great for interactive exploration, because it captures what developer types.

In [9]:
f1 <- function(x) expr(x)
f1(a+b+c)

x

### Capture multiple expression 
Use exprs(x) if you wish to capture more than one argument.

In [13]:
f11 <- function(...) exprs(...)
f11(x=1, y=x*z)

$x
[1] 1

$y
x * z


### Capture from an lazily evaluated function argument
In the following function f1, x is passed to enexpr(x). Enexpr() takes in the variable and capture the argument exactly. Enexpr() captures what the user supplies to the function by looking at the internal **promise** object.

A **promise** captures the expression need to compute the value and the environment in which to compute it. People are not aware of promise because the moment you access a promise, its code gets evaluated and yields a value. 

In [10]:
f2 <- function(x) enexpr(x)
f2(a+b+c)

a + b + c

### Capture the promises from multiple expressions

In [15]:
f22 <- function(...) enexprs(...)
f22(x=a+b+c, y=1+2+x)

$x
a + b + c

$y
1 + 2 + x


## 3.2. Unquotation

Unquotation allows the developer to call the function to selectively evaluate parts of the expression that would otherwise be quoted.    
* !! (bang-bang) unquote one expression
* !!! (bang-bang-bang) unquote more than one expressions

In [16]:
x <- expr(a+b+c)
expr(f(!!x, y))

f(a + b + c, y)

In [17]:
x <- exprs(1, 2, 3, y = 10)
expr(f(!!!x, z = z))

f(1, 2, 3, y = 10, z = z)

### !!x and !(!x)

If you wish to use !! as unquotation and encounters operator precedence issue, you can put bracket around the expression (ie. **(!!x)**). If you wish to doubly nagate a value in a quasiquotation, you can use **!(!x)**.


If you need to manually deparse a function, ```rlang::expr_deparse()``` can be used.

In [18]:
x <- quote(!!x + !!y)
expr_deparse(x)

### Missing Arguments

It might be useful to unquote a missing argument, ```maybe_missing()``` and unquote-splice (```!!!```) can come in handy.

In [23]:
args <- list(missing_arg(), missing_arg())
expr(foo(!!!args))

foo(, )

### Dot-dot-dot(...)

```do.call(what, args)``` has two main arguments.


The following example removes the restriction on column numbers

In [27]:
dfs <- list(
  a = data.frame(x = 1, y = 2),
  b = data.frame(x = 3, y = 4)
)

do.call("rbind", dfs)

Unnamed: 0,x,y
a,1,2
b,3,4


In [28]:
var <- "x"
val <- c(4, 3, 9)
args <- list(val)
names(args) <- var
do.call("data.frame", args)

x
4
3
9


# 4. Evaluation
## 4.1 Evaluation Basics
```rlang::eval_bare()```    
In ```rlang::eval_bare()```, the first argument ```expr``` is an expression to evaluate, which will usually be an expression. The second argument, ```env``` gives the environment in which the expression should be evaluated.

In [40]:
x <- 10
eval_bare(expr(x))

In [41]:
y <- 2
eval_bare(expr(x+y), env(x=1000))

In [42]:
eval_bare(
  expr(x + y), 
  env(`+` = function(x, y) paste0(x, " + ", y))
)

However, ```eval_bare()``` may produce unexpected result due to the conflicting variable names in the closure. Here is an example.

In [73]:
foo <- function(x) {
  y <- 100
  x <- enexpr(x)
  
  eval_bare(x)
}

In [74]:
z <- 100
foo(z * 2)

When the variable name does not have conflict with the closure, it evaluates fine. However, if there are conflicts

In [75]:
y <- 10
foo(y * 2)

y exists in the closure, so y is assigned as 100

### Use of local
```local()``` removes the intermediate vairables. In the following examples, x and y in local() are removed after the function.

In [43]:
rm(x, y)

foo <- local({
  x <- 10
  y <- 200
  x + y
})

foo
x
y

ERROR: Error in eval(expr, envir, enclos): object 'x' not found


## 4.2 Quosures (quoting and closure)

Quosures - The simplest form of evaluation combines an expression and an environment. 

### Create and Manipulate Quosure

|                 | quo      |
|-----------------|----------|
| direct          | quo()    |
| direct (plural) | quos()   |
| func arg        | enquo()  |
| func args       | enquos() |

In [45]:
quo(x + y + z)
quos(x + 1, y + 2)

<quosure>
  expr: ^x + y + z
  env:  global

[[1]]
<quosure>
  expr: ^x + 1
  env:  global

[[2]]
<quosure>
  expr: ^y + 2
  env:  global


<quosure>
  expr: ^a + b
  env:  global

In [46]:
foo <- function(x) enquo(x)
foo(a + b)

<quosure>
  expr: ^a + b
  env:  global

In [58]:
x <- quo(y)
quo(x)
quo(!!x)

<quosure>
  expr: ^x
  env:  global

<quosure>
  expr: ^y
  env:  global

The quosures are printed: each quosure starts with ```^```, which will be helpful when you need to unquote inside of a quosure

### Evaluating

```new_quosure()``` takes in expression and an environment as its argument

In [67]:
x <- new_quosure(expr(x + y), env(x = 1, y = 10))
eval_tidy(x)

Quosures rely on R’s internal representation of function arguments as a special type of object called a promise, whereas promises are not accessible in R as the code in the promise gets evaluated in the environmnet at the moment you access the promise. 

Also, a promise is evaluated once, when you access it for the first time. Every time you access it subsequently it will return the same value. A quosure must be evaluated explicitly, and each evaluation is independent of the previous evaluations.

In [68]:
foo <- function(x_arg) {
  list(x_arg, x_arg)
}
foo(runif(3))

In this example, the numbers are the same, but if we use quosure, the numbers will be different.

In [69]:
x_quo <- quo(runif(3))
eval_tidy(x_quo)
eval_tidy(x_quo)

### Data Masks

The power of ```tidy_eval()``` is that it can include a data frame to define where the evaluation takes place.
- A **data mask** is a data frame where the evaluated code will look first for variable definitions.
- A data mask introduces ambiguity, so to remove that ambguity when necessary, we introduces **pronouns**.

**Enquote using enquo() or enquos()**    
If we enquote an expression by ```enquo()```, it will be same to say we supply an evaluation environment ```caller_env()```. The following examples will produce the same result.

In [78]:
foo1 <- function(x) {
  y <- 100
  x <- enquo(x)
  
  eval_tidy(x)
}
y <- 10
foo1(y * 2)

In [82]:
foo2 <- function(x) {
  y <- 100
  x <- enexpr(x)
  
  eval_bare(x, caller_env())
}

y <- 10
foo2(y * 2)

---
```tidy_eval()``` takes in a data frame, and the data frame behaves like an environment. Also, we can take the environment and use it to another quosure.    


The first example shows how eval_tidy can be used on a data frame. The second example takes the environment of the first quosure and evaluate by ```quo_set_env()```.

In [70]:
x <- 10
df <- data.frame(y = 1:10)
q1 <- quo(x * y)

eval_tidy(q1, df)

In [71]:
df_env <- as_env(df, parent = quo_get_env(q1))
q2 <- quo_set_env(q1, df_env)

eval_tidy(q2)

In [72]:
q3 <- quo_set_env(quo(x-y), df_env)
eval_tidy(q3)

# Common Practice in Survey Data

Some survey data set include each respondent as one row, and question numbers as columns. We sometimes have to gather information for one or multiple columns. In this case, we don't know how many columns are needed. Also, if we wish to summarize data with different functions, such as ```mean=mean(col_1)```, ```sum=sum(col_2)``` or other self-defined calculations, we do now know what or how many columns we need to pass in ```dplyr``` functions like ```summary()```, ```mutate()``` and ```group_by()```. I use ```stringr``` library to deal with the column names and ```rlang``` library to deal with the generation of expressions.

Originally, I used ```stringr``` to generate expressions as well; however, using ```str_interp()``` can be dangerous. When the variable string passed in is wrong, the program may not catch the error before evaluation.

In [4]:
library(dplyr)
library(rlang)
library(purrr)

In [5]:
set.seed(1)
survey <- data.frame(q1a = floor(runif(20, min=1, max=5)),
                    q1b = floor(runif(20, min=1, max=3)),
                    q1c = floor(runif(20, min=1, max=6)),
                    q2a = floor(runif(20, min=1, max=3)),
                    q2c = floor(runif(20, min=1, max=5)),
                    q3a = floor(runif(20, min=1, max=4)))
row.names(survey) <- lapply(1:20, function(x) paste0("student", x))
survey

Unnamed: 0,q1a,q1b,q1c,q2a,q2c,q3a
student1,2,2,5,2,2,2
student2,2,1,4,1,3,2
student3,3,2,4,1,2,1
student4,4,1,3,1,2,3
student5,1,1,3,2,4,2
student6,4,1,4,1,1,1
student7,4,1,1,1,3,1
student8,3,1,3,2,1,2
student9,3,2,4,1,1,3
student10,1,1,4,2,1,2


Each column means a sub-question, and the number in cell indicates the answer from students.

The first step is often to count how many people select one of the factors

---
If you know the specific column names, the following function can be used to obtain a data frame of the column.

**Single Question**

In [46]:
get_one_col <- function(question, dataframe){
    q <- sym(question)
    dataframe[, question] <-  as.factor(dataframe[, question])
    df <- dataframe %>% group_by(!!q) %>% summarise(Freq=n())
    df
}

get_one_col(c("q1a"), survey)

q1a


q1a,Freq
1,4
2,5
3,5
4,6


**Multiple Questions**

In [84]:
get_more_col <- function(questions, dataframe){
  qs <- syms(questions)
  dataframe[, questions] <- map_dfc(dataframe[, questions], as.factor)
  df_lst <- lapply(1:length(questions), function(x) dataframe %>% group_by(!!qs[[x]]) %>% summarise(Freq=n()))
  df_lst
}

get_more_col(c("q1a", "q1b"), survey)

q1a,Freq
1,4
2,5
3,5
4,6

q1b,Freq
1,12
2,8


If you wish to use quosure instead of strings. That's fine too. However, in the following example, the whole data frame has been modified.

**Single Question**

In [45]:
get_one_col <- function(question, dataframe){
    q <- quo(!!ensym(question))
    ## ensym() returns an expression, to quote the expression with environment
    ##  first unquote the expression in quo(), as otherwise quo will take ensym(question) literally
    ##  then use quo()
    dataframe <- map_dfc(dataframe, as.factor)
    df <- dataframe %>% group_by(!!q) %>% summarise(Freq=n())
    df
}

get_one_col(q1a, survey)

q1a,Freq
1,4
2,5
3,5
4,6


In [65]:
get_more_col <- function(..., dataframe){
  qs <- enquos(...)
  len_dots <- dots_n(...)
  dataframe <- map_dfc(dataframe, as.factor)
  df_lst <- lapply(1:len_dots, function(x) dataframe %>% group_by(!!qs[[x]]) %>% summarise(Freq=n()))
  df_lst
}

get_more_col(q1a, q1b, dataframe=survey)

q1a,Freq
1,4
2,5
3,5
4,6

q1b,Freq
1,12
2,8


In [25]:
quo <- quo(x+y)
quo

<quosure>
  expr: ^x + y
  env:  global

In [27]:
squashed_quo <- quo_squash(quo)
squashed_quo

x + y

In [54]:
quo_label(quo)

In [31]:
quo_name(quo)

In [35]:
quo_text(quo)

In [32]:
expr_label("a\nb")

In [33]:
expr_name("a\nb")

In [34]:
expr_text("a\nb")

In [39]:
quo(!!expr(x))

<quosure>
  expr: ^x
  env:  global

In [47]:
expr_print(1:3)
expr_print(function() NULL)

<int: 1L, 2L, 3L>
<function() NULL>


In [49]:
expr_print(quote(1:3))
expr_print(quote(function() NULL))


ERROR: Error in parse(text = x, srcfile = src): <text>:3:15: unexpected symbol
2: expr_f <- ~list(!!(1 + 2))
3: expr_interp(f)print
                 ^


In [50]:
f <- ~list(!!(1 + 2))
expr_interp(f)

~list(3)

In [51]:
f <- ~list(~!!(1 + 2), !!(1 + 2))
expr_interp(f)


~list(~3, 3)

In [52]:
other_fn <- function(x) toupper(x)
fn <- expr_interp(function(x) {
x <- paste0(x, "_suffix")
!!! body(other_fn)
})
fn
fn("foo")


In [53]:
is_syntactic_literal("string")

In [82]:
q1 <- expr("1")
is_expression(q1)
is_syntactic_literal(q1)
is_symbol(q1)
is_quosure(q1)

In [83]:
q2 <- quo(x)
is_expression(q2)
is_syntactic_literal(q2)
is_symbol(q2)
is_quosure(q2)

In [73]:
q3 <- quote(x + 1)
is_expression(q3)
is_call(q3)

In [75]:
is_syntactic_literal("string")
is_syntactic_literal(NULL)
is_syntactic_literal(letters)
is_syntactic_literal(quote(call()))


In [76]:
is_formula(~10)

In [77]:
is_formula(10)

In [85]:
quo <- quo(wrapper(!!quo(wrappee)))
quo
quo_squash(quo)

<quosure>
  expr: ^wrapper(^wrappee)
  env:  global

wrapper(wrappee)