replyr
is post-maintenance, we are no longer bug-fixing or updating
this package. It has been pointless to track shifting
dplyr
/dbplyr
/rlang
APIs and data structures post dplyr
0.5
.
Most of what it does is now done better in one of our newer
non-monolithic packages:
- Programming and meta-programming tools:
wrapr
. - Big data data manipulation:
rquery
andcdata
. - Adapting to standard evaluation interfaces:
seplyr
.
This document describes replyr
, an R
package available from Github and
CRAN.
It comes as a bit of a shock for R
dplyr
users when they
switch from using a tbl
implementation based on R in-memory
data.frame
s to one based on a remote database or service. A lot of the
power and convenience of the dplyr
notation is hard to maintain with
these more restricted data service providers. Things that work locally
can’t always be used remotely at scale. It is emphatically not yet the
case that one can practice with dplyr
in one modality and hope to move
to another back-end without significant debugging and work-arounds. The
replyr
package attempts to
provide practical data manipulation affordances to make code perform
similarly on local or remote (big) data.
Note: replyr
is meant only for “tame data frames” that is data frames
with non-duplicate column names that are also valid simple (without
quotes) R
variables names and columns that are R
simple vector types
(numbers, strings, and such).
replyr
supplies methods to get a grip on working with remote tbl
sources (SQL
databases, Spark
) through dplyr
. The idea is to add
convenience functions to make such tasks more like working with an
in-memory data.frame
. Results still do depend on which dplyr
service
you use, but with replyr
you have fairly uniform access to some useful
functions. The rule of thumb is: try dplyr
first, and if that does not
work check if replyr
has researched a work-around.
replyr
uniformly uses standard or parametric interfaces (names of
variables as strings) in favor of name capture so that you can easily
program over replyr
.
Primary replyr
services include:
- Join Controller
- Join Planner
replyr::replyr_split
replyr::replyr_bind_rows
replyr::gapply
replyr::replyr_summary
replyr::replyr_apply_f_mapped
wrapr::let
wrapr::let
allows execution of arbitrary code with substituted
variable names (note this is subtly different than binding values for
names as with base::substitute
or base::with
). This allows the user
to write arbitrary dplyr
code in the case of “parametric variable
names”
(that is when variable names are not known at coding time, but will
become available later at run time as values in other variables) without
directly using the dplyr
“underbar forms” (and the direct use of
lazyeval::interp
, .dots=stats::setNames
, or rlang
/tidyeval
).
Example:
library('dplyr')
# nice parametric function we write
ComputeRatioOfColumns <- function(d,
NumeratorColumnName,
DenominatorColumnName,
ResultColumnName) {
wrapr::let(
alias=list(NumeratorColumn=NumeratorColumnName,
DenominatorColumn=DenominatorColumnName,
ResultColumn=ResultColumnName),
expr={
# (pretend) large block of code written with concrete column names.
# due to the let wrapper in this function it will behave as if it was
# using the specified paremetric column names.
d %>% mutate(ResultColumn = NumeratorColumn/DenominatorColumn)
})
}
# example data
d <- data.frame(a=1:5, b=3:7)
# example application
d %>% ComputeRatioOfColumns('a','b','c')
# a b c
# 1 1 3 0.3333333
# 2 2 4 0.5000000
# 3 3 5 0.6000000
# 4 4 6 0.6666667
# 5 5 7 0.7142857
wrapr::let
makes construction of abstract functions over dplyr
controlled data much easier. It is designed for the case where the
“expr
” block is large sequence of statements and pipelines.
wrapr::let
was only the secondary proposal in the original 2016
“Parametric variable names”
article.
What we really wanted was a stack of view so the data pretended to have
names that matched the code (i.e., re-mapping the data, not the code).
With a bit of thought we can achieve this if we associate the data
re-mapping with a function environment instead of with the data. So a
re-mapping is active as long as a given controlling function is in
control. In our case that function is replyr::replyr_apply_f_mapped()
and works as follows:
Suppose the operation we wish to use is a rank-reducing function that
has been supplied as function from somewhere else that we do not have
control of (such as a package). The function could be simple such as the
following, but we are going to assume we want to use it without
alteration (including the without the small alteration of introducing
wrapr::let()
).
# an external function with hard-coded column names
DecreaseRankColumnByOne <- function(d) {
d$RankColumn <- d$RankColumn - 1
d
}
To apply this function to d
(which doesn’t have the expected column
names!) we use replyr::replyr_apply_f_mapped()
to create a new
parametrized adapter as follows:
# our data
d <- data.frame(Sepal_Length = c(5.8,5.7),
Sepal_Width = c(4.0,4.4),
Species = 'setosa',
rank = c(1,2))
# a wrapper to introduce parameters
DecreaseRankColumnByOneNamed <- function(d, ColName) {
replyr::replyr_apply_f_mapped(d,
f = DecreaseRankColumnByOne,
nmap = c(RankColumn = ColName),
restrictMapIn = FALSE,
restrictMapOut = FALSE)
}
# use
dF <- DecreaseRankColumnByOneNamed(d, 'rank')
print(dF)
# Sepal_Length Sepal_Width Species rank
# 1 5.8 4.0 setosa 0
# 2 5.7 4.4 setosa 1
replyr::replyr_apply_f_mapped()
renames the columns to the names
expected by DecreaseRankColumnByOne
(the mapping specified in nmap
),
applies DecreaseRankColumnByOne
, and then inverts the mapping before
returning the value.
replyr::replyr_split
and replyr::replyr_bind_rows
work over many
remote data types including Spark
. This allows code like the
following:
suppressPackageStartupMessages(library("dplyr"))
library("replyr")
sc <- sparklyr::spark_connect(version='2.0.2',
master = "local")
diris <- copy_to(sc, iris, 'diris')
f2 <- . %>%
arrange(Sepal_Length, Sepal_Width, Petal_Length, Petal_Width) %>%
head(2)
diris %>%
replyr_split('Species') %>%
lapply(f2) %>%
replyr_bind_rows()
## Source: query [6 x 5]
## Database: spark connection master=local[4] app=sparklyr local=TRUE
##
## # A tibble: 6 x 5
## Species Sepal_Length Sepal_Width Petal_Length Petal_Width
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 versicolor 5.0 2.0 3.5 1.0
## 2 versicolor 4.9 2.4 3.3 1.0
## 3 setosa 4.3 3.0 1.1 0.1
## 4 setosa 4.4 2.9 1.4 0.2
## 5 virginica 4.9 2.5 4.5 1.7
## 6 virginica 5.6 2.8 4.9 2.0
sparklyr::spark_disconnect(sc)
replyr::gapply
is a “grouped ordered apply” data operation. Many
calculations can be written in terms of this primitive, including
per-group rank calculation (assuming your data services supports window
functions), per-group summaries, and per-group selections. It is meant
to be a specialization of “The
Split-Apply-Combine”
strategy with all three steps wrapped into a single operator.
Example:
library('dplyr')
d <- data.frame(group=c(1,1,2,2,2),
order=c(.1,.2,.3,.4,.5))
rank_in_group <- . %>% mutate(constcol=1) %>%
mutate(rank=cumsum(constcol)) %>% select(-constcol)
d %>% replyr::gapply('group', rank_in_group, ocolumn='order', decreasing=TRUE)
# group order rank
# 1 1 0.2 1
# 2 1 0.1 2
# 3 2 0.5 1
# 4 2 0.4 2
# 5 2 0.3 3
The user supplies a function or pipeline that is meant to be applied
per-group and the replyr::gapply
wrapper orchestrates the calculation.
In this example rank_in_group
was assumed to know the column names in
our data, so we directly used them instead of abstracting through
wrapr::let
. replyr::gapply
defaults to using dplyr::group_by
as
its splitting or partitioning control, but can also perform actual
splits using ‘split’ (‘base::split’) or ‘extract’ (sequential
extraction). Semantics are slightly different between cases given how
dplyr
treats grouping columns, the issue is illustrated in the
difference between the definitions of sumgroupS
and sumgroupG
in
this
example).
The replyr::replyr_*
functions are all convenience functions supplying
common functionality (such as replyr::replyr_nrow
) that works across
many data services providers. These are prefixed (instead of being S3
or S4
methods) so they do not interfere with common methods. Many of
these functions can expensive (which is why dplyr
does not provide
them as a default), or are patching around corner cases (which is why
these functions appear to duplicate base::
and dplyr::
capabilities). The issues replyr::replyr_*
claim to patch around have
all been filed as issues on the appropriate R
packages and are
documented
here (to
confirm they are not phantoms).
Example: replyr::replyr_summary
working on a database service (when
base::summary
does not).
d <- data.frame(x=rep(c(1,2,2), 5),
y=c(3,5,NA),
z=c(NA,'a','b'),
stringsAsFactors = FALSE)
my_db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::initExtension(my_db) # filed as dplyr issue https://github.com/tidyverse/dplyr/issues/3150
dRemote <- replyr::replyr_copy_to(my_db,d,'d')
summary(dRemote)
# Length Class Mode
# src 2 src_SQLiteConnection list
# ops 2 op_base_remote list
glimpse(dRemote)
# Rows: ??
# Columns: 3
# Database: sqlite 3.33.0 [:memory:]
# $ x <dbl> 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2
# $ y <dbl> 3, 5, NA, 3, 5, NA, 3, 5, NA, 3, 5, NA, 3, 5, NA
# $ z <chr> NA, "a", "b", NA, "a", "b", NA, "a", "b", NA, "a", "b", NA, "a", "b"
replyr::replyr_summary(dRemote)
# column index class nrows nna nunique min max mean sd lexmin lexmax
# 1 x 1 numeric 15 0 NA 1 2 1.666667 0.487950 <NA> <NA>
# 2 y 2 numeric 15 5 NA 3 5 4.000000 1.054093 <NA> <NA>
# 3 z 3 character 15 5 NA NA NA NA NA a b
cdata::qlook(my_db, 'd')
# table `d` SQLiteConnection
# nrow: 15
# NOTE: "obs" below is count of sample, not number of rows of data.
# 'data.frame': 10 obs. of 3 variables:
# $ x: num 1 2 2 1 2 2 1 2 2 1
# $ y: num 3 5 NA 3 5 NA 3 5 NA 3
# $ z: chr NA "a" "b" NA ...
Data types, capabilities, and row-orders all vary a lot as we switch
remote data services. But the point of replyr
is to provide at least
some convenient version of typical functions such as: summary
, nrow
,
unique values, and filter rows by values in a set.
This is a very new package with no guarantees or claims of fitness for
purpose. Some implemented operations are going to be slow and expensive
(part of why they are not exposed in dplyr
itself).
We will probably only ever cover:
- Native
data.frame
s (andtbl
/tibble
) sparklyr
(Spark
2.0.0 or greater)RPostgreSQL
SQLite
RMySQL
(limited support in some cases)
Additional replyr
functions include:
replyr::replyr_filter
replyr::replyr_inTest
These are designed to subset data based on a columns values being in a given set. These allow selection of rows by testing membership in a set (very useful for partitioning data). Example below:
library('dplyr')
values <- c(2)
dRemote %>% replyr::replyr_filter('x', values)
# Warning: `as.tbl()` is deprecated as of dplyr 1.0.0.
# Please use `tibble::as_tibble()` instead.
# This warning is displayed once every 8 hours.
# Call `lifecycle::last_warnings()` to see where this warning was generated.
# # Source: table<replyr_filter_70965071119671872805_0000000001> [?? x 3]
# # Database: sqlite 3.33.0 [:memory:]
# x y z
# <dbl> <dbl> <chr>
# 1 2 5 a
# 2 2 NA b
# 3 2 5 a
# 4 2 NA b
# 5 2 5 a
# 6 2 NA b
# 7 2 5 a
# 8 2 NA b
# 9 2 5 a
# 10 2 NA b
There are a few goals for replyr
:
- Providing missing convenience functions that work well over all
common
dplyr
service providers. Examples includereplyr_summary
,replyr_filter
, andreplyr_nrow
. - Providing a basis for “row number free” data analysis. SQL back-ends don’t commonly supply row number indexing (or even deterministic order of rows), so a lot of tasks you could do in memory by adjoining columns have to be done through formal key-based joins.
- Providing emulations of functionality missing from non-favored
service providers (such as windowing functions,
quantile
,sample_n
,cumsum
; missing fromSQLite
andRMySQL
). - Working around corner case issues, and some variations in semantics.
- Sheer bull-headedness in emulating operations that don’t quite fit
into the pure
dplyr
formulation.
Good code should fill one important gap and work on a variety of dplyr
back ends (you can test RMySQL
, and RPostgreSQL
using docker as
mentioned
here and
here;
sparklyr
can be tried in local mode as described
here). I am especially interested in clever
“you wouldn’t thing this was efficiently possible, but” solutions
(which give us an expanded grammar of useful operators), and replacing
current hacks with more efficient general solutions. Targets of interest
include sample_n
(which isn’t currently implemented for tbl_sqlite
),
cumsum
, and quantile
(currently we have an expensive implementation
of quantile
based on binary search: replyr::replyr_quantile
).
replyr
services include:
- Moving data into or out of the remote data store (including adding
optional row numbers),
replyr_copy_to
andreplyr_copy_from
. - Basic summary info:
replyr_nrow
,replyr_dim
, andreplyr_summary
. - Random row sampling (like
dplyr::sample_n
, but working with more service providers). Some of this functionality is provided byreplyr_filter
andreplyr_inTest
. - Emulating The Split-Apply-Combine
Strategy, which is
the purpose
gapply
,replyr_split
, andreplyr_bind_rows
. - Emulating
tidyr
gather/spread (or pivoting and anti-pivoting). - Patching around differences in
dplyr
services providers (and documenting the reasons for the patches). - Making use of “parameterized names” much easier (that is: writing code does not know the name of the column it is expected to work over, but instead takes the column name from a user supplied variable).
Additional desired capabilities of interest include:
cumsum
or row numbering (interestingly enough if you have row numbering you can implement cumulative sum in log-n rounds using joins to implement pointer chasing/jumping ideas, but that is unlikely to be practical,lag
is enough to generate next pointers, which can be boosted to row-numberings).- Inserting random values (or even better random unique values) in a remote column. Most service providers have a pseudo-random source you can use.
replyr
is package for speeding up reliable data manipulation using
dplyr
(especially on databases and Spark
). It is also a good central
place to collect patches and fixes needed to work around corner cases
and semantic variations between versions of data sources.
rm(list=ls())
gc()
# used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
# Ncells 1026656 54.9 2076977 111 NA 1383707 73.9
# Vcells 1925172 14.7 8388608 64 49152 2686998 20.6
Note: replyr
is targeted at data with “tame column names” (column
names that are valid both in databases, and as R
unquoted variable
names) and basic types (column values that are simple R
types such as
character
, numeric
, logical
, and so on).
Also replyr
tries to be a “source agnostic” package, meaning it
minimizes the number of places it checks for data source and uses
specialized code, this can mean some operations are slow. For example
replyr
does not (yet) use sparklyr::sdf_pivot()
.