rqdatatable
is an
implementation of the rquery
piped Codd-style relational algebra hosted on
data.table
. rquery
allow the expression of
complex transformations as a series of relational operators and
rqdatatable
implements the operators using data.table
.
A Python
version of rquery
/rqdatatable
is under initial
development as
data_algebra
.
For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.
library("rqdatatable")
## Loading required package: rquery
# data example
dL <- build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
1 , "withdrawal behavior", 5 |
1 , "positive re-framing", 2 |
2 , "withdrawal behavior", 3 |
2 , "positive re-framing", 4 )
scale <- 0.237
# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
extend_nse(.,
probability :=
exp(assessmentTotal * scale)) %.>%
normalize_cols(.,
"probability",
partitionby = 'subjectID') %.>%
pick_top_k(.,
k = 1,
partitionby = 'subjectID',
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c('subjectID',
'diagnosis',
'probability')) %.>%
orderby(., cols = 'subjectID')
We can show the expanded form of query tree.
cat(format(rquery_pipeline))
mk_td("dL", c(
"subjectID",
"surveyCategory",
"assessmentTotal")) %.>%
extend(.,
probability := exp(assessmentTotal * 0.237)) %.>%
extend(.,
probability := probability / sum(probability),
partitionby = c('subjectID'),
orderby = c(),
reverse = c()) %.>%
extend(.,
row_number := row_number(),
partitionby = c('subjectID'),
orderby = c('probability', 'surveyCategory'),
reverse = c('probability', 'surveyCategory')) %.>%
select_rows(.,
row_number <= 1) %.>%
rename_columns(.,
c('diagnosis' = 'surveyCategory')) %.>%
select_columns(., c(
"subjectID", "diagnosis", "probability")) %.>%
order_rows(.,
c('subjectID'),
reverse = c(),
limit = NULL)
And execute it using data.table
.
ex_data_table(rquery_pipeline)
## subjectID diagnosis probability
## 1: 1 withdrawal behavior 0.6706221
## 2: 2 positive re-framing 0.5589742
One can also apply the pipeline to new tables.
build_frame(
"subjectID", "surveyCategory" , "assessmentTotal" |
7 , "withdrawal behavior", 5 |
7 , "positive re-framing", 20 ) %.>%
rquery_pipeline
## subjectID diagnosis probability
## 1: 7 positive re-framing 0.9722128
Initial bench-marking of rqdatatable
is very favorable (notes
here).
Note rqdatatable
has an “immediate mode” which allows direct
application of pipelines stages without pre-assembling the pipeline.
“Immediate mode” is a convenience for ad-hoc analyses, and has some
negative performance impact, so we encourage users to build pipelines
for most work. Some notes on the issue can be found
here.
rqdatatable
implements the rquery
grammar in the style of a “Turing
or Cook reduction” (implementing the result in terms of multiple oracle
calls to the related system). For some related work on translating
dplyr
to data.table
in the style of “Karp reductions” (implementing
the result as a translation into a single problem in the related system)
please see table.express
and also dtplyr
(though the
0.0.3
and prior versions of dtplyr
have severe performance issues
due to excess table copying,
ref,
ref).
To install rqdatatable
please use install.packages("rqdatatable")
.
–
Note rqdatatable
is intended for “simple column names”, in particular
as rqdatatable
often uses eval()
to work over data.table
escape
characters such as “\
” and “\\
” are not reliable in column names.
Also rqdatatable
does not support tables with no columns.