Skip to content
Implement the rquery piped query algebra in R using data.table. Distributed under choice of GPL-2 or GPL-3 license.
R
Branch: master
Clone or download
Latest commit 061bd72 Nov 12, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R check row count on project Nov 4, 2019
docs rebuild site Nov 12, 2019
extras rebuild and recheck Nov 12, 2019
inst/unit_tests check row count on project Nov 4, 2019
man simplify package Oct 24, 2019
tests fix test runner Feb 17, 2019
tools try to fix image Jul 8, 2018
vignettes add link Jun 23, 2019
.Rbuildignore move wrapping to rquery Oct 24, 2019
.gitignore add CRAN-RELEASE Feb 17, 2019
DESCRIPTION rebuild and recheck Nov 12, 2019
LICENSE adjust license Jul 23, 2019
NAMESPACE simplify package Oct 24, 2019
NEWS.md rebuild and recheck Nov 12, 2019
README.Rmd add data-algebra link Aug 27, 2019
README.md rebuild site Sep 26, 2019
cran-comments.md rebuild and recheck Nov 12, 2019
rqdatatable.Rproj rename project May 31, 2018

README.md

rqdatatable is an implementation of the rquery piped Codd-style relational algebra hosted on data.table. rquery allow the expression of complex transformations as a series of relational operators and rqdatatable implements the operators using data.table.

A Python version of rquery/rqdatatable is under initial development as data_algebra.

For example scoring a logistic regression model (which requires grouping, ordering, and ranking) is organized as follows. For more on this example please see “Let’s Have Some Sympathy For The Part-time R User”.

library("rqdatatable")
## Loading required package: rquery
# data example
dL <- build_frame(
   "subjectID", "surveyCategory"     , "assessmentTotal" |
   1          , "withdrawal behavior", 5                 |
   1          , "positive re-framing", 2                 |
   2          , "withdrawal behavior", 3                 |
   2          , "positive re-framing", 4                 )
scale <- 0.237

# example rquery pipeline
rquery_pipeline <- local_td(dL) %.>%
  extend_nse(.,
             probability :=
               exp(assessmentTotal * scale))  %.>% 
  normalize_cols(.,
                 "probability",
                 partitionby = 'subjectID') %.>%
  pick_top_k(.,
             k = 1,
             partitionby = 'subjectID',
             orderby = c('probability', 'surveyCategory'),
             reverse = c('probability', 'surveyCategory')) %.>% 
  rename_columns(., c('diagnosis' = 'surveyCategory')) %.>%
  select_columns(., c('subjectID', 
                      'diagnosis', 
                      'probability')) %.>%
  orderby(., cols = 'subjectID')

We can show the expanded form of query tree.

cat(format(rquery_pipeline))
mk_td("dL", c(
  "subjectID",
  "surveyCategory",
  "assessmentTotal")) %.>%
 extend(.,
  probability := exp(assessmentTotal * 0.237)) %.>%
 extend(.,
  probability := probability / sum(probability),
  partitionby = c('subjectID'),
  orderby = c(),
  reverse = c()) %.>%
 extend(.,
  row_number := row_number(),
  partitionby = c('subjectID'),
  orderby = c('probability', 'surveyCategory'),
  reverse = c('probability', 'surveyCategory')) %.>%
 select_rows(.,
   row_number <= 1) %.>%
 rename_columns(.,
  c('diagnosis' = 'surveyCategory')) %.>%
 select_columns(., c(
   "subjectID", "diagnosis", "probability")) %.>%
 order_rows(.,
  c('subjectID'),
  reverse = c(),
  limit = NULL)

And execute it using data.table.

ex_data_table(rquery_pipeline)
##    subjectID           diagnosis probability
## 1:         1 withdrawal behavior   0.6706221
## 2:         2 positive re-framing   0.5589742

One can also apply the pipeline to new tables.

build_frame(
   "subjectID", "surveyCategory"     , "assessmentTotal" |
   7          , "withdrawal behavior", 5                 |
   7          , "positive re-framing", 20                ) %.>%
  rquery_pipeline
##    subjectID           diagnosis probability
## 1:         7 positive re-framing   0.9722128

Initial bench-marking of rqdatatable is very favorable (notes here).

Note rqdatatable has an “immediate mode” which allows direct application of pipelines stages without pre-assembling the pipeline. “Immediate mode” is a convenience for ad-hoc analyses, and has some negative performance impact, so we encourage users to build pipelines for most work. Some notes on the issue can be found here.

rqdatatable implements the rquery grammar in the style of a “Turing or Cook reduction” (implementing the result in terms of multiple oracle calls to the related system). For some related work on translating dplyr to data.table in the style of “Karp reductions” (implementing the result as a translation into a single problem in the related system) please see table.express and also dtplyr (though the 0.0.3 and prior versions of dtplyr have severe performance issues due to excess table copying, ref, ref).

To install rqdatatable please use install.packages("rqdatatable").

Note rqdatatable is intended for “simple column names”, in particular as rqdatatable often uses eval() to work over data.table escape characters such as “\” and “\\” are not reliable in column names. Also rqdatatable does not support tables with no columns.

You can’t perform that action at this time.