ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

nealrichardson · 2021-04-28T21:29:15Z

Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

You can't dictionary_encode a dataset column. Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})} (ARROW-12632). I will remove the as.factor method and leave a TODO to restore it after that JIRA is resolved.
with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing mutate(); this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

5b501c5 is the main switch to use InMemoryDataset
b31fb5e deletes array_expression
0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
a0914f6 + eee491a contain ARROW-12696

ianmcook · 2021-05-03T15:17:01Z

The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct?

nealrichardson · 2021-05-03T15:21:06Z

The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct?

Correct. This also puts us in better shape to use the C++ query engine (which will use Expressions, and which Datasets will feed) when it becomes available 🔜 . But the most immediate effect for us is that we get to simplify our code (less FUN passed around) and delete duplicate tests.

r/tests/testthat/test-dplyr-mutate.R

r/tests/testthat/test-dplyr.R

r/R/dplyr.R

westonpace · 2021-05-10T20:15:58Z

I think you snagged the wrong Jira :). Or else I haven't been following this issue closely enough.

nealrichardson · 2021-05-10T22:24:01Z

Yeah I may have transposed the numbers, will confirm

github-actions · 2021-05-11T12:13:23Z

https://issues.apache.org/jira/browse/ARROW-12731

r/R/dplyr-functions.R

r/tests/testthat/test-dplyr-string-functions.R

jonkeane

This looks good, a few (small) comments

r/R/dplyr-functions.R

nealrichardson · 2021-05-12T22:57:55Z

@github-actions crossbow submit -g r

github-actions · 2021-05-12T22:58:51Z

Revision: bc6e356

Submitted crossbow builds: ursacomputing/crossbow @ actions-407

Task	Status
conda-linux-gcc-py36-cpu-r36
conda-linux-gcc-py37-cpu-r40
conda-osx-clang-py36-r36
conda-osx-clang-py37-r40
conda-win-vs2017-py36-r36
conda-win-vs2017-py37-r40
homebrew-r-autobrew
test-r-devdocs
test-r-install-local
test-r-linux-as-cran
test-r-linux-valgrind
test-r-minimal-build
test-r-rhub-ubuntu-gcc-release-latest
test-r-rocker-r-base-latest
test-r-rstudio-r-base-3.6-bionic
test-r-rstudio-r-base-3.6-centos7-devtoolset-8
test-r-rstudio-r-base-3.6-centos8
test-r-rstudio-r-base-3.6-opensuse15
test-r-rstudio-r-base-3.6-opensuse42
test-r-version-compatibility
test-r-versions
test-r-without-arrow
test-ubuntu-18.04-r-sanitizer

ianmcook · 2021-05-13T03:04:23Z

This looks great.

The increased dependency on Dataset means we will have to skip many more tests to make test-r-minimal-build pass.

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

nealrichardson · 2021-05-13T14:47:53Z

@github-actions crossbow submit test-r-minimal-build

github-actions · 2021-05-13T15:07:38Z

Revision: fa5731e

Submitted crossbow builds: ursacomputing/crossbow @ actions-410

Task	Status
test-r-minimal-build

@bkietz

Discussing with @bkietz on apache#10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats: * You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved. * with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine. There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights: * apache@5b501c5 is the main switch to use InMemoryDataset * apache@b31fb5e deletes `array_expression` * apache@0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions * apache@2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz * apache@d12f584 just splits up dplyr.R into many files; apache@34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface). * apache@a0914f6 + apache@eee491a contain ARROW-12696 Closes apache#10191 from nealrichardson/dplyr-in-memory Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

github-actions bot added the Component: R label Apr 28, 2021

nealrichardson mentioned this pull request Apr 30, 2021

ARROW-12614: [C++][Compute] Remove support for Tables in ExecuteScalarExpression #10213

Closed

nealrichardson force-pushed the dplyr-in-memory branch from 3bbe342 to 9640cff Compare April 30, 2021 22:50

bkietz self-requested a review May 3, 2021 15:07

bkietz reviewed May 3, 2021

View reviewed changes

r/tests/testthat/test-dplyr-mutate.R Outdated Show resolved Hide resolved

r/tests/testthat/test-dplyr.R Outdated Show resolved Hide resolved

nealrichardson force-pushed the dplyr-in-memory branch 2 times, most recently from eb6848f to 4f854b8 Compare May 6, 2021 17:52

bkietz reviewed May 7, 2021

View reviewed changes

r/R/dplyr.R Outdated Show resolved Hide resolved

nealrichardson force-pushed the dplyr-in-memory branch 2 times, most recently from 27c62f5 to 795e1f9 Compare May 8, 2021 00:07

nealrichardson marked this pull request as ready for review May 10, 2021 19:44

nealrichardson changed the title ~~[R] [WIP] Use InMemoryDataset for Table/RecordBatch in dplyr code~~ [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 10, 2021

nealrichardson changed the title ~~[R] Use InMemoryDataset for Table/RecordBatch in dplyr code~~ ARROW-12371: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 10, 2021

nealrichardson changed the title ~~ARROW-12371: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code~~ ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 11, 2021

apache deleted a comment from github-actions bot May 12, 2021

thisisnic reviewed May 12, 2021

View reviewed changes

r/R/dplyr-functions.R Show resolved Hide resolved

jonkeane reviewed May 12, 2021

View reviewed changes

r/R/dplyr-functions.R Show resolved Hide resolved

thisisnic reviewed May 12, 2021

View reviewed changes

r/tests/testthat/test-dplyr-string-functions.R Show resolved Hide resolved

jonkeane approved these changes May 12, 2021

View reviewed changes

nealrichardson added 2 commits May 13, 2021 07:29

Use InMemoryDataset for Table/RecordBatch in dplyr code

b16f84f

Fix some failures, find another?

e47fb0e

nealrichardson and others added 15 commits May 13, 2021 07:29

Debugging and skipping

40d0d37

Clean up now that dataset writing issue is fixed

37ba677

Remove skip

27a367e

Add JIRA issue to skip

457dcc4

Some simplifications

fdbbb27

Add Expression(schema) method and improve adq print method

d2104db

Remove 'array_expression' class

4258e85

Move dplyr function definitions to own file and simplify (no more FUN)

1a38f01

Split dplyr.R

47dd897

Update test that you can't add vectors to a dataset

e0df53c

Delete some duplicated tests in test-dataset.R

a56ad09

ARROW-12696: [R] Improve testing of error messages converted to warnings

501a1a4

Apply same message handling in filter(); refactor

cc55d47

s/=/<-/g

52643dd

Co-authored-by: Jonathan Keane <jkeane@gmail.com>

Skip dplyr tests if dataset not in build

fa5731e

nealrichardson force-pushed the dplyr-in-memory branch from bc6e356 to fa5731e Compare May 13, 2021 14:47

nealrichardson closed this in 9347731 May 13, 2021

nealrichardson deleted the dplyr-in-memory branch May 13, 2021 15:47

thisisnic mentioned this pull request Nov 3, 2021

R blog post apache/arrow-site#158

Merged

This was referenced May 5, 2021

[C++] Dataset writing can only include projected columns if input columns are also included #18647

Closed

[R] Use InMemoryDataset for Table/RecordBatch in dplyr code #28473

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

nealrichardson commented Apr 28, 2021 •

edited

Loading

ianmcook commented May 3, 2021

nealrichardson commented May 3, 2021

westonpace commented May 10, 2021

nealrichardson commented May 10, 2021

github-actions bot commented May 11, 2021

jonkeane left a comment

nealrichardson commented May 12, 2021

github-actions bot commented May 12, 2021

ianmcook commented May 13, 2021 •

edited

Loading

nealrichardson commented May 13, 2021

github-actions bot commented May 13, 2021

ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

Conversation

nealrichardson commented Apr 28, 2021 • edited Loading

ianmcook commented May 3, 2021

nealrichardson commented May 3, 2021

westonpace commented May 10, 2021

nealrichardson commented May 10, 2021

github-actions bot commented May 11, 2021

jonkeane left a comment

Choose a reason for hiding this comment

nealrichardson commented May 12, 2021

github-actions bot commented May 12, 2021

ianmcook commented May 13, 2021 • edited Loading

nealrichardson commented May 13, 2021

github-actions bot commented May 13, 2021

nealrichardson commented Apr 28, 2021 •

edited

Loading

ianmcook commented May 13, 2021 •

edited

Loading