Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191

Closed
wants to merge 17 commits into from

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Apr 28, 2021

Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

  • You can't dictionary_encode a dataset column. Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})} (ARROW-12632). I will remove the as.factor method and leave a TODO to restore it after that JIRA is resolved.
  • with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing mutate(); this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

  • 5b501c5 is the main switch to use InMemoryDataset
  • b31fb5e deletes array_expression
  • 0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
  • 2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
  • d12f584 just splits up dplyr.R into many files; 34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
  • a0914f6 + eee491a contain ARROW-12696

@ianmcook
Copy link
Member

ianmcook commented May 3, 2021

The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct?

@nealrichardson
Copy link
Member Author

The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct?

Correct. This also puts us in better shape to use the C++ query engine (which will use Expressions, and which Datasets will feed) when it becomes available 🔜 . But the most immediate effect for us is that we get to simplify our code (less FUN passed around) and delete duplicate tests.

r/tests/testthat/test-dplyr-mutate.R Outdated Show resolved Hide resolved
r/tests/testthat/test-dplyr.R Outdated Show resolved Hide resolved
@nealrichardson nealrichardson force-pushed the dplyr-in-memory branch 2 times, most recently from eb6848f to 4f854b8 Compare May 6, 2021 17:52
r/R/dplyr.R Outdated Show resolved Hide resolved
@nealrichardson nealrichardson force-pushed the dplyr-in-memory branch 2 times, most recently from 27c62f5 to 795e1f9 Compare May 8, 2021 00:07
@nealrichardson nealrichardson marked this pull request as ready for review May 10, 2021 19:44
@nealrichardson nealrichardson changed the title [R] [WIP] Use InMemoryDataset for Table/RecordBatch in dplyr code [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 10, 2021
@nealrichardson nealrichardson changed the title [R] Use InMemoryDataset for Table/RecordBatch in dplyr code ARROW-12371: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 10, 2021
@westonpace
Copy link
Member

I think you snagged the wrong Jira :). Or else I haven't been following this issue closely enough.

@nealrichardson
Copy link
Member Author

Yeah I may have transposed the numbers, will confirm

@nealrichardson nealrichardson changed the title ARROW-12371: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code May 11, 2021
@github-actions
Copy link

@apache apache deleted a comment from github-actions bot May 12, 2021
@apache apache deleted a comment from github-actions bot May 12, 2021
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, a few (small) comments

r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Outdated Show resolved Hide resolved
r/R/dplyr-functions.R Show resolved Hide resolved
@nealrichardson
Copy link
Member Author

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: bc6e356

Submitted crossbow builds: ursacomputing/crossbow @ actions-407

Task Status
conda-linux-gcc-py36-cpu-r36 Azure
conda-linux-gcc-py37-cpu-r40 Azure
conda-osx-clang-py36-r36 Azure
conda-osx-clang-py37-r40 Azure
conda-win-vs2017-py36-r36 Azure
conda-win-vs2017-py37-r40 Azure
homebrew-r-autobrew Github Actions
test-r-devdocs Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-3.6-bionic Azure
test-r-rstudio-r-base-3.6-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-3.6-centos8 Azure
test-r-rstudio-r-base-3.6-opensuse15 Azure
test-r-rstudio-r-base-3.6-opensuse42 Azure
test-r-version-compatibility Github Actions
test-r-versions Github Actions
test-r-without-arrow Azure
test-ubuntu-18.04-r-sanitizer Azure

@ianmcook
Copy link
Member

ianmcook commented May 13, 2021

This looks great.

The increased dependency on Dataset means we will have to skip many more tests to make test-r-minimal-build pass.

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit test-r-minimal-build

@github-actions
Copy link

Revision: fa5731e

Submitted crossbow builds: ursacomputing/crossbow @ actions-410

Task Status
test-r-minimal-build Azure

@nealrichardson nealrichardson deleted the dplyr-in-memory branch May 13, 2021 15:47
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Discussing with @bkietz on apache#10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:

* You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved.
* with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.

There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:

* apache@5b501c5 is the main switch to use InMemoryDataset
* apache@b31fb5e deletes `array_expression`
* apache@0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions
* apache@2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz
* apache@d12f584 just splits up dplyr.R into many files; apache@34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface).
* apache@a0914f6 + apache@eee491a contain ARROW-12696

Closes apache#10191 from nealrichardson/dplyr-in-memory

Authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants