-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12731: [R] Use InMemoryDataset for Table/RecordBatch in dplyr code #10191
Conversation
3bbe342
to
9640cff
Compare
The purpose of this is to simplify the dplyr implementation by having only one API (the Dataset API) that we are coding against, correct? |
Correct. This also puts us in better shape to use the C++ query engine (which will use Expressions, and which Datasets will feed) when it becomes available 🔜 . But the most immediate effect for us is that we get to simplify our code (less |
eb6848f
to
4f854b8
Compare
27c62f5
to
795e1f9
Compare
I think you snagged the wrong Jira :). Or else I haven't been following this issue closely enough. |
Yeah I may have transposed the numbers, will confirm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, a few (small) comments
@github-actions crossbow submit -g r |
Revision: bc6e356 Submitted crossbow builds: ursacomputing/crossbow @ actions-407 |
This looks great. The increased dependency on Dataset means we will have to skip many more tests to make |
bc6e356
to
fa5731e
Compare
@github-actions crossbow submit test-r-minimal-build |
Revision: fa5731e Submitted crossbow builds: ursacomputing/crossbow @ actions-410
|
Discussing with @bkietz on apache#10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats: * You can't dictionary_encode a dataset column. `Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}` (ARROW-12632). I will remove the `as.factor` method and leave a TODO to restore it after that JIRA is resolved. * with the existing array_expressions, you could supply an additional Array (or R data convertible to an Array) when doing `mutate()`; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine. There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights: * apache@5b501c5 is the main switch to use InMemoryDataset * apache@b31fb5e deletes `array_expression` * apache@0d31938 simplifies the interface for adding functions to the dplyr data_mask; definitely check this one out and see what you think of the new way--I hope it's much simpler to add new functions * apache@2e6374f improves the print method for queries by showing both the expression and the expected type of the output column, per suggestion from @bkietz * apache@d12f584 just splits up dplyr.R into many files; apache@34dc1e6 deletes tests that are duplicated between test-dplyr*.R and test-dataset.R (since they're now going through a common C++ interface). * apache@a0914f6 + apache@eee491a contain ARROW-12696 Closes apache#10191 from nealrichardson/dplyr-in-memory Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Discussing with @bkietz on #10166, we realized that we could already evaluate filter/project on Table/RecordBatch by wrapping it in InMemoryDataset and using the Dataset machinery, so I wanted to see how well that worked. Mostly it does, with a couple of caveats:
Error: Invalid: ExecuteScalarExpression cannot Execute non-scalar expression {x=dictionary_encode(x, {NON-REPRESENTABLE OPTIONS})}
(ARROW-12632). I will remove theas.factor
method and leave a TODO to restore it after that JIRA is resolved.mutate()
; this is not implemented for Datasets and that's ok. For Tables/RecordBatches, the behavior in this PR is to pull the data into R, which is fine.There are a lot of changes here, which means the diff is big, but I've tried to group into distinct commits the main action. Highlights:
array_expression