-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17738: [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table #14160
Conversation
|
d756178
to
c02c856
Compare
@@ -119,7 +119,7 @@ test_that("collect(as_data_frame=FALSE)", { | |||
filter(int > 5) %>% | |||
group_by(int) %>% | |||
collect(as_data_frame = FALSE) | |||
expect_s3_class(b4, "arrow_dplyr_query") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need a test (and matching code change) that a Table with groups in the $metadata
prints that it is grouped. Currently, the print method does not reveal that you have groupings, even though they get restored when you as.data.frame()
:
> tab <- arrow_table(mtcars)
> tab$metadata$r$attributes$.group_vars <- "cyl"
> tab
Table
32 rows x 11 columns
$mpg <double>
$cyl <double>
$disp <double>
$hp <double>
$drat <double>
$wt <double>
$qsec <double>
$vs <double>
$am <double>
$gear <double>
$carb <double>
See $metadata for additional Schema metadata
I wonder what else is lost or concealed by this change. This feature of reading groups from $metadata$r$attributes$.group_vars
was a bit of a hack and not something we're generally relying on, so I don't think it's a simple change. Or at least, it's not something we can just rely that the test suite already covers everything we care about. (Search for ".group_vars" in the code, it's not touched in many places, and only in one test.)
Not a reason not to do it, just need to be thorough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review.
I think we need a test (and matching code change) that a Table with groups in the
$metadata
prints that it is grouped. Currently, the print method does not reveal that you have groupings, even though they get restored when youas.data.frame()
:
I agree that the print results need improvement.
But would it be better to create a separate ticket and do it there?
This occurs even in arrow 9.0.0, regardless of the behavior with compute
.
mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table()
#> Table
#> 32 rows x 11 columns
#> $mpg <double>
#> $cyl <double>
#> $disp <double>
#> $hp <double>
#> $drat <double>
#> $wt <double>
#> $qsec <double>
#> $vs <double>
#> $am <double>
#> $gear <double>
#> $carb <double>
#>
#> See $metadata for additional Schema metadata
Created on 2022-09-20 with reprex v2.0.2
This feature of reading groups from
$metadata$r$attributes$.group_vars
was a bit of a hack and not something we're generally relying on, so I don't think it's a simple change. Or at least, it's not something we can just rely that the test suite already covers everything we care about.
Since dplyr::group_vars
reads this field, isn't this function generally used?
b3813e7
to
aa0be7e
Compare
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
This reverts commit 83eafbe.
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any problems with the implementation, but I don't happen to think think that restoring groups without user consent is a good idea (feel free to convince me otherwise, though!). I know that there is precedent here...this is something we do in a few places (e.g., I think you can even write/read a Parquet file and have the groupings restored). Unintended groupings are often left over after a summarise()
and it's always safer to do an explicit group_by()
before a dplyr operation anyway (again, feel free to alert me to a use-case I'm missing).
The fact that I happen to think that shouldn't block that PR if @nealrichardson doesn't think it's a problem...I'm a new contributor to this package in the grand scheme of things!
Thank you for taking a look at this. mtcars |> arrow::arrow_table() |> dplyr::group_by(vs, am) |> dplyr::summarise(wt = mean(wt)) |> dplyr::compute()
#> Table (query)
#> vs: double
#> am: double
#> wt: double
#>
#> * Grouped by vs
#> See $.data for the source Arrow object I noticed while creating this PR that using |
Agreed! I personally think it would be safer not to save the grouping variables with the table; however, I can also see the argument for consistency since there are other places where this information gets saved. |
I believe saving groups is necessary to do so at this stage because it is counter-intuitive to have different results for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You convinced me! I will save my grievances with metadata restoration for another time.
In that case, I think the change that is needed is a test_that("collect() is identical to compute() %>% collect()", {})
with grouped and ungrouped Tables and Datasets. Does that sound reasonable?
I am not personally worried about the printing of groups but am happy to defer to Neal's earlier review...automatic metadata restoration does lots of other things we could warn users about too and it's reasonable that it's best done together.
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Thank you for your suggestion. I have added tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great from my end! Thanks!
Thanks for merging! |
Benchmark runs are scheduled for baseline = 45a008d and contender = 12667cd. 12667cd is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
No longer needed thanks to apache#14160 and @eitsupi's work
No longer needed thanks to apache#14160 and @eitsupi's work
…yr_query to arrow Table (apache#14160) Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
No description provided.