Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17738: [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table #14160

Merged
merged 12 commits into from
Oct 7, 2022

Conversation

eitsupi
Copy link
Contributor

@eitsupi eitsupi commented Sep 17, 2022

No description provided.

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@eitsupi eitsupi changed the title ARROW-17738: [R] dplyr::compute does not work for grouped arrow dplyr query ARROW-17738: [R] dplyr::compute should convert from grouped arrow_dplyr_query to arrow Table Sep 17, 2022
@eitsupi eitsupi force-pushed the r-fix-compute branch 2 times, most recently from d756178 to c02c856 Compare September 20, 2022 03:55
r/tests/testthat/test-dplyr-collect.R Outdated Show resolved Hide resolved
@@ -119,7 +119,7 @@ test_that("collect(as_data_frame=FALSE)", {
filter(int > 5) %>%
group_by(int) %>%
collect(as_data_frame = FALSE)
expect_s3_class(b4, "arrow_dplyr_query")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test (and matching code change) that a Table with groups in the $metadata prints that it is grouped. Currently, the print method does not reveal that you have groupings, even though they get restored when you as.data.frame():

> tab <- arrow_table(mtcars)
> tab$metadata$r$attributes$.group_vars <- "cyl"
> tab
Table
32 rows x 11 columns
$mpg <double>
$cyl <double>
$disp <double>
$hp <double>
$drat <double>
$wt <double>
$qsec <double>
$vs <double>
$am <double>
$gear <double>
$carb <double>

See $metadata for additional Schema metadata

I wonder what else is lost or concealed by this change. This feature of reading groups from $metadata$r$attributes$.group_vars was a bit of a hack and not something we're generally relying on, so I don't think it's a simple change. Or at least, it's not something we can just rely that the test suite already covers everything we care about. (Search for ".group_vars" in the code, it's not touched in many places, and only in one test.)

Not a reason not to do it, just need to be thorough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review.

I think we need a test (and matching code change) that a Table with groups in the $metadata prints that it is grouped. Currently, the print method does not reveal that you have groupings, even though they get restored when you as.data.frame():

I agree that the print results need improvement.
But would it be better to create a separate ticket and do it there?
This occurs even in arrow 9.0.0, regardless of the behavior with compute.

mtcars |> dplyr::group_by(cyl) |> arrow::arrow_table()
#> Table
#> 32 rows x 11 columns
#> $mpg <double>
#> $cyl <double>
#> $disp <double>
#> $hp <double>
#> $drat <double>
#> $wt <double>
#> $qsec <double>
#> $vs <double>
#> $am <double>
#> $gear <double>
#> $carb <double>
#>
#> See $metadata for additional Schema metadata

Created on 2022-09-20 with reprex v2.0.2

This feature of reading groups from $metadata$r$attributes$.group_vars was a bit of a hack and not something we're generally relying on, so I don't think it's a simple change. Or at least, it's not something we can just rely that the test suite already covers everything we care about.

Since dplyr::group_vars reads this field, isn't this function generally used?

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any problems with the implementation, but I don't happen to think think that restoring groups without user consent is a good idea (feel free to convince me otherwise, though!). I know that there is precedent here...this is something we do in a few places (e.g., I think you can even write/read a Parquet file and have the groupings restored). Unintended groupings are often left over after a summarise() and it's always safer to do an explicit group_by() before a dplyr operation anyway (again, feel free to alert me to a use-case I'm missing).

The fact that I happen to think that shouldn't block that PR if @nealrichardson doesn't think it's a problem...I'm a new contributor to this package in the grand scheme of things!

@eitsupi
Copy link
Contributor Author

eitsupi commented Oct 5, 2022

Thank you for taking a look at this.
I am more concerned with dplyr::compute not returning a Table than whether the group will continue to remain.

mtcars |> arrow::arrow_table() |> dplyr::group_by(vs, am) |> dplyr::summarise(wt = mean(wt)) |> dplyr::compute()
#> Table (query)
#> vs: double
#> am: double
#> wt: double
#>
#> * Grouped by vs
#> See $.data for the source Arrow object

I noticed while creating this PR that using as_arrow_table instead of compute makes it a Table, but I think it would be difficult for users to notice that unless they read the documentation thoroughly.
I also believe that compute should always return Table for consistency with the behavior of dbplyr and dtplyr.

@paleolimbot
Copy link
Member

Agreed! I personally think it would be safer not to save the grouping variables with the table; however, I can also see the argument for consistency since there are other places where this information gets saved.

@eitsupi
Copy link
Contributor Author

eitsupi commented Oct 5, 2022

I believe saving groups is necessary to do so at this stage because it is counter-intuitive to have different results for data |> compute() |> collect() and data |> collect().

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You convinced me! I will save my grievances with metadata restoration for another time.

In that case, I think the change that is needed is a test_that("collect() is identical to compute() %>% collect()", {}) with grouped and ungrouped Tables and Datasets. Does that sound reasonable?

I am not personally worried about the printing of groups but am happy to defer to Neal's earlier review...automatic metadata restoration does lots of other things we could warn users about too and it's reasonable that it's best done together.

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
@eitsupi
Copy link
Contributor Author

eitsupi commented Oct 6, 2022

In that case, I think the change that is needed is a test_that("collect() is identical to compute() %>% collect()", {}) with grouped and ungrouped Tables and Datasets. Does that sound reasonable?

Thank you for your suggestion. I have added tests.
Is this also necessary for datasets?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great from my end! Thanks!

@paleolimbot paleolimbot merged commit 12667cd into apache:master Oct 7, 2022
@eitsupi eitsupi deleted the r-fix-compute branch October 7, 2022 20:56
@eitsupi
Copy link
Contributor Author

eitsupi commented Oct 7, 2022

Thanks for merging!

@ursabot
Copy link

ursabot commented Oct 7, 2022

Benchmark runs are scheduled for baseline = 45a008d and contender = 12667cd. 12667cd is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️1.1% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.36% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 12667cd0 ec2-t3-xlarge-us-east-2
[Failed] 12667cd0 test-mac-arm
[Failed] 12667cd0 ursa-i9-9960x
[Finished] 12667cd0 ursa-thinkcentre-m75q
[Finished] 45a008d7 ec2-t3-xlarge-us-east-2
[Failed] 45a008d7 test-mac-arm
[Failed] 45a008d7 ursa-i9-9960x
[Finished] 45a008d7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

amoeba added a commit to amoeba/arrow that referenced this pull request Oct 8, 2022
No longer needed thanks to apache#14160 and @eitsupi's work
amoeba added a commit to amoeba/arrow that referenced this pull request Oct 11, 2022
No longer needed thanks to apache#14160 and @eitsupi's work
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
…yr_query to arrow Table (apache#14160)

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants