Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17737: [R] Groups before conversion to a Table must not be restored after collect() #14175

Merged
merged 22 commits into from
Oct 14, 2022

Conversation

eitsupi
Copy link
Contributor

@eitsupi eitsupi commented Sep 20, 2022

If a grouped data.frame is converted to arrow dplyr query and then back to a data.frame again, the data.frame-era groups are restored, even if it is ungrouped in the query.

mtcars |>
  dplyr::group_by(cyl) |>
  arrow::arrow_table() |>
  dplyr::group_by() |>
  dplyr::ungroup() |>
  dplyr::collect() |>
  dplyr::group_vars()
#> [1] "cyl"

This PR will update to ensure that the arrow dplyr query's groups are applied when compute or collect.

@github-actions
Copy link

r/R/dplyr-collect.R Outdated Show resolved Hide resolved
@@ -182,7 +182,7 @@ dim.arrow_dplyr_query <- function(x) {
# Query on in-memory Table, so evaluate the filter
# Don't need any columns
x <- select.arrow_dplyr_query(x, NULL)
rows <- nrow(compute.arrow_dplyr_query(x))
rows <- nrow(as_arrow_table(x))
Copy link
Contributor Author

@eitsupi eitsupi Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because manipulating metadata for a table with no rows will cause the size to be updated to 0 x 0.

mtcars |> arrow::arrow_table() |> dplyr::select(NULL) |> arrow::as_arrow_table()
#> Table
#> 32 rows x 0 columns
#>
#>
#> See $metadata for additional Schema metadata
mtcars |> arrow::arrow_table() |> dplyr::select(NULL) |> arrow::as_arrow_table() |> dplyr::ungroup()
#> Table
#> 0 rows x 0 columns
#>
#>
#> See $metadata for additional Schema metadata

Created on 2022-10-07 with reprex v2.0.2

I don't know if this (handling of tables with no rows) is a problem.
A table with 0 rows and multiple columns appears to be quite exceptional, since creating a table from a data frame with no rows results in 0 x 0.

mtcars |> dplyr::select(NULL) |> arrow::arrow_table()
#> Table
#> 0 rows x 0 columns
#>
#>
#> See $metadata for additional Schema metadata

Created on 2022-10-07 with reprex v2.0.2

@eitsupi eitsupi changed the title ARROW-17737: [R] Continue to retain grouping metadata even if ungroup arrow dplyr query ARROW-17737: [R] Groups before conversion to a Table must not be restored after collect() Oct 7, 2022
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion for simplifying this change. Thanks for taking this on, will be nice to get this in the upcoming release along with your other changes around here.

r/R/dplyr-collect.R Outdated Show resolved Hide resolved
r/R/dplyr-collect.R Outdated Show resolved Hide resolved
r/R/dplyr.R Outdated Show resolved Hide resolved
r/R/dplyr.R Outdated Show resolved Hide resolved
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
…ibutes$.group_vars should not character(0)

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
because of Table with 0 columns handling

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
@nealrichardson nealrichardson merged commit d008c17 into apache:master Oct 14, 2022
@eitsupi eitsupi deleted the r-group-convert branch October 14, 2022 13:13
@ursabot
Copy link

ursabot commented Oct 16, 2022

Benchmark runs are scheduled for baseline = d1a8f4b and contender = d008c17. d008c17 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.11% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d008c17e ec2-t3-xlarge-us-east-2
[Failed] d008c17e test-mac-arm
[Failed] d008c17e ursa-i9-9960x
[Finished] d008c17e ursa-thinkcentre-m75q
[Finished] d1a8f4ba ec2-t3-xlarge-us-east-2
[Failed] d1a8f4ba test-mac-arm
[Failed] d1a8f4ba ursa-i9-9960x
[Finished] d1a8f4ba ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Oct 16, 2022

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants