ARROW-17737: [R] Groups before conversion to a Table must not be restored after `collect()` #14175

eitsupi · 2022-09-20T14:24:23Z

If a grouped data.frame is converted to arrow dplyr query and then back to a data.frame again, the data.frame-era groups are restored, even if it is ungrouped in the query.

mtcars |>
  dplyr::group_by(cyl) |>
  arrow::arrow_table() |>
  dplyr::group_by() |>
  dplyr::ungroup() |>
  dplyr::collect() |>
  dplyr::group_vars()
#> [1] "cyl"

This PR will update to ensure that the arrow dplyr query's groups are applied when compute or collect.

github-actions · 2022-09-20T14:39:51Z

https://issues.apache.org/jira/browse/ARROW-17737

r/R/dplyr-collect.R

eitsupi · 2022-10-07T22:42:03Z

r/R/dplyr.R

@@ -182,7 +182,7 @@ dim.arrow_dplyr_query <- function(x) {
    # Query on in-memory Table, so evaluate the filter
    # Don't need any columns
    x <- select.arrow_dplyr_query(x, NULL)
-    rows <- nrow(compute.arrow_dplyr_query(x))
+    rows <- nrow(as_arrow_table(x))


This is because manipulating metadata for a table with no rows will cause the size to be updated to 0 x 0.

mtcars |> arrow::arrow_table() |> dplyr::select(NULL) |> arrow::as_arrow_table() #> Table #> 32 rows x 0 columns #> #> #> See $metadata for additional Schema metadata mtcars |> arrow::arrow_table() |> dplyr::select(NULL) |> arrow::as_arrow_table() |> dplyr::ungroup() #> Table #> 0 rows x 0 columns #> #> #> See $metadata for additional Schema metadata

^{Created on 2022-10-07 with reprex v2.0.2}

I don't know if this (handling of tables with no rows) is a problem.
A table with 0 rows and multiple columns appears to be quite exceptional, since creating a table from a data frame with no rows results in 0 x 0.

mtcars |> dplyr::select(NULL) |> arrow::arrow_table() #> Table #> 0 rows x 0 columns #> #> #> See $metadata for additional Schema metadata

^{Created on 2022-10-07 with reprex v2.0.2}

nealrichardson

One suggestion for simplifying this change. Thanks for taking this on, will be nice to get this in the upcoming release along with your other changes around here.

r/R/dplyr-collect.R

r/R/dplyr.R

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

This reverts commit 83eafbe.

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

…ibutes$.group_vars should not character(0) Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

because of Table with 0 columns handling Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

ursabot · 2022-10-16T04:12:30Z

Benchmark runs are scheduled for baseline = d1a8f4b and contender = d008c17. d008c17 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.11% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.18% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d008c17e ec2-t3-xlarge-us-east-2
[Failed] d008c17e test-mac-arm
[Failed] d008c17e ursa-i9-9960x
[Finished] d008c17e ursa-thinkcentre-m75q
[Finished] d1a8f4ba ec2-t3-xlarge-us-east-2
[Failed] d1a8f4ba test-mac-arm
[Failed] d1a8f4ba ursa-i9-9960x
[Finished] d1a8f4ba ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-10-16T04:12:48Z

['Python', 'R'] benchmarks have high level of regressions.
test-mac-arm

eitsupi force-pushed the r-group-convert branch from 56d8238 to a97b53e Compare September 20, 2022 14:35

github-actions bot added the Component: R label Sep 20, 2022

eitsupi commented Sep 20, 2022

View reviewed changes

r/R/dplyr-collect.R Outdated Show resolved Hide resolved

eitsupi force-pushed the r-group-convert branch from b9fb2e3 to 42a95d0 Compare September 23, 2022 11:11

eitsupi force-pushed the r-group-convert branch 2 times, most recently from 379c3b5 to 1a41638 Compare October 7, 2022 22:17

eitsupi marked this pull request as ready for review October 7, 2022 22:23

eitsupi mentioned this pull request Oct 7, 2022

Needs a maintainer to approve running workflows every time? #14349

Closed

eitsupi commented Oct 7, 2022

View reviewed changes

eitsupi changed the title ~~ARROW-17737: [R] Continue to retain grouping metadata even if ungroup arrow dplyr query~~ ARROW-17737: [R] Groups before conversion to a Table must not be restored after collect() Oct 7, 2022

nealrichardson requested changes Oct 10, 2022

View reviewed changes

r/R/dplyr-collect.R Outdated Show resolved Hide resolved

nealrichardson reviewed Oct 12, 2022

View reviewed changes

r/R/dplyr-collect.R Outdated Show resolved Hide resolved

r/R/dplyr.R Outdated Show resolved Hide resolved

r/R/dplyr.R Outdated Show resolved Hide resolved

nealrichardson force-pushed the r-group-convert branch from 225975e to 943d603 Compare October 13, 2022 13:41

eitsupi added 16 commits October 13, 2022 21:34

add tests for compute

6659080

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

rename file to match name of file to be tested

55d803d

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

fix tests

cd8fa1d

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

remove unused line

9e1c73b

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

Revert "rename file to match name of file to be tested"

71c170f

This reverts commit 83eafbe.

move test to the other file and rename the test case

fd4b15e

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

add tests for compute and collect

2fab634

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

more test

7830d2c

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

use NULL for empty group vars, and remove group vars metadata from .data

dc19476

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

restore group vars from arrow dplyr query to Table

cae1f26

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

arrow_dplyr_query$group_by_vars should character, and metadata$r$attr…

73edf12

…ibutes$.group_vars should not character(0) Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

fix

f6d2a0d

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

use as_arrow_table instead of collect

da80d76

because of Table with 0 columns handling Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

fix typo

3012a54

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

add tests

d0c545f

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

add tests

ce5d581

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

eitsupi added 6 commits October 13, 2022 21:34

ensure not to convert to a grouped_df

e117e8c

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

change to manipulate groups only when necessary

36674d1

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

separate the test case

8b5ef46

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

fix rebasing

8a1f7cd

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

ensure ungroup .data if it is a Table

72a5ab6

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

more simplify

a706268

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>

nealrichardson force-pushed the r-group-convert branch from 943d603 to a706268 Compare October 14, 2022 01:34

nealrichardson approved these changes Oct 14, 2022

View reviewed changes

nealrichardson merged commit d008c17 into apache:master Oct 14, 2022

eitsupi deleted the r-group-convert branch October 14, 2022 13:13

asfimport mentioned this pull request Oct 16, 2022

[R] Groups before conversion to a Table must not be restored after collect() #32971

Closed

eitsupi mentioned this pull request Aug 3, 2023

Manually converting LazyGroupBy to LazyFrame breaks printing pola-rs/r-polars#338

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17737: [R] Groups before conversion to a Table must not be restored after `collect()` #14175

ARROW-17737: [R] Groups before conversion to a Table must not be restored after `collect()` #14175

eitsupi commented Sep 20, 2022 •

edited

Loading

github-actions bot commented Sep 20, 2022

eitsupi Oct 7, 2022 •

edited

Loading

nealrichardson left a comment

ursabot commented Oct 16, 2022

ursabot commented Oct 16, 2022

ARROW-17737: [R] Groups before conversion to a Table must not be restored after collect() #14175

ARROW-17737: [R] Groups before conversion to a Table must not be restored after collect() #14175

Conversation

eitsupi commented Sep 20, 2022 • edited Loading

github-actions bot commented Sep 20, 2022

eitsupi Oct 7, 2022 • edited Loading

Choose a reason for hiding this comment

nealrichardson left a comment

Choose a reason for hiding this comment

ursabot commented Oct 16, 2022

ursabot commented Oct 16, 2022

ARROW-17737: [R] Groups before conversion to a Table must not be restored after `collect()` #14175

ARROW-17737: [R] Groups before conversion to a Table must not be restored after `collect()` #14175

eitsupi commented Sep 20, 2022 •

edited

Loading

eitsupi Oct 7, 2022 •

edited

Loading