Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr #35473

Merged
merged 9 commits into from
May 18, 2023

Conversation

eitsupi
Copy link
Contributor

@eitsupi eitsupi commented May 8, 2023

Rationale for this change

The argument .cols of the dplyr::across function has the following description.

You can't select grouping columns because they are already automatically handled by the verb (i.e. summarise() or mutate()).

However, this behavior is currently not reproduced in the arrow package and an error occurs when selecting the column used for grouping with everything().

mtcars |>
  arrow::as_arrow_table() |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum)) |>
  dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: double
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)

Created on 2023-05-05 with reprex v2.0.2

This PR fixes this behavior to match with dplyr's original behavior.

What changes are included in this PR?

  • Auto exclude grouping columns in across in mutate, transmute, and summarise.
  • The .data argument of internal function expand_across should be arrow_dplyr_query.
    Some tests have been slightly modified to accommodate this change.
  • mutate, transmute, arrange, filter always return arrow_dplyr_query.
    Currently, arrow_dplyr_query is not returned in the following cases, which was not consistent.
    mtcars |> arrow::arrow_table() |> dplyr::mutate()
  • Correct the order of columns in results of group_by(foo) |> mutate(.keep = "none")
    Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
    mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::mutate(am, .keep = "none") |> dplyr::collect()
  • Correct the order of columns in results of group_by(foo) |> transmute()
    Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
    mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::transmute(mpg) |> dplyr::collect()
    After transmute, the group columns should move to the left. (This is a different behavior from mutate(.keep = "none"), which keeps the original position.)

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@github-actions
Copy link

github-actions bot commented May 8, 2023

⚠️ GitHub issue #35445 has been automatically assigned in GitHub to PR creator.

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
@eitsupi eitsupi changed the title GH-35445 : [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr May 8, 2023
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! Would you mind updating the tests to use the example_data dataset, for consistency with other tests?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 9, 2023
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 10, 2023
@eitsupi eitsupi marked this pull request as draft May 10, 2023 15:10
@eitsupi

This comment was marked as resolved.

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
@eitsupi eitsupi marked this pull request as ready for review May 10, 2023 15:41
@eitsupi

This comment was marked as off-topic.

Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One tiny change to the tests, otherwise this is good to go. Thanks!

r/tests/testthat/test-dplyr-mutate.R Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 13, 2023
Signed-off-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
@thisisnic thisisnic merged commit 6bd0050 into apache:main May 18, 2023
@eitsupi eitsupi deleted the r-group-vars-across branch May 18, 2023 13:28
@ursabot
Copy link

ursabot commented May 20, 2023

Benchmark runs are scheduled for baseline = 3e4eaa9 and contender = 6bd0050. 6bd0050 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.42% ⬆️0.0%] test-mac-arm
[Finished ⬇️1.31% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.57% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 6bd00508 ec2-t3-xlarge-us-east-2
[Finished] 6bd00508 test-mac-arm
[Finished] 6bd00508 ursa-i9-9960x
[Finished] 6bd00508 ursa-thinkcentre-m75q
[Finished] 3e4eaa91 ec2-t3-xlarge-us-east-2
[Finished] 3e4eaa91 test-mac-arm
[Finished] 3e4eaa91 ursa-i9-9960x
[Finished] 3e4eaa91 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented May 20, 2023

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr
3 participants