[R] Behavior something like `group_by(foo) |> across(everything())` is different from dplyr #35445

eitsupi · 2023-05-05T09:12:08Z

Describe the bug, including details regarding any error messages, version, and platform.

In dplyr, I believe that using across(everything()) on a grouped data frame will not select the column used for grouping.

mtcars |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum))
#> # A tibble: 3 × 11
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  293. 1156.   909  44.8  25.1  211.    10     8    45    17
#> 2     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24
#> 3     8  211. 4943.  2929  45.2  56.0  235.     0     2    46    49

^{Created on 2023-05-05 with reprex v2.0.2}

However, arrow does not seem to exclude the columns used for grouping. The following example results in an error.
(I installed arrow 12.0.0.20230503 from R-universe)

mtcars |>
  arrow::as_arrow_table() |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum)) |>
  dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: double
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)

^{Created on 2023-05-05 with reprex v2.0.2}

Component(s)

R

The text was updated successfully, but these errors were encountered:

thisisnic · 2023-05-05T16:19:30Z

Thanks for reporting this @eitsupi; can confirm this is reproducible and is a bug we should fix.

…ing()) is different from dplyr (#35473) ### Rationale for this change The argument `.cols` of the `dplyr::across` function has the following description. > You can't select grouping columns because they are already automatically handled by the verb (i.e. summarise() or mutate()). However, this behavior is currently not reproduced in the `arrow` package and an error occurs when selecting the column used for grouping with `everything()`. ``` r mtcars |> arrow::as_arrow_table() |> dplyr::group_by(cyl) |> dplyr::summarise(dplyr::across(everything(), sum)) |> dplyr::collect() #> Error in `compute.arrow_dplyr_query()`: #> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double #> cyl: double #> disp: double #> hp: double #> drat: double #> wt: double #> qsec: double #> vs: double #> am: double #> gear: double #> carb: double #> cyl: double #> Backtrace: #> ▆ #> 1. ├─dplyr::collect(...) #> 2. └─arrow:::collect.arrow_dplyr_query(...) #> 3. └─arrow:::compute.arrow_dplyr_query(x) #> 4. └─base::tryCatch(...) #> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers) #> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]]) #> 7. └─value[[3L]](cond) #> 8. └─arrow:::augment_io_error_msg(e, call, schema = schema()) #> 9. └─rlang::abort(msg, call = call) ``` <sup>Created on 2023-05-05 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> This PR fixes this behavior to match with dplyr's original behavior. ### What changes are included in this PR? - Auto exclude grouping columns in `across` in `mutate`, `transmute`, and `summarise`. - The `.data` argument of internal function `expand_across` should be `arrow_dplyr_query`. Some tests have been slightly modified to accommodate this change. - `mutate`, `transmute`, `arrange`, `filter` always return `arrow_dplyr_query`. Currently, `arrow_dplyr_query` is not returned in the following cases, which was not consistent. ```r mtcars |> arrow::arrow_table() |> dplyr::mutate() ``` - Correct the order of columns in results of `group_by(foo) |> mutate(.keep = "none")` Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr. ```r mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::mutate(am, .keep = "none") |> dplyr::collect() ``` - Correct the order of columns in results of `group_by(foo) |> transmute()` Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr. ```r mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::transmute(mpg) |> dplyr::collect() ``` After `transmute`, the group columns should move to the left. (This is a different behavior from `mutate(.keep = "none")`, which keeps the original position.) ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * Closes: #35445 Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>

eitsupi added the Type: bug label May 5, 2023

github-actions bot added the Component: R label May 5, 2023

thisisnic added the Priority: Critical label May 5, 2023

github-actions bot mentioned this issue May 8, 2023

GH-35445: [R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr #35473

Merged

github-actions bot assigned eitsupi May 8, 2023

thisisnic closed this as completed in #35473 May 18, 2023

thisisnic added this to the 13.0.0 milestone May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] Behavior something like `group_by(foo) |> across(everything())` is different from dplyr #35445

[R] Behavior something like `group_by(foo) |> across(everything())` is different from dplyr #35445

eitsupi commented May 5, 2023

thisisnic commented May 5, 2023

[R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr #35445

[R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr #35445

Comments

eitsupi commented May 5, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

thisisnic commented May 5, 2023

[R] Behavior something like `group_by(foo) |> across(everything())` is different from dplyr #35445

[R] Behavior something like `group_by(foo) |> across(everything())` is different from dplyr #35445