Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Behavior something like group_by(foo) |> across(everything()) is different from dplyr #35445

Closed
eitsupi opened this issue May 5, 2023 · 1 comment · Fixed by #35473
Closed

Comments

@eitsupi
Copy link
Contributor

eitsupi commented May 5, 2023

Describe the bug, including details regarding any error messages, version, and platform.

In dplyr, I believe that using across(everything()) on a grouped data frame will not select the column used for grouping.

mtcars |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum))
#> # A tibble: 3 × 11
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  293. 1156.   909  44.8  25.1  211.    10     8    45    17
#> 2     6  138. 1283.   856  25.1  21.8  126.     4     3    27    24
#> 3     8  211. 4943.  2929  45.2  56.0  235.     0     2    46    49

Created on 2023-05-05 with reprex v2.0.2

However, arrow does not seem to exclude the columns used for grouping. The following example results in an error.
(I installed arrow 12.0.0.20230503 from R-universe)

mtcars |>
  arrow::as_arrow_table() |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum)) |>
  dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: double
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)

Created on 2023-05-05 with reprex v2.0.2

Component(s)

R

@thisisnic
Copy link
Member

Thanks for reporting this @eitsupi; can confirm this is reproducible and is a bug we should fix.

thisisnic pushed a commit that referenced this issue May 18, 2023
…ing()) is different from dplyr (#35473)

### Rationale for this change

The argument `.cols` of the `dplyr::across` function has the following description.

> You can't select grouping columns because they are already automatically handled by the verb (i.e. summarise() or mutate()).

However, this behavior is currently not reproduced in the `arrow` package and an error occurs when selecting the column used for grouping with `everything()`.

``` r
mtcars |>
  arrow::as_arrow_table() |>
  dplyr::group_by(cyl) |>
  dplyr::summarise(dplyr::across(everything(), sum)) |>
  dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: double
#> Backtrace:
#>     ▆
#>  1. ├─dplyr::collect(...)
#>  2. └─arrow:::collect.arrow_dplyr_query(...)
#>  3.   └─arrow:::compute.arrow_dplyr_query(x)
#>  4.     └─base::tryCatch(...)
#>  5.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  6.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  7.           └─value[[3L]](cond)
#>  8.             └─arrow:::augment_io_error_msg(e, call, schema = schema())
#>  9.               └─rlang::abort(msg, call = call)
```

<sup>Created on 2023-05-05 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

This PR fixes this behavior to match with dplyr's original behavior.

### What changes are included in this PR?

- Auto exclude grouping columns in `across` in `mutate`, `transmute`, and `summarise`.
- The `.data` argument of internal function `expand_across` should be `arrow_dplyr_query`.
  Some tests have been slightly modified to accommodate this change.
- `mutate`, `transmute`, `arrange`, `filter` always return `arrow_dplyr_query`.
  Currently, `arrow_dplyr_query` is not returned in the following cases, which was not consistent. 
  ```r
  mtcars |> arrow::arrow_table() |> dplyr::mutate()
  ```
- Correct the order of columns in results of `group_by(foo) |> mutate(.keep = "none")`
  Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
  ```r
  mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::mutate(am, .keep = "none") |> dplyr::collect()
  ```
- Correct the order of columns in results of `group_by(foo) |> transmute()`
  Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
  ```r
  mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::transmute(mpg) |> dplyr::collect()
  ```
  After `transmute`, the group columns should move to the left. (This is a different behavior from `mutate(.keep = "none")`, which keeps the original position.)

### Are these changes tested?

Yes.

### Are there any user-facing changes?

Yes.

* Closes: #35445

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
@thisisnic thisisnic added this to the 13.0.0 milestone May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants