You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, arrow does not seem to exclude the columns used for grouping. The following example results in an error.
(I installed arrow 12.0.0.20230503 from R-universe)
…ing()) is different from dplyr (#35473)
### Rationale for this change
The argument `.cols` of the `dplyr::across` function has the following description.
> You can't select grouping columns because they are already automatically handled by the verb (i.e. summarise() or mutate()).
However, this behavior is currently not reproduced in the `arrow` package and an error occurs when selecting the column used for grouping with `everything()`.
``` r
mtcars |>
arrow::as_arrow_table() |>
dplyr::group_by(cyl) |>
dplyr::summarise(dplyr::across(everything(), sum)) |>
dplyr::collect()
#> Error in `compute.arrow_dplyr_query()`:
#> ! Invalid: Multiple matches for FieldRef.Name(cyl) in mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> cyl: double
#> Backtrace:
#> ▆
#> 1. ├─dplyr::collect(...)
#> 2. └─arrow:::collect.arrow_dplyr_query(...)
#> 3. └─arrow:::compute.arrow_dplyr_query(x)
#> 4. └─base::tryCatch(...)
#> 5. └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#> 6. └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#> 7. └─value[[3L]](cond)
#> 8. └─arrow:::augment_io_error_msg(e, call, schema = schema())
#> 9. └─rlang::abort(msg, call = call)
```
<sup>Created on 2023-05-05 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
This PR fixes this behavior to match with dplyr's original behavior.
### What changes are included in this PR?
- Auto exclude grouping columns in `across` in `mutate`, `transmute`, and `summarise`.
- The `.data` argument of internal function `expand_across` should be `arrow_dplyr_query`.
Some tests have been slightly modified to accommodate this change.
- `mutate`, `transmute`, `arrange`, `filter` always return `arrow_dplyr_query`.
Currently, `arrow_dplyr_query` is not returned in the following cases, which was not consistent.
```r
mtcars |> arrow::arrow_table() |> dplyr::mutate()
```
- Correct the order of columns in results of `group_by(foo) |> mutate(.keep = "none")`
Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
```r
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::mutate(am, .keep = "none") |> dplyr::collect()
```
- Correct the order of columns in results of `group_by(foo) |> transmute()`
Currently, the results of the following query show that the columns used for grouping have moved to the tail and differ from the behavior of dplyr.
```r
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::transmute(mpg) |> dplyr::collect()
```
After `transmute`, the group columns should move to the left. (This is a different behavior from `mutate(.keep = "none")`, which keeps the original position.)
### Are these changes tested?
Yes.
### Are there any user-facing changes?
Yes.
* Closes: #35445
Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Describe the bug, including details regarding any error messages, version, and platform.
In dplyr, I believe that using
across(everything())
on a grouped data frame will not select the column used for grouping.Created on 2023-05-05 with reprex v2.0.2
However, arrow does not seem to exclude the columns used for grouping. The following example results in an error.
(I installed arrow 12.0.0.20230503 from R-universe)
Created on 2023-05-05 with reprex v2.0.2
Component(s)
R
The text was updated successfully, but these errors were encountered: