Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Column order after group_by(foo) |> select(...) is different from dplyr #35534

Closed
eitsupi opened this issue May 10, 2023 · 1 comment · Fixed by #36305
Closed

[R] Column order after group_by(foo) |> select(...) is different from dplyr #35534

eitsupi opened this issue May 10, 2023 · 1 comment · Fixed by #36305
Assignees
Labels
Milestone

Comments

@eitsupi
Copy link
Contributor

eitsupi commented May 10, 2023

Describe the bug, including details regarding any error messages, version, and platform.

Related to #35473

I corrected the column order after mutate and transmute in #35473, but it seems that it needs to be corrected in select as well.

The columns used for groups should to be moved to the left.

mtcars |> dplyr::group_by(cyl) |> dplyr::select(mpg)
#> Adding missing grouping variables: `cyl`
#> # A tibble: 32 × 2
#> # Groups:   cyl [3]
#>      cyl   mpg
#>    <dbl> <dbl>
#>  1     6  21
#>  2     6  21
#>  3     4  22.8
#>  4     6  21.4
#>  5     8  18.7
#>  6     6  18.1
#>  7     8  14.3
#>  8     4  24.4
#>  9     4  22.8
#> 10     6  19.2
#> # … with 22 more rows
mtcars |> arrow::arrow_table() |> dplyr::group_by(cyl) |> dplyr::select(mpg) |> dplyr::collect()
#> # A tibble: 32 × 2
#> # Groups:   cyl [3]
#>      mpg   cyl
#>    <dbl> <dbl>
#>  1  21       6
#>  2  21       6
#>  3  22.8     4
#>  4  21.4     6
#>  5  18.7     8
#>  6  18.1     6
#>  7  14.3     8
#>  8  24.4     4
#>  9  22.8     4
#> 10  19.2     6
#> # … with 22 more rows

Created on 2023-05-10 with reprex v2.0.2

Component(s)

R

@nealrichardson
Copy link
Member

Thanks. This would be in ensure_group_vars(), and it looks like we might also want to mimic the Adding missing grouping variables message to be explicit when this happens.

paleolimbot added a commit that referenced this issue Jun 27, 2023
…nning of the variable list (#36305)

### Rationale for this change

As reported by @ eitsupi, dplyr adds missing grouping variables to the beginning of the variable list; however, we add them to the *end* of the variable list. Our general policy is to match dplyr's behaviour everywhere.

Before this PR:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

tibble::tibble(int = 1:4, chr = letters[1:4]) |> 
  group_by(chr) |> 
  select(int) |> 
  collect()
#> Adding missing grouping variables: `chr`
#> # A tibble: 4 × 2
#> # Groups:   chr [4]
#>   chr     int
#>   <chr> <int>
#> 1 a         1
#> 2 b         2
#> 3 c         3
#> 4 d         4

arrow_table(int = 1:4, chr = letters[1:4]) |> 
  group_by(chr) |> 
  select(int) |> 
  collect()
#> # A tibble: 4 × 2
#> # Groups:   chr [4]
#>     int chr  
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 c    
#> 4     4 d
```

After this PR:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

tibble::tibble(int = 1:4, chr = letters[1:4]) |> 
  group_by(chr) |> 
  select(int) |> 
  collect()
#> Adding missing grouping variables: `chr`
#> # A tibble: 4 × 2
#> # Groups:   chr [4]
#>   chr     int
#>   <chr> <int>
#> 1 a         1
#> 2 b         2
#> 3 c         3
#> 4 d         4

arrow_table(int = 1:4, chr = letters[1:4]) |> 
  group_by(chr) |> 
  select(int) |> 
  collect()
#> # A tibble: 4 × 2
#> # Groups:   chr [4]
#>   chr     int
#>   <chr> <int>
#> 1 a         1
#> 2 b         2
#> 3 c         3
#> 4 d         4
```

### Are these changes tested?

Yes, a test was added.

### Are there any user-facing changes?

Yes, column ordering will be different. This could be a breaking change because existing code that refers to columns by index may change; however, referring to a column by name is much more common.
* Closes: #35534

Authored-by: Dewey Dunnington <dewey@voltrondata.com>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
@paleolimbot paleolimbot added this to the 13.0.0 milestone Jun 27, 2023
@raulcd raulcd added the Breaking Change Includes a breaking change to the API label Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants