Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14872

Closed
DavZim opened this issue Dec 7, 2022 · 2 comments · Fixed by #14905
Closed
Assignees
Milestone

Comments

@DavZim
Copy link

DavZim commented Dec 7, 2022

Describe the bug, including details regarding any error messages, version, and platform.

When collecting a query with multiple group_by + summarise statements, one variable gets wrongly assigned values from another variable. When an ungroup is inserted, everything works fine again.

To reproduce, consider the following:
In the examples below, the variable gender should be F, or M and not Group X.
When the ungroup() is inserted (second part), gender is again F/M and not Group X.

library(dplyr)
library(arrow)

# Create sample dataset
N <- 1000
set.seed(123)
orig_data <- tibble(
  code_group = sample(paste("Group", 1:2), N, replace = TRUE),
  year = sample(2015:2016, N, replace = TRUE),
  gender = sample(c("F", "M"), N, replace = TRUE),
  value = runif(N, 0, 10)
)
write_dataset(orig_data, "example")

# Query and replicate the error
(ds <- open_dataset("example/"))
#> FileSystemDataset with 1 Parquet file
#> code_group: string
#> year: int32
#> gender: string
#> value: double

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |>
  collect()
#> # A tibble: 2 × 4
#> # Groups:   code_group [2]
#>   code_group gender  value    NN
#>   <chr>      <chr>   <dbl> <int>
#> 1 Group 1    Group 1  724.     4
#> 2 Group 2    Group 2  661.     4

ERROR the gender variable is replaced by the values of the group variable

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  ungroup() |>                                             #< Added this line...
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |>
  collect()
#> # A tibble: 4 × 4
#> # Groups:   code_group [2]
#>   code_group gender value    NN
#>   <chr>      <chr>  <dbl> <int>
#> 1 Group 1    F       724.     2
#> 2 Group 2    M       627.     2
#> 3 Group 1    M       658.     2
#> 4 Group 2    F       661.     2

Note now after inserting the ungroup() between the group-by - summarise calls, gender is not replaced

Quick look at the query (note Node 4 where "gender": code_group)

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |> 
  show_query()
#> ExecPlan with 8 nodes:
#> 7:SinkNode{}
#>   6:ProjectNode{projection=[code_group, gender, value, NN]}
#>     5:GroupByNode{keys=["code_group", "gender"], aggregates=[
#>      hash_max(value, {skip_nulls=false, min_count=0}),
#>      hash_sum(NN, {skip_nulls=true, min_count=1}),
#>     ]}
#>       4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": code_group]}       #< gender is wrongfully mapped to code_group! 
#>         3:ProjectNode{projection=[year, code_group, gender, value]}
#>           2:GroupByNode{keys=["year", "code_group", "gender"], aggregates=[
#>              hash_sum(value, {skip_nulls=false, min_count=0}),
#>           ]}
#>             1:ProjectNode{projection=[value, year, code_group, gender]}
#>               0:SourceNode{}

Note that this was also asked here on SO

Component(s)

R

@eitsupi
Copy link
Contributor

eitsupi commented Dec 8, 2022

A simpler example.

mtcars |>
  arrow::arrow_table() |>
  dplyr::group_by(mpg, cyl) |>
  dplyr::summarise(value = "foo") |>
  dplyr::group_by(mpg, value) |>
  dplyr::collect()
#> # A tibble: 27 × 3
#> # Groups:   mpg, value [25]
#>      mpg   cyl value
#>    <dbl> <dbl> <dbl>
#>  1  21       6  21
#>  2  22.8     4  22.8
#>  3  21.4     6  21.4
#>  4  18.7     8  18.7
#>  5  18.1     6  18.1
#>  6  14.3     8  14.3
#>  7  19.2     6  19.2
#>  8  17.3     8  17.3
#>  9  14.7     8  14.7
#> 10  33.9     4  33.9
#> # … with 17 more rows

Created on 2022-12-08 with reprex v2.0.2

By the way, I don't think this is a summarise problem, because the same thing happens if we use mutate instead of summarise.

mtcars |>
  arrow::arrow_table() |>
  dplyr::group_by(mpg, cyl) |>
  dplyr::mutate(value = "foo") |>
  dplyr::group_by(mpg, value) |>
  dplyr::collect()
#> # A tibble: 32 × 12
#> # Groups:   mpg, value [25]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb value
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4  21
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4  21
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1  22.8
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1  21.4
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2  18.7
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1  18.1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4  14.3
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2  24.4
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2  22.8
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4  19.2
#> # … with 22 more rows

Created on 2022-12-08 with reprex v2.0.2

Edit: There seems to be a problem in the process of adding group(s).

mtcars |>
  arrow::arrow_table() |>
  dplyr::group_by(mpg) |>
  dplyr::group_by(mpg, cyl, disp) |>
  dplyr::collect()
#> # A tibble: 32 × 11
#> # Groups:   mpg, cyl, disp [25]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21    21    21     110  3.9   2.62  16.5     0     1     4     4
#>  2  21    21    21     110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8  22.8  22.8    93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4  21.4  21.4   110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7  18.7  18.7   175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1  18.1  18.1   105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3  14.3  14.3   245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4  24.4  24.4    62  3.69  3.19  20       1     0     4     2
#>  9  22.8  22.8  22.8    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2  19.2  19.2   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

Created on 2022-12-08 with reprex v2.0.2

@paleolimbot
Copy link
Member

Thanks @DavZim and @eitsupi for the awesome reprexes!

The problem is there some code that silently assumed that there were zero existing groups. The relevant chunk is here, I think:

# set up group names and check which are new
gbp <- dplyr::group_by_prepare(.data, !!!expression_list, .add = .add)
existing_groups <- dplyr::group_vars(gbp$data)
new_group_names <- setdiff(gbp$group_names, existing_groups)
names(new_groups) <- new_group_names

Here, new_groups has a different length than new_group_names which is causing the weirdness.

paleolimbot added a commit that referenced this issue Dec 13, 2022
…p_by/summarise statements are used (#14905)

Reprex using CRAN arrow:

``` r
library(arrow, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl, value [3]
#>      mpg   cyl value `"foo"`
#>    <dbl> <dbl> <dbl> <chr>  
#>  1  21       6     6 foo    
#>  2  21       6     6 foo    
#>  3  22.8     4     4 foo    
#>  4  21.4     6     6 foo    
#>  5  18.7     8     8 foo    
#>  6  18.1     6     6 foo    
#>  7  14.3     8     8 foo    
#>  8  24.4     4     4 foo    
#>  9  22.8     4     4 foo    
#> 10  19.2     6     6 foo    
#> # … with 22 more rows
```

<sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>

After this PR:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)

mtcars |>
  arrow_table() |>
  select(mpg, cyl) |> 
  group_by(mpg, cyl) |>
  group_by(cyl, value = "foo") |>
  collect()
#> # A tibble: 32 × 3
#> # Groups:   cyl, value [3]
#>      mpg   cyl value
#>    <dbl> <dbl> <chr>
#>  1  21       6 foo  
#>  2  21       6 foo  
#>  3  22.8     4 foo  
#>  4  21.4     6 foo  
#>  5  18.7     8 foo  
#>  6  18.1     6 foo  
#>  7  14.3     8 foo  
#>  8  24.4     4 foo  
#>  9  22.8     4 foo  
#> 10  19.2     6 foo  
#> # … with 22 more rows
```

<sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup>
* Closes: #14872

Authored-by: Dewey Dunnington <dewey@fishandwhistle.net>
Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>
@paleolimbot paleolimbot added this to the 11.0.0 milestone Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants