[R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14872

DavZim · 2022-12-07T15:55:24Z

Describe the bug, including details regarding any error messages, version, and platform.

When collecting a query with multiple group_by + summarise statements, one variable gets wrongly assigned values from another variable. When an ungroup is inserted, everything works fine again.

To reproduce, consider the following:
In the examples below, the variable gender should be F, or M and not Group X.
When the ungroup() is inserted (second part), gender is again F/M and not Group X.

library(dplyr)
library(arrow)

# Create sample dataset
N <- 1000
set.seed(123)
orig_data <- tibble(
  code_group = sample(paste("Group", 1:2), N, replace = TRUE),
  year = sample(2015:2016, N, replace = TRUE),
  gender = sample(c("F", "M"), N, replace = TRUE),
  value = runif(N, 0, 10)
)
write_dataset(orig_data, "example")

# Query and replicate the error
(ds <- open_dataset("example/"))
#> FileSystemDataset with 1 Parquet file
#> code_group: string
#> year: int32
#> gender: string
#> value: double

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |>
  collect()
#> # A tibble: 2 × 4
#> # Groups:   code_group [2]
#>   code_group gender  value    NN
#>   <chr>      <chr>   <dbl> <int>
#> 1 Group 1    Group 1  724.     4
#> 2 Group 2    Group 2  661.     4

ERROR the gender variable is replaced by the values of the group variable

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  ungroup() |>                                             #< Added this line...
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |>
  collect()
#> # A tibble: 4 × 4
#> # Groups:   code_group [2]
#>   code_group gender value    NN
#>   <chr>      <chr>  <dbl> <int>
#> 1 Group 1    F       724.     2
#> 2 Group 2    M       627.     2
#> 3 Group 1    M       658.     2
#> 4 Group 2    F       661.     2

Note now after inserting the ungroup() between the group-by - summarise calls, gender is not replaced

Quick look at the query (note Node 4 where "gender": code_group)

ds |>
  group_by(year, code_group, gender) |>
  summarise(value = sum(value)) |>
  group_by(code_group, gender) |>
  summarise(value = max(value), NN = n()) |> 
  show_query()
#> ExecPlan with 8 nodes:
#> 7:SinkNode{}
#>   6:ProjectNode{projection=[code_group, gender, value, NN]}
#>     5:GroupByNode{keys=["code_group", "gender"], aggregates=[
#>      hash_max(value, {skip_nulls=false, min_count=0}),
#>      hash_sum(NN, {skip_nulls=true, min_count=1}),
#>     ]}
#>       4:ProjectNode{projection=[value, "NN": 1, code_group, "gender": code_group]}       #< gender is wrongfully mapped to code_group! 
#>         3:ProjectNode{projection=[year, code_group, gender, value]}
#>           2:GroupByNode{keys=["year", "code_group", "gender"], aggregates=[
#>              hash_sum(value, {skip_nulls=false, min_count=0}),
#>           ]}
#>             1:ProjectNode{projection=[value, year, code_group, gender]}
#>               0:SourceNode{}

Note that this was also asked here on SO

Component(s)

R

The text was updated successfully, but these errors were encountered:

eitsupi · 2022-12-08T12:30:46Z

A simpler example.

mtcars |>
  arrow::arrow_table() |>
  dplyr::group_by(mpg, cyl) |>
  dplyr::summarise(value = "foo") |>
  dplyr::group_by(mpg, value) |>
  dplyr::collect()
#> # A tibble: 27 × 3
#> # Groups:   mpg, value [25]
#>      mpg   cyl value
#>    <dbl> <dbl> <dbl>
#>  1  21       6  21
#>  2  22.8     4  22.8
#>  3  21.4     6  21.4
#>  4  18.7     8  18.7
#>  5  18.1     6  18.1
#>  6  14.3     8  14.3
#>  7  19.2     6  19.2
#>  8  17.3     8  17.3
#>  9  14.7     8  14.7
#> 10  33.9     4  33.9
#> # … with 17 more rows

^{Created on 2022-12-08 with reprex v2.0.2}

By the way, I don't think this is a summarise problem, because the same thing happens if we use mutate instead of summarise.

mtcars |>
  arrow::arrow_table() |>
  dplyr::group_by(mpg, cyl) |>
  dplyr::mutate(value = "foo") |>
  dplyr::group_by(mpg, value) |>
  dplyr::collect()
#> # A tibble: 32 × 12
#> # Groups:   mpg, value [25]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb value
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4  21
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4  21
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1  22.8
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1  21.4
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2  18.7
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1  18.1
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4  14.3
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2  24.4
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2  22.8
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4  19.2
#> # … with 22 more rows

^{Created on 2022-12-08 with reprex v2.0.2}

Edit: There seems to be a problem in the process of adding group(s).

mtcars |>
  arrow::arrow_table() |>
  dplyr::group_by(mpg) |>
  dplyr::group_by(mpg, cyl, disp) |>
  dplyr::collect()
#> # A tibble: 32 × 11
#> # Groups:   mpg, cyl, disp [25]
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21    21    21     110  3.9   2.62  16.5     0     1     4     4
#>  2  21    21    21     110  3.9   2.88  17.0     0     1     4     4
#>  3  22.8  22.8  22.8    93  3.85  2.32  18.6     1     1     4     1
#>  4  21.4  21.4  21.4   110  3.08  3.22  19.4     1     0     3     1
#>  5  18.7  18.7  18.7   175  3.15  3.44  17.0     0     0     3     2
#>  6  18.1  18.1  18.1   105  2.76  3.46  20.2     1     0     3     1
#>  7  14.3  14.3  14.3   245  3.21  3.57  15.8     0     0     3     4
#>  8  24.4  24.4  24.4    62  3.69  3.19  20       1     0     4     2
#>  9  22.8  22.8  22.8    95  3.92  3.15  22.9     1     0     4     2
#> 10  19.2  19.2  19.2   123  3.92  3.44  18.3     1     0     4     4
#> # … with 22 more rows

^{Created on 2022-12-08 with reprex v2.0.2}

paleolimbot · 2022-12-08T20:10:48Z

Thanks @DavZim and @eitsupi for the awesome reprexes!

The problem is there some code that silently assumed that there were zero existing groups. The relevant chunk is here, I think:

arrow/r/R/dplyr-group-by.R

Lines 40 to 45 in 7d85764

    
           # set up group names and check which are new 
        
           gbp <- dplyr::group_by_prepare(.data, !!!expression_list, .add = .add) 
        
           existing_groups <- dplyr::group_vars(gbp$data) 
        
           new_group_names <- setdiff(gbp$group_names, existing_groups) 
        
           names(new_groups) <- new_group_names

Here, new_groups has a different length than new_group_names which is causing the weirdness.

…p_by/summarise statements are used (#14905) Reprex using CRAN arrow: ``` r library(arrow, warn.conflicts = FALSE) library(dplyr, warn.conflicts = FALSE) mtcars |> arrow_table() |> select(mpg, cyl) |> group_by(mpg, cyl) |> group_by(cyl, value = "foo") |> collect() #> # A tibble: 32 × 4 #> # Groups: cyl, value [3] #> mpg cyl value `"foo"` #> <dbl> <dbl> <dbl> <chr> #> 1 21 6 6 foo #> 2 21 6 6 foo #> 3 22.8 4 4 foo #> 4 21.4 6 6 foo #> 5 18.7 8 8 foo #> 6 18.1 6 6 foo #> 7 14.3 8 8 foo #> 8 24.4 4 4 foo #> 9 22.8 4 4 foo #> 10 19.2 6 6 foo #> # … with 22 more rows ``` <sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> After this PR: ``` r library(arrow, warn.conflicts = FALSE) #> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information. library(dplyr, warn.conflicts = FALSE) mtcars |> arrow_table() |> select(mpg, cyl) |> group_by(mpg, cyl) |> group_by(cyl, value = "foo") |> collect() #> # A tibble: 32 × 3 #> # Groups: cyl, value [3] #> mpg cyl value #> <dbl> <dbl> <chr> #> 1 21 6 foo #> 2 21 6 foo #> 3 22.8 4 foo #> 4 21.4 6 foo #> 5 18.7 8 foo #> 6 18.1 6 foo #> 7 14.3 8 foo #> 8 24.4 4 foo #> 9 22.8 4 foo #> 10 19.2 6 foo #> # … with 22 more rows ``` <sup>Created on 2022-12-09 with [reprex v2.0.2](https://reprex.tidyverse.org)</sup> * Closes: #14872 Authored-by: Dewey Dunnington <dewey@fishandwhistle.net> Signed-off-by: Dewey Dunnington <dewey@voltrondata.com>

DavZim added the Type: bug label Dec 7, 2022

paleolimbot added the Component: R label Dec 8, 2022

github-actions bot mentioned this issue Dec 9, 2022

GH-14872: [R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14905

Merged

github-actions bot assigned paleolimbot Dec 9, 2022

paleolimbot closed this as completed in #14905 Dec 13, 2022

paleolimbot added this to the 11.0.0 milestone Dec 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14872

[R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14872

DavZim commented Dec 7, 2022

eitsupi commented Dec 8, 2022 •

edited

Loading

paleolimbot commented Dec 8, 2022

[R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14872

[R] arrow returns wrong variable content when multiple group_by/summarise statements are used #14872

Comments

DavZim commented Dec 7, 2022

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

eitsupi commented Dec 8, 2022 • edited Loading

paleolimbot commented Dec 8, 2022

eitsupi commented Dec 8, 2022 •

edited

Loading