Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41323: [R] Redo how summarize() evaluates expressions #41223

Merged
merged 11 commits into from Apr 22, 2024

Conversation

nealrichardson
Copy link
Contributor

@nealrichardson nealrichardson commented Apr 15, 2024

Rationale for this change

Previously, the NSE expression handling in summarize() worked differently from filter/mutate/etc. Among the implications, it would not have been possible to define bindings for other aggregation functions that can be translated into some combination of supported aggregations, such as weighted.mean().

What changes are included in this PR?

  • Expressions in summarize() can now be evaluated with "regular" arrow_eval(). Aggregation bindings stick the contents of the aggregation data they previously returned into an ..aggregations list that lives in an enclosing environment, and then return a FieldRef pointing to that. This makes the code in e.g. summarize_eval() a little harder to follow, since it's grabbing and pointing to objects out of its immediate scope, but I've tried to comment thoroughly and am happy to add more.
  • arrow_eval() inspects the expression it receives for any functions that are not in the NSE mask and not in some other package's namespace (i.e. hopefully just user functions) and inserts them into the NSE mask, setting the enclosing environment for that copy of the function to be the mask, so that if the function calls other functions that we do have bindings for, the bindings get called. This is the approach I suggested back in [R] Try to arrow_eval user-defined functions #29667 (comment), and it is what fixes [R] Try to arrow_eval user-defined functions #29667 and [R] Improve user experience when mixing R code with Arrow dplyr pipelines #40938.

Are these changes tested?

Existing tests, which are pretty comprehensive, pass. But it would be good to try to be more evil in manual testing with the user-defined R function support.

Are there any user-facing changes?

Yes.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I haven't pulled it locally (but should sometime soon). A few comments, mostly around comments

r/R/dplyr-eval.R Show resolved Hide resolved
r/R/dplyr-eval.R Outdated Show resolved Hide resolved
r/R/dplyr-eval.R Outdated Show resolved Hide resolved
r/R/dplyr-summarize.R Show resolved Hide resolved
Comment on lines -242 to +299
.data$aggregations <- ctx$aggregations
.data$aggregations <- ..aggregations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to know more about going from the ctx object to storing these as ..aggregations. I'm not at all opposed, and think this looks more natural given some of our other machinery — but can't tell directly here if/why that's necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll explain here, and then I'll add more in comments.

summarize() is complicated because you can do a mixture of scalar operations and aggregations, but that's not how acero works. So we have to pull out the aggregations, collect them in one list (that will become an Aggregate ExecNode), and in the expressions, replace them with FieldRefs so that further operations can happen (in what will become a ProjectNode that works on the result of the Aggregate).

In "normal" arrow_eval, like in mutate(), each expression/quosure results in a single Arrow Expression. But in summarize(), it could generate one or more aggregations that go into Aggregate, and then an Expression after that. Example from the comments in do_arrow_summarize():

  # For example,
  #   summarize(mean = sum(x) / n())
  # is effectively implemented as
  #   summarize(..temp0 = sum(x), ..temp1 = n()) %>%
  #   mutate(mean = ..temp0 / ..temp1) %>%
  #   select(-starts_with("..temp"))

So, each aggregation binding needs to push a ..tempN aggregation onto some list somewhere and return a FieldRef so that any projections happening after evaluate "normally".

ctx was useful previously because we weren't calling arrow_eval() on the expressions, we were walking them, looking for known aggregation functions, pulling them out and inserting into the expression the ..tempN symbols, and then once it was all scalar functions left, calling arrow_eval(). But that doesn't allow for defining a weighted.mean binding as function(x, w) sum(x * w) / sum(w) because you can't just substitute into the call to weighted.mean(x, w), you really just want to evaluate it and let the usual Arrow Expression logic work.

The challenge was: where is that "somewhere" to collect the aggregations? It needs to be somewhere where the binding functions can find it, we can't pass it in as an argument everywhere.

Generally, R looks up symbols in each parent frame of where the function is defined, like this:

> x <- 1
> f <- function() x
> f()
[1] 1

So you'd think that you could just have ..aggregations in this function, and when you evaluate the expressions, they would find them in the enclosing environment. But no. It's like this:

> g <- function() {
+   x <- 2
+   f()
+ }
> g()
[1] 1

f() has its own environment and looks up symbols from its parents.

So, I put ..aggregations in this environment, and then set this environment as the parent for each of the aggregation bindings in arrow_mask(). That way, they can find it and assign into it. This seemed better than the alternatives I almost gave up and fell back to, like something global/in the package namespace.

After doing that, I realized that I could do a similar thing with user functions: copy them into the mask and set their environment to be the mask, so they'd find the other bindings there.

Does that make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaah yes, got it. I didn't totally put together that ..aggregations wasn't at the package scope, but that's really clever. And because it's transient within the call we don't have to worry about flushing it at the end / cleaning it up / managing state, yeah?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added/reworked the comments in this function, and otherwise did some simplification so that I hope the code is more readable too. LMK what you think.

Comment on lines 307 to 316
# We can tell the expression is invalid if it references fields not in
# the schema of the data after summarize(). Evaulating its type will
# throw an error if it's invalid.
tryCatch(..post_mutate[[post]]$type(out$.data$schema), error = function(e) {
msg <- paste(
"Expression", as_label(exprs[[post]]),
"is not a valid aggregation expression or is"
)
arrow_not_supported(msg)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaky!

@@ -279,7 +277,17 @@ test_that("purrr-style lambda functions are supported", {
)
})

test_that("ARROW-14071 - function(x)-style lambda functions are not supported", {
test_that("ARROW-14071 - user-defined R functions", {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
test_that("ARROW-14071 - user-defined R functions", {
test_that("ARROW-14071 - R functions from a user's environment", {

Just to be super clear this isn't about UDFs

# We can also define functions that call supported aggregation functions
# and it just works
wtd_mean <- function(x, w) sum(x * w) / sum(w)
withr::local_options(list(arrow.debug = TRUE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this is a helpful catch / test that honestly I could see being helpful when debugging this too — but makes this super clear that it's hitting this code and not some other path

r/tests/testthat/test-dplyr-summarize.R Show resolved Hide resolved
@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes awaiting change review Awaiting change review and removed awaiting review Awaiting review awaiting merge Awaiting merge awaiting changes Awaiting changes awaiting change review Awaiting change review labels Apr 15, 2024
@thisisnic
Copy link
Member

This is cool. A few additional tests it might be worth chucking in to show how it works (or in the docs somewhere?) that are illustrative of a few things I wanted to check:

library(dplyr)
library(arrow)

single_transform <- function(x){
  str_remove_all(x, "[aeiou]")
}

multistep_transform <- function(x){
  y = stringr::str_replace_all(x, "B", "c")
  z = str_remove_all(y, "[aeiou]")
  z2 = str_to_upper(z)
  z2
}

multistep_transform_in_one <- function(x){
  str_to_upper(str_remove_all(stringr::str_replace_all(x, "B", "c"), "[aeiou]"))
}

tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
  arrow_table() %>%
  mutate(y = single_transform(x)) %>%
  collect()
#> # A tibble: 4 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 Foo   F    
#> 2 Bar   Br   
#> 3 Baz   Bz   
#> 4 Qux   Qx

tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
  arrow_table() %>%
  mutate(y = multistep_transform(x)) %>%
  collect()
#> # A tibble: 4 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 Foo   F    
#> 2 Bar   CR   
#> 3 Baz   CZ   
#> 4 Qux   QX

tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
  arrow_table() %>%
  mutate(y = multistep_transform_in_one(x)) %>%
  collect()
#> # A tibble: 4 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 Foo   F    
#> 2 Bar   CR   
#> 3 Baz   CZ   
#> 4 Qux   QX

@thisisnic
Copy link
Member

thisisnic commented Apr 18, 2024

This is what I was referring to with "not supported in arrow" @nealrichardson though I hadn't realised it already works really well when calling functions directly in terms of reporting the failed function, so things might be fine as-is, though printing out which function doesn't have bindings would be a nice-to-have.

library(arrow)
library(dplyr)

single_transform <- function(x){
  # this function does have bindings
  stringr::str_remove_all(x, "[aeiou]")
}

# succeeds
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
  arrow_table() %>%
  mutate(y = single_transform(x)) %>%
  collect()
#> # A tibble: 4 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 Foo   F    
#> 2 Bar   Br   
#> 3 Baz   Bz   
#> 4 Qux   Qx

single_transform2 <- function(x){
  # this function doesn't have bindings
  stringr::str_to_sentence(x)
}

tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
  arrow_table() %>%
  mutate(y = stringr::str_to_sentence(x)) %>%
  collect()
#> Warning: Expression stringr::str_to_sentence(x) not supported in Arrow; pulling
#> data into R
#> # A tibble: 4 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 Foo   Foo  
#> 2 Bar   Bar  
#> 3 Baz   Baz  
#> 4 Qux   Qux

tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
  arrow_table() %>%
  mutate(y = single_transform2(x)) %>%
  collect()
#> Warning: Expression single_transform2(x) not supported in Arrow; pulling data
#> into R
#> # A tibble: 4 × 2
#>   x     y    
#>   <chr> <chr>
#> 1 Foo   Foo  
#> 2 Bar   Bar  
#> 3 Baz   Baz  
#> 4 Qux   Qux

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 21, 2024
@nealrichardson nealrichardson changed the title [R] Redo how summarize() evaluates expressions GH-41323: [R] Redo how summarize() evaluates expressions Apr 21, 2024
@nealrichardson nealrichardson linked an issue Apr 21, 2024 that may be closed by this pull request
Copy link

⚠️ GitHub issue #41323 has been automatically assigned in GitHub to PR creator.

@nealrichardson nealrichardson marked this pull request as ready for review April 22, 2024 13:46
@nealrichardson
Copy link
Contributor Author

This is ready to go AFAIK @jonkeane @thisisnic, LMK if you have any more feedback.

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good though probably worth addressing this comment in a follow-up ticket perhaps?

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 22, 2024
@nealrichardson
Copy link
Contributor Author

Looks good though probably worth addressing this comment in a follow-up ticket perhaps?

I don't think I can do any better than what it does now. With single_transform2(x) failing, there's not anything in that expression that would give away what is not supported, you have to step through the function, and it could be doing anything so IDK that looking for unsupported functions named within that would do much. And unfortunately the error that stringr::str_to_sentence raises isn't obvious either: if you set options(arrow.debug=TRUE), the warning shown includes the actual error, saying: "Warning: In single_transform2(x), cannot coerce type 'environment' to vector of type 'character'; pulling data into R".

@nealrichardson nealrichardson merged commit 5865e96 into apache:main Apr 22, 2024
12 checks passed
@nealrichardson nealrichardson deleted the redo-summarize-eval branch April 22, 2024 20:20
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5865e96.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

r/NEWS.md Show resolved Hide resolved
zanmato1984 pushed a commit to zanmato1984/arrow that referenced this pull request Apr 24, 2024
…e#41223)

### Rationale for this change

Previously, the NSE expression handling in `summarize()` worked
differently from filter/mutate/etc. Among the implications, it would not
have been possible to define bindings for other aggregation functions
that can be translated into some combination of supported aggregations,
such as `weighted.mean()`.

### What changes are included in this PR?

* Expressions in `summarize()` can now be evaluated with "regular"
`arrow_eval()`. Aggregation bindings stick the contents of the
aggregation data they previously returned into an `..aggregations` list
that lives in an enclosing environment, and then return a FieldRef
pointing to that. This makes the code in e.g. `summarize_eval()` a
little harder to follow, since it's grabbing and pointing to objects out
of its immediate scope, but I've tried to comment thoroughly and am
happy to add more.
* `arrow_eval()` inspects the expression it receives for any functions
that are not in the NSE mask and not in some other package's namespace
(i.e. hopefully just user functions) and inserts them into the NSE mask,
setting the enclosing environment for that copy of the function to be
the mask, so that if the function calls other functions that we do have
bindings for, the bindings get called. This is the approach I suggested
back in
apache#29667 (comment),
and it is what fixes apache#29667 and apache#40938.

### Are these changes tested?

Existing tests, which are pretty comprehensive, pass. But it would be
good to try to be more evil in manual testing with the user-defined R
function support.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#41323
amoeba added a commit that referenced this pull request Apr 24, 2024
### Rationale for this change

This clarifies the language added in #41223, as discussed in a post-merge review in #41223 (comment).

### What changes are included in this PR?

Just a tweak to R's NEWS.md file.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

Authored-by: Bryce Mecum <petridish@gmail.com>
Signed-off-by: Bryce Mecum <petridish@gmail.com>
nealrichardson added a commit that referenced this pull request Apr 29, 2024
### Rationale for this change

Since it doesn't look like Acero will be getting window functions any
time soon, implement support in `mutate()` for transformations that
involve aggregations, like `x - mean(x)`, via left_join.

### What changes are included in this PR?

Following #41223, I realized I could reuse that evaluation path in
`mutate()`. Evaluating expressions accumulates `..aggregations` and
`mutate_stuff`; in summarize() we apply aggregations and then mutate on
the result. If expressions in the `mutate_stuff` reference columns in
the original data and not just the result of aggregations, we reject it.

Here, if there are aggregations, we apply them on a copy of the query up
to that point, and join the result back onto the query, then apply the
mutations on that. It's not a problem for those mutate expressions to
reference both columns in the original data and the results of the
aggregations because both are present.

There are ~three~ two caveats:

* Join has non-deterministic order, so while `mutate()` doesn't
generally affect row order, if this code path is activated, row order
may not be stable. With datasets, it's not guaranteed anyway.
* ~Acero's join seems to have a limitation currently where missing
values are not joined to each other. If your join key has NA in it, and
you do a left_join, your new columns will all be NA, even if there is a
corresponding value in the right dataset. I made
#41358 to address that, and in the
meantime, I've added a workaround
(b9de504)
that's not awesome but has the right behavior.~ Fixed and rebased.
* I believe it is possible in dplyr to get this behavior in other verbs:
filter, arrange, even summarize. I've only done this for mutate. Are we
ok with that?

### Are these changes tested?

Yes

### Are there any user-facing changes?

This works now:

``` r
library(arrow)
library(dplyr)

mtcars |>
  arrow_table() |>
  select(cyl, mpg, hp) |>
  group_by(cyl) |>
  mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl [3]
#>      cyl   mpg    hp stdize_mpg
#>    <dbl> <dbl> <dbl>      <dbl>
#>  1     6  21     110      0.865
#>  2     6  21     110      0.865
#>  3     4  22.8    93     -0.857
#>  4     6  21.4   110      1.14 
#>  5     8  18.7   175      1.41 
#>  6     6  18.1   105     -1.13 
#>  7     8  14.3   245     -0.312
#>  8     4  24.4    62     -0.502
#>  9     4  22.8    95     -0.857
#> 10     6  19.2   123     -0.373
#> # ℹ 22 more rows
```

<sup>Created on 2024-04-23 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>

* GitHub Issue: #29537
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…e#41223)

### Rationale for this change

Previously, the NSE expression handling in `summarize()` worked
differently from filter/mutate/etc. Among the implications, it would not
have been possible to define bindings for other aggregation functions
that can be translated into some combination of supported aggregations,
such as `weighted.mean()`.

### What changes are included in this PR?

* Expressions in `summarize()` can now be evaluated with "regular"
`arrow_eval()`. Aggregation bindings stick the contents of the
aggregation data they previously returned into an `..aggregations` list
that lives in an enclosing environment, and then return a FieldRef
pointing to that. This makes the code in e.g. `summarize_eval()` a
little harder to follow, since it's grabbing and pointing to objects out
of its immediate scope, but I've tried to comment thoroughly and am
happy to add more.
* `arrow_eval()` inspects the expression it receives for any functions
that are not in the NSE mask and not in some other package's namespace
(i.e. hopefully just user functions) and inserts them into the NSE mask,
setting the enclosing environment for that copy of the function to be
the mask, so that if the function calls other functions that we do have
bindings for, the bindings get called. This is the approach I suggested
back in
apache#29667 (comment),
and it is what fixes apache#29667 and apache#40938.

### Are these changes tested?

Existing tests, which are pretty comprehensive, pass. But it would be
good to try to be more evil in manual testing with the user-defined R
function support.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#41323
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…he#41368)

### Rationale for this change

This clarifies the language added in apache#41223, as discussed in a post-merge review in apache#41223 (comment).

### What changes are included in this PR?

Just a tweak to R's NEWS.md file.

### Are these changes tested?

No.

### Are there any user-facing changes?

No.

Authored-by: Bryce Mecum <petridish@gmail.com>
Signed-off-by: Bryce Mecum <petridish@gmail.com>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…he#41350)

### Rationale for this change

Since it doesn't look like Acero will be getting window functions any
time soon, implement support in `mutate()` for transformations that
involve aggregations, like `x - mean(x)`, via left_join.

### What changes are included in this PR?

Following apache#41223, I realized I could reuse that evaluation path in
`mutate()`. Evaluating expressions accumulates `..aggregations` and
`mutate_stuff`; in summarize() we apply aggregations and then mutate on
the result. If expressions in the `mutate_stuff` reference columns in
the original data and not just the result of aggregations, we reject it.

Here, if there are aggregations, we apply them on a copy of the query up
to that point, and join the result back onto the query, then apply the
mutations on that. It's not a problem for those mutate expressions to
reference both columns in the original data and the results of the
aggregations because both are present.

There are ~three~ two caveats:

* Join has non-deterministic order, so while `mutate()` doesn't
generally affect row order, if this code path is activated, row order
may not be stable. With datasets, it's not guaranteed anyway.
* ~Acero's join seems to have a limitation currently where missing
values are not joined to each other. If your join key has NA in it, and
you do a left_join, your new columns will all be NA, even if there is a
corresponding value in the right dataset. I made
apache#41358 to address that, and in the
meantime, I've added a workaround
(apache@b9de504)
that's not awesome but has the right behavior.~ Fixed and rebased.
* I believe it is possible in dplyr to get this behavior in other verbs:
filter, arrange, even summarize. I've only done this for mutate. Are we
ok with that?

### Are these changes tested?

Yes

### Are there any user-facing changes?

This works now:

``` r
library(arrow)
library(dplyr)

mtcars |>
  arrow_table() |>
  select(cyl, mpg, hp) |>
  group_by(cyl) |>
  mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl [3]
#>      cyl   mpg    hp stdize_mpg
#>    <dbl> <dbl> <dbl>      <dbl>
#>  1     6  21     110      0.865
#>  2     6  21     110      0.865
#>  3     4  22.8    93     -0.857
#>  4     6  21.4   110      1.14 
#>  5     8  18.7   175      1.41 
#>  6     6  18.1   105     -1.13 
#>  7     8  14.3   245     -0.312
#>  8     4  24.4    62     -0.502
#>  9     4  22.8    95     -0.857
#> 10     6  19.2   123     -0.373
#> # ℹ 22 more rows
```

<sup>Created on 2024-04-23 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>

* GitHub Issue: apache#29537
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…e#41223)

### Rationale for this change

Previously, the NSE expression handling in `summarize()` worked
differently from filter/mutate/etc. Among the implications, it would not
have been possible to define bindings for other aggregation functions
that can be translated into some combination of supported aggregations,
such as `weighted.mean()`.

### What changes are included in this PR?

* Expressions in `summarize()` can now be evaluated with "regular"
`arrow_eval()`. Aggregation bindings stick the contents of the
aggregation data they previously returned into an `..aggregations` list
that lives in an enclosing environment, and then return a FieldRef
pointing to that. This makes the code in e.g. `summarize_eval()` a
little harder to follow, since it's grabbing and pointing to objects out
of its immediate scope, but I've tried to comment thoroughly and am
happy to add more.
* `arrow_eval()` inspects the expression it receives for any functions
that are not in the NSE mask and not in some other package's namespace
(i.e. hopefully just user functions) and inserts them into the NSE mask,
setting the enclosing environment for that copy of the function to be
the mask, so that if the function calls other functions that we do have
bindings for, the bindings get called. This is the approach I suggested
back in
apache#29667 (comment),
and it is what fixes apache#29667 and apache#40938.

### Are these changes tested?

Existing tests, which are pretty comprehensive, pass. But it would be
good to try to be more evil in manual testing with the user-defined R
function support.

### Are there any user-facing changes?

Yes.

* GitHub Issue: apache#41323
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 4, 2024
…he#41350)

### Rationale for this change

Since it doesn't look like Acero will be getting window functions any
time soon, implement support in `mutate()` for transformations that
involve aggregations, like `x - mean(x)`, via left_join.

### What changes are included in this PR?

Following apache#41223, I realized I could reuse that evaluation path in
`mutate()`. Evaluating expressions accumulates `..aggregations` and
`mutate_stuff`; in summarize() we apply aggregations and then mutate on
the result. If expressions in the `mutate_stuff` reference columns in
the original data and not just the result of aggregations, we reject it.

Here, if there are aggregations, we apply them on a copy of the query up
to that point, and join the result back onto the query, then apply the
mutations on that. It's not a problem for those mutate expressions to
reference both columns in the original data and the results of the
aggregations because both are present.

There are ~three~ two caveats:

* Join has non-deterministic order, so while `mutate()` doesn't
generally affect row order, if this code path is activated, row order
may not be stable. With datasets, it's not guaranteed anyway.
* ~Acero's join seems to have a limitation currently where missing
values are not joined to each other. If your join key has NA in it, and
you do a left_join, your new columns will all be NA, even if there is a
corresponding value in the right dataset. I made
apache#41358 to address that, and in the
meantime, I've added a workaround
(apache@b9de504)
that's not awesome but has the right behavior.~ Fixed and rebased.
* I believe it is possible in dplyr to get this behavior in other verbs:
filter, arrange, even summarize. I've only done this for mutate. Are we
ok with that?

### Are these changes tested?

Yes

### Are there any user-facing changes?

This works now:

``` r
library(arrow)
library(dplyr)

mtcars |>
  arrow_table() |>
  select(cyl, mpg, hp) |>
  group_by(cyl) |>
  mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |>
  collect()
#> # A tibble: 32 × 4
#> # Groups:   cyl [3]
#>      cyl   mpg    hp stdize_mpg
#>    <dbl> <dbl> <dbl>      <dbl>
#>  1     6  21     110      0.865
#>  2     6  21     110      0.865
#>  3     4  22.8    93     -0.857
#>  4     6  21.4   110      1.14 
#>  5     8  18.7   175      1.41 
#>  6     6  18.1   105     -1.13 
#>  7     8  14.3   245     -0.312
#>  8     4  24.4    62     -0.502
#>  9     4  22.8    95     -0.857
#> 10     6  19.2   123     -0.373
#> # ℹ 22 more rows
```

<sup>Created on 2024-04-23 with [reprex
v2.1.0](https://reprex.tidyverse.org)</sup>

* GitHub Issue: apache#29537
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R] Redo how summarize() evaluates [R] Try to arrow_eval user-defined functions
4 participants