New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41323: [R] Redo how summarize() evaluates expressions #41223
GH-41323: [R] Redo how summarize() evaluates expressions #41223
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this! I haven't pulled it locally (but should sometime soon). A few comments, mostly around comments
.data$aggregations <- ctx$aggregations | ||
.data$aggregations <- ..aggregations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to know more about going from the ctx
object to storing these as ..aggregations
. I'm not at all opposed, and think this looks more natural given some of our other machinery — but can't tell directly here if/why that's necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll explain here, and then I'll add more in comments.
summarize()
is complicated because you can do a mixture of scalar operations and aggregations, but that's not how acero works. So we have to pull out the aggregations, collect them in one list (that will become an Aggregate ExecNode), and in the expressions, replace them with FieldRefs so that further operations can happen (in what will become a ProjectNode that works on the result of the Aggregate).
In "normal" arrow_eval, like in mutate()
, each expression/quosure results in a single Arrow Expression. But in summarize()
, it could generate one or more aggregations that go into Aggregate, and then an Expression after that. Example from the comments in do_arrow_summarize()
:
# For example,
# summarize(mean = sum(x) / n())
# is effectively implemented as
# summarize(..temp0 = sum(x), ..temp1 = n()) %>%
# mutate(mean = ..temp0 / ..temp1) %>%
# select(-starts_with("..temp"))
So, each aggregation binding needs to push a ..tempN
aggregation onto some list somewhere and return a FieldRef so that any projections happening after evaluate "normally".
ctx
was useful previously because we weren't calling arrow_eval()
on the expressions, we were walking them, looking for known aggregation functions, pulling them out and inserting into the expression the ..tempN
symbols, and then once it was all scalar functions left, calling arrow_eval()
. But that doesn't allow for defining a weighted.mean
binding as function(x, w) sum(x * w) / sum(w)
because you can't just substitute into the call to weighted.mean(x, w)
, you really just want to evaluate it and let the usual Arrow Expression logic work.
The challenge was: where is that "somewhere" to collect the aggregations? It needs to be somewhere where the binding functions can find it, we can't pass it in as an argument everywhere.
Generally, R looks up symbols in each parent frame of where the function is defined, like this:
> x <- 1
> f <- function() x
> f()
[1] 1
So you'd think that you could just have ..aggregations
in this function, and when you evaluate the expressions, they would find them in the enclosing environment. But no. It's like this:
> g <- function() {
+ x <- 2
+ f()
+ }
> g()
[1] 1
f()
has its own environment and looks up symbols from its parents.
So, I put ..aggregations
in this environment, and then set this environment as the parent for each of the aggregation bindings in arrow_mask()
. That way, they can find it and assign into it. This seemed better than the alternatives I almost gave up and fell back to, like something global/in the package namespace.
After doing that, I realized that I could do a similar thing with user functions: copy them into the mask and set their environment to be the mask, so they'd find the other bindings there.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaah yes, got it. I didn't totally put together that ..aggregations
wasn't at the package scope, but that's really clever. And because it's transient within the call we don't have to worry about flushing it at the end / cleaning it up / managing state, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added/reworked the comments in this function, and otherwise did some simplification so that I hope the code is more readable too. LMK what you think.
r/R/dplyr-summarize.R
Outdated
# We can tell the expression is invalid if it references fields not in | ||
# the schema of the data after summarize(). Evaulating its type will | ||
# throw an error if it's invalid. | ||
tryCatch(..post_mutate[[post]]$type(out$.data$schema), error = function(e) { | ||
msg <- paste( | ||
"Expression", as_label(exprs[[post]]), | ||
"is not a valid aggregation expression or is" | ||
) | ||
arrow_not_supported(msg) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sneaky!
r/tests/testthat/test-dplyr-across.R
Outdated
@@ -279,7 +277,17 @@ test_that("purrr-style lambda functions are supported", { | |||
) | |||
}) | |||
|
|||
test_that("ARROW-14071 - function(x)-style lambda functions are not supported", { | |||
test_that("ARROW-14071 - user-defined R functions", { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_that("ARROW-14071 - user-defined R functions", { | |
test_that("ARROW-14071 - R functions from a user's environment", { |
Just to be super clear this isn't about UDFs
# We can also define functions that call supported aggregation functions | ||
# and it just works | ||
wtd_mean <- function(x, w) sum(x * w) / sum(w) | ||
withr::local_options(list(arrow.debug = TRUE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, this is a helpful catch / test that honestly I could see being helpful when debugging this too — but makes this super clear that it's hitting this code and not some other path
This is cool. A few additional tests it might be worth chucking in to show how it works (or in the docs somewhere?) that are illustrative of a few things I wanted to check: library(dplyr)
library(arrow)
single_transform <- function(x){
str_remove_all(x, "[aeiou]")
}
multistep_transform <- function(x){
y = stringr::str_replace_all(x, "B", "c")
z = str_remove_all(y, "[aeiou]")
z2 = str_to_upper(z)
z2
}
multistep_transform_in_one <- function(x){
str_to_upper(str_remove_all(stringr::str_replace_all(x, "B", "c"), "[aeiou]"))
}
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
arrow_table() %>%
mutate(y = single_transform(x)) %>%
collect()
#> # A tibble: 4 × 2
#> x y
#> <chr> <chr>
#> 1 Foo F
#> 2 Bar Br
#> 3 Baz Bz
#> 4 Qux Qx
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
arrow_table() %>%
mutate(y = multistep_transform(x)) %>%
collect()
#> # A tibble: 4 × 2
#> x y
#> <chr> <chr>
#> 1 Foo F
#> 2 Bar CR
#> 3 Baz CZ
#> 4 Qux QX
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
arrow_table() %>%
mutate(y = multistep_transform_in_one(x)) %>%
collect()
#> # A tibble: 4 × 2
#> x y
#> <chr> <chr>
#> 1 Foo F
#> 2 Bar CR
#> 3 Baz CZ
#> 4 Qux QX |
This is what I was referring to with "not supported in arrow" @nealrichardson though I hadn't realised it already works really well when calling functions directly in terms of reporting the failed function, so things might be fine as-is, though printing out which function doesn't have bindings would be a nice-to-have. library(arrow)
library(dplyr)
single_transform <- function(x){
# this function does have bindings
stringr::str_remove_all(x, "[aeiou]")
}
# succeeds
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
arrow_table() %>%
mutate(y = single_transform(x)) %>%
collect()
#> # A tibble: 4 × 2
#> x y
#> <chr> <chr>
#> 1 Foo F
#> 2 Bar Br
#> 3 Baz Bz
#> 4 Qux Qx
single_transform2 <- function(x){
# this function doesn't have bindings
stringr::str_to_sentence(x)
}
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
arrow_table() %>%
mutate(y = stringr::str_to_sentence(x)) %>%
collect()
#> Warning: Expression stringr::str_to_sentence(x) not supported in Arrow; pulling
#> data into R
#> # A tibble: 4 × 2
#> x y
#> <chr> <chr>
#> 1 Foo Foo
#> 2 Bar Bar
#> 3 Baz Baz
#> 4 Qux Qux
tibble::tibble(x = c("Foo", "Bar", "Baz", "Qux")) %>%
arrow_table() %>%
mutate(y = single_transform2(x)) %>%
collect()
#> Warning: Expression single_transform2(x) not supported in Arrow; pulling data
#> into R
#> # A tibble: 4 × 2
#> x y
#> <chr> <chr>
#> 1 Foo Foo
#> 2 Bar Bar
#> 3 Baz Baz
#> 4 Qux Qux |
2b44bd1
to
b59cda7
Compare
|
This is ready to go AFAIK @jonkeane @thisisnic, LMK if you have any more feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good though probably worth addressing this comment in a follow-up ticket perhaps?
I don't think I can do any better than what it does now. With |
After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit 5865e96. There was 1 benchmark result indicating a performance regression:
The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…e#41223) ### Rationale for this change Previously, the NSE expression handling in `summarize()` worked differently from filter/mutate/etc. Among the implications, it would not have been possible to define bindings for other aggregation functions that can be translated into some combination of supported aggregations, such as `weighted.mean()`. ### What changes are included in this PR? * Expressions in `summarize()` can now be evaluated with "regular" `arrow_eval()`. Aggregation bindings stick the contents of the aggregation data they previously returned into an `..aggregations` list that lives in an enclosing environment, and then return a FieldRef pointing to that. This makes the code in e.g. `summarize_eval()` a little harder to follow, since it's grabbing and pointing to objects out of its immediate scope, but I've tried to comment thoroughly and am happy to add more. * `arrow_eval()` inspects the expression it receives for any functions that are not in the NSE mask and not in some other package's namespace (i.e. hopefully just user functions) and inserts them into the NSE mask, setting the enclosing environment for that copy of the function to be the mask, so that if the function calls other functions that we do have bindings for, the bindings get called. This is the approach I suggested back in apache#29667 (comment), and it is what fixes apache#29667 and apache#40938. ### Are these changes tested? Existing tests, which are pretty comprehensive, pass. But it would be good to try to be more evil in manual testing with the user-defined R function support. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#41323
### Rationale for this change This clarifies the language added in #41223, as discussed in a post-merge review in #41223 (comment). ### What changes are included in this PR? Just a tweak to R's NEWS.md file. ### Are these changes tested? No. ### Are there any user-facing changes? No. Authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Bryce Mecum <petridish@gmail.com>
### Rationale for this change Since it doesn't look like Acero will be getting window functions any time soon, implement support in `mutate()` for transformations that involve aggregations, like `x - mean(x)`, via left_join. ### What changes are included in this PR? Following #41223, I realized I could reuse that evaluation path in `mutate()`. Evaluating expressions accumulates `..aggregations` and `mutate_stuff`; in summarize() we apply aggregations and then mutate on the result. If expressions in the `mutate_stuff` reference columns in the original data and not just the result of aggregations, we reject it. Here, if there are aggregations, we apply them on a copy of the query up to that point, and join the result back onto the query, then apply the mutations on that. It's not a problem for those mutate expressions to reference both columns in the original data and the results of the aggregations because both are present. There are ~three~ two caveats: * Join has non-deterministic order, so while `mutate()` doesn't generally affect row order, if this code path is activated, row order may not be stable. With datasets, it's not guaranteed anyway. * ~Acero's join seems to have a limitation currently where missing values are not joined to each other. If your join key has NA in it, and you do a left_join, your new columns will all be NA, even if there is a corresponding value in the right dataset. I made #41358 to address that, and in the meantime, I've added a workaround (b9de504) that's not awesome but has the right behavior.~ Fixed and rebased. * I believe it is possible in dplyr to get this behavior in other verbs: filter, arrange, even summarize. I've only done this for mutate. Are we ok with that? ### Are these changes tested? Yes ### Are there any user-facing changes? This works now: ``` r library(arrow) library(dplyr) mtcars |> arrow_table() |> select(cyl, mpg, hp) |> group_by(cyl) |> mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |> collect() #> # A tibble: 32 × 4 #> # Groups: cyl [3] #> cyl mpg hp stdize_mpg #> <dbl> <dbl> <dbl> <dbl> #> 1 6 21 110 0.865 #> 2 6 21 110 0.865 #> 3 4 22.8 93 -0.857 #> 4 6 21.4 110 1.14 #> 5 8 18.7 175 1.41 #> 6 6 18.1 105 -1.13 #> 7 8 14.3 245 -0.312 #> 8 4 24.4 62 -0.502 #> 9 4 22.8 95 -0.857 #> 10 6 19.2 123 -0.373 #> # ℹ 22 more rows ``` <sup>Created on 2024-04-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup> * GitHub Issue: #29537
…e#41223) ### Rationale for this change Previously, the NSE expression handling in `summarize()` worked differently from filter/mutate/etc. Among the implications, it would not have been possible to define bindings for other aggregation functions that can be translated into some combination of supported aggregations, such as `weighted.mean()`. ### What changes are included in this PR? * Expressions in `summarize()` can now be evaluated with "regular" `arrow_eval()`. Aggregation bindings stick the contents of the aggregation data they previously returned into an `..aggregations` list that lives in an enclosing environment, and then return a FieldRef pointing to that. This makes the code in e.g. `summarize_eval()` a little harder to follow, since it's grabbing and pointing to objects out of its immediate scope, but I've tried to comment thoroughly and am happy to add more. * `arrow_eval()` inspects the expression it receives for any functions that are not in the NSE mask and not in some other package's namespace (i.e. hopefully just user functions) and inserts them into the NSE mask, setting the enclosing environment for that copy of the function to be the mask, so that if the function calls other functions that we do have bindings for, the bindings get called. This is the approach I suggested back in apache#29667 (comment), and it is what fixes apache#29667 and apache#40938. ### Are these changes tested? Existing tests, which are pretty comprehensive, pass. But it would be good to try to be more evil in manual testing with the user-defined R function support. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#41323
…he#41368) ### Rationale for this change This clarifies the language added in apache#41223, as discussed in a post-merge review in apache#41223 (comment). ### What changes are included in this PR? Just a tweak to R's NEWS.md file. ### Are these changes tested? No. ### Are there any user-facing changes? No. Authored-by: Bryce Mecum <petridish@gmail.com> Signed-off-by: Bryce Mecum <petridish@gmail.com>
…he#41350) ### Rationale for this change Since it doesn't look like Acero will be getting window functions any time soon, implement support in `mutate()` for transformations that involve aggregations, like `x - mean(x)`, via left_join. ### What changes are included in this PR? Following apache#41223, I realized I could reuse that evaluation path in `mutate()`. Evaluating expressions accumulates `..aggregations` and `mutate_stuff`; in summarize() we apply aggregations and then mutate on the result. If expressions in the `mutate_stuff` reference columns in the original data and not just the result of aggregations, we reject it. Here, if there are aggregations, we apply them on a copy of the query up to that point, and join the result back onto the query, then apply the mutations on that. It's not a problem for those mutate expressions to reference both columns in the original data and the results of the aggregations because both are present. There are ~three~ two caveats: * Join has non-deterministic order, so while `mutate()` doesn't generally affect row order, if this code path is activated, row order may not be stable. With datasets, it's not guaranteed anyway. * ~Acero's join seems to have a limitation currently where missing values are not joined to each other. If your join key has NA in it, and you do a left_join, your new columns will all be NA, even if there is a corresponding value in the right dataset. I made apache#41358 to address that, and in the meantime, I've added a workaround (apache@b9de504) that's not awesome but has the right behavior.~ Fixed and rebased. * I believe it is possible in dplyr to get this behavior in other verbs: filter, arrange, even summarize. I've only done this for mutate. Are we ok with that? ### Are these changes tested? Yes ### Are there any user-facing changes? This works now: ``` r library(arrow) library(dplyr) mtcars |> arrow_table() |> select(cyl, mpg, hp) |> group_by(cyl) |> mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |> collect() #> # A tibble: 32 × 4 #> # Groups: cyl [3] #> cyl mpg hp stdize_mpg #> <dbl> <dbl> <dbl> <dbl> #> 1 6 21 110 0.865 #> 2 6 21 110 0.865 #> 3 4 22.8 93 -0.857 #> 4 6 21.4 110 1.14 #> 5 8 18.7 175 1.41 #> 6 6 18.1 105 -1.13 #> 7 8 14.3 245 -0.312 #> 8 4 24.4 62 -0.502 #> 9 4 22.8 95 -0.857 #> 10 6 19.2 123 -0.373 #> # ℹ 22 more rows ``` <sup>Created on 2024-04-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup> * GitHub Issue: apache#29537
…e#41223) ### Rationale for this change Previously, the NSE expression handling in `summarize()` worked differently from filter/mutate/etc. Among the implications, it would not have been possible to define bindings for other aggregation functions that can be translated into some combination of supported aggregations, such as `weighted.mean()`. ### What changes are included in this PR? * Expressions in `summarize()` can now be evaluated with "regular" `arrow_eval()`. Aggregation bindings stick the contents of the aggregation data they previously returned into an `..aggregations` list that lives in an enclosing environment, and then return a FieldRef pointing to that. This makes the code in e.g. `summarize_eval()` a little harder to follow, since it's grabbing and pointing to objects out of its immediate scope, but I've tried to comment thoroughly and am happy to add more. * `arrow_eval()` inspects the expression it receives for any functions that are not in the NSE mask and not in some other package's namespace (i.e. hopefully just user functions) and inserts them into the NSE mask, setting the enclosing environment for that copy of the function to be the mask, so that if the function calls other functions that we do have bindings for, the bindings get called. This is the approach I suggested back in apache#29667 (comment), and it is what fixes apache#29667 and apache#40938. ### Are these changes tested? Existing tests, which are pretty comprehensive, pass. But it would be good to try to be more evil in manual testing with the user-defined R function support. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#41323
…he#41350) ### Rationale for this change Since it doesn't look like Acero will be getting window functions any time soon, implement support in `mutate()` for transformations that involve aggregations, like `x - mean(x)`, via left_join. ### What changes are included in this PR? Following apache#41223, I realized I could reuse that evaluation path in `mutate()`. Evaluating expressions accumulates `..aggregations` and `mutate_stuff`; in summarize() we apply aggregations and then mutate on the result. If expressions in the `mutate_stuff` reference columns in the original data and not just the result of aggregations, we reject it. Here, if there are aggregations, we apply them on a copy of the query up to that point, and join the result back onto the query, then apply the mutations on that. It's not a problem for those mutate expressions to reference both columns in the original data and the results of the aggregations because both are present. There are ~three~ two caveats: * Join has non-deterministic order, so while `mutate()` doesn't generally affect row order, if this code path is activated, row order may not be stable. With datasets, it's not guaranteed anyway. * ~Acero's join seems to have a limitation currently where missing values are not joined to each other. If your join key has NA in it, and you do a left_join, your new columns will all be NA, even if there is a corresponding value in the right dataset. I made apache#41358 to address that, and in the meantime, I've added a workaround (apache@b9de504) that's not awesome but has the right behavior.~ Fixed and rebased. * I believe it is possible in dplyr to get this behavior in other verbs: filter, arrange, even summarize. I've only done this for mutate. Are we ok with that? ### Are these changes tested? Yes ### Are there any user-facing changes? This works now: ``` r library(arrow) library(dplyr) mtcars |> arrow_table() |> select(cyl, mpg, hp) |> group_by(cyl) |> mutate(stdize_mpg = (mpg - mean(mpg)) / sd(mpg)) |> collect() #> # A tibble: 32 × 4 #> # Groups: cyl [3] #> cyl mpg hp stdize_mpg #> <dbl> <dbl> <dbl> <dbl> #> 1 6 21 110 0.865 #> 2 6 21 110 0.865 #> 3 4 22.8 93 -0.857 #> 4 6 21.4 110 1.14 #> 5 8 18.7 175 1.41 #> 6 6 18.1 105 -1.13 #> 7 8 14.3 245 -0.312 #> 8 4 24.4 62 -0.502 #> 9 4 22.8 95 -0.857 #> 10 6 19.2 123 -0.373 #> # ℹ 22 more rows ``` <sup>Created on 2024-04-23 with [reprex v2.1.0](https://reprex.tidyverse.org)</sup> * GitHub Issue: apache#29537
Rationale for this change
Previously, the NSE expression handling in
summarize()
worked differently from filter/mutate/etc. Among the implications, it would not have been possible to define bindings for other aggregation functions that can be translated into some combination of supported aggregations, such asweighted.mean()
.What changes are included in this PR?
summarize()
can now be evaluated with "regular"arrow_eval()
. Aggregation bindings stick the contents of the aggregation data they previously returned into an..aggregations
list that lives in an enclosing environment, and then return a FieldRef pointing to that. This makes the code in e.g.summarize_eval()
a little harder to follow, since it's grabbing and pointing to objects out of its immediate scope, but I've tried to comment thoroughly and am happy to add more.arrow_eval()
inspects the expression it receives for any functions that are not in the NSE mask and not in some other package's namespace (i.e. hopefully just user functions) and inserts them into the NSE mask, setting the enclosing environment for that copy of the function to be the mask, so that if the function calls other functions that we do have bindings for, the bindings get called. This is the approach I suggested back in [R] Try to arrow_eval user-defined functions #29667 (comment), and it is what fixes [R] Try to arrow_eval user-defined functions #29667 and [R] Improve user experience when mixing R code with Arrow dplyr pipelines #40938.Are these changes tested?
Existing tests, which are pretty comprehensive, pass. But it would be good to try to be more evil in manual testing with the user-defined R function support.
Are there any user-facing changes?
Yes.