-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11699: [R] Implement dplyr::across() for mutate() #13786
Conversation
|
r/tests/testthat/test-dplyr-mutate.R
Outdated
compare_dplyr_binding( | ||
.input %>% | ||
select(name, homeworld, species) %>% | ||
mutate(across(!name, as.factor)) %>% | ||
collect(), | ||
starwars, | ||
warning = "Expression across.*not supported in Arrow" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test now fails with the following error:
Error in `collect()`:
! Invalid: ExecuteScalarExpression cannot Execute non-scalar expression dictionary_encode(homeworld, {null_encoding_behavior=MASK})
/home/nic2/arrow/cpp/src/arrow/compute/exec/project_node.cc:91 ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
/home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:559 iterator_.Next()
/home/nic2/arrow/cpp/src/arrow/record_batch.cc:337 ReadNext(&batch)
/home/nic2/arrow/cpp/src/arrow/record_batch.cc:351 ToRecordBatches()
Run `rlang::last_error()` to see where the error occurred.
TODO: investigate more, open JIRA, wrap test in expect_error with comment linking to JIRA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There already is at least one issue for as.factor
in Acero: ARROW-12632, and I think there was another.
I'd change this test to use a different function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though we should probably make sure that as.factor
isn't mapped to anything in the dplyr funcs for now, so that this fails in the right way rather than try to collect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed the test, and then opened a ticket and associated PR to remove the existing mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So cool! A few thoughts although I learned dplyr before across()
existed and I admittedly have never used it in real life.
r/R/dplyr-mutate.R
Outdated
new_quos <- quosures_from_func_list(func_list, cols) | ||
} | ||
|
||
quos_out <- append(quos_out, new_quos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working my head around this but I think that somewhere you need to rlang::quo_set_env(quo_out, rlang::quo_get_env(quo_in))
to make sure that symbol references that are not columns are fetched from the calling environment. I'm struggling to come up with an example where that can happen so maybe this isn't relevant here, but it seems like somehow rlang::quo_get_env(quo_in))
should be passed on to the output quosures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, updated now.
r/R/dplyr-mutate.R
Outdated
new_quo_names <- map2_chr( | ||
names(func_list_full), cols_list_full, | ||
~ paste(.y, .x, sep = "_") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you implement the .names
argument here? There's almost certainly a cleaner way to do this but something like:
withr::with_environment(as.environment(list(.col = "something", .fn = "something-else")), glue::glue("{.col}_{.fn}"))
#> something_something-else
Created on 2022-08-11 by the reprex package (v2.0.1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've deferred it to another ticket - ARROW-17364
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now implemented here
collect(), | ||
tbl | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this will work but what about across(1:dbl2, list("fun1" = round, "fun2" = exp))
? Does putting something obnoxious like across(1:dbl2, list("fun1" = round(this_is_not_cool(something_else)), "fun2" = exp))
result in an interpretable error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catches there, have added in a couple of failing tests for those that I will make pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
across(1:dbl2, list("fun1" = round, "fun2" = exp))
Good catch! I've now fixed it to properly check for the name from the list and not just the function name. -
across(1:dbl2, list("fun1" = round(this_is_not_cool(something_else)), "fun2" = exp))
This doesn't work in R either. We get the "not supported in Arrow" error, which then pushes it to R, which then gives the error it would have given anyway. I've added a test in for it, but it's a bit clunky and I'm not sure if we need it. What do you think? (It's the test which hasround(sqrt(dbl))
in it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing this comment...I think it's a good test to have in there because you're walking the syntax tree yourself. When I wrote this, I was worried somebody would get something like object of type LANGSXP is not subsettable
or something.
ca784d9
to
76024a7
Compare
collect(), | ||
tbl | ||
), | ||
"Unsupported selection helper" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this unsupported? Do we have JIRA(s) for this? Seems like it should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where()
isn't yet supported - have added a comment which links to the JIRA for it
r/R/dplyr-mutate.R
Outdated
if (!is_list(funcs) && as.character(funcs)[[1]] == "~") { | ||
abort( | ||
paste( | ||
"purrr-style lambda functions as `.fns` argument to `across()`", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? We already import purrr::as_mapper
, is this more complicated than funcs <- map(funcs, as_mapper)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, if I try to do that, I get the error:
Error in as_mapper(funcs) :
Formula must carry an environment.
Currently trying to work out how to fix that, f_env()
looks promising!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that works; as_mapper()
get us the function, but what I want is the name of the function so that I can make a quosure, which later turns into something which a later step turns into something that Arrow can work with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't make sense until ARROW-14071 is merged so deferring it for now.
r/R/dplyr-mutate.R
Outdated
@@ -24,7 +24,9 @@ mutate.arrow_dplyr_query <- function(.data, | |||
.before = NULL, | |||
.after = NULL) { | |||
call <- match.call() | |||
exprs <- ensure_named_exprs(quos(...)) | |||
|
|||
expression_list <- unfold_across(.data, quos(...)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you've deferred this from filter and summarize, but is that necessary? Is it more involved than adding in this unfold_across
in them? There's no other modifications to mutate() so it seems reasonable to guess that it would be that straightforward in the other functions too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not as straightforward as I need to delve into the internals of do_arrow_summarize()
and swap some bits around. Given this PR is not tiny and I'd be wanting to add a lot of tests for all 3, I'd like to keep them separate to try to keep things simpler for review.
This PR now also implements the |
…ummarise tests in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so close! I'm definitely in favor of deferring filter()
and summarise()
to another PR (although it will be confusing if they are not implemented before the next release). I really like how you split up the functions in dplyr-across.R...made it easy to review!
The one thing I think this needs is a set of tests directly on expand_across()
. I think you can very easily convert the existing tests to something like:
library(rlang, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
testthat::expect_identical(
arrow:::expand_across(
data.frame(dbl = double(), dbl2 = double()),
rlang::quos(across(c(dbl, dbl2), round))
),
list(
dbl = quo(round(dbl)),
dbl2 = quo(round(dbl2))
)
)
Created on 2022-08-26 by the reprex package (v2.0.1)
(With apologies for not suggesting that in my first review! I think those tests would help to communicate exactly what expand_across()
is doing and highlight the fact that you did a really good job separating the implementation into highly testable pieces.)
@paleolimbot Thanks for the review! Quick question for you; I want to keep in the tests which have |
I think you nailed it - the I would personally just copy/paste all the places where |
I absolutely agree with the argument that this is a big function and it'd be worth having tests for it, though also wondering if copying and pasting every single test might be a bit excessive - we don't necessarily test intermediate functions in some of our other dplyr verbs (though perhaps we should). I might just check with @nealrichardson or @jonkeane to see if they see any middle ground in what needs to be tested - if there isn't, I'm happy to copy and paste everything and approach it that way. |
Haven't looked at the PR lately but, as I understand it,
|
@paleolimbot I've updated the tests as per your and Neals' useful suggestions. Mind giving it another review now? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look great! Thank you!
collect(), | ||
tbl | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing this comment...I think it's a good test to have in there because you're walking the syntax tree yourself. When I wrote this, I was worried somebody would get something like object of type LANGSXP is not subsettable
or something.
The `rlang::quo` function used in lines added in PR #13786 (d5f80cb) but not imported. ```r mtcars |> arrow::arrow_table() |> dplyr::mutate(dplyr::across(starts_with("c"), as.character)) |> dplyr::collect() #> Error in quo(!!call2(.x, sym(.y))) : could not find function "quo" ``` Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This PR introduces a partial implementation for `dplyr::across()` when called within `dplyr::mutate()`. ``` r arrow_table(iris) %>% mutate(across(starts_with("Sepal"), list(round, sqrt))) #> Table (query) #> Sepal.Length: double #> Sepal.Width: double #> Petal.Length: double #> Petal.Width: double #> Species: dictionary<values=string, indices=int8> #> Sepal.Length_1: double (round(Sepal.Length, {ndigits=0, round_mode=HALF_TO_EVEN})) #> Sepal.Length_2: double (sqrt_checked(Sepal.Length)) #> Sepal.Width_1: double (round(Sepal.Width, {ndigits=0, round_mode=HALF_TO_EVEN})) #> Sepal.Width_2: double (sqrt_checked(Sepal.Width)) #> #> See $.data for the source Arrow object ``` I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation: * [ARROW-17362: [R] Implement dplyr::across() inside summarise()](https://issues.apache.org/jira/browse/ARROW-17362) * [ARROW-17387: [R] Implement dplyr::across() inside filter()](https://issues.apache.org/jira/browse/ARROW-17387) * ~[ARROW-17364: [R] Implement .names argument inside across()](https://issues.apache.org/jira/browse/ARROW-17364)~ (now done in this PR, will close it once this is merged) * [ARROW-17366: [R] Support purrr-style lambda functions in .fns argument to across()](https://issues.apache.org/jira/browse/ARROW-17366) Closes apache#13786 from thisisnic/ARROW-11699_across Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
This PR introduces a partial implementation for `dplyr::across()` when called within `dplyr::mutate()`. ``` r arrow_table(iris) %>% mutate(across(starts_with("Sepal"), list(round, sqrt))) #> Table (query) #> Sepal.Length: double #> Sepal.Width: double #> Petal.Length: double #> Petal.Width: double #> Species: dictionary<values=string, indices=int8> #> Sepal.Length_1: double (round(Sepal.Length, {ndigits=0, round_mode=HALF_TO_EVEN})) #> Sepal.Length_2: double (sqrt_checked(Sepal.Length)) #> Sepal.Width_1: double (round(Sepal.Width, {ndigits=0, round_mode=HALF_TO_EVEN})) #> Sepal.Width_2: double (sqrt_checked(Sepal.Width)) #> #> See $.data for the source Arrow object ``` I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation: * [ARROW-17362: [R] Implement dplyr::across() inside summarise()](https://issues.apache.org/jira/browse/ARROW-17362) * [ARROW-17387: [R] Implement dplyr::across() inside filter()](https://issues.apache.org/jira/browse/ARROW-17387) * ~[ARROW-17364: [R] Implement .names argument inside across()](https://issues.apache.org/jira/browse/ARROW-17364)~ (now done in this PR, will close it once this is merged) * [ARROW-17366: [R] Support purrr-style lambda functions in .fns argument to across()](https://issues.apache.org/jira/browse/ARROW-17366) Closes apache#13786 from thisisnic/ARROW-11699_across Authored-by: Nic Crane <thisisnic@gmail.com> Signed-off-by: Nic Crane <thisisnic@gmail.com>
The `rlang::quo` function used in lines added in PR apache#13786 (d5f80cb) but not imported. ```r mtcars |> arrow::arrow_table() |> dplyr::mutate(dplyr::across(starts_with("c"), as.character)) |> dplyr::collect() #> Error in quo(!!call2(.x, sym(.y))) : could not find function "quo" ``` Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This PR introduces a partial implementation for
dplyr::across()
when called withindplyr::mutate()
.I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation:
ARROW-17364: [R] Implement .names argument inside across()(now done in this PR, will close it once this is merged)