Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11699: [R] Implement dplyr::across() for mutate() #13786

Closed
wants to merge 77 commits into from

Conversation

thisisnic
Copy link
Member

@thisisnic thisisnic commented Aug 2, 2022

This PR introduces a partial implementation for dplyr::across() when called within dplyr::mutate().

arrow_table(iris) %>%
  mutate(across(starts_with("Sepal"), list(round, sqrt)))
#> Table (query)
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8>
#> Sepal.Length_1: double (round(Sepal.Length, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Length_2: double (sqrt_checked(Sepal.Length))
#> Sepal.Width_1: double (round(Sepal.Width, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Width_2: double (sqrt_checked(Sepal.Width))
#> 
#> See $.data for the source Arrow object

I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation:

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

@github-actions
Copy link

github-actions bot commented Aug 2, 2022

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment on lines 282 to 289
compare_dplyr_binding(
.input %>%
select(name, homeworld, species) %>%
mutate(across(!name, as.factor)) %>%
collect(),
starwars,
warning = "Expression across.*not supported in Arrow"
)
Copy link
Member Author

@thisisnic thisisnic Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test now fails with the following error:

Error in `collect()`:
! Invalid: ExecuteScalarExpression cannot Execute non-scalar expression dictionary_encode(homeworld, {null_encoding_behavior=MASK})
/home/nic2/arrow/cpp/src/arrow/compute/exec/project_node.cc:91  ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
/home/nic2/arrow/cpp/src/arrow/compute/exec/exec_plan.cc:559  iterator_.Next()
/home/nic2/arrow/cpp/src/arrow/record_batch.cc:337  ReadNext(&batch)
/home/nic2/arrow/cpp/src/arrow/record_batch.cc:351  ToRecordBatches()
Run `rlang::last_error()` to see where the error occurred.

TODO: investigate more, open JIRA, wrap test in expect_error with comment linking to JIRA

Copy link
Member

@nealrichardson nealrichardson Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There already is at least one issue for as.factor in Acero: ARROW-12632, and I think there was another.

I'd change this test to use a different function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though we should probably make sure that as.factor isn't mapped to anything in the dplyr funcs for now, so that this fails in the right way rather than try to collect.

Copy link
Member Author

@thisisnic thisisnic Aug 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the test, and then opened a ticket and associated PR to remove the existing mapping.

@thisisnic thisisnic marked this pull request as ready for review August 10, 2022 08:27
@thisisnic thisisnic changed the title ARROW-11699: [R] Implement dplyr::across() ARROW-11699: [R] Implement dplyr::across() for mutate() Aug 10, 2022
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cool! A few thoughts although I learned dplyr before across() existed and I admittedly have never used it in real life.

r/R/dplyr-mutate.R Outdated Show resolved Hide resolved
new_quos <- quosures_from_func_list(func_list, cols)
}

quos_out <- append(quos_out, new_quos)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working my head around this but I think that somewhere you need to rlang::quo_set_env(quo_out, rlang::quo_get_env(quo_in)) to make sure that symbol references that are not columns are fetched from the calling environment. I'm struggling to come up with an example where that can happen so maybe this isn't relevant here, but it seems like somehow rlang::quo_get_env(quo_in)) should be passed on to the output quosures?

Copy link
Member Author

@thisisnic thisisnic Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, updated now.

r/R/dplyr-mutate.R Outdated Show resolved Hide resolved
r/tests/testthat/test-dplyr-mutate.R Outdated Show resolved Hide resolved
Comment on lines 235 to 236
new_quo_names <- map2_chr(
names(func_list_full), cols_list_full,
~ paste(.y, .x, sep = "_")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you implement the .names argument here? There's almost certainly a cleaner way to do this but something like:

withr::with_environment(as.environment(list(.col = "something", .fn = "something-else")), glue::glue("{.col}_{.fn}"))
#> something_something-else

Created on 2022-08-11 by the reprex package (v2.0.1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've deferred it to another ticket - ARROW-17364

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now implemented here

collect(),
tbl
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this will work but what about across(1:dbl2, list("fun1" = round, "fun2" = exp))? Does putting something obnoxious like across(1:dbl2, list("fun1" = round(this_is_not_cool(something_else)), "fun2" = exp)) result in an interpretable error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catches there, have added in a couple of failing tests for those that I will make pass.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. across(1:dbl2, list("fun1" = round, "fun2" = exp))
    Good catch! I've now fixed it to properly check for the name from the list and not just the function name.

  2. across(1:dbl2, list("fun1" = round(this_is_not_cool(something_else)), "fun2" = exp))
    This doesn't work in R either. We get the "not supported in Arrow" error, which then pushes it to R, which then gives the error it would have given anyway. I've added a test in for it, but it's a bit clunky and I'm not sure if we need it. What do you think? (It's the test which has round(sqrt(dbl)) in it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this comment...I think it's a good test to have in there because you're walking the syntax tree yourself. When I wrote this, I was worried somebody would get something like object of type LANGSXP is not subsettable or something.

collect(),
tbl
),
"Unsupported selection helper"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this unsupported? Do we have JIRA(s) for this? Seems like it should work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where() isn't yet supported - have added a comment which links to the JIRA for it

if (!is_list(funcs) && as.character(funcs)[[1]] == "~") {
abort(
paste(
"purrr-style lambda functions as `.fns` argument to `across()`",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? We already import purrr::as_mapper, is this more complicated than funcs <- map(funcs, as_mapper)?

Copy link
Member Author

@thisisnic thisisnic Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if I try to do that, I get the error:

Error in as_mapper(funcs) : 
  Formula must carry an environment.

Currently trying to work out how to fix that, f_env() looks promising!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that works; as_mapper() get us the function, but what I want is the name of the function so that I can make a quosure, which later turns into something which a later step turns into something that Arrow can work with.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense until ARROW-14071 is merged so deferring it for now.

@@ -24,7 +24,9 @@ mutate.arrow_dplyr_query <- function(.data,
.before = NULL,
.after = NULL) {
call <- match.call()
exprs <- ensure_named_exprs(quos(...))

expression_list <- unfold_across(.data, quos(...))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you've deferred this from filter and summarize, but is that necessary? Is it more involved than adding in this unfold_across in them? There's no other modifications to mutate() so it seems reasonable to guess that it would be that straightforward in the other functions too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not as straightforward as I need to delve into the internals of do_arrow_summarize() and swap some bits around. Given this PR is not tiny and I'd be wanting to add a lot of tests for all 3, I'd like to keep them separate to try to keep things simpler for review.

r/R/dplyr-mutate.R Outdated Show resolved Hide resolved
r/R/dplyr-mutate.R Outdated Show resolved Hide resolved
@thisisnic
Copy link
Member Author

This PR now also implements the .names argument.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so close! I'm definitely in favor of deferring filter() and summarise() to another PR (although it will be confusing if they are not implemented before the next release). I really like how you split up the functions in dplyr-across.R...made it easy to review!

The one thing I think this needs is a set of tests directly on expand_across(). I think you can very easily convert the existing tests to something like:

library(rlang, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

testthat::expect_identical(
  arrow:::expand_across(
    data.frame(dbl = double(), dbl2 = double()),
    rlang::quos(across(c(dbl, dbl2), round))
  ),
  list(
    dbl = quo(round(dbl)),
    dbl2 = quo(round(dbl2))
  )
)

Created on 2022-08-26 by the reprex package (v2.0.1)

(With apologies for not suggesting that in my first review! I think those tests would help to communicate exactly what expand_across() is doing and highlight the fact that you did a really good job separating the implementation into highly testable pieces.)

r/R/dplyr-across.R Outdated Show resolved Hide resolved
r/R/dplyr-across.R Outdated Show resolved Hide resolved
@thisisnic
Copy link
Member Author

This is so close! I'm definitely in favor of deferring filter() and summarise() to another PR (although it will be confusing if they are not implemented before the next release). I really like how you split up the functions in dplyr-across.R...made it easy to review!

The one thing I think this needs is a set of tests directly on expand_across(). I think you can very easily convert the existing tests to something like:

library(rlang, warn.conflicts = FALSE)
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.

testthat::expect_identical(
  arrow:::expand_across(
    data.frame(dbl = double(), dbl2 = double()),
    rlang::quos(across(c(dbl, dbl2), round))
  ),
  list(
    dbl = quo(round(dbl)),
    dbl2 = quo(round(dbl2))
  )
)

Created on 2022-08-26 by the reprex package (v2.0.1)

(With apologies for not suggesting that in my first review! I think those tests would help to communicate exactly what expand_across() is doing and highlight the fact that you did a really good job separating the implementation into highly testable pieces.)

@paleolimbot Thanks for the review! Quick question for you; I want to keep in the tests which have across() inside of mutate() as it makes it a lot easier to see "how are we matching dplyr's functionality" which I imagine will come up if when there are bug reports about this. Any tips as to which functionality to include in the expand_across() tests vs. the mutate(across(...) tests? Or is this going to be a pain, and I should just swap them all over?

@paleolimbot
Copy link
Member

I think you nailed it - the mutate() tests are about matching dplyr's functionality (e.g., they will fail if dplyr changes something and makes this code out of sync); the expand_across() tests are about expand_across(); and we need both.

I would personally just copy/paste all the places where across() appears in your mutate tests but expect_identical(expand_across(quo(across(...)), ...) instead. If you feel this is out of scope for this PR I am OK with that, too (but I think it's worth trying).

@thisisnic
Copy link
Member Author

I think you nailed it - the mutate() tests are about matching dplyr's functionality (e.g., they will fail if dplyr changes something and makes this code out of sync); the expand_across() tests are about expand_across(); and we need both.

I would personally just copy/paste all the places where across() appears in your mutate tests but expect_identical(expand_across(quo(across(...)), ...) instead. If you feel this is out of scope for this PR I am OK with that, too (but I think it's worth trying).

I absolutely agree with the argument that this is a big function and it'd be worth having tests for it, though also wondering if copying and pasting every single test might be a bit excessive - we don't necessarily test intermediate functions in some of our other dplyr verbs (though perhaps we should). I might just check with @nealrichardson or @jonkeane to see if they see any middle ground in what needs to be tested - if there isn't, I'm happy to copy and paste everything and approach it that way.

@nealrichardson
Copy link
Member

Haven't looked at the PR lately but, as I understand it, expand_across() takes a list of quosures and returns a list of quosures. So you should be able to unit-test all of its behavior in isolation, and then you want one or two tests in mutate() that it does the right thing in practice (integration tests). Of course, doing everything in mutate() has its conveniences; you could make it easier to write the unit tests with a test helper, something like:

expect_across_equal <- function(actual, expected) {
  expect_identical(expand_across(actual), expected)
}

expect_across_equal(
  quos(lgl, across(is.numeric, ~ . * 5)), 
  quos(lgl, int * 5, dbl * 5)
)

@thisisnic
Copy link
Member Author

@paleolimbot I've updated the tests as per your and Neals' useful suggestions. Mind giving it another review now?

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look great! Thank you!

collect(),
tbl
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing this comment...I think it's a good test to have in there because you're walking the syntax tree yourself. When I wrote this, I was worried somebody would get something like object of type LANGSXP is not subsettable or something.

@thisisnic thisisnic closed this in d5f80cb Sep 1, 2022
nealrichardson pushed a commit that referenced this pull request Sep 27, 2022
The `rlang::quo` function used in lines added in PR #13786 (d5f80cb) but not imported.

```r
mtcars |> arrow::arrow_table() |> dplyr::mutate(dplyr::across(starts_with("c"), as.character)) |> dplyr::collect()
#> Error in quo(!!call2(.x, sym(.y))) : could not find function "quo"
```

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
zagto pushed a commit to zagto/arrow that referenced this pull request Oct 7, 2022
This PR introduces a partial implementation for `dplyr::across()` when called within `dplyr::mutate()`.

``` r
arrow_table(iris) %>%
  mutate(across(starts_with("Sepal"), list(round, sqrt)))
#> Table (query)
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8>
#> Sepal.Length_1: double (round(Sepal.Length, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Length_2: double (sqrt_checked(Sepal.Length))
#> Sepal.Width_1: double (round(Sepal.Width, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Width_2: double (sqrt_checked(Sepal.Width))
#>
#> See $.data for the source Arrow object
```

I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation:
* [ARROW-17362: [R] Implement dplyr::across() inside summarise()](https://issues.apache.org/jira/browse/ARROW-17362)
* [ARROW-17387: [R] Implement dplyr::across() inside filter()](https://issues.apache.org/jira/browse/ARROW-17387)
* ~[ARROW-17364: [R] Implement .names argument inside across()](https://issues.apache.org/jira/browse/ARROW-17364)~ (now done in this PR, will close it once this is merged)
* [ARROW-17366: [R] Support purrr-style lambda functions in .fns argument to across()](https://issues.apache.org/jira/browse/ARROW-17366)

Closes apache#13786 from thisisnic/ARROW-11699_across

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
This PR introduces a partial implementation for `dplyr::across()` when called within `dplyr::mutate()`.

``` r
arrow_table(iris) %>%
  mutate(across(starts_with("Sepal"), list(round, sqrt)))
#> Table (query)
#> Sepal.Length: double
#> Sepal.Width: double
#> Petal.Length: double
#> Petal.Width: double
#> Species: dictionary<values=string, indices=int8>
#> Sepal.Length_1: double (round(Sepal.Length, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Length_2: double (sqrt_checked(Sepal.Length))
#> Sepal.Width_1: double (round(Sepal.Width, {ndigits=0, round_mode=HALF_TO_EVEN}))
#> Sepal.Width_2: double (sqrt_checked(Sepal.Width))
#>
#> See $.data for the source Arrow object
```

I've opened a number of follow-up tickets for the tasks needed to be done to provide a more complete implementation:
* [ARROW-17362: [R] Implement dplyr::across() inside summarise()](https://issues.apache.org/jira/browse/ARROW-17362)
* [ARROW-17387: [R] Implement dplyr::across() inside filter()](https://issues.apache.org/jira/browse/ARROW-17387)
* ~[ARROW-17364: [R] Implement .names argument inside across()](https://issues.apache.org/jira/browse/ARROW-17364)~ (now done in this PR, will close it once this is merged)
* [ARROW-17366: [R] Support purrr-style lambda functions in .fns argument to across()](https://issues.apache.org/jira/browse/ARROW-17366)

Closes apache#13786 from thisisnic/ARROW-11699_across

Authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Oct 17, 2022
The `rlang::quo` function used in lines added in PR apache#13786 (d5f80cb) but not imported.

```r
mtcars |> arrow::arrow_table() |> dplyr::mutate(dplyr::across(starts_with("c"), as.character)) |> dplyr::collect()
#> Error in quo(!!call2(.x, sym(.y))) : could not find function "quo"
```

Authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants