ARROW-17387: [R] Implement dplyr::across() inside filter() #14281

thisisnic · 2022-09-30T08:36:44Z

The implementation differs here from dplyr in that some steps are removed as the dplyr functionality evaluates functions sooner and so has extra steps required.

github-actions · 2022-09-30T08:37:06Z

https://issues.apache.org/jira/browse/ARROW-17387

r/R/dplyr-across.R

…comparing quosures with quosures

nealrichardson

One suggested cleanup but otherwise LGTM!

r/R/dplyr-across.R

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

r/tests/testthat/test-dplyr-across.R

ursabot · 2022-10-11T14:32:43Z

Benchmark runs are scheduled for baseline = ece5b65 and contender = a1d7a44. a1d7a44 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Failed ⬇️1.1% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.21% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] a1d7a449 ec2-t3-xlarge-us-east-2
[Failed] a1d7a449 test-mac-arm
[Failed] a1d7a449 ursa-i9-9960x
[Finished] a1d7a449 ursa-thinkcentre-m75q
[Finished] ece5b654 ec2-t3-xlarge-us-east-2
[Failed] ece5b654 test-mac-arm
[Failed] ece5b654 ursa-i9-9960x
[Finished] ece5b654 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-10-11T14:33:01Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

jorisvandenbossche · 2022-10-12T06:57:05Z

I got notified in another PR (which was merged after this one) about potential perf regression, but so I think it is this PR that gives a slight slowdown in some benchmarks. The biggest regression (TPCH-08) seems a flaky measurement, but for TPCH-16 it seems persistent (although it's only a very small slowdown, no idea if it is significant)

thisisnic · 2022-10-12T08:26:42Z

I got notified in another PR (which was merged after this one) about potential perf regression, but so I think it is this PR that gives a slight slowdown in some benchmarks. The biggest regression (TPCH-08) seems a flaky measurement, but for TPCH-16 it seems persistent (although it's only a very small slowdown, no idea if it is significant)

What's a good way to investigate this? I wouldn't expect a slowdown from this PR, but can't rule it out entirely.

jorisvandenbossche · 2022-10-12T08:41:17Z

I suppose the easiest would be to first try to get the code behind that benchmark running locally (but not familiar with the R benchmarks for what's the easiest way to so this), and then if you can run it locally, then checking if you can reproduce the slowdown, and if so you can profile both cases (but again, I don't know how to do that in R)

thisisnic · 2022-10-12T10:20:19Z

Thinking about this more, I'd expect it to be a tiny bit slower but not enough to affect the benchmarks, and given that this block of code has been added to other operations which are used in the benchmarks, I'd expect to see a regression in those too if it was problematic. @jonkeane, I don't suppose you'd mind helping me take a look at this?

nealrichardson · 2022-10-12T12:01:59Z

I don't think this is concerning. None of the TPC-H queries have the features that this PR adds support for, so there's not much extra work being added that would add meaningful time. I did a quick benchmark of the "fast" path of the function in question here, and it costs <1ms:

> q <- rlang:::quos(a + b, c + d, e - f, f(a, b, c))
> bench::mark(arrow:::expand_across(list(), q))
# A tibble: 1 × 13
  expression                            min median itr/s…¹ mem_a…² gc/se…³ n_itr
  <bch:expr>                       <bch:tm> <bch:>   <dbl> <bch:b>   <dbl> <int>
1 arrow:::expand_across(list(), q)   87.4µs 91.5µs  10755.  42.9KB    43.7  4923

jorisvandenbossche · 2022-10-12T12:08:07Z

Conbench also only identified the 0.01 and 0.1 scale factors as slowdown. The same benchmark but with scale factor 1 or 10 didn't show a consistent slowdown: https://conbench.ursa.dev/compare/benchmarks/44eef0b11f204c09bebde7b2a4050c98...07d114936d9c4b94b69aa712f0a8423e/ (which matches with what Neal is saying)

nealrichardson · 2022-10-12T13:09:24Z

Yeah, I've separately found those low scale factor benchmarks to be sensitive/flaky. They're useful in alerting when we add 10ms to the query assembly time (see also #13985), and those micro-regressions do add up if not addressed. But they're not all that noticeable if you're working with larger data.

github-actions bot added the Component: R label Sep 30, 2022

thisisnic changed the title ~~ARROW-17387: [R] Implement dplyr::across() inside filter() [WIP]~~ ARROW-17387: [R] Implement dplyr::across() inside filter() Sep 30, 2022

thisisnic marked this pull request as ready for review September 30, 2022 10:17

nealrichardson reviewed Sep 30, 2022

View reviewed changes

r/R/dplyr-across.R Show resolved Hide resolved

thisisnic added 12 commits October 8, 2022 08:14

Add comment, rebuild NAMESPACE file

2722e0f

Add tests for filter and across

efa908f

Update across() to work with if_any() and if_all() and add tests

43523ca

Vectorise call to is_call

e3f9b44

Add tests for if_all and if_any, and update tests to make sure we're …

fdfc27c

…comparing quosures with quosures

Add more complex test and delete test of deprecated behaviour

f4bf802

Add environment to combined expressions

998b397

Update docs

e9aae11

Namespace reduce call

0ab3d81

Import expr_text

9bf2d12

Styler doesn't mind but lintr does

ae5aced

Run devtools::document

0e6ab68

thisisnic force-pushed the ARROW-17387_filter_latest branch from b2cdd9d to 0e6ab68 Compare October 8, 2022 07:22

Import expr_text

7f8b8cf

nealrichardson approved these changes Oct 10, 2022

View reviewed changes

r/R/dplyr-across.R Outdated Show resolved Hide resolved

Update r/R/dplyr-across.R

6459ad0

Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>

thisisnic commented Oct 10, 2022

View reviewed changes

r/tests/testthat/test-dplyr-across.R Outdated Show resolved Hide resolved

Remove duplicated test

d4d3a00

thisisnic merged commit a1d7a44 into apache:master Oct 11, 2022

asfimport mentioned this pull request Oct 12, 2022

[R] Implement dplyr::across() inside filter() #32659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17387: [R] Implement dplyr::across() inside filter() #14281

ARROW-17387: [R] Implement dplyr::across() inside filter() #14281

thisisnic commented Sep 30, 2022 •

edited

github-actions bot commented Sep 30, 2022

nealrichardson left a comment

ursabot commented Oct 11, 2022

ursabot commented Oct 11, 2022

jorisvandenbossche commented Oct 12, 2022

thisisnic commented Oct 12, 2022

jorisvandenbossche commented Oct 12, 2022

thisisnic commented Oct 12, 2022

nealrichardson commented Oct 12, 2022

jorisvandenbossche commented Oct 12, 2022

nealrichardson commented Oct 12, 2022

ARROW-17387: [R] Implement dplyr::across() inside filter() #14281

ARROW-17387: [R] Implement dplyr::across() inside filter() #14281

Conversation

thisisnic commented Sep 30, 2022 • edited

github-actions bot commented Sep 30, 2022

nealrichardson left a comment

Choose a reason for hiding this comment

ursabot commented Oct 11, 2022

ursabot commented Oct 11, 2022

jorisvandenbossche commented Oct 12, 2022

thisisnic commented Oct 12, 2022

jorisvandenbossche commented Oct 12, 2022

thisisnic commented Oct 12, 2022

nealrichardson commented Oct 12, 2022

jorisvandenbossche commented Oct 12, 2022

nealrichardson commented Oct 12, 2022

thisisnic commented Sep 30, 2022 •

edited