New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17387: [R] Implement dplyr::across() inside filter() #14281
ARROW-17387: [R] Implement dplyr::across() inside filter() #14281
Conversation
…comparing quosures with quosures
b2cdd9d
to
0e6ab68
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggested cleanup but otherwise LGTM!
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Benchmark runs are scheduled for baseline = ece5b65 and contender = a1d7a44. a1d7a44 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
I got notified in another PR (which was merged after this one) about potential perf regression, but so I think it is this PR that gives a slight slowdown in some benchmarks. The biggest regression (TPCH-08) seems a flaky measurement, but for TPCH-16 it seems persistent (although it's only a very small slowdown, no idea if it is significant) |
What's a good way to investigate this? I wouldn't expect a slowdown from this PR, but can't rule it out entirely. |
I suppose the easiest would be to first try to get the code behind that benchmark running locally (but not familiar with the R benchmarks for what's the easiest way to so this), and then if you can run it locally, then checking if you can reproduce the slowdown, and if so you can profile both cases (but again, I don't know how to do that in R) |
Thinking about this more, I'd expect it to be a tiny bit slower but not enough to affect the benchmarks, and given that this block of code has been added to other operations which are used in the benchmarks, I'd expect to see a regression in those too if it was problematic. @jonkeane, I don't suppose you'd mind helping me take a look at this? |
I don't think this is concerning. None of the TPC-H queries have the features that this PR adds support for, so there's not much extra work being added that would add meaningful time. I did a quick benchmark of the "fast" path of the function in question here, and it costs <1ms:
|
Conbench also only identified the 0.01 and 0.1 scale factors as slowdown. The same benchmark but with scale factor 1 or 10 didn't show a consistent slowdown: https://conbench.ursa.dev/compare/benchmarks/44eef0b11f204c09bebde7b2a4050c98...07d114936d9c4b94b69aa712f0a8423e/ (which matches with what Neal is saying) |
Yeah, I've separately found those low scale factor benchmarks to be sensitive/flaky. They're useful in alerting when we add 10ms to the query assembly time (see also #13985), and those micro-regressions do add up if not addressed. But they're not all that noticeable if you're working with larger data. |
The implementation differs here from dplyr in that some steps are removed as the dplyr functionality evaluates functions sooner and so has extra steps required.