New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-17462: [R] Cast scalars to type of field in Expression building #13985
Conversation
@ursabot help |
Supported benchmark command examples:
To run all benchmarks: To filter benchmarks by language: To filter Python and R benchmarks by name: To filter C++ benchmarks by archery --suite-filter and --benchmark-filter: For other |
@ursabot please benchmark lang=R |
Benchmark runs are scheduled for baseline = 80bba29 and contender = ed6fce6. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
Benchmark regressions, at least the worst of them, are due to ARROW-17601. By keeping the computation on Decimal types instead of casting to double, we hit an expression that by our current logic would need to promote to a scale that can't fit in Decimal128, so the evaluation errors somewhere, and because these are evaluating on Arrow Table, it falls back to pulling all the data into an R data.frame and doing the work there--hence the regression. I'll see what I can do to mitigate/work around this in this PR. Most extreme case would be to not cast scalars to decimal, i.e. restore the status quo, where most queries on decimal data would end up getting coerced to float. But hopefully we can do better than that. We have very few tests for queries on decimal types, but they're all over the TPC-H data, so that's why we only observed this in the benchmarks. That should probably get rectified too. |
@ursabot please benchmark lang=R |
Benchmark runs are scheduled for baseline = 80bba29 and contender = 3d52485. Results will be available as each benchmark for each run completes. |
['Python', 'R'] benchmarks have high level of regressions. |
Some of the remaining benchmark regressions are spurious (file-write, dataframe-to-table, neither of which are affected by this change). The other TPC-H ones are legitimate, but they're on tiny scale factors of data (0.1, 0.01, i.e. 100mb and 10mb), so the extra 10-15ms that the type checking this PR introduces shows up as statistically significant. IMO the tradeoff is worth it: we preserve the types of the original data better (especially for integers, and after ARROW-17601, decimals), we have more convenient passing of strings for dates/timestamps in expressions, and by avoiding unnecessary casts, we should get performance benefits on some queries. As it turns out, we aren't currently benchmarking queries where the performance benefit would show. I'd like to add a benchmark for the case shown on this issue, but it fails for me locally due to ARROW-17556. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few specific comments...I agree that for certain functions this is more correct whatever the performance cost, although it should be clearly fenced to certain functions where this makes sense.
} | ||
x | ||
}) | ||
args <- wrap_scalars(args, FUN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to whitelist the functions this applies to? (Or maybe you already do this and I'm not reading this correctly?) This logic is awesome and very appropriate for most math functions but I wonder if there are some compute functions (maybe binary_repeat
) that will stop working when used with build_expr()
. I think that user-defined functions also generate their bindings through build_expr()
(although they don't have to).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a blocklist rather than an allowlist, and binary_repeat is on it (L285, below). If there are compute functions that don't work with this change, we don't test them.
Do you think we should exclude UDFs from the type matching too?
For functions that do go through build_expr()
, the way to skip the type-matching logic is to wrap the value in Expression$scalar()
. Only non-Expressions are cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think you should whitelist here...in theory one can use build_expr()
for any compute function, although many bindings choose to go directly through Expression$create()
instead. Using a blocklist would mean you can only use build_expr()
safely for specific functions (in which case you should probably compute what those functions are so that can be documented).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the function list on https://arrow.apache.org/docs/cpp/compute.html and evaluated whether you should try to cast scalar inputs to the type of the corresponding column (and remember, if you can't cast the scalar without loss of precision, it doesn't add the cast, so for int + 4.2
, 4.2 won't get cast to int so that will go to cast(int, float64) + 4.2
in Acero). For the non-unary scalar functions, all but 4 make sense to try to convert scalars like this. The 4 functions that don't are binary_repeat
, list_element
, binary_join
(kind of an odd case, which we don't use, we use binary_join_element_wise instead), and make_struct
. It's around 40-50 functions on the allow side, so it seems that the "don't cast" functions are the exception.
Does that persuade you in favor of blocklist instead of allowlist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think a whitelist is safer, although feel free to make the change. The build_expr()
in the user-defined function code ( https://github.com/apache/arrow/blob/master/r/R/compute.R#L384 ) would have to change to something approaching the previous behaviour since we have no guarantees about those functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pulled UDFs out of build_expr in d54de48, and in a followup I'll go further to reduce the usage of build_expr
to places where the type matching matters (more of an allowlist, in that sense), pull out the special cases inside of it, and rename it to something like build_simple_expr
to make clear that it is a special case and not the default path you should choose.
Sound ok to you?
ecf1aaa
to
587b526
Compare
587b526
to
54fbe02
Compare
f24d00a
to
0bf8d23
Compare
0bf8d23
to
30410fa
Compare
Windows failure is due to the R 4.2.2 release today: mirrors haven't been updated yet. I'll take up the refactoring @paleolimbot requested in ARROW-18203. |
Benchmark runs are scheduled for baseline = 8066c5e and contender = d045fc5. d045fc5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Logic is encapsulated in
wrap_scalars()
in expression.R. Most test updating (that is not linting) is changing some printed output types becauseint * 2
now staysint32
, and the printed ExecPlans don't have as manycast
s in them. The tests added intest-dplyr-query.R
are the explicit tests of the feature.