ARROW-17462: [R] Cast scalars to type of field in Expression building #13985

nealrichardson · 2022-08-27T21:51:06Z

Logic is encapsulated in wrap_scalars() in expression.R. Most test updating (that is not linting) is changing some printed output types because int * 2 now stays int32, and the printed ExecPlans don't have as many casts in them. The tests added in test-dplyr-query.R are the explicit tests of the feature.

github-actions · 2022-08-27T21:51:30Z

https://issues.apache.org/jira/browse/ARROW-17462

r/R/expression.R

nealrichardson · 2022-09-01T16:57:43Z

@ursabot help

ursabot · 2022-09-01T16:57:44Z

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks:
@ursabot please benchmark

To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java
@ursabot please benchmark lang=JavaScript

To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3

For other command=cpp-micro options, please see https://github.com/voltrondata-labs/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

nealrichardson · 2022-09-01T16:57:57Z

@ursabot please benchmark lang=R

ursabot · 2022-09-01T16:58:02Z

Benchmark runs are scheduled for baseline = 80bba29 and contender = ed6fce6. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️14.79% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] ed6fce67 test-mac-arm
[Failed] ed6fce67 ursa-i9-9960x
[Finished] 80bba299 ec2-t3-xlarge-us-east-2
[Failed] 80bba299 test-mac-arm
[Failed] 80bba299 ursa-i9-9960x
[Finished] 80bba299 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-09-01T18:33:51Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

nealrichardson · 2022-09-02T17:27:36Z

Benchmark regressions, at least the worst of them, are due to ARROW-17601. By keeping the computation on Decimal types instead of casting to double, we hit an expression that by our current logic would need to promote to a scale that can't fit in Decimal128, so the evaluation errors somewhere, and because these are evaluating on Arrow Table, it falls back to pulling all the data into an R data.frame and doing the work there--hence the regression.

I'll see what I can do to mitigate/work around this in this PR. Most extreme case would be to not cast scalars to decimal, i.e. restore the status quo, where most queries on decimal data would end up getting coerced to float. But hopefully we can do better than that.

We have very few tests for queries on decimal types, but they're all over the TPC-H data, so that's why we only observed this in the benchmarks. That should probably get rectified too.

nealrichardson · 2022-09-03T17:44:37Z

@ursabot please benchmark lang=R

ursabot · 2022-09-03T17:44:42Z

Benchmark runs are scheduled for baseline = 80bba29 and contender = 3d52485. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️8.49% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 3d52485d test-mac-arm
[Failed] 3d52485d ursa-i9-9960x
[Finished] 80bba299 ec2-t3-xlarge-us-east-2
[Failed] 80bba299 test-mac-arm
[Failed] 80bba299 ursa-i9-9960x
[Finished] 80bba299 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2022-09-03T19:11:40Z

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

nealrichardson · 2022-09-04T19:32:47Z

Some of the remaining benchmark regressions are spurious (file-write, dataframe-to-table, neither of which are affected by this change). The other TPC-H ones are legitimate, but they're on tiny scale factors of data (0.1, 0.01, i.e. 100mb and 10mb), so the extra 10-15ms that the type checking this PR introduces shows up as statistically significant.

IMO the tradeoff is worth it: we preserve the types of the original data better (especially for integers, and after ARROW-17601, decimals), we have more convenient passing of strings for dates/timestamps in expressions, and by avoiding unnecessary casts, we should get performance benefits on some queries. As it turns out, we aren't currently benchmarking queries where the performance benefit would show. I'd like to add a benchmark for the case shown on this issue, but it fails for me locally due to ARROW-17556.

paleolimbot

A few specific comments...I agree that for certain functions this is more correct whatever the performance cost, although it should be clearly fenced to certain functions where this makes sense.

paleolimbot · 2022-09-07T01:07:50Z

r/R/expression.R

-      }
-      x
-    })
+    args <- wrap_scalars(args, FUN)


Do you want to whitelist the functions this applies to? (Or maybe you already do this and I'm not reading this correctly?) This logic is awesome and very appropriate for most math functions but I wonder if there are some compute functions (maybe binary_repeat) that will stop working when used with build_expr(). I think that user-defined functions also generate their bindings through build_expr() (although they don't have to).

There's a blocklist rather than an allowlist, and binary_repeat is on it (L285, below). If there are compute functions that don't work with this change, we don't test them.

Do you think we should exclude UDFs from the type matching too?

For functions that do go through build_expr(), the way to skip the type-matching logic is to wrap the value in Expression$scalar(). Only non-Expressions are cast.

I really think you should whitelist here...in theory one can use build_expr() for any compute function, although many bindings choose to go directly through Expression$create() instead. Using a blocklist would mean you can only use build_expr() safely for specific functions (in which case you should probably compute what those functions are so that can be documented).

I went through the function list on https://arrow.apache.org/docs/cpp/compute.html and evaluated whether you should try to cast scalar inputs to the type of the corresponding column (and remember, if you can't cast the scalar without loss of precision, it doesn't add the cast, so for int + 4.2, 4.2 won't get cast to int so that will go to cast(int, float64) + 4.2 in Acero). For the non-unary scalar functions, all but 4 make sense to try to convert scalars like this. The 4 functions that don't are binary_repeat, list_element, binary_join (kind of an odd case, which we don't use, we use binary_join_element_wise instead), and make_struct. It's around 40-50 functions on the allow side, so it seems that the "don't cast" functions are the exception.

Does that persuade you in favor of blocklist instead of allowlist?

I still think a whitelist is safer, although feel free to make the change. The build_expr() in the user-defined function code ( https://github.com/apache/arrow/blob/master/r/R/compute.R#L384 ) would have to change to something approaching the previous behaviour since we have no guarantees about those functions.

I just pulled UDFs out of build_expr in d54de48, and in a followup I'll go further to reduce the usage of build_expr to places where the type matching matters (more of an allowlist, in that sense), pull out the special cases inside of it, and rename it to something like build_simple_expr to make clear that it is a special case and not the default path you should choose.

Sound ok to you?

r/tests/testthat/test-dplyr-funcs-datetime.R

nealrichardson · 2022-10-31T15:24:38Z

Windows failure is due to the R 4.2.2 release today: mirrors haven't been updated yet.

I'll take up the refactoring @paleolimbot requested in ARROW-18203.

ursabot · 2022-10-31T21:42:06Z

Benchmark runs are scheduled for baseline = 8066c5e and contender = d045fc5. d045fc5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.07% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d045fc5d ec2-t3-xlarge-us-east-2
[Failed] d045fc5d test-mac-arm
[Finished] d045fc5d ursa-i9-9960x
[Finished] d045fc5d ursa-thinkcentre-m75q
[Finished] 8066c5e1 ec2-t3-xlarge-us-east-2
[Failed] 8066c5e1 test-mac-arm
[Finished] 8066c5e1 ursa-i9-9960x
[Finished] 8066c5e1 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions bot added the Component: R label Aug 27, 2022

paleolimbot reviewed Aug 29, 2022

View reviewed changes

r/R/expression.R Show resolved Hide resolved

r/R/expression.R Outdated Show resolved Hide resolved

r/R/expression.R Outdated Show resolved Hide resolved

nealrichardson marked this pull request as ready for review August 30, 2022 15:41

paleolimbot reviewed Sep 7, 2022

View reviewed changes

nealrichardson force-pushed the cast-scalars branch from ecf1aaa to 587b526 Compare September 19, 2022 17:37

nealrichardson force-pushed the cast-scalars branch from 587b526 to 54fbe02 Compare October 3, 2022 16:11

nealrichardson mentioned this pull request Oct 12, 2022

ARROW-17387: [R] Implement dplyr::across() inside filter() #14281

Merged

nealrichardson force-pushed the cast-scalars branch 3 times, most recently from f24d00a to 0bf8d23 Compare October 17, 2022 19:36

nealrichardson added 8 commits October 31, 2022 09:42

R scalars should be cast to match the expression type, where appropriate

374402e

Add a few tests and clean up notes

eec7ab1

Progress on string-datetime parsing

2339e27

Assume timezone

6393bb5

Refactor and resolve some TODOs

b554f10

Don't auto-cast to Decimal types to avoid compute bug

97bfc13

Avoid some cpp calls

2b2e15b

Take UDFs out of build_expr and have Expression ensure Expression inputs

30410fa

nealrichardson force-pushed the cast-scalars branch from 0bf8d23 to 30410fa Compare October 31, 2022 13:43

nealrichardson merged commit d045fc5 into apache:master Oct 31, 2022

nealrichardson deleted the cast-scalars branch October 31, 2022 15:39

asfimport mentioned this pull request Nov 4, 2022

[R] Cast scalars to type of field in Expression building #32726

Closed

asfimport mentioned this pull request Dec 12, 2022

[R] Refactor build_expr and eval_array_expression to remove special casing #32829

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-17462: [R] Cast scalars to type of field in Expression building #13985

ARROW-17462: [R] Cast scalars to type of field in Expression building #13985

nealrichardson commented Aug 27, 2022 •

edited

github-actions bot commented Aug 27, 2022

nealrichardson commented Sep 1, 2022

ursabot commented Sep 1, 2022

nealrichardson commented Sep 1, 2022

ursabot commented Sep 1, 2022 •

edited

ursabot commented Sep 1, 2022

nealrichardson commented Sep 2, 2022

nealrichardson commented Sep 3, 2022

ursabot commented Sep 3, 2022 •

edited

ursabot commented Sep 3, 2022

nealrichardson commented Sep 4, 2022

paleolimbot left a comment

paleolimbot Sep 7, 2022

nealrichardson Sep 7, 2022

paleolimbot Sep 7, 2022

nealrichardson Sep 7, 2022

paleolimbot Sep 8, 2022

nealrichardson Oct 4, 2022

nealrichardson commented Oct 31, 2022

ursabot commented Oct 31, 2022

ARROW-17462: [R] Cast scalars to type of field in Expression building #13985

ARROW-17462: [R] Cast scalars to type of field in Expression building #13985

Conversation

nealrichardson commented Aug 27, 2022 • edited

github-actions bot commented Aug 27, 2022

nealrichardson commented Sep 1, 2022

ursabot commented Sep 1, 2022

nealrichardson commented Sep 1, 2022

ursabot commented Sep 1, 2022 • edited

ursabot commented Sep 1, 2022

nealrichardson commented Sep 2, 2022

nealrichardson commented Sep 3, 2022

ursabot commented Sep 3, 2022 • edited

ursabot commented Sep 3, 2022

nealrichardson commented Sep 4, 2022

paleolimbot left a comment

Choose a reason for hiding this comment

paleolimbot Sep 7, 2022

Choose a reason for hiding this comment

nealrichardson Sep 7, 2022

Choose a reason for hiding this comment

paleolimbot Sep 7, 2022

Choose a reason for hiding this comment

nealrichardson Sep 7, 2022

Choose a reason for hiding this comment

paleolimbot Sep 8, 2022

Choose a reason for hiding this comment

nealrichardson Oct 4, 2022

Choose a reason for hiding this comment

nealrichardson commented Oct 31, 2022

ursabot commented Oct 31, 2022

nealrichardson commented Aug 27, 2022 •

edited

ursabot commented Sep 1, 2022 •

edited

ursabot commented Sep 3, 2022 •

edited