Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17462: [R] Cast scalars to type of field in Expression building #13985

Merged
merged 8 commits into from Oct 31, 2022

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Aug 27, 2022

Logic is encapsulated in wrap_scalars() in expression.R. Most test updating (that is not linting) is changing some printed output types because int * 2 now stays int32, and the printed ExecPlans don't have as many casts in them. The tests added in test-dplyr-query.R are the explicit tests of the feature.

@github-actions
Copy link

r/R/expression.R Show resolved Hide resolved
r/R/expression.R Outdated Show resolved Hide resolved
r/R/expression.R Outdated Show resolved Hide resolved
@nealrichardson nealrichardson marked this pull request as ready for review August 30, 2022 15:41
@nealrichardson
Copy link
Member Author

@ursabot help

@ursabot
Copy link

ursabot commented Sep 1, 2022

Supported benchmark command examples:

@ursabot benchmark help

To run all benchmarks:
@ursabot please benchmark

To filter benchmarks by language:
@ursabot please benchmark lang=Python
@ursabot please benchmark lang=C++
@ursabot please benchmark lang=R
@ursabot please benchmark lang=Java
@ursabot please benchmark lang=JavaScript

To filter Python and R benchmarks by name:
@ursabot please benchmark name=file-write
@ursabot please benchmark name=file-write lang=Python
@ursabot please benchmark name=file-.*

To filter C++ benchmarks by archery --suite-filter and --benchmark-filter:
@ursabot please benchmark command=cpp-micro --suite-filter=arrow-compute-vector-selection-benchmark --benchmark-filter=TakeStringRandomIndicesWithNulls/262144/2 --iterations=3

For other command=cpp-micro options, please see https://github.com/voltrondata-labs/benchmarks/blob/main/benchmarks/cpp_micro_benchmarks.py

@nealrichardson
Copy link
Member Author

@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented Sep 1, 2022

Benchmark runs are scheduled for baseline = 80bba29 and contender = ed6fce6. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️14.79% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] ed6fce67 test-mac-arm
[Failed] ed6fce67 ursa-i9-9960x
[Finished] 80bba299 ec2-t3-xlarge-us-east-2
[Failed] 80bba299 test-mac-arm
[Failed] 80bba299 ursa-i9-9960x
[Finished] 80bba299 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Sep 1, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@nealrichardson
Copy link
Member Author

Benchmark regressions, at least the worst of them, are due to ARROW-17601. By keeping the computation on Decimal types instead of casting to double, we hit an expression that by our current logic would need to promote to a scale that can't fit in Decimal128, so the evaluation errors somewhere, and because these are evaluating on Arrow Table, it falls back to pulling all the data into an R data.frame and doing the work there--hence the regression.

I'll see what I can do to mitigate/work around this in this PR. Most extreme case would be to not cast scalars to decimal, i.e. restore the status quo, where most queries on decimal data would end up getting coerced to float. But hopefully we can do better than that.

We have very few tests for queries on decimal types, but they're all over the TPC-H data, so that's why we only observed this in the benchmarks. That should probably get rectified too.

@nealrichardson
Copy link
Member Author

@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented Sep 3, 2022

Benchmark runs are scheduled for baseline = 80bba29 and contender = 3d52485. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Failed] test-mac-arm
[Failed ⬇️8.49% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Failed] 3d52485d test-mac-arm
[Failed] 3d52485d ursa-i9-9960x
[Finished] 80bba299 ec2-t3-xlarge-us-east-2
[Failed] 80bba299 test-mac-arm
[Failed] 80bba299 ursa-i9-9960x
[Finished] 80bba299 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Sep 3, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@nealrichardson
Copy link
Member Author

Some of the remaining benchmark regressions are spurious (file-write, dataframe-to-table, neither of which are affected by this change). The other TPC-H ones are legitimate, but they're on tiny scale factors of data (0.1, 0.01, i.e. 100mb and 10mb), so the extra 10-15ms that the type checking this PR introduces shows up as statistically significant.

IMO the tradeoff is worth it: we preserve the types of the original data better (especially for integers, and after ARROW-17601, decimals), we have more convenient passing of strings for dates/timestamps in expressions, and by avoiding unnecessary casts, we should get performance benefits on some queries. As it turns out, we aren't currently benchmarking queries where the performance benefit would show. I'd like to add a benchmark for the case shown on this issue, but it fails for me locally due to ARROW-17556.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few specific comments...I agree that for certain functions this is more correct whatever the performance cost, although it should be clearly fenced to certain functions where this makes sense.

}
x
})
args <- wrap_scalars(args, FUN)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to whitelist the functions this applies to? (Or maybe you already do this and I'm not reading this correctly?) This logic is awesome and very appropriate for most math functions but I wonder if there are some compute functions (maybe binary_repeat) that will stop working when used with build_expr(). I think that user-defined functions also generate their bindings through build_expr() (although they don't have to).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a blocklist rather than an allowlist, and binary_repeat is on it (L285, below). If there are compute functions that don't work with this change, we don't test them.

Do you think we should exclude UDFs from the type matching too?

For functions that do go through build_expr(), the way to skip the type-matching logic is to wrap the value in Expression$scalar(). Only non-Expressions are cast.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think you should whitelist here...in theory one can use build_expr() for any compute function, although many bindings choose to go directly through Expression$create() instead. Using a blocklist would mean you can only use build_expr() safely for specific functions (in which case you should probably compute what those functions are so that can be documented).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the function list on https://arrow.apache.org/docs/cpp/compute.html and evaluated whether you should try to cast scalar inputs to the type of the corresponding column (and remember, if you can't cast the scalar without loss of precision, it doesn't add the cast, so for int + 4.2, 4.2 won't get cast to int so that will go to cast(int, float64) + 4.2 in Acero). For the non-unary scalar functions, all but 4 make sense to try to convert scalars like this. The 4 functions that don't are binary_repeat, list_element, binary_join (kind of an odd case, which we don't use, we use binary_join_element_wise instead), and make_struct. It's around 40-50 functions on the allow side, so it seems that the "don't cast" functions are the exception.

Does that persuade you in favor of blocklist instead of allowlist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think a whitelist is safer, although feel free to make the change. The build_expr() in the user-defined function code ( https://github.com/apache/arrow/blob/master/r/R/compute.R#L384 ) would have to change to something approaching the previous behaviour since we have no guarantees about those functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pulled UDFs out of build_expr in d54de48, and in a followup I'll go further to reduce the usage of build_expr to places where the type matching matters (more of an allowlist, in that sense), pull out the special cases inside of it, and rename it to something like build_simple_expr to make clear that it is a special case and not the default path you should choose.

Sound ok to you?

r/tests/testthat/test-dplyr-funcs-datetime.R Outdated Show resolved Hide resolved
@nealrichardson
Copy link
Member Author

Windows failure is due to the R 4.2.2 release today: mirrors haven't been updated yet.

I'll take up the refactoring @paleolimbot requested in ARROW-18203.

@nealrichardson nealrichardson merged commit d045fc5 into apache:master Oct 31, 2022
@nealrichardson nealrichardson deleted the cast-scalars branch October 31, 2022 15:39
@ursabot
Copy link

ursabot commented Oct 31, 2022

Benchmark runs are scheduled for baseline = 8066c5e and contender = d045fc5. d045fc5 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.27% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.07% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] d045fc5d ec2-t3-xlarge-us-east-2
[Failed] d045fc5d test-mac-arm
[Finished] d045fc5d ursa-i9-9960x
[Finished] d045fc5d ursa-thinkcentre-m75q
[Finished] 8066c5e1 ec2-t3-xlarge-us-east-2
[Failed] 8066c5e1 test-mac-arm
[Finished] 8066c5e1 ursa-i9-9960x
[Finished] 8066c5e1 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants