Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-18481: [C++] prefer casting literal over casting field ref #15180

Conversation

westonpace
Copy link
Member

@westonpace westonpace commented Jan 4, 2023

I ran into this problem while trying to work out partition pruning in the new scan node. I feel like this is a somewhat naive approach but it seems to work.

I think it would fail if a DispatchBest existed where a n-ary kernel existed with non-equal types. For example, if there was a function foo(int8, int32) and it had a dispatch best of some kind.

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

@github-actions
Copy link

github-actions bot commented Jan 4, 2023

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@@ -368,6 +368,153 @@ bool Expression::IsSatisfiable() const {

namespace {

TypeHolder SmallestTypeFor(const arrow::Datum& value) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a rather naive approach but it seems to work.

}
case Type::DOUBLE: {
double doub = value.scalar_as<DoubleScalar>().value;
if (double(float(doub)) == doub) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the best way to determine if a double can be exactly represented by a float?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't catch nans. I'd recommend including an explicit clause for nans and infs. For the value itself, though, I think this roundtrip is reasonable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of checking fmod(doub) == 0 and demoting to integer? This would simplify expressions like i32 > 0.0 to an integer comparison. Perhaps not worth it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I'm thinking no for floating->integral. In general it should be easy enough to convert from one literal to another in just about any serialization of expressions (e.g. change 1.0 to 1).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check for nan/inf

@westonpace westonpace requested a review from bkietz January 4, 2023 00:29
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the general approach is acceptable, modulo a few nits.

For future reference, my own plan for this issue was to write this as a simplification pass. This approach is much simpler and only introduces a few minor edge cases. Nicely done

}
case Type::DOUBLE: {
double doub = value.scalar_as<DoubleScalar>().value;
if (double(float(doub)) == doub) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't catch nans. I'd recommend including an explicit clause for nans and infs. For the value itself, though, I think this roundtrip is reasonable.

}
case Type::DOUBLE: {
double doub = value.scalar_as<DoubleScalar>().value;
if (double(float(doub)) == doub) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of checking fmod(doub) == 0 and demoting to integer? This would simplify expressions like i32 > 0.0 to an integer comparison. Perhaps not worth it?

ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),
cmp(cast(field_ref("dict_i32"), int64()), literal(int64_t(4))));
cmp(field_ref("dict_i32"), literal(int32_t(4))));
ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a case which demonstrates the behavior in the presence of unsigned integers. SmallestTypeFor preserves signedness information, which can produce some odd edge cases which we should be explicit about. For example:

Suggested change
ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),
ExpectBindsTo(cmp(field_ref("i8"), literal(uint8_t(4))),
cmp(cast(field_ref("i8"), int16()), literal(int16_t(4))));
ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@westonpace westonpace force-pushed the feature/ARROW-11402--avoid-pointless-casts branch from 2870fb2 to cadf4ae Compare February 1, 2023 23:37
@westonpace
Copy link
Member Author

I've addressed review. In doing so I encountered a problem #33990 and addressed that here as well.

Closes #33990

@westonpace
Copy link
Member Author

The test failures revealed another issue which I addressed here. In DispatchBest for simple arithmetic we were declaring float32 to be the common type between decimal and float32. However, I believe float64 is a more accurate choice. This also matches the rule we use for trigonometric functions. CC @lidavidm for second opinion.

@nealrichardson there was another R test failure which I deemed inevitable and I have updated the test. It is a slight behavior change. Does this seem acceptable: https://github.com/apache/arrow/pull/15180/files#diff-45391fbe156c77f99e14090369faed8e11ec49583634fa1a51b634ef8b9eb5bf

@nealrichardson
Copy link
Member

Let me take a look. If this PR does what I think it's doing, there's probably a bunch of R code I can delete now that attempted the same.

@lidavidm
Copy link
Member

lidavidm commented Feb 2, 2023

float64 sounds fine to me (I'm surprised it was float32 actually)

@@ -280,7 +280,9 @@ test_that("infer_type() gets the right type for Expression", {
expect_equal(y$type(), infer_type(y))
expect_equal(infer_type(y), float64())
expect_equal(add_xy$type(), infer_type(add_xy))
expect_equal(infer_type(add_xy), float64())
# even though 10 is a float64, arrow will clamp it to the narrowest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only because both are scalars? If the float64 were corresponding to a field in the data, I would think that downcasting to float32 would be undesirable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

@nealrichardson
Copy link
Member

It would be good to run the macrobenchmarks on this. I would expect some speedup, mainly in the python queries (assuming we have any that would trigger this) since much of this is already handled in the R code, because we should be avoiding some big casts. Perhaps more importantly though, I ran into some issues with decimal types in #14553 and had to exclude decimals from the type conversion. I noticed this because a TPC-H query errored as a result of trying to keep operations on decimals.

@westonpace
Copy link
Member Author

It would be good to run the macrobenchmarks on this. I would expect some speedup

I will but I'm not certain we will get any speedup.

mainly in the python queries (assuming we have any that would trigger this)

We don't. The python queries are extremely basic.

I ran into some issues with decimal types

Yes. In fact, for some of the simpler TPC-H queries, that conversion from decimal to double is a majority of the execution time. Unfortunately, this PR does not fix that particular case because we don't support decimal arithmetic so a cast is inevitable.

The implementation is a bit subtle. All of our non-decimal implicit casts go from "narrow type" to "wider type" (e.g. it is a safe cast). For example, if we have add<int8, int16> it will become add<int16, int16>. So by shrinking the literal as much as possible we ensure it is the one that will be cast (or avoid the cast).

With decimal implicit casts we actually go in the opposite direction. We go from decimal (a wide type) to float64 (a more narrow type). Since we don't shrink a float64 literal into a decimal literal we still end up casting the array and not the literal.

@westonpace
Copy link
Member Author

@ursabot please benchmark lang=R

@ursabot
Copy link

ursabot commented Feb 3, 2023

Benchmark runs are scheduled for baseline = 7eeb74e and contender = c8250e3. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c8250e36 test-mac-arm
[Finished] c8250e36 ursa-i9-9960x
[Finished] 7eeb74ed ec2-t3-xlarge-us-east-2
[Finished] 7eeb74ed test-mac-arm
[Finished] 7eeb74ed ursa-i9-9960x
[Finished] 7eeb74ed ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@westonpace westonpace changed the title ARROW-11402: [C++] prefer casting literal over casting field ref GH-18481: [C++] prefer casting literal over casting field ref Feb 4, 2023
@westonpace westonpace merged commit b56b91e into apache:master Feb 4, 2023
@github-actions
Copy link

github-actions bot commented Feb 4, 2023

@github-actions
Copy link

github-actions bot commented Feb 4, 2023

⚠️ GitHub issue #18481 has been automatically assigned in GitHub to PR creator.

@ursabot
Copy link

ursabot commented Feb 4, 2023

Benchmark runs are scheduled for baseline = 838d0da and contender = b56b91e. b56b91e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.28% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.41% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b56b91e1 ec2-t3-xlarge-us-east-2
[Failed] b56b91e1 test-mac-arm
[Finished] b56b91e1 ursa-i9-9960x
[Finished] b56b91e1 ursa-thinkcentre-m75q
[Finished] 838d0daf ec2-t3-xlarge-us-east-2
[Failed] 838d0daf test-mac-arm
[Finished] 838d0daf ursa-i9-9960x
[Finished] 838d0daf ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@nealrichardson
Copy link
Member

nealrichardson commented Feb 4, 2023

I pulled this, removed the R code that tries to make scalar inputs match the type of the corresponding fields where safe and appropriate, and ran the tests. I was surprised to see that field<int32> * 1<float64> resulted in float32. The R code I would be removing checks that 1<float64> can be represented as int32 (the field's type) without loss, and if so, it casts it to match the field. It's fine with me if the acero code doesn't want to use that rule, but I found the result of float32, i.e. casting both inputs to something else, surprising.

@westonpace
Copy link
Member Author

The R code I would be removing checks that 1 can be represented as int32 (the field's type) without loss, and if so, it casts it to match the field.

@bkietz had suggested a similar rule. I am a little bit unsure about switching over types like that but we can add it if we want.

I found the result of float32, i.e. casting both inputs to something else, surprising.

I found this surprising as well (I added a test case for it to draw attention to it). I think it is the correct thing to do though (although I may be biased as I think avoiding this situation would end up being tricky). Plus, if a user wants, they can always explicitly cast to get their desired behavior. Also, even if we adopted the above rule, this scenario could still occur with something like field<int32> * 1.5<float64> (1.5 can safely be represented as float32 but not an integer).

Yes, you have the negative consequence that the result of that 1.5x might not be easily represented in float32 for all values of x but that would be true if the right hand side were an array too. In other words, if field<int32> * 1<float64> should be float64 for precision purposes then one could argue that field<int32> * field<float32> should be float64 with similar reasoning. That is, if we are going to implicitly cast we should implicitly cast to the target that has the best chance of representing all results.

sjperkins pushed a commit to sjperkins/arrow that referenced this pull request Feb 10, 2023
…pache#15180)

I ran into this problem while trying to work out partition pruning in the new scan node.  I feel like this is a somewhat naive approach but it seems to work.

I think it would fail if a `DispatchBest` existed where a n-ary kernel existed with non-equal types.  For example, if there was a function foo(int8, int32) and it had a dispatch best of some kind.

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
gringasalpastor pushed a commit to gringasalpastor/arrow that referenced this pull request Feb 17, 2023
…pache#15180)

I ran into this problem while trying to work out partition pruning in the new scan node.  I feel like this is a somewhat naive approach but it seems to work.

I think it would fail if a `DispatchBest` existed where a n-ary kernel existed with non-equal types.  For example, if there was a function foo(int8, int32) and it had a dispatch best of some kind.

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
fatemehp pushed a commit to fatemehp/arrow that referenced this pull request Feb 24, 2023
…pache#15180)

I ran into this problem while trying to work out partition pruning in the new scan node.  I feel like this is a somewhat naive approach but it seems to work.

I think it would fail if a `DispatchBest` existed where a n-ary kernel existed with non-equal types.  For example, if there was a function foo(int8, int32) and it had a dispatch best of some kind.

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Weston Pace <weston.pace@gmail.com>
pitrou pushed a commit that referenced this pull request Aug 14, 2023
### Rationale for this change

This patch ( #15180 ) adds a `SmallestTypeFor` to handling expression type. However, it lost timezone when handling.

### What changes are included in this PR?

Add `timezone` in `SmallestTypeFor`

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Yeah it's a bugfix

* Closes: #37110

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
loicalleyne pushed a commit to loicalleyne/arrow that referenced this pull request Nov 13, 2023
…pache#37135)

### Rationale for this change

This patch ( apache#15180 ) adds a `SmallestTypeFor` to handling expression type. However, it lost timezone when handling.

### What changes are included in this PR?

Add `timezone` in `SmallestTypeFor`

### Are these changes tested?

Currently not

### Are there any user-facing changes?

Yeah it's a bugfix

* Closes: apache#37110

Authored-by: mwish <maplewish117@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[C++][Dataset] Allow more aggresive implicit casts for literals
5 participants