GH-18481: [C++] prefer casting literal over casting field ref #15180

westonpace · 2023-01-04T00:28:13Z

I ran into this problem while trying to work out partition pruning in the new scan node. I feel like this is a somewhat naive approach but it seems to work.

I think it would fail if a DispatchBest existed where a n-ary kernel existed with non-equal types. For example, if there was a function foo(int8, int32) and it had a dispatch best of some kind.

Closes: [C++][Dataset] Allow more aggresive implicit casts for literals #18481

github-actions · 2023-01-04T00:28:31Z

https://issues.apache.org/jira/browse/ARROW-11402

github-actions · 2023-01-04T00:28:33Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

westonpace · 2023-01-04T00:28:35Z

cpp/src/arrow/compute/exec/expression.cc

@@ -368,6 +368,153 @@ bool Expression::IsSatisfiable() const {

 namespace {

+TypeHolder SmallestTypeFor(const arrow::Datum& value) {


I feel like this is a rather naive approach but it seems to work.

westonpace · 2023-01-04T00:28:57Z

cpp/src/arrow/compute/exec/expression.cc

+    }
+    case Type::DOUBLE: {
+      double doub = value.scalar_as<DoubleScalar>().value;
+      if (double(float(doub)) == doub) {


Is this the best way to determine if a double can be exactly represented by a float?

This doesn't catch nans. I'd recommend including an explicit clause for nans and infs. For the value itself, though, I think this roundtrip is reasonable.

What do you think of checking fmod(doub) == 0 and demoting to integer? This would simplify expressions like i32 > 0.0 to an integer comparison. Perhaps not worth it?

At the moment I'm thinking no for floating->integral. In general it should be easy enough to convert from one literal to another in just about any serialization of expressions (e.g. change 1.0 to 1).

I've added a check for nan/inf

bkietz

I think the general approach is acceptable, modulo a few nits.

For future reference, my own plan for this issue was to write this as a simplification pass. This approach is much simpler and only introduces a few minor edge cases. Nicely done

bkietz · 2023-01-27T22:12:15Z

cpp/src/arrow/compute/exec/expression.cc

+    }
+    case Type::DOUBLE: {
+      double doub = value.scalar_as<DoubleScalar>().value;
+      if (double(float(doub)) == doub) {


This doesn't catch nans. I'd recommend including an explicit clause for nans and infs. For the value itself, though, I think this roundtrip is reasonable.

bkietz · 2023-01-27T22:19:24Z

cpp/src/arrow/compute/exec/expression.cc

+    }
+    case Type::DOUBLE: {
+      double doub = value.scalar_as<DoubleScalar>().value;
+      if (double(float(doub)) == doub) {


What do you think of checking fmod(doub) == 0 and demoting to integer? This would simplify expressions like i32 > 0.0 to an integer comparison. Perhaps not worth it?

bkietz · 2023-01-27T22:24:30Z

cpp/src/arrow/compute/exec/expression_test.cc

    ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),
-                  cmp(cast(field_ref("dict_i32"), int64()), literal(int64_t(4))));
+                  cmp(field_ref("dict_i32"), literal(int32_t(4))));
+    ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),


Please add a case which demonstrates the behavior in the presence of unsigned integers. SmallestTypeFor preserves signedness information, which can produce some odd edge cases which we should be explicit about. For example:

Suggested change

ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),

ExpectBindsTo(cmp(field_ref("i8"), literal(uint8_t(4))),

cmp(cast(field_ref("i8"), int16()), literal(int16_t(4))));

ExpectBindsTo(cmp(field_ref("dict_i32"), literal(int64_t(4))),

…that literal(NaN) == literal(NaN)

westonpace · 2023-02-01T23:40:49Z

I've addressed review. In doing so I encountered a problem #33990 and addressed that here as well.

Closes #33990

…32 instead of float64. Fix test bug in R

westonpace · 2023-02-02T14:52:06Z

The test failures revealed another issue which I addressed here. In DispatchBest for simple arithmetic we were declaring float32 to be the common type between decimal and float32. However, I believe float64 is a more accurate choice. This also matches the rule we use for trigonometric functions. CC @lidavidm for second opinion.

@nealrichardson there was another R test failure which I deemed inevitable and I have updated the test. It is a slight behavior change. Does this seem acceptable: https://github.com/apache/arrow/pull/15180/files#diff-45391fbe156c77f99e14090369faed8e11ec49583634fa1a51b634ef8b9eb5bf

nealrichardson · 2023-02-02T15:13:47Z

Let me take a look. If this PR does what I think it's doing, there's probably a bunch of R code I can delete now that attempted the same.

lidavidm · 2023-02-02T18:06:22Z

float64 sounds fine to me (I'm surprised it was float32 actually)

nealrichardson · 2023-02-02T20:26:09Z

r/tests/testthat/test-type.R

@@ -280,7 +280,9 @@ test_that("infer_type() gets the right type for Expression", {
  expect_equal(y$type(), infer_type(y))
  expect_equal(infer_type(y), float64())
  expect_equal(add_xy$type(), infer_type(add_xy))
-  expect_equal(infer_type(add_xy), float64())
+  # even though 10 is a float64, arrow will clamp it to the narrowest


This is only because both are scalars? If the float64 were corresponding to a field in the data, I would think that downcasting to float32 would be undesirable.

nealrichardson · 2023-02-02T21:25:02Z

It would be good to run the macrobenchmarks on this. I would expect some speedup, mainly in the python queries (assuming we have any that would trigger this) since much of this is already handled in the R code, because we should be avoiding some big casts. Perhaps more importantly though, I ran into some issues with decimal types in #14553 and had to exclude decimals from the type conversion. I noticed this because a TPC-H query errored as a result of trying to keep operations on decimals.

westonpace · 2023-02-03T00:54:24Z

It would be good to run the macrobenchmarks on this. I would expect some speedup

I will but I'm not certain we will get any speedup.

mainly in the python queries (assuming we have any that would trigger this)

We don't. The python queries are extremely basic.

I ran into some issues with decimal types

Yes. In fact, for some of the simpler TPC-H queries, that conversion from decimal to double is a majority of the execution time. Unfortunately, this PR does not fix that particular case because we don't support decimal arithmetic so a cast is inevitable.

The implementation is a bit subtle. All of our non-decimal implicit casts go from "narrow type" to "wider type" (e.g. it is a safe cast). For example, if we have add<int8, int16> it will become add<int16, int16>. So by shrinking the literal as much as possible we ensure it is the one that will be cast (or avoid the cast).

With decimal implicit casts we actually go in the opposite direction. We go from decimal (a wide type) to float64 (a more narrow type). Since we don't shrink a float64 literal into a decimal literal we still end up casting the array and not the literal.

westonpace · 2023-02-03T01:05:55Z

@ursabot please benchmark lang=R

ursabot · 2023-02-03T01:06:01Z

Benchmark runs are scheduled for baseline = 7eeb74e and contender = c8250e3. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Only ['Python'] langs are supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.0% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Skipped ⚠️ Only ['C++', 'Java'] langs are supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] c8250e36 test-mac-arm
[Finished] c8250e36 ursa-i9-9960x
[Finished] 7eeb74ed ec2-t3-xlarge-us-east-2
[Finished] 7eeb74ed test-mac-arm
[Finished] 7eeb74ed ursa-i9-9960x
[Finished] 7eeb74ed ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

github-actions · 2023-02-04T05:55:06Z

Closes: [C++][Dataset] Allow more aggresive implicit casts for literals #18481

github-actions · 2023-02-04T05:55:09Z

⚠️ GitHub issue #18481 has been automatically assigned in GitHub to PR creator.

ursabot · 2023-02-04T14:03:14Z

Benchmark runs are scheduled for baseline = 838d0da and contender = b56b91e. b56b91e is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.28% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.41% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] b56b91e1 ec2-t3-xlarge-us-east-2
[Failed] b56b91e1 test-mac-arm
[Finished] b56b91e1 ursa-i9-9960x
[Finished] b56b91e1 ursa-thinkcentre-m75q
[Finished] 838d0daf ec2-t3-xlarge-us-east-2
[Failed] 838d0daf test-mac-arm
[Finished] 838d0daf ursa-i9-9960x
[Finished] 838d0daf ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

nealrichardson · 2023-02-04T15:34:56Z

I pulled this, removed the R code that tries to make scalar inputs match the type of the corresponding fields where safe and appropriate, and ran the tests. I was surprised to see that field<int32> * 1<float64> resulted in float32. The R code I would be removing checks that 1<float64> can be represented as int32 (the field's type) without loss, and if so, it casts it to match the field. It's fine with me if the acero code doesn't want to use that rule, but I found the result of float32, i.e. casting both inputs to something else, surprising.

westonpace · 2023-02-06T15:02:28Z

The R code I would be removing checks that 1 can be represented as int32 (the field's type) without loss, and if so, it casts it to match the field.

@bkietz had suggested a similar rule. I am a little bit unsure about switching over types like that but we can add it if we want.

I found the result of float32, i.e. casting both inputs to something else, surprising.

I found this surprising as well (I added a test case for it to draw attention to it). I think it is the correct thing to do though (although I may be biased as I think avoiding this situation would end up being tricky). Plus, if a user wants, they can always explicitly cast to get their desired behavior. Also, even if we adopted the above rule, this scenario could still occur with something like field<int32> * 1.5<float64> (1.5 can safely be represented as float32 but not an integer).

Yes, you have the negative consequence that the result of that 1.5x might not be easily represented in float32 for all values of x but that would be true if the right hand side were an array too. In other words, if field<int32> * 1<float64> should be float64 for precision purposes then one could argue that field<int32> * field<float32> should be float64 with similar reasoning. That is, if we are going to implicitly cast we should implicitly cast to the target that has the best chance of representing all results.

…pache#15180) I ran into this problem while trying to work out partition pruning in the new scan node. I feel like this is a somewhat naive approach but it seems to work. I think it would fail if a `DispatchBest` existed where a n-ary kernel existed with non-equal types. For example, if there was a function foo(int8, int32) and it had a dispatch best of some kind. Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Weston Pace <weston.pace@gmail.com>

### Rationale for this change This patch ( #15180 ) adds a `SmallestTypeFor` to handling expression type. However, it lost timezone when handling. ### What changes are included in this PR? Add `timezone` in `SmallestTypeFor` ### Are these changes tested? Currently not ### Are there any user-facing changes? Yeah it's a bugfix * Closes: #37110 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…pache#37135) ### Rationale for this change This patch ( apache#15180 ) adds a `SmallestTypeFor` to handling expression type. However, it lost timezone when handling. ### What changes are included in this PR? Add `timezone` in `SmallestTypeFor` ### Are these changes tested? Currently not ### Are there any user-facing changes? Yeah it's a bugfix * Closes: apache#37110 Authored-by: mwish <maplewish117@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

github-actions bot added the Component: C++ label Jan 4, 2023

westonpace commented Jan 4, 2023

View reviewed changes

westonpace requested a review from bkietz January 4, 2023 00:29

bkietz requested changes Jan 27, 2023

View reviewed changes

ARROW-11402: prefer casting literal over casting field ref

128da07

westonpace mentioned this pull request Feb 1, 2023

[C++] I know NAN != NAN but shouldn't literal(NAN) == literal(NAN)? #33990

Closed

Added more tests to address PR reivew. Added special case for NaN so …

cadf4ae

…that literal(NaN) == literal(NaN)

westonpace force-pushed the feature/ARROW-11402--avoid-pointless-casts branch from 2870fb2 to cadf4ae Compare February 1, 2023 23:37

westonpace added 2 commits February 1, 2023 15:41

Remove accidental inclusion of cmath

eec7f14

Fix bug where dipatch best was dispatching decimal x float32 to float…

a0bc865

…32 instead of float64. Fix test bug in R

westonpace requested review from paleolimbot and thisisnic as code owners February 2, 2023 14:45

github-actions bot added the Component: R label Feb 2, 2023

nealrichardson reviewed Feb 2, 2023

View reviewed changes

Fix type-narrowing compiler warning for MSVC

c8250e3

westonpace changed the title ~~ARROW-11402: [C++] prefer casting literal over casting field ref~~ GH-18481: [C++] prefer casting literal over casting field ref Feb 4, 2023

westonpace merged commit b56b91e into apache:master Feb 4, 2023

asfimport mentioned this pull request Jan 4, 2023

[C++][Dataset] Allow more aggresive implicit casts for literals #18481

Closed

mapleFU mentioned this pull request Aug 12, 2023

GH-37110: [C++] Expression: SmallestTypeFor lost tz for Scalar #37135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-18481: [C++] prefer casting literal over casting field ref #15180

GH-18481: [C++] prefer casting literal over casting field ref #15180

westonpace commented Jan 4, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Jan 4, 2023

github-actions bot commented Jan 4, 2023

westonpace Jan 4, 2023

westonpace Jan 4, 2023

bkietz Jan 27, 2023

bkietz Jan 27, 2023

westonpace Feb 1, 2023

westonpace Feb 1, 2023

bkietz left a comment

bkietz Jan 27, 2023

bkietz Jan 27, 2023

bkietz Jan 27, 2023

westonpace Feb 1, 2023

westonpace commented Feb 1, 2023

westonpace commented Feb 2, 2023

nealrichardson commented Feb 2, 2023

lidavidm commented Feb 2, 2023

nealrichardson Feb 2, 2023

westonpace Feb 3, 2023

nealrichardson commented Feb 2, 2023

westonpace commented Feb 3, 2023

westonpace commented Feb 3, 2023

ursabot commented Feb 3, 2023 •

edited

Loading

github-actions bot commented Feb 4, 2023

github-actions bot commented Feb 4, 2023

ursabot commented Feb 4, 2023

nealrichardson commented Feb 4, 2023 •

edited

Loading

westonpace commented Feb 6, 2023

		@@ -368,6 +368,153 @@ bool Expression::IsSatisfiable() const {

		namespace {

		TypeHolder SmallestTypeFor(const arrow::Datum& value) {

GH-18481: [C++] prefer casting literal over casting field ref #15180

GH-18481: [C++] prefer casting literal over casting field ref #15180

Conversation

westonpace commented Jan 4, 2023 • edited by github-actions bot Loading

github-actions bot commented Jan 4, 2023

github-actions bot commented Jan 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkietz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace commented Feb 1, 2023

westonpace commented Feb 2, 2023

nealrichardson commented Feb 2, 2023

lidavidm commented Feb 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nealrichardson commented Feb 2, 2023

westonpace commented Feb 3, 2023

westonpace commented Feb 3, 2023

ursabot commented Feb 3, 2023 • edited Loading

github-actions bot commented Feb 4, 2023

github-actions bot commented Feb 4, 2023

ursabot commented Feb 4, 2023

nealrichardson commented Feb 4, 2023 • edited Loading

westonpace commented Feb 6, 2023

westonpace commented Jan 4, 2023 •

edited by github-actions bot

Loading

ursabot commented Feb 3, 2023 •

edited

Loading

nealrichardson commented Feb 4, 2023 •

edited

Loading