Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26218][SQL][Follow up] Fix the corner case when casting float to Integer. #27151

Closed

Conversation

turboFei
Copy link
Member

@turboFei turboFei commented Jan 9, 2020

What changes were proposed in this pull request?

When spark.sql.ansi.enabled is true, for the statement:

select cast(cast(2147483648 as Float) as Integer) //result is 2147483647

Its result is 2147483647 and does not throw ArithmeticException.

The root cause is that, the below code does not work for some corner cases.

override def toInt(x: Float): Int = {
// When casting floating values to integral types, Spark uses the method `Numeric.toInt`
// Or `Numeric.toLong` directly. For positive floating values, it is equivalent to `Math.floor`;
// for negative floating values, it is equivalent to `Math.ceil`.
// So, we can use the condition `Math.floor(x) <= upperBound && Math.ceil(x) >= lowerBound`
// to check if the floating value x is in the range of an integral type after rounding.
// This condition applies to converting Float/Double value to any integral types.
if (Math.floor(x) <= intUpperBound && Math.ceil(x) >= intLowerBound) {
x.toInt
} else {
overflowException(x, "int")
}
}

For example:

image

In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly.

Why are the changes needed?

Result corrupt.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added Unit test.

@turboFei
Copy link
Member Author

turboFei commented Jan 9, 2020

For long:
image

@turboFei
Copy link
Member Author

turboFei commented Jan 9, 2020

For pgsql:
image
For teradata:
image

@turboFei
Copy link
Member Author

turboFei commented Jan 9, 2020

@turboFei turboFei changed the title SPARK-26218: [Follow up] throw exception on overflow for integers [SPARK-26218][Follow up] Fix the conner case when cast float to Integer. Jan 9, 2020
@turboFei turboFei changed the title [SPARK-26218][Follow up] Fix the conner case when cast float to Integer. [SPARK-26218][Follow up] Fix the conner case when casting float to Integer. Jan 9, 2020
@turboFei turboFei changed the title [SPARK-26218][Follow up] Fix the conner case when casting float to Integer. [SPARK-26218][Follow up] Fix the corner case when casting float to Integer. Jan 9, 2020
@turboFei turboFei force-pushed the SPARK-26218-follow-up-int-overflow branch from 3a12066 to 5e7b1ff Compare January 9, 2020 14:37
@cloud-fan
Copy link
Contributor

OK to test

@srowen
Copy link
Member

srowen commented Jan 9, 2020

Hm, but:

scala> (Int.MaxValue.toFloat+1).toInt
res13: Int = 2147483647

scala> (Int.MaxValue.toFloat+1).toInt == Int.MaxValue
res14: Boolean = true

Those values do correctly cast to an int. The cast does lose precision of course, but according to Scala/Java, the result is correct, no?

@turboFei
Copy link
Member Author

turboFei commented Jan 9, 2020

Hm, but:

scala> (Int.MaxValue.toFloat+1).toInt
res13: Int = 2147483647

scala> (Int.MaxValue.toFloat+1).toInt == Int.MaxValue
res14: Boolean = true

Those values do correctly cast to an int. The cast does lose precision of course, but according to Scala/Java, the result is correct, no?

Yes, the behavior is consistent with Scala/Java, it seems that if the value exceeds Int.Max, cast it to Int is Int.Max.
But when spark.sql.ansi.enabled is true, we should throw exception to keep consistent with ansi.

@srowen
Copy link
Member

srowen commented Jan 9, 2020

Is this code path only used for ANSI mode? and is that defined by ANSI? I wouldn't expect the result of the cast to retain that much accuracy. You're not in general going to get the same int out when the int is large, after the round-trip - right?

@@ -121,8 +121,8 @@ object FloatExactNumeric extends FloatIsFractional {
private def overflowException(x: Float, dataType: String) =
throw new ArithmeticException(s"Casting $x to $dataType causes overflow")

private val intUpperBound = Int.MaxValue.toFloat
private val intLowerBound = Int.MinValue.toFloat
private val intUpperBound = Int.MaxValue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm also not clear how this helps - won't it just promote to a float in the comparison below anyway?
Do we want floorDiv, etc, instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Math.floor returns double, so it's promoted to double

Copy link
Member Author

@turboFei turboFei Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by cloud-fan, it seems that cast int to float, then cast to double is not same with casting it to double directly.
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's true

scala> Int.MaxValue.toDouble
res2: Double = 2.147483647E9

scala> Int.MaxValue.toFloat.toDouble
res3: Double = 2.147483648E9

private val intUpperBound = Int.MaxValue.toFloat
private val intLowerBound = Int.MinValue.toFloat
private val intUpperBound = Int.MaxValue
private val intLowerBound = Int.MinValue
private val longUpperBound = Long.MaxValue.toFloat
private val longLowerBound = Long.MinValue.toFloat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we can remove toFloat here too? also the toDouble in DoubleExactNumeric. They will be promoted anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. It looks more gracefully with consistent style.

@turboFei
Copy link
Member Author

turboFei commented Jan 9, 2020

Is this code path only used for ANSI mode? and is that defined by ANSI? I wouldn't expect the result of the cast to retain that much accuracy. You're not in general going to get the same int out when the int is large, after the round-trip - right?

Yes, it would be only invoked for ANSI mode.
Ansi would throw exception when overflow.
I have attached the relative behaviors of pgsql and teradata above.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-26218][Follow up] Fix the corner case when casting float to Integer. [SPARK-26218][SQL][Follow up] Fix the corner case when casting float to Integer. Jan 9, 2020
@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #4990 has finished for PR 27151 at commit 477408d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116483 has finished for PR 27151 at commit 477408d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116476 has finished for PR 27151 at commit 477408d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Member Author

turboFei commented Jan 10, 2020

Thanks for your review. May we need close this PR.
Float is not an accurate type and its scale is only 8 bits, which means that 21234567890f == 21234567800f.
And the Math.floor(floatValue) operation would also cast this float value to a double value, so it's reasonable to compare with Math.floor(floatValue) with Int.MaxValue.toFloat.

Thanks for your review again.

@cloud-fan
Copy link
Contributor

We are talking about SQL semantic not IEEE floating number definition.

For pgsql

cloud0fan=# SELECT CAST(CAST(2147483648 as FLOAT) as Int);
ERROR:  integer out of range

I think the fix makes sense.

@srowen
Copy link
Member

srowen commented Jan 12, 2020

(OK I'm into the idea, yes)

@turboFei turboFei closed this Jan 12, 2020
@turboFei turboFei reopened this Jan 12, 2020
@cloud-fan
Copy link
Contributor

ok to test

@cloud-fan
Copy link
Contributor

@turboFei can you fix the conflicts?

@cloud-fan
Copy link
Contributor

If you look at the pgsql result, the new result is actually corrected: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/float4.out#L299

Can you just re-generate the answer files? You can look at the doc of SQLQueryTestSuite to see how to do it.

@turboFei
Copy link
Member Author

turboFei commented Feb 4, 2020

I will do it later, thanks.

@turboFei turboFei force-pushed the SPARK-26218-follow-up-int-overflow branch from 477408d to 5ee6ba1 Compare February 4, 2020 17:09
@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117851 has finished for PR 27151 at commit 5ee6ba1.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2020

Test build #117853 has finished for PR 27151 at commit 34ddb1e.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

unfortunately it conflicts. Can you fix it by git rebase? thanks!

@turboFei turboFei force-pushed the SPARK-26218-follow-up-int-overflow branch from 34ddb1e to 0534eb6 Compare February 5, 2020 03:58
@turboFei
Copy link
Member Author

turboFei commented Feb 5, 2020

Except Int.MaxValue, the max Integer i, which satisfies i.toFloat.toInt == i, is 2147483520.

And it has been added into the UT(query-34/float4.sql). So, I just remove query-35.

@@ -106,7 +106,6 @@ SELECT smallint(float('32767.6'));
SELECT smallint(float('-32768.4'));
SELECT smallint(float('-32768.6'));
SELECT int(float('2147483520'));
SELECT int(float('2147483647'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are copied from pgsql and we shouldn't change it. We just need to re-generate the answer file and keep the actual result as it is.

@@ -106,7 +106,7 @@ SELECT smallint(float('32767.6'));
SELECT smallint(float('-32768.4'));
SELECT smallint(float('-32768.6'));
SELECT int(float('2147483520'));
SELECT int(float('2147483647'));
SELECT int(float('2147483392'));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's NOT change the pgsql tests. They are used to verify the difference between Spark and pgsql. We should respect the test result, whatever it is.

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117885 has finished for PR 27151 at commit 0534eb6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117898 has finished for PR 27151 at commit ccad7e6.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117911 has finished for PR 27151 at commit 0ac6e47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The last 2 commits are empty and just to trigger jenkins. The last effective commit has passed tests. I'm merging it to master/3.0, thanks!

@cloud-fan cloud-fan closed this in 6d507b4 Feb 5, 2020
cloud-fan pushed a commit that referenced this pull request Feb 5, 2020
…to Integer

### What changes were proposed in this pull request?
When spark.sql.ansi.enabled is true, for the statement:
```
select cast(cast(2147483648 as Float) as Integer) //result is 2147483647
```
Its result is 2147483647 and does not throw `ArithmeticException`.

The root cause is that, the below code does not work for some corner cases.
https://github.com/apache/spark/blob/94fc0e3235162afc6038019eed6ec546e3d1983e/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala#L129-L141

For example:

![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png)

In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

Added Unit test.

Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6d507b4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Feb 5, 2020

Test build #117918 has finished for PR 27151 at commit 4d47a49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Dec 3, 2020
…ting float to Integer

### What changes were proposed in this pull request?
This is a followup of [#27151](#27151). It fixes the same issue for the codegen path.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Unit test.

Closes #30585 from luluorta/SPARK-26218.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants