[SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL #20023

mgaido91 · 2017-12-19T16:41:29Z

What changes were proposed in this pull request?

When there is an operation between Decimals and the result is a number which is not representable exactly with the result's precision and scale, Spark is returning NULL. This was done to reflect Hive's behavior, but it is against SQL ANSI 2011, which states that "If the result cannot be represented exactly in the result type, then whether it is rounded or truncated is implementation-defined". Moreover, Hive now changed its behavior in order to respect the standard, thanks to HIVE-15331.

Therefore, the PR propose to:

update the rules to determine the result precision and scale according to the new Hive's ones introduces in HIVE-15331;
round the result of the operations, when it is not representable exactly with the result's precision and scale, instead of returning NULL
introduce a new config spark.sql.decimalOperations.allowPrecisionLoss which default to true (ie. the new behavior) in order to allow users to switch back to the previous one.

Hive behavior reflects SQLServer's one. The only difference is that the precision and scale are adjusted for all the arithmetic operations in Hive, while SQL Server is said to do so only for multiplications and divisions in the documentation. This PR follows Hive's behavior.

A more detailed explanation is available here: https://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCAEorWNAJ4TxJR9NBcgSFMD_VxTg8qVxusjP%2BAJP-x%2BJV9zH-yA%40mail.gmail.com%3E.

How was this patch tested?

modified and added UTs. Comparisons with results of Hive and SQLServer.

…ften returns NULL

SparkQA · 2017-12-19T19:26:00Z

Test build #85116 has finished for PR 20023 at commit 3037d4a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-12-20T09:31:45Z

@cloud-fan @dongjoon-hyun @gatorsmile @rxin @viirya I saw you worked on this files. Maybe you can help reviewing the PR. For further details about the reasons of this PR, please refer to the e-mail I sent on the dev mail list. Thank you.

cloud-fan · 2017-12-20T12:02:33Z

Ideally we should not change behaviors as possible as we can, but since this behavior is from Hive and Hive also changed it, might be OK to follow Hive and also change it? cc @hvanhovell too

mgaido91 · 2017-12-20T12:07:15Z

@cloud-fan yes, Hive changed and most important at the moment we are not compliant with SQL standard. So currently Spark is returning results which are different from Hive and not compliant with SQL standard. This is why I proposed this change.

hvanhovell · 2017-12-20T21:07:28Z

In am generally in favor of following the SQL standard. How about we do this. Let's make the standard behavior the default, and add a flag to revert to the old behavior. This allows us to ease users into the new behavior, and for us it can provide some data points on when we can remove the old behavior. I hope we can remove this for Spark 2.4 or later.

At the end of the day it will be a bit more work, as I'd definitely would make an effort to isolate the the two behaviors as much as possible.

mgaido91 · 2017-12-20T21:43:31Z

thanks for looking at this @hvanhovell. The reasons why I didn't introduce a configuration variable for this behavior are:

As far as I know, currently there is no way to read reliably a configuration in catalyst;
Also in Hive, the behavior was changed without introducing any configuration to switch back to the previous behavior;
Many people are complaining about the current Spark behavior in the JIRA and therefore it seems that the previous behavior is neither desired nor useful to users.

Let me know if you don't agree with these arguments. Thanks.

dongjoon-hyun · 2017-12-20T22:34:04Z

sql/core/src/test/resources/sql-tests/inputs/decimals.sql

+insert into decimals_test values(1, 100.0, 999.0);
+insert into decimals_test values(2, 12345.123, 12345.123);
+insert into decimals_test values(3, 0.1234567891011, 1234.1);
+insert into decimals_test values(4, 123456789123456789.0, 1.123456789123456789);


nit. How about making into one SQL statement?

insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789)

hvanhovell · 2017-12-20T22:35:18Z

I don't fully agree :)...

You can use SQLConf.get for this. Or you can wire up the rules using the SessionStateBuilder.
I am reluctant to change this for a minor version. I don't think the Hive approach is very good for UX.
People are also going to complain if do change it.

dongjoon-hyun · 2017-12-20T22:36:01Z

sql/core/src/test/resources/sql-tests/inputs/decimals.sql

@@ -0,0 +1,16 @@
+-- tests for decimals handling in operations
+-- Spark draws its inspiration byt Hive implementation


The hyperlinks in the PR came from Microsoft, and the first purpose is SQL compliant. Can we remove this line?

dongjoon-hyun · 2017-12-20T22:39:50Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -1526,15 +1526,15 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
    checkAnswer(sql("select 10.300000000000000000 * 3.000000000000000000"),
      Row(BigDecimal("30.900000000000000000000000000000000000", new MathContext(38))))
    checkAnswer(sql("select 10.300000000000000000 * 3.0000000000000000000"),
-      Row(null))


Two cases (2 and 3) were mentioned in the email. If this is the only NULL-return test case from previous behavior, can we have another test case?

Currently, Spark behaves like follows: 1. It follows some rules taken from intial Hive implementation; 2. it returns NULL; 3. it returns NULL.

The third case is never checked in the current codebase, ie. when we go out of the representable range of values. I haven't added a test for it, because I was waiting for feedbacks by the community about how to handle the 3rd case and I focused this PR only on points 1 and 2. But I can add a test case for it and eventually change it in a future PR to address the 3rd point in the e-mail. Thanks.

dongjoon-hyun · 2017-12-20T22:42:32Z

Thank you for pining me, @mgaido91 . The approach of PR looks good to me.
BTW, do we need to borrow more Hive 2.2 test cases?

mgaido91 · 2017-12-20T22:47:41Z

@hvanhovell, as far as 1 is regarded, I was referring to this comment and this PR where it is explicitly stated that using SQLConf here is not safe and it shouldn't be done. Let me know if I am missing something.
I am sorry, but I think I haven't fully understood what you meant by

Or you can wire up the rules using the SessionStateBuilder

may I kindly ask you if you could elaborate this sentence a bit more? Thank you very much.

mgaido91 · 2017-12-20T23:07:49Z

Thank you for your review @dongjoon-hyun. I think what we can do is add more test to the whitelist in HiveCompatibilitySuite, updating them according to HIVE-15331. Were you thinking to this or something different? Thanks.

dongjoon-hyun · 2017-12-20T23:29:02Z

I thought adding more cases into decimals.sql. :) But, for now, never mind about that~

viirya · 2017-12-21T04:34:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+  private[sql] def adjustPrecisionScale(precision: Int, scale: Int): DecimalType = {
+    // Assumptions:
+    // precision >= scale
+    // scale >= 0


Use assert to make sure assumptions?

I can add it even though it is not needed... there is no way we can violate those constraints. If you believe it is better to use assert, I will do that.

viirya · 2017-12-21T06:32:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala

+   * Type coercion for BinaryOperator in which one side is a non-decimal literal numeric, and the
+   * other side is a decimal.
+   */
+  private def nondecimalLiteralAndDecimal(


Is this rule newly introduced?

Yes, it is. If we don't introduce this, we have a failure in Hive compatibility tests, because Hive use the exact precision and scale needed by the literals, while we, before this change, were using conservative values for each type. For instance, if we have a select 123.12345 * 3, before this change 3 would have been interpreted as Decimal(10, 0), which is the type for integers. After the change, 3 would become Decimal(1, 0), as Hive does. This prevents from needing more precision that what is actually needed.

viirya · 2017-12-21T06:34:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
    case DoubleType => DoubleDecimal
  }

+  private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match {
+    case v: Short => fromBigDecimal(BigDecimal(v))


Can't we just use ShortDecimal, IntDecimal...?

No, please see my comments above.

viirya · 2017-12-21T06:37:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

@@ -136,10 +137,54 @@ object DecimalType extends AbstractDataType {
    case DoubleType => DoubleDecimal
  }

+  private[sql] def forLiteral(literal: Literal): DecimalType = literal.value match {


Is this different than forType if applied on Literal.dataType?

yes, please see my comment above for an example. Thanks.

viirya · 2017-12-21T06:39:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+    // scale >= 0
+    if (precision <= MAX_PRECISION) {
+      // Adjustment only needed when we exceed max precision
+      DecimalType(precision, scale)


Shouldn't we also prevent scale > MAX_SCALE?

this is prevented outside this function.

viirya · 2017-12-21T06:45:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+      val intDigits = precision - scale
+      // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise
+      // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits
+      val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)


Sounds like MAXIMUM_ADJUSTED_SCALE instead of MINIMUM_ADJUSTED_SCALE.

It is the MINIMUM_ADJUSTED_SCALE. We can't have a scale lower that that, even though we would need not to loose precision. Please see the comments above.

We can't have a scale lower that that...

Don't you get a scale lower than MINIMUM_ADJUSTED_SCALE by Math.min(scale, MINIMUM_ADJUSTED_SCALE)?

Yes, sorry, my answer was very poor, I will rephrase. scale contains the scale which we need to represent the values without any precision loss. What we are doing here is saying that the lower bound for the scale is either the scale that we need to correctly represent the value or the MINIMUM_ADJUSTED_SCALE. After this, in the line below we state that the scale we will use is the max between the number of digits of the precision we don't need on the left of the dot and this minScaleValue: ie. even though in some cases we might need a scale higher than MINIMUM_ADJUSTED_SCALE, but the number of digits needed on the left on the dot would force us to have a scale lower than MINIMUM_ADJUSTED_SCALE, we enforce that we will maintain at least MINIMUM_ADJUSTED_SCALE. We can't let the scale be lower that this threshold, even though it would be needed to enforce that we don't loose digits on the left of the dot. Please refer also to the blog post I linked in the comment above for further (hopefully better) explanation.

viirya · 2017-12-21T06:48:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+      // If original scale less than MINIMUM_ADJUSTED_SCALE, use original scale value; otherwise
+      // preserve at least MINIMUM_ADJUSTED_SCALE fractional digits
+      val minScaleValue = Math.min(scale, MINIMUM_ADJUSTED_SCALE)
+      val adjustedScale = Math.max(MAX_PRECISION - intDigits, minScaleValue)


Sounds like Math.min?

It is max because we take either the scale which would prevent a loss of "space" for intDigits, ie. the part on the left of the dot, or the minScaleValue, which is the scale we are ensuring to provide at least.

viirya · 2017-12-21T06:56:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+   * corresponding scale is reduced to prevent the integral part of a result from being truncated.
+   *
+   * For further reference, please see
+   * https://blogs.msdn.microsoft.com/sqlprogrammability/2006/03/29/multiplication-and-division-with-numerics/.


Not sure if this blog link can be available for long time.

viirya · 2017-12-21T06:59:01Z

I think that we should be careful on suddenly changing behavior.

gatorsmile · 2017-12-21T07:59:32Z

Thanks for your efforts! This change needs more careful review and investigation.

Could you post the outputs of Oracle, DB2, SQL server and Hive? Are their results consistent?

mgaido91 · 2017-12-21T09:38:45Z

@gatorsmile, please refer to the e-mail to the dev mail list for further details. I run the script I added to the tests in this PR, the results are:

Hive behaves exactly as Spark after this PR;
SQLServer the same, even though on additions and subtractions it seems to maintain one more precision digit in some cases (I am running SQLServer 2017, since Hive implementation, and therefore this too, are inspired to SQLServer2005, there might have been a small behavior change in this case). Anyway, differently from Hive and Spark it throws an exception in case 3 described in the email (it is compliant to SQL standard, point 3 of the email is out of scope of this PR, I will create another PR for it once we agree on how to handle that case);
Oracle and Postgres have nearly infinite precision. Thus it is nearly impossible to provoke a rounding on them. If we force a precision loss on them (point 3 of the email, out of scope of this PR) they throw an exception (compliant to SQL standard and SQLServer);

Here you are the outputs of the queries.

Hive 2.3.0 (same as Spark after PR)

0: jdbc:hive2://localhost:10000> create table decimals_test(id int, a decimal(38,18), b decimal(38,18));
No rows affected (2.085 seconds)
0: jdbc:hive2://localhost:10000> insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789);
No rows affected (14.054 seconds)
0: jdbc:hive2://localhost:10000> select id, a+b, a-b, a*b, a/b from decimals_test order by id;
+-----+---------------------------------------+---------------------------------------+----------------------------+----------------------------+
| id  |                  _c1                  |                  _c2                  |            _c3             |            _c4             |
+-----+---------------------------------------+---------------------------------------+----------------------------+----------------------------+
| 1   | 1099.00000000000000000                | -899.00000000000000000                | 99900.000000               | 0.100100                   |
| 2   | 24690.24600000000000000               | 0E-17                                 | 152402061.885129           | 1.000000                   |
| 3   | 1234.22345678910110000                | -1233.97654321089890000               | 152.358023                 | 0.000100                   |
| 4   | 123456789123456790.12345678912345679  | 123456789123456787.87654321087654321  | 138698367904130467.515623  | 109890109097814272.043109  |
+-----+---------------------------------------+---------------------------------------+----------------------------+----------------------------+

SQLServer 2017

1> create table decimals_test(id int, a decimal(38,18), b decimal(38,18));
2> insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789);
3> select id, a+b, a-b, a*b, a/b from decimals_test order by id;
4> GO

(4 rows affected)
id                                                                                                                                                                             
----------- ---------------------------------------- ---------------------------------------- ---------------------------------------- ----------------------------------------
          1                  1099.000000000000000000                  -899.000000000000000000                             99900.000000                                  .100100
          2                 24690.246000000000000000                      .000000000000000000                         152402061.885129                                 1.000000
          3                  1234.223456789101100000                 -1233.976543210898900000                               152.358023                                  .000100
          4    123456789123456790.123456789123456789    123456789123456787.876543210876543211                138698367904130467.515623                109890109097814272.043109

Postgres and Oracle

postgres=# create table decimals_test(id int, a decimal(38,18), b decimal(38,18));
CREATE TABLE
postgres=# insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789);
INSERT 0 4
postgres=# select id, a+b, a-b, a*b, a/b from decimals_test order by id;
 id |               ?column?                |               ?column?                |                        ?column?                         |               ?column?                
----+---------------------------------------+---------------------------------------+---------------------------------------------------------+---------------------------------------
  1 |               1099.000000000000000000 |               -899.000000000000000000 |              99900.000000000000000000000000000000000000 |                0.10010010010010010010
  2 |              24690.246000000000000000 |                  0.000000000000000000 |          152402061.885129000000000000000000000000000000 |                1.00000000000000000000
  3 |               1234.223456789101100000 |              -1233.976543210898900000 |                152.358023429667510000000000000000000000 |            0.000100037913541123085649
  4 | 123456789123456790.123456789123456789 | 123456789123456787.876543210876543211 | 138698367904130467.515622620750190521000000000000000000 | 109890109097814272.043109406191131436
(4 rows)

Spark before the PR

scala> sql("create table decimals_test(id int, a decimal(38,18), b decimal(38,18)) using parquet")
res0: org.apache.spark.sql.DataFrame = []

scala> sql("insert into decimals_test values (1, 100.0, 999.0), (2, 12345.123, 12345.123), (3, 0.1234567891011, 1234.1), (4, 123456789123456789.0, 1.123456789123456789)")
res1: org.apache.spark.sql.DataFrame = []                                       

scala> sql("select id, a+b, a-b, a*b, a/b from decimals_test order by id").show(truncate = false)
+---+-------------------------------------+-------------------------------------+-------+-------------------------------------+
|id |(a + b)                              |(a - b)                              |(a * b)|(a / b)                              |
+---+-------------------------------------+-------------------------------------+-------+-------------------------------------+
|1  |1099.000000000000000000              |-899.000000000000000000              |null   |0.100100100100100100                 |
|2  |24690.246000000000000000             |0E-18                                |null   |1.000000000000000000                 |
|3  |1234.223456789101100000              |-1233.976543210898900000             |null   |0.000100037913541123                 |
|4  |123456789123456790.123456789123456789|123456789123456787.876543210876543211|null   |109890109097814272.043109406191131436|
+---+-------------------------------------+-------------------------------------+-------+-------------------------------------+

gatorsmile · 2017-12-21T21:02:55Z

In DB2, a * b will just stop with an error message: SQL0802N Arithmetic overflow or other arithmetic exception occurred.

Thus, it might be more straightforward for us to follow what DB2 does.

gatorsmile · 2017-12-21T21:09:59Z

Regarding the rules for deciding precision and scales, DB2 z/OS also has its own rules: https://www.ibm.com/support/knowledgecenter/en/SSEPEK_10.0.0/sqlref/src/tpc/db2z_witharithmeticoperators.html

Could you compare it with MS SQL Server?

cloud-fan · 2017-12-22T08:00:22Z

it looks like other databases are very careful about precision lose. However, following DB2 and throwing exception is pretty bad for big data applications, but returning null is also bad as it violates SQL standard.

A new proposal: can we increase the max decimal precision to 76 and keep max scale as 38? Then we can avoid precision lose IIUC.

mgaido91 · 2017-12-22T11:26:15Z

@gatorsmile I answered to your comments about DB2 in the e-mail.

@cloud-fan that would help, but not solve the problem. It would just make the problem being generated by bigger numbers.

As you can see from the e-mail, DB2 behavior is actually in accordance to SQL standards and the other DBs, it just have a smaller maximum precision. And the case of throwing an exception is point 3 of my e-mails and it is out of scope of this PR, because I think we best discuss before which is the right approach in that case and then I can eventually create a PR.
Therefore, also DB2 behavior is aligned to the other SQL engines, the SQL standard and Spark is the only one which is currently behaving differently.

gatorsmile · 2017-12-23T03:50:50Z

db2 => create table decimals_test(id int, a decimal(31,18), b decimal(31,18))
DB20000I  The SQL command completed successfully.
db2 => insert into decimals_test values (1, 2.33, 1.12)
DB20000I  The SQL command completed successfully.
db2 => select a * b from decimals_test

1                                
---------------------------------
SQL0802N  Arithmetic overflow or other arithmetic exception occurred.  
SQLSTATE=22003

I might not get your point. Above is the result I got. This is your scenario 3 or 2?

mgaido91 · 2017-12-23T10:51:59Z

@gatorsmile that is scenario 3. I will explain you why and after I will do and errata corrige of the summary I did in my last e-mail, because I made a mistake about how DB2 computes the result precision and scale, sorry for that.

Anyway, what you showed is an example of point 3 because DB2 computes the result type as DECIMAL( MIN(31, p1 + p2), MIN(31, s1 + s2) ). Therefore, in your case the result type was DECIMAL(31, 31). Since your result had more than 0 significant digits, it was out out of the range of the representable values and an overflow exception was thrown.
You can reproduce case 2 as follows:

db2 => create table decimals_test (id int, a decimal(31,31), b decimal(31,31))
DB20000I  The SQL command completed successfully.
db2 => insert into decimals_test values(1, 0.12345678912345678912345689, 0.12345678912345678912345)             
DB20000I  The SQL command completed successfully.
db2 => select a*b from dd

1                                
---------------------------------
 .0152415787806736785461049526020

As you can see a truncation occurred.

Now, let me amend my table to summarize the behavior of the many DBs:

Rules to determine precision and scale
- Hive, SQLServer (and Spark after the PR): I won't include the exact formulas, anyway the relevant part is that in case of precision higher that the maximum value, we use the maximum available value (38) as precision and the maximum between the needed scale (computing according the relevant formula) and a minimum value guaranteed for the scale which is 6.
- DB2: computes the result type as DECIMAL( MIN(31, p1 + p2), MIN(31, s1 + s2) ).
- Postgres and Oracle: NA
- SQL ANSI 2011: no indication
- Spark now: if the precision needed is more than 38, use 38 as precision; use the needed scale without any adjustment.
Behavior in case of precision loss but result in the range of the representable values
- Oracle, Hive, SQLServer (and Spark after the PR): round the result.
- DB2: truncates the result (and sets a warning flag).
- Postgres: NA, it has infinite precision...
- SQL ANSI 2011: either truncate or round the value.
- Spark now: returns NULL.
Behavior in case of result out of the range of the representable values (i.e overflow)
- Oracle, DB2, SQLServer: throw an exception.
- Postgres: NA, they have nearly infinite precision...
- SQL ANSI 2011: an exception should be raised
- Spark now, Hive: return NULL (for Hive, there is a open ticket to make it compliant to the SQL standard).

gatorsmile · 2017-12-24T09:29:49Z

Thanks for your detailed summary!

We do not have a SQLCA. Thus, it is hard for us to send a warning message back like what DB2 does.. Silently losing the precision looks scary to me. Oracle sounds like following the rule, If a value exceeds the precision, then Oracle returns an error. If a value exceeds the scale, then Oracle rounds it.

SQL ANSI 2011 does not document many details. For example, the result type of DB2's division is different from either our existing rule or the rule you changed. The rule you mentioned above about DB2 is just for multiplification.

I am not sure whether we can finalize our default type cocersion rule DecimalPrecision now. However, for Hive compliance, we can add a new rule after we introduce the new conf spark.sql.typeCoercion.mode. See the PR #18853 for details. The new behavior will be added if and only if spark.sql.typeCoercion.mode is set to hive.

Could you first help us improve the test cases added in #20008 ? Thanks!

mgaido91 · 2017-12-24T11:07:11Z

Thanks for your analysis @gatorsmile. Actually the rule you specified for Oracle is what it uses when casting, rather then when doing arithmetic operations. Yes DB2 has rather different rules to define the output type of operations. Anyway, we can have a behavior practically identical to DB2 by changing the value of MINIMUM_ADJUSTED_SCALE to 31. Therefore, I'd propose, instead of using the configuration you pointed out, to use a configuration for the MINIMUM_ADJUSTED_SCALE, changing which we can have both the behavior of Hive and SQLServer and the one of DB2. What do you think?

The reason why I am suggesting this is that my first concern is not Hive compliance, but SQL standard compliance. Indeed, as you con see from the summary, on point 1 there is not a uniform behavior (but this is OK to SQL standard since it gives freedom). But on point 2 we are the only ones who are not compliant to SQL standard. And having this behavior by default doesn't look the right thing to do IMHO. On point 3, only we and Hive are not compliant. Thus I think also that should be changed. But in that case, we can't use the same flag, because it would be inconsistent. What do you think?

I can understand and agree that loosing precision looks scary. But to me returning NULL is even more scary if possible: indeed, NULL is what should be returned if either if the two operands are NULL. Thus queries running on other DBs which relies on this might return very bad result. For instance, let's think to a report where we join a prices table and a sold_product table per country. In this use case, we can assume that if the result is NULL, it means that there was no sold product in that country and then coalesce the output of the multiplication to 0. This would work well on any DB but Spark. With my proposal of tuning the MINIMUM_ADJUSTED_SCALE, each customer can decide (query by query) how much precision loss they can tolerate. And if we agree to change point 3 behavior to the SQL standard, in case of it is not possible to meet their desires we throw an exception, giving them the choice about what to do: allow more precision loss, change their input data type, etc. etc. This is the safer way IMHO.

I would ne happy to help improving test cases. May I just kindly ask you how you meant to do that? What would you like to be tested more? Would you like me to add more test cases in scope of this PR or to open a new one for that?

Thank you for your time reading my long messages. I just want to take the best choice and give you all the elements I have to decide for the best all together.
Thank you.

gatorsmile · 2017-12-25T04:17:37Z

Following ANSI SQL compliance sounds good to me. However, many details are vendor-specific. That means, the query results still varies even if we can be 100% ANSI SQL compliant.

To avoid frequently introducing behavior breaking changes, we can also introduce a new mode strict for spark.sql.typeCoercion.mode. (Hive is also not 100% ANSI SQL compliant) Instead of inventing a completely new one, we can try to follow one of the mainstream open-source databases. For example, Postgres.

Before introducing the new mode, we first need to understand the difference between Spark SQL and the other. That is the reason why we need to write the test cases first. Then, we can run them against different systems. This PR clearly shows the current test cases do not cover the scenarios of 2 and 3.

mgaido91 · 2017-12-25T19:10:30Z

Thanks @gatorsmile. Then should I create a follow up PR for #20008 in order to cover the cases 2 and 3 before going on with this PR or can we go on with this PR and the test cases added in this PR?

cloud-fan · 2018-01-17T13:38:32Z

LGTM.

One thing we can improve is the golden file test framework. I found we sometimes repeat the test cases with a config on and off. We should write the test cases once and list the configs we wanna try, and ask the test framework to do it. This can be a follow-up.

@mgaido91 thanks for your great work!

SparkQA · 2018-01-17T17:08:04Z

Test build #86272 has finished for PR 20023 at commit 03644fe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T18:04:08Z

Test build #86276 has finished for PR 20023 at commit 7653e6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T18:09:10Z

Test build #86277 has finished for PR 20023 at commit 2b66098.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T18:17:59Z

Test build #86271 has finished for PR 20023 at commit cf3b372.

This patch fails from timeout after a configured wait of `300m`.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-18T04:16:17Z

docs/sql-programming-guide.md

+ - Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible. This is compliant to SQL standards and Hive's behavior introduced in HIVE-15331. This involves the following changes
+    - The rules to determine the result type of an arithmetic operation have been updated. In particular, if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals.
+    - Literal values used in SQL operations are converted to DECIMAL with the exact precision and scale needed by them.
+    - The configuration `spark.sql.decimalOperations.allowPrecisionLoss` has been introduced. It defaults to `true`, which means the new behavior described here; if set to `false`, Spark will use the previous rules and behavior.


Also need to explain what is the previous behavior.

At least, we need to say, NULL will be returned in this case.

gatorsmile · 2018-01-18T04:16:56Z

docs/sql-programming-guide.md

@@ -1795,6 +1795,11 @@ options.

 - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.

+ - Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible. This is compliant to SQL standards and Hive's behavior introduced in HIVE-15331. This involves the following changes
+    - The rules to determine the result type of an arithmetic operation have been updated. In particular, if the precision / scale needed are out of the range of available values, the scale is reduced up to 6, in order to prevent the truncation of the integer part of the decimals.


We need to explicitly document which arithmetic operations are affected.

gatorsmile · 2018-01-18T04:21:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .doc("When true (default), establishing the result type of an arithmetic operation " +
+        "happens according to Hive behavior and SQL ANSI 2011 specification, ie. rounding the " +
+        "decimal part of the result if an exact representation is not possible. Otherwise, NULL " +
+        "is returned in those cases, as previously.")


Yeah. This is better.

gatorsmile · 2018-01-18T04:23:58Z

docs/sql-programming-guide.md

@@ -1795,6 +1795,11 @@ options.

 - Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.

+ - Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible. This is compliant to SQL standards and Hive's behavior introduced in HIVE-15331. This involves the following changes


This is the new behavior introduced in Hive 2.2. We have to emphasize it.

gatorsmile · 2018-01-18T04:33:56Z

LGTM except a few comments about the doc

cloud-fan · 2018-01-18T09:29:43Z

LGTM, pending jenkins

SparkQA · 2018-01-18T12:18:20Z

Test build #86330 has finished for PR 20023 at commit b4b0350.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-18T13:25:20Z

thanks, merging to master/2.3!

…ften returns NULL ## What changes were proposed in this pull request? When there is an operation between Decimals and the result is a number which is not representable exactly with the result's precision and scale, Spark is returning `NULL`. This was done to reflect Hive's behavior, but it is against SQL ANSI 2011, which states that "If the result cannot be represented exactly in the result type, then whether it is rounded or truncated is implementation-defined". Moreover, Hive now changed its behavior in order to respect the standard, thanks to HIVE-15331. Therefore, the PR propose to: - update the rules to determine the result precision and scale according to the new Hive's ones introduces in HIVE-15331; - round the result of the operations, when it is not representable exactly with the result's precision and scale, instead of returning `NULL` - introduce a new config `spark.sql.decimalOperations.allowPrecisionLoss` which default to `true` (ie. the new behavior) in order to allow users to switch back to the previous one. Hive behavior reflects SQLServer's one. The only difference is that the precision and scale are adjusted for all the arithmetic operations in Hive, while SQL Server is said to do so only for multiplications and divisions in the documentation. This PR follows Hive's behavior. A more detailed explanation is available here: https://mail-archives.apache.org/mod_mbox/spark-dev/201712.mbox/%3CCAEorWNAJ4TxJR9NBcgSFMD_VxTg8qVxusjP%2BAJP-x%2BJV9zH-yA%40mail.gmail.com%3E. ## How was this patch tested? modified and added UTs. Comparisons with results of Hive and SQLServer. Author: Marco Gaido <marcogaido91@gmail.com> Closes #20023 from mgaido91/SPARK-22036. (cherry picked from commit e28eb43) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

… integral literals ## What changes were proposed in this pull request? #20023 proposed to allow precision lose during decimal operations, to reduce the possibilities of overflow. This is a behavior change and is protected by the DECIMAL_OPERATIONS_ALLOW_PREC_LOSS config. However, that PR introduced another behavior change: pick a minimum precision for integral literals, which is not protected by a config. This PR add a new config for it: `spark.sql.literal.pickMinimumPrecision`. This can allow users to work around issue in SPARK-25454, which is caused by a long-standing bug of negative scale. ## How was this patch tested? a new test Closes #22494 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com> (cherry picked from commit d0990e3) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

… integral literals ## What changes were proposed in this pull request? apache#20023 proposed to allow precision lose during decimal operations, to reduce the possibilities of overflow. This is a behavior change and is protected by the DECIMAL_OPERATIONS_ALLOW_PREC_LOSS config. However, that PR introduced another behavior change: pick a minimum precision for integral literals, which is not protected by a config. This PR add a new config for it: `spark.sql.literal.pickMinimumPrecision`. This can allow users to work around issue in SPARK-25454, which is caused by a long-standing bug of negative scale. ## How was this patch tested? a new test Closes apache#22494 from cloud-fan/decimal. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…ain integral digits first ### What changes were proposed in this pull request? This is kind of a followup of #20023 . It's simply wrong to cut the decimal precision to 38 if a wider decimal type exceeds the max precision, which drops the integral digits and makes the decimal value very likely to overflow. In #20023 , we fixed this issue for arithmetic operations, but many other operations suffer from the same issue: Union, binary comparison, in subquery, create_array, coalesce, etc. This PR fixes all the remaining operators, without the min scale limitation, which should be applied to division and multiple only according to the SQL server doc: https://learn.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15 ### Why are the changes needed? To produce reasonable wider decimal type. ### Does this PR introduce _any_ user-facing change? Yes, the final data type of these operators will be changed if it's decimal type and its precision exceeds the max and the scale is not 0. ### How was this patch tested? updated tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #43781 from cloud-fan/decimal. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-22036][SQL] Decimal multiplication with high precision/scale o…

3037d4a

…ften returns NULL

dongjoon-hyun reviewed Dec 20, 2017

View reviewed changes

viirya reviewed Dec 21, 2017

View reviewed changes

mgaido91 added 3 commits January 17, 2018 14:45

address comments

03644fe

move tests in decimalArithmeticOperations.sql

7653e6d

rename to fromLiteral

2b66098

gatorsmile reviewed Jan 18, 2018

View reviewed changes

docs fix

b4b0350

asfgit closed this in e28eb43 Jan 18, 2018

mgaido91 mentioned this pull request Jun 16, 2018

[SPARK-24562][TESTS] Support different configs for same test in SQLQueryTestSuite #21568

Closed

This was referenced Sep 20, 2018

[SPARK-25454][SQL] add a new config for picking minimum precision for integral literals #22494

Closed

[SPARK-25454][SQL] Avoid precision loss in division with decimal with negative scale #22450

Closed

rednaxelafx mentioned this pull request Jan 18, 2019

[SPARK-26664][SQL] Make DecimalType's minimum adjusted scale configurable #23587

Closed

JingsongLi mentioned this pull request May 8, 2021

[FLINK-22586][table] Improve the precision dedivation for decimal ari… apache/flink#15848

Merged

cloud-fan mentioned this pull request Nov 13, 2023

[SPARK-45905][SQL] Least common type between decimal types should retain integral digits first #43781

Closed

		@@ -0,0 +1,16 @@
		-- tests for decimals handling in operations
		-- Spark draws its inspiration byt Hive implementation

		@@ -1795,6 +1795,11 @@ options.

		- Since Spark 2.3, when all inputs are binary, SQL `elt()` returns an output as binary. Otherwise, it returns as a string. Until Spark 2.3, it always returns as a string despite of input types. To keep the old behavior, set `spark.sql.function.eltOutputAsString` to `true`.

		- Since Spark 2.3, by default arithmetic operations between decimals return a rounded value if an exact representation is not possible. This is compliant to SQL standards and Hive's behavior introduced in HIVE-15331. This involves the following changes

[SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL #20023

[SPARK-22036][SQL] Decimal multiplication with high precision/scale often returns NULL #20023

Conversation

mgaido91 commented Dec 19, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Dec 19, 2017

mgaido91 commented Dec 20, 2017

cloud-fan commented Dec 20, 2017

mgaido91 commented Dec 20, 2017

hvanhovell commented Dec 20, 2017

mgaido91 commented Dec 20, 2017

Choose a reason for hiding this comment

hvanhovell commented Dec 20, 2017

dongjoon-hyun Dec 20, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Dec 20, 2017

mgaido91 commented Dec 20, 2017

mgaido91 commented Dec 20, 2017

dongjoon-hyun commented Dec 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Dec 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mgaido91 Dec 21, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya commented Dec 21, 2017

gatorsmile commented Dec 21, 2017

mgaido91 commented Dec 21, 2017

gatorsmile commented Dec 21, 2017

gatorsmile commented Dec 21, 2017

cloud-fan commented Dec 22, 2017

mgaido91 commented Dec 22, 2017 • edited

gatorsmile commented Dec 23, 2017 • edited

mgaido91 commented Dec 23, 2017 • edited

gatorsmile commented Dec 24, 2017 • edited

mgaido91 commented Dec 24, 2017

gatorsmile commented Dec 25, 2017

mgaido91 commented Dec 25, 2017

cloud-fan commented Jan 17, 2018

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Jan 18, 2018 • edited

Choose a reason for hiding this comment

gatorsmile commented Jan 18, 2018

cloud-fan commented Jan 18, 2018

SparkQA commented Jan 18, 2018

cloud-fan commented Jan 18, 2018

mgaido91 commented Dec 19, 2017 •

edited

dongjoon-hyun Dec 20, 2017 •

edited

mgaido91 Dec 21, 2017 •

edited

mgaido91 Dec 21, 2017 •

edited

mgaido91 commented Dec 22, 2017 •

edited

gatorsmile commented Dec 23, 2017 •

edited

mgaido91 commented Dec 23, 2017 •

edited

gatorsmile commented Dec 24, 2017 •

edited

gatorsmile Jan 18, 2018 •

edited