[SPARK-28018][SQL] Allow upcasting decimal to double/float by gengliangwang · Pull Request #24849 · apache/spark

gengliangwang · 2019-06-12T08:09:58Z

What changes were proposed in this pull request?

Currently, Spark only allows upcasting DecimalType to IntegralType or DecimalType.

This PR proposes: if the target DecimalType is tighter than DoubleType or FloatType, we can upcast it as well.
The upcasting matters because it blocks #24806. E.g, if there is a table "t" with only one column of double type,

INSERT INTO TABLE t SELECT 10.0

The data type of value 10.0 is DecimalType(3, 1). In the current code, Spark can't upcast the decimal to the column of double type.

How was this patch tested?

Unit test

gengliangwang · 2019-06-12T08:10:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

      (precision - scale) >= (dt.precision - dt.scale) && scale >= dt.scale
    case dt: IntegralType =>
      isWiderThan(DecimalType.forType(dt))
+    // For DoubleType/FloatType, the value can be NaN, PositiveInfinity or NegativeInfinity.


This PR also added comment to explain why we can't upcast double/float to decimal.

gengliangwang · 2019-06-12T08:13:10Z

@cloud-fan @ueshin @maropu @dongjoon-hyun @rxin @rdblue @gatorsmile @HyukjinKwon
Actually, I am not super confident about this. I would like to know your idea.

rxin · 2019-06-12T08:57:02Z

We shouldn't do this. Decimal guarantees no loss of precision (which is important for financial calculations). This type of cast is very dangerous.

gengliangwang · 2019-06-12T09:05:50Z

@rxin Thanks, +1.
I think we can close this one.

SparkQA · 2019-06-12T11:14:51Z

Test build #106411 has finished for PR 24849 at commit 44b9fef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-06-12T19:02:30Z

I agree with @rxin: "This is not a safe cast because the scale of the decimal is lost."

wangyum · 2019-06-27T08:10:27Z

At least PostgreSQL, Vertica, SQL Server and DB2 allow cast decimal to double/float.

PostgreSQL:

postgres=# select version();
                                                             version
----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.3 (Debian 11.3-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
(1 row)

postgres=# create table test_spark_27856(c1 float8);
CREATE TABLE
postgres=#  insert into test_spark_27856 values(cast(1123456789.01234567890123456 as numeric(38, 20)));
INSERT 0 1
postgres=# select c1, cast(1123456789.01234567890123456 as numeric(38, 20)) from test_spark_27856;
        c1        |             numeric
------------------+---------------------------------
 1123456789.01235 | 1123456789.01234567890123456000
(1 row)

Vertica:

dbadmin=> select version();
              version
------------------------------------
 Vertica Analytic Database v9.1.1-0
(1 row)

dbadmin=> create table test_spark_27856(c1 float8);
CREATE TABLE
dbadmin=> insert into test_spark_27856 values(cast(1123456789.01234567890123456 as numeric(38, 20)));
 OUTPUT
--------
      1
(1 row)

dbadmin=> select c1, cast(1123456789.01234567890123456 as numeric(38, 20)) from test_spark_27856;
        c1        |            ?column?
------------------+---------------------------------
 1123456789.01235 | 1123456789.01234567890123456000
(1 row)

SQL Server:

1> select @@version
2> go

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Microsoft SQL Server 2019 (CTP3.0) - 15.0.1600.8 (X64)
	May 17 2019 00:56:19
	Copyright (C) 2019 Microsoft Corporation
	Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS) <X64>

(1 rows affected)
1> create table test_spark_27856(c1 real)
2> go
1> insert into test_spark_27856 values(cast(112345678901234567890.123456 as decimal(38,10)));
2> go

(1 rows affected)
1> select c1, cast(112345678901234567890.123456 as decimal(38,10)) from test_spark_27856
2> go
c1
-------------- ----------------------------------------
 1.1234568E+20         112345678901234567890.1234560000

(1 rows affected)

DB2:

[db2inst1@2f3c821d36b7 ~]$ db2 "select service_level from table (sysproc.env_get_inst_info())"

SERVICE_LEVEL
--------------------------------------------------------------------------------------------------------------------------------
DB2 v10.5.0.5

  1 record(s) selected.

[db2inst1@2f3c821d36b7 ~]$ db2 "create table test_spark_27856(c1 double)"
DB20000I  The SQL command completed successfully.
[db2inst1@2f3c821d36b7 ~]$ db2 "insert into test_spark_27856 values(cast(112345678901234567890.123456 as decimal(31,10)))"
DB20000I  The SQL command completed successfully.
[db2inst1@2f3c821d36b7 ~]$ db2 "select c1, cast(112345678901234567890.123456 as decimal(31,10)) from test_spark_27856"

C1                       2
------------------------ ---------------------------------
  +1.12345678901235E+020  112345678901234567890.1234560000

  1 record(s) selected.

cloud-fan · 2019-06-27T08:26:02Z

I think this makes sense for table insertion. In postgres, we can even insert float/double to int/long column.

Seems truncating is not that dangerous to table insertion.

gengliangwang · 2019-06-27T09:18:16Z

@wangyum @cloud-fan Thanks a lot for the investigation.
I have also investigated Mysql/Oracle/PostgreSQL, all of them allow casting float/double value to int/long column. We can't use their behaviors as references since their data type casting allows precision loss.

The safe casting we are proposing in Spark 3.0 is a totally new behavior. It checks the data types in analyzer statically. However, considering most of the popular DBMS are using implicit casting, are we making it too strict for this specific case decimal -> double/float?
It is true that the computation result can be different between the following.

scala> BigDecimal("1115.32") + BigDecimal("0.0049")
res0: scala.math.BigDecimal = 1115.3249

scala> 1115.32 +0.0049
res1: Double = 1115.3248999999998

But there is no precision loss in the casting itself

scala> BigDecimal(1115.3249).toDouble
res2: Double = 1115.3249

I think we should reconsider the proposal in this PR unless there is a better solution.

BTW, I have been thinking about other solutions to this problem, e.g. only allow such casting for LocalRelation, or only allow such casting for table insertion. But allowing special cases makes the casting behaviors inconsistent. The proposal in this PR is simple and explainable.

cloud-fan · 2019-06-27T10:38:37Z

regarding Upcast, seems both float/double -> int/long and decimal -> float/double should be forbidden. But table insertion should allow more cases than Upcast. How about we update DataType.canWrite?

gengliangwang · 2019-06-27T12:35:59Z

regarding Upcast, seems both float/double -> int/long and decimal -> float/double should be forbidden.

float/double can be Infinity, I don't think allowing float/double -> int/long is a good idea.

How about we update DataType.canWrite

I have tried creating a patch to allow such conversion for LocalRelation: https://github.com/apache/spark/compare/master...gengliangwang:allowUpcastDecimalLocalRelation?expand=1
The code is actually quite ugly. If canWrite is true, then the conversion is created as Upcast. In such case, we can't simply update DataType.canWrite. We will have to mark some of the unsafe conversions as Cast.

I still suggest that we should simply allow upcasting tighter decimal to double/float.

rdblue · 2019-06-27T21:43:42Z

@wangyum, the problem is with implicit casts inserted by Spark, not explicit casts included in the query.

rdblue · 2019-06-27T21:45:30Z

@cloud-fan, DataType.canWrite should not allow decimal to double conversion. That loses information and must not be inserted automatically.

rdblue · 2019-06-27T21:47:24Z

I still suggest that we should simply allow upcasting tighter decimal to double/float.

This is not a safe cast.

In the DSv2 sync we discussed options that would work. Why not go with those?

cloud-fan · 2019-06-28T02:22:21Z

the problem is with implicit casts inserted by Spark, not explicit casts included in the query.

@rdblue seems you misread the query.

insert into test_spark_27856 values(cast(112345678901234567890.123456 as decimal(31,10))). We can replace the cast with a decimal literal to be more clear. It inserts a decimal value to a float column, so implicit cast does happen.

gengliangwang · 2019-06-28T02:40:10Z

This is not a safe cast.

I think it is explainable. There is no precision loss in the casting itself

In the DSv2 sync we discussed options that would work. Why not go with those?

As I remember, we didn't have a solution in DSv2 sync and we decide to come up with solutions offline.

rdblue · 2019-06-28T15:28:43Z

@cloud-fan, thanks for pointing that out. I see that it is inserting the cast.

rdblue · 2019-06-28T15:32:43Z

@gengliangwang, there were two suggestions in the DSv2 sync:

Add a decimal type for SQL literals that can be cast to float because the intended type of the literal is not known, or use some analysis rule that matches literals for the same purpose
Parse literals as floats and insert an implicit cast from float to decimal

Both of those options are still valid.

I'm surprised to see that decimals can be implicitly cast to float columns in most databases. Is this part of the SQL spec? If so, then I think we can go with that. Otherwise, I'd prefer one of the other options.

gengliangwang · 2019-07-01T06:58:36Z

Add a decimal type for SQL literals that can be cast to float because the intended type of the literal is not known, or use some analysis rule that matches literals for the same purpose

Parse literals as floats and insert an implicit cast from float to decimal

Sorry I meant there is not a conclusion in the sync. I came up with the proposals in the sync, but I don't think they are good enough.

gengliangwang · 2019-07-01T07:07:34Z

@rdblue I have checked the Information technology — Database languages — SQL —Part 2 Foundation (SQL/Foundation) of the year 2011

In section 15.10: Effect of inserting tables into base tables

So we can refer the rules In section 9.2: Store assignment

From ISO SQL standard, "If a value of the declared type of T can be obtained from V by rounding or truncation, then the value of T is set to that value".
Checking whether data types strictly upcast-able in analyzer doesn't match the standard. I think we can consider a new approach:

Always use Cast in table insertion if the input data type doesn't match the data type of table column
Cast throws runtime exception on failure, instead of Null results.

The new approach would be configurable.

allow upcasting decimal to double/float

44b9fef

gengliangwang commented Jun 12, 2019

View reviewed changes

gengliangwang mentioned this pull request Jun 12, 2019

[WIP][SPARK-27856][SQL] Only allow type upcasting when inserting table #24806

Closed

gengliangwang closed this Jun 12, 2019

dongjoon-hyun added the SQL label Feb 5, 2020

Conversation

gengliangwang commented Jun 12, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gengliangwang Jun 12, 2019

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Jun 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rxin commented Jun 12, 2019

Uh oh!

gengliangwang commented Jun 12, 2019

Uh oh!

SparkQA commented Jun 12, 2019

Uh oh!

rdblue commented Jun 12, 2019

Uh oh!

wangyum commented Jun 27, 2019

Uh oh!

cloud-fan commented Jun 27, 2019

Uh oh!

gengliangwang commented Jun 27, 2019

Uh oh!

cloud-fan commented Jun 27, 2019

Uh oh!

gengliangwang commented Jun 27, 2019

Uh oh!

rdblue commented Jun 27, 2019

Uh oh!

rdblue commented Jun 27, 2019

Uh oh!

rdblue commented Jun 27, 2019

Uh oh!

cloud-fan commented Jun 28, 2019

Uh oh!

gengliangwang commented Jun 28, 2019

Uh oh!

rdblue commented Jun 28, 2019

Uh oh!

rdblue commented Jun 28, 2019

Uh oh!

gengliangwang commented Jul 1, 2019

Uh oh!

gengliangwang commented Jul 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gengliangwang commented Jun 12, 2019 •

edited

Loading

gengliangwang commented Jul 1, 2019 •

edited

Loading