Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28018][SQL] Allow upcasting decimal to double/float #24849

Closed

Conversation

Projects
None yet
6 participants
@gengliangwang
Copy link
Contributor

commented Jun 12, 2019

What changes were proposed in this pull request?

Currently, Spark only allows upcasting DecimalType to IntegralType or DecimalType.

This PR proposes: if the target DecimalType is tighter than DoubleType or FloatType, we can upcast it as well.
The upcasting matters because it blocks #24806. E.g, if there is a table "t" with only one column of double type,

INSERT INTO TABLE t SELECT 10.0

The data type of value 10.0 is DecimalType(3, 1). In the current code, Spark can't upcast the decimal to the column of double type.

How was this patch tested?

Unit test

@@ -77,6 +77,7 @@ case class DecimalType(precision: Int, scale: Int) extends FractionalType {
(precision - scale) >= (dt.precision - dt.scale) && scale >= dt.scale
case dt: IntegralType =>
isWiderThan(DecimalType.forType(dt))
// For DoubleType/FloatType, the value can be NaN, PositiveInfinity or NegativeInfinity.

This comment has been minimized.

Copy link
@gengliangwang

gengliangwang Jun 12, 2019

Author Contributor

This PR also added comment to explain why we can't upcast double/float to decimal.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

@cloud-fan @ueshin @maropu @dongjoon-hyun @rxin @rdblue @gatorsmile @HyukjinKwon
Actually, I am not super confident about this. I would like to know your idea.

@rxin

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

We shouldn't do this. Decimal guarantees no loss of precision (which is important for financial calculations). This type of cast is very dangerous.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

@rxin Thanks, +1.
I think we can close this one.

@SparkQA

This comment has been minimized.

Copy link

commented Jun 12, 2019

Test build #106411 has finished for PR 24849 at commit 44b9fef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@rdblue

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2019

@wangyum

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

At least PostgreSQL, Vertica, SQL Server and DB2 allow cast decimal to double/float.

PostgreSQL:

postgres=# select version();
                                                             version
----------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.3 (Debian 11.3-1.pgdg90+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516, 64-bit
(1 row)

postgres=# create table test_spark_27856(c1 float8);
CREATE TABLE
postgres=#  insert into test_spark_27856 values(cast(1123456789.01234567890123456 as numeric(38, 20)));
INSERT 0 1
postgres=# select c1, cast(1123456789.01234567890123456 as numeric(38, 20)) from test_spark_27856;
        c1        |             numeric
------------------+---------------------------------
 1123456789.01235 | 1123456789.01234567890123456000
(1 row)

Vertica:

dbadmin=> select version();
              version
------------------------------------
 Vertica Analytic Database v9.1.1-0
(1 row)

dbadmin=> create table test_spark_27856(c1 float8);
CREATE TABLE
dbadmin=> insert into test_spark_27856 values(cast(1123456789.01234567890123456 as numeric(38, 20)));
 OUTPUT
--------
      1
(1 row)

dbadmin=> select c1, cast(1123456789.01234567890123456 as numeric(38, 20)) from test_spark_27856;
        c1        |            ?column?
------------------+---------------------------------
 1123456789.01235 | 1123456789.01234567890123456000
(1 row)

SQL Server:

1> select @@version
2> go

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Microsoft SQL Server 2019 (CTP3.0) - 15.0.1600.8 (X64)
	May 17 2019 00:56:19
	Copyright (C) 2019 Microsoft Corporation
	Developer Edition (64-bit) on Linux (Ubuntu 16.04.6 LTS) <X64>

(1 rows affected)
1> create table test_spark_27856(c1 real)
2> go
1> insert into test_spark_27856 values(cast(112345678901234567890.123456 as decimal(38,10)));
2> go

(1 rows affected)
1> select c1, cast(112345678901234567890.123456 as decimal(38,10)) from test_spark_27856
2> go
c1
-------------- ----------------------------------------
 1.1234568E+20         112345678901234567890.1234560000

(1 rows affected)

DB2:

[db2inst1@2f3c821d36b7 ~]$ db2 "select service_level from table (sysproc.env_get_inst_info())"

SERVICE_LEVEL
--------------------------------------------------------------------------------------------------------------------------------
DB2 v10.5.0.5

  1 record(s) selected.

[db2inst1@2f3c821d36b7 ~]$ db2 "create table test_spark_27856(c1 double)"
DB20000I  The SQL command completed successfully.
[db2inst1@2f3c821d36b7 ~]$ db2 "insert into test_spark_27856 values(cast(112345678901234567890.123456 as decimal(31,10)))"
DB20000I  The SQL command completed successfully.
[db2inst1@2f3c821d36b7 ~]$ db2 "select c1, cast(112345678901234567890.123456 as decimal(31,10)) from test_spark_27856"

C1                       2
------------------------ ---------------------------------
  +1.12345678901235E+020  112345678901234567890.1234560000

  1 record(s) selected.
@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

I think this makes sense for table insertion. In postgres, we can even insert float/double to int/long column.

Seems truncating is not that dangerous to table insertion.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

@wangyum @cloud-fan Thanks a lot for the investigation.
I have also investigated Mysql/Oracle/PostgreSQL, all of them allow casting float/double value to int/long column. We can't use their behaviors as references since their data type casting allows precision loss.

The safe casting we are proposing in Spark 3.0 is a totally new behavior. It checks the data types in analyzer statically. However, considering most of the popular DBMS are using implicit casting, are we making it too strict for this specific case decimal -> double/float?
It is true that the computation result can be different between the following.

scala> BigDecimal("1115.32") + BigDecimal("0.0049")
res0: scala.math.BigDecimal = 1115.3249

scala> 1115.32 +0.0049
res1: Double = 1115.3248999999998

But there is no precision loss in the casting itself

scala> BigDecimal(1115.3249).toDouble
res2: Double = 1115.3249

I think we should reconsider the proposal in this PR unless there is a better solution.

BTW, I have been thinking about other solutions to this problem, e.g. only allow such casting for LocalRelation, or only allow such casting for table insertion. But allowing special cases makes the casting behaviors inconsistent. The proposal in this PR is simple and explainable.

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

regarding Upcast, seems both float/double -> int/long and decimal -> float/double should be forbidden. But table insertion should allow more cases than Upcast. How about we update DataType.canWrite?

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 27, 2019

regarding Upcast, seems both float/double -> int/long and decimal -> float/double should be forbidden.

float/double can be Infinity, I don't think allowing float/double -> int/long is a good idea.

How about we update DataType.canWrite

I have tried creating a patch to allow such conversion for LocalRelation: https://github.com/apache/spark/compare/master...gengliangwang:allowUpcastDecimalLocalRelation?expand=1
The code is actually quite ugly. If canWrite is true, then the conversion is created as Upcast. In such case, we can't simply update DataType.canWrite. We will have to mark some of the unsafe conversions as Cast.

I still suggest that we should simply allow upcasting tighter decimal to double/float.

@rdblue

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

@wangyum, the problem is with implicit casts inserted by Spark, not explicit casts included in the query.

@rdblue

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

@cloud-fan, DataType.canWrite should not allow decimal to double conversion. That loses information and must not be inserted automatically.

@rdblue

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2019

I still suggest that we should simply allow upcasting tighter decimal to double/float.

This is not a safe cast.

In the DSv2 sync we discussed options that would work. Why not go with those?

@cloud-fan

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

the problem is with implicit casts inserted by Spark, not explicit casts included in the query.

@rdblue seems you misread the query.

insert into test_spark_27856 values(cast(112345678901234567890.123456 as decimal(31,10))). We can replace the cast with a decimal literal to be more clear. It inserts a decimal value to a float column, so implicit cast does happen.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jun 28, 2019

This is not a safe cast.

I think it is explainable. There is no precision loss in the casting itself

In the DSv2 sync we discussed options that would work. Why not go with those?

As I remember, we didn't have a solution in DSv2 sync and we decide to come up with solutions offline.

@rdblue

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@cloud-fan, thanks for pointing that out. I see that it is inserting the cast.

@rdblue

This comment has been minimized.

Copy link
Contributor

commented Jun 28, 2019

@gengliangwang, there were two suggestions in the DSv2 sync:

  • Add a decimal type for SQL literals that can be cast to float because the intended type of the literal is not known, or use some analysis rule that matches literals for the same purpose
  • Parse literals as floats and insert an implicit cast from float to decimal

Both of those options are still valid.

I'm surprised to see that decimals can be implicitly cast to float columns in most databases. Is this part of the SQL spec? If so, then I think we can go with that. Otherwise, I'd prefer one of the other options.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

  • Add a decimal type for SQL literals that can be cast to float because the intended type of the literal is not known, or use some analysis rule that matches literals for the same purpose
  • Parse literals as floats and insert an implicit cast from float to decimal

Sorry I meant there is not a conclusion in the sync. I came up with the proposals in the sync, but I don't think they are good enough.

@gengliangwang

This comment has been minimized.

Copy link
Contributor Author

commented Jul 1, 2019

@rdblue I have checked the Information technology — Database languages — SQL —Part 2 Foundation (SQL/Foundation) of the year 2011

In section 15.10: Effect of inserting tables into base tables
image

So we can refer the rules In section 9.2: Store assignment
image

From ISO SQL standard, "If a value of the declared type of T can be obtained from V by rounding or truncation, then the value of T is set to that value".
Checking whether data types strictly upcast-able in analyzer doesn't match the standard. I think we can consider a new approach:

  1. Always use Cast in table insertion if the input data type doesn't match the data type of table column
  2. Cast throws runtime exception on failure, instead of Null results.

The new approach would be configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.