[SPARK-30252][SQL] Disallow negative scale of Decimal #26881

Ngone51 · 2019-12-13T09:48:45Z

What changes were proposed in this pull request?

This PR propose to disallow negative scale of Decimal in Spark. And this PR brings two behavior changes:

for literals like 1.23E4BD or 1.23E4(with spark.sql.legacy.exponentLiteralAsDecimal.enabled=true, see SPARK-29956), we set its (precision, scale) to (5, 0) rather than (3, -2);
add negative scale check inside the decimal method if it exposes to set scale explicitly. If check fails, AnalysisException throws.

And user could still use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled to restore the previous behavior.

Why are the changes needed?

According to SQL standard,

4.4.2 Characteristics of numbers
An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer.

scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale.

Presto:

presto:default> create table t (i decimal(2, -1));
Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type>
create table t (i decimal(2, -1))

PostgrelSQL:

postgres=# create table t(i decimal(2, -1));
ERROR:  NUMERIC scale -1 must be between 0 and precision 2
LINE 1: create table t(i decimal(2, -1));
                         ^

And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL:

scala> spark.sql("create table t(i decimal(2, -1))");
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28)

== SQL ==
create table t(i decimal(2, -1))
----------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 35 elided

However, it is still possible to create such table or DatFrame using Spark SQL programming API:

scala> val tb =
 CatalogTable(
  TableIdentifier("test", None),
  CatalogTableType.MANAGED, 
  CatalogStorageFormat.empty, 
  StructType(StructField("i", DecimalType(2, -1) ) :: Nil))

scala> spark.sql("SELECT 1.23E4BD")
res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)]

while, these two different behavior could make user confused.

On the other side, even if user creates such table or DataFrame with negative scale decimal type, it can't write data out if using format, like parquet or orc. Because these formats have their own check for negative scale and fail on it.

scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet")
19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2
	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309)
	at org.apache.parquet.schema.Types$Builder.named(Types.java:290)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

So, I think it would be better to disallow negative scale totally and make behaviors above be consistent.

Does this PR introduce any user-facing change?

Yes, if spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false, user couldn't create Decimal value with negative scale anymore.

How was this patch tested?

Added new tests in ExpressionParserSuite and DecimalSuite;
Updated SQLQueryTestSuite.

SparkQA · 2019-12-13T12:05:27Z

Test build #115295 has finished for PR 26881 at commit caafaa6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-14T10:49:09Z

Query 13 of subquery/in-subquery/in-set-operations.sql passed locally. And it seems it shouldn't be affected.

Ngone51 · 2019-12-14T10:49:19Z

Jenkins, retest this please.

SparkQA · 2019-12-14T12:59:41Z

Test build #115327 has finished for PR 26881 at commit caafaa6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2019-12-14T14:07:43Z

I may need to update with the newest master branch and test again. Let me see...

Ngone51 · 2020-01-08T06:58:01Z

cc @cloud-fan

cloud-fan · 2020-01-08T11:48:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+  private[sql] def fromJavaBigDecimal(d: JavaBigDecimal): DecimalType = {
+    val (precision, scale) = if (d.scale < 0 && SQLConf.get.ansiEnabled) {
+      (d.precision - d.scale, 0)
+    } else {


shall we also handle precision < scale like https://github.com/apache/spark/pull/26881/files#diff-92db6fd7dbad00cec24c5878a8c354d9R134 ?

added 9d41ea2

SparkQA · 2020-01-09T08:05:02Z

Test build #116356 has finished for PR 26881 at commit 9d41ea2.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-09T08:39:10Z

Test build #116363 has finished for PR 26881 at commit 7bd8478.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-10T05:42:01Z

Test build #116447 has finished for PR 26881 at commit f3f34f1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-10T07:48:56Z

also cc @viirya @maropu

maropu · 2020-01-10T10:29:26Z

Oh, I see. We need too keep this Spark-specific behaviour for future releases? If no DBMS-like system accepts negative scales, I think it is worth making positive scale by default and keeping the old behaviour (negative scale) with a legacy option.

cloud-fan · 2020-01-10T10:44:29Z

@maropu ah good point! I think we should disallow it by default.

@Ngone51 can you create a new legacy config instead of using the ansi mode?

SparkQA · 2020-01-10T13:19:47Z

Test build #116481 has finished for PR 26881 at commit 2c2df5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-01-10T16:56:35Z

@maropu Thanks for the suggestion! Addressed.

Ngone51 · 2020-01-10T17:04:09Z

sql/core/src/test/resources/sql-tests/results/literals.sql.out

 -- !query 21
 select 0.3, -0.8, .5, -.18, 0.1111, .1111
 -- !query 21 schema
-struct<0.3:decimal(1,1),-0.8:decimal(1,1),0.5:decimal(1,1),-0.18:decimal(2,2),0.1111:decimal(4,4),0.1111:decimal(4,4)>


Actually, I think this is an existed bug in Spark: for a number less than 1, its precision&scale are different between Decimal and DecimalType if it's created from literal(because precision&scale are defined separately). This PR fixes it by:

private[sql] def fromDecimal(d: Decimal): DecimalType = DecimalType(d.precision, d.scale)

Is this a bug? For example;

hive> create table testrel as select 0.3; hive> describe testrel; OK _c0 decimal(1,1)

Is it difficult to keep the current behaviour?
cc: @gatorsmile @cloud-fan

That's true that there's other DBMS would ignore the leftmost zero, which could bring larger scale for values less than 1. I don't know that Spark is also intentionally to follow this. But AFAIK, for number like 0.3, in Spark, it will have (precision, scale) as (2, 1) in Decimal, but (1, 1) in DecimalType.

Maybe, we shall add this as a new feature in following PR.

The precision/scale in decimal type should be the one we expect. Shall we update the underlying Decimal and correct the precision?

@maropu @cloud-fan I've opened a separate PR #27217 to address this issue. PTAL.

SparkQA · 2020-01-10T20:15:19Z

Test build #116500 has finished for PR 26881 at commit d08789a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2020-01-10T22:40:13Z

@Ngone51 Could you update the migration guide, too?

viirya · 2020-01-11T00:55:49Z

Now it is not controlled by ansi mode? Then we should update the PR title and description.

cloud-fan · 2020-01-15T07:18:22Z

python/pyspark/sql/tests/test_types.py

+        finally:
+            self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false")
+
+def test_create_dataframe_from_objects(self):


indentation is wrong

cloud-fan · 2020-01-15T07:19:00Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.legacy.allowNegativeScaleOfDecimal.enabled")
+      .internal()
+      .doc("When set to true, negative scale of Decimal type is allowed. For example, " +
+        "the type of number 1E10 under legacy mode is DecimalType(2, -9), but is " +


cloud-fan · 2020-01-15T07:23:07Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

  }

+  test("SPARK-30252: Decimal should set zero scale rather than negative scale by default") {
+    withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> "false") {


since the test name says "by default", we should not set config here.

cloud-fan · 2020-01-15T07:23:20Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala

+  }
+
+  test("SPARK-30252: Negative scale is not allowed by default") {
+    withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> "false") {


sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out

SparkQA · 2020-01-16T12:49:58Z

Test build #116850 has finished for PR 26881 at commit 64704dd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

The code looks fine to me if the tests passed.

Ngone51 · 2020-01-17T05:33:57Z

I reverted the check for max precision added in set(decimal: BigDecimal) because it can break overflow check. That is, Spark is allowed to create a decimal which has precision larger than 38 and then overflow check will decide to return null or throw exception which depends on ansi. So, if we try to add check early in the set, then we'll get exception early too before we check overflow.

For example, for the query below:

spark.sql("select cast(11111111111111111111.123 as decimal(23, 3)) * cast(99999999999999999999.123 as decimal(23, 3))").show

without max precision check, we'll get null; with it, we'll get an exception.

SparkQA · 2020-01-17T08:02:23Z

Test build #116897 has finished for PR 26881 at commit 156c31f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-01-17T12:28:52Z

Test build #116921 has finished for PR 26881 at commit 603aed0.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-01-17T12:37:59Z

File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/sql/types.py", line 871, in __main__._parse_datatype_json_string
Failed example:
    check_datatype(DecimalType(1,-1))
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in __run
        compileflags, 1) in test.globs
      File "<doctest __main__._parse_datatype_json_string[15]>", line 1, in <module>
        check_datatype(DecimalType(1,-1))
      File "<doctest __main__._parse_datatype_json_string[1]>", line 4, in check_datatype
        scala_datatype = spark._jsparkSession.parseDataType(datatype.json())
      File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/sql/utils.py", line 102, in deco
        raise converted
    AnalysisException: Negative scale is not allowed: -1. You can use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true to enable legacy mode to allow it.;
**********************************************************************
   1 of  16 in __main__._parse_datatype_json_string
***Test Failed*** 1 failures.

SparkQA · 2020-01-17T18:54:04Z

Test build #116948 has finished for PR 26881 at commit 563853b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Ngone51 · 2020-01-21T09:30:36Z

@cloud-fan @maropu @viirya Any more comments?

cloud-fan · 2020-01-21T13:14:32Z

thanks, merging to master!

viirya · 2020-01-21T16:33:09Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala

+        s"You can use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true " +
+        s"to enable legacy mode to allow it.")


nit: no need s"".

viirya

late LGTM.

Ngone51 · 2020-01-22T02:08:32Z

Thanks all!!!

Ngone51 added 3 commits December 13, 2019 16:04

disallow negative scale under ansi mode

e88a6ac

attach JIRA id

fbb9db1

improve test

caafaa6

dongjoon-hyun added the SQL label Dec 13, 2019

Merge branch 'master' of github.com:apache/spark into nonnegative-scale

80758eb

cloud-fan reviewed Jan 8, 2020

View reviewed changes

Ngone51 added 2 commits January 9, 2020 15:01

consider precision <= scale

9d41ea2

create type from decimal

7bd8478

use BigDecimal to create integeral decimal

f3f34f1

fix test

2c2df5b

Ngone51 added 2 commits January 11, 2020 00:27

fix decimal type for values less than 1.0

3f2abae

use legacy config

d08789a

Ngone51 commented Jan 10, 2020

View reviewed changes

cloud-fan reviewed Jan 15, 2020

View reviewed changes

sql/core/src/test/resources/sql-tests/results/ansi/literals.sql.out Outdated Show resolved Hide resolved

Ngone51 added 5 commits January 16, 2020 17:44

merge master

feb0a6f

revert sql.out for [-1.0,1.0] decimal

c1bd853

add check for max precision

1ba3067

update with conf

a76930f

fix python indent

64704dd

cloud-fan approved these changes Jan 16, 2020

View reviewed changes

maropu approved these changes Jan 17, 2020

View reviewed changes

revert precision check

156c31f

fix test

603aed0

fix python failed doc

563853b

maropu approved these changes Jan 21, 2020

View reviewed changes

cloud-fan closed this in ff39c92 Jan 21, 2020

viirya reviewed Jan 21, 2020

View reviewed changes

kazuyukitanimura mentioned this pull request Jul 12, 2024

feat: Spark-4.0 widening type support apache/datafusion-comet#604

Merged

		s"You can use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true " +
		s"to enable legacy mode to allow it.")

[SPARK-30252][SQL] Disallow negative scale of Decimal #26881

[SPARK-30252][SQL] Disallow negative scale of Decimal #26881

Uh oh!

Conversation

Ngone51 commented Dec 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Dec 13, 2019

Uh oh!

Ngone51 commented Dec 14, 2019

Uh oh!

Ngone51 commented Dec 14, 2019

Uh oh!

SparkQA commented Dec 14, 2019

Uh oh!

Ngone51 commented Dec 14, 2019

Uh oh!

Ngone51 commented Jan 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 9, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

cloud-fan commented Jan 10, 2020

Uh oh!

maropu commented Jan 10, 2020

Uh oh!

cloud-fan commented Jan 10, 2020

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

Ngone51 commented Jan 10, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 10, 2020

Uh oh!

maropu commented Jan 10, 2020

Uh oh!

viirya commented Jan 11, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jan 16, 2020

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

Ngone51 commented Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 17, 2020

Uh oh!

SparkQA commented Jan 17, 2020

Uh oh!

Ngone51 commented Dec 13, 2019 •

edited

Loading

Ngone51 commented Jan 17, 2020 •

edited

Loading