Skip to content

Conversation

@Ngone51
Copy link
Member

@Ngone51 Ngone51 commented Dec 13, 2019

What changes were proposed in this pull request?

This PR propose to disallow negative scale of Decimal in Spark. And this PR brings two behavior changes:

  1. for literals like 1.23E4BD or 1.23E4(with spark.sql.legacy.exponentLiteralAsDecimal.enabled=true, see SPARK-29956), we set its (precision, scale) to (5, 0) rather than (3, -2);
  2. add negative scale check inside the decimal method if it exposes to set scale explicitly. If check fails, AnalysisException throws.

And user could still use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled to restore the previous behavior.

Why are the changes needed?

According to SQL standard,

4.4.2 Characteristics of numbers
An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer.

scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale.

Presto:

presto:default> create table t (i decimal(2, -1));
Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type>
create table t (i decimal(2, -1))

PostgrelSQL:

postgres=# create table t(i decimal(2, -1));
ERROR:  NUMERIC scale -1 must be between 0 and precision 2
LINE 1: create table t(i decimal(2, -1));
                         ^

And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL:

scala> spark.sql("create table t(i decimal(2, -1))");
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28)

== SQL ==
create table t(i decimal(2, -1))
----------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 35 elided

However, it is still possible to create such table or DatFrame using Spark SQL programming API:

scala> val tb =
 CatalogTable(
  TableIdentifier("test", None),
  CatalogTableType.MANAGED, 
  CatalogStorageFormat.empty, 
  StructType(StructField("i", DecimalType(2, -1) ) :: Nil))
scala> spark.sql("SELECT 1.23E4BD")
res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)]

while, these two different behavior could make user confused.

On the other side, even if user creates such table or DataFrame with negative scale decimal type, it can't write data out if using format, like parquet or orc. Because these formats have their own check for negative scale and fail on it.

scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet")
19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2
	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309)
	at org.apache.parquet.schema.Types$Builder.named(Types.java:290)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

So, I think it would be better to disallow negative scale totally and make behaviors above be consistent.

Does this PR introduce any user-facing change?

Yes, if spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false, user couldn't create Decimal value with negative scale anymore.

How was this patch tested?

Added new tests in ExpressionParserSuite and DecimalSuite;
Updated SQLQueryTestSuite.

@SparkQA
Copy link

SparkQA commented Dec 13, 2019

Test build #115295 has finished for PR 26881 at commit caafaa6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Dec 14, 2019

Query 13 of subquery/in-subquery/in-set-operations.sql passed locally. And it seems it shouldn't be affected.

@Ngone51
Copy link
Member Author

Ngone51 commented Dec 14, 2019

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Dec 14, 2019

Test build #115327 has finished for PR 26881 at commit caafaa6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Dec 14, 2019

I may need to update with the newest master branch and test again. Let me see...

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 8, 2020

cc @cloud-fan

private[sql] def fromJavaBigDecimal(d: JavaBigDecimal): DecimalType = {
val (precision, scale) = if (d.scale < 0 && SQLConf.get.ansiEnabled) {
(d.precision - d.scale, 0)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added 9d41ea2

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116356 has finished for PR 26881 at commit 9d41ea2.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2020

Test build #116363 has finished for PR 26881 at commit 7bd8478.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116447 has finished for PR 26881 at commit f3f34f1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

also cc @viirya @maropu

@maropu
Copy link
Member

maropu commented Jan 10, 2020

Oh, I see. We need too keep this Spark-specific behaviour for future releases? If no DBMS-like system accepts negative scales, I think it is worth making positive scale by default and keeping the old behaviour (negative scale) with a legacy option.

@cloud-fan
Copy link
Contributor

@maropu ah good point! I think we should disallow it by default.

@Ngone51 can you create a new legacy config instead of using the ansi mode?

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116481 has finished for PR 26881 at commit 2c2df5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 10, 2020

@maropu Thanks for the suggestion! Addressed.

-- !query 21
select 0.3, -0.8, .5, -.18, 0.1111, .1111
-- !query 21 schema
struct<0.3:decimal(1,1),-0.8:decimal(1,1),0.5:decimal(1,1),-0.18:decimal(2,2),0.1111:decimal(4,4),0.1111:decimal(4,4)>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think this is an existed bug in Spark: for a number less than 1, its precision&scale are different between Decimal and DecimalType if it's created from literal(because precision&scale are defined separately). This PR fixes it by:

private[sql] def fromDecimal(d: Decimal): DecimalType = DecimalType(d.precision, d.scale)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a bug? For example;

hive> create table testrel as select 0.3;
hive> describe testrel;
OK
_c0                 	decimal(1,1)   

Is it difficult to keep the current behaviour?
cc: @gatorsmile @cloud-fan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true that there's other DBMS would ignore the leftmost zero, which could bring larger scale for values less than 1. I don't know that Spark is also intentionally to follow this. But AFAIK, for number like 0.3, in Spark, it will have (precision, scale) as (2, 1) in Decimal, but (1, 1) in DecimalType.

Maybe, we shall add this as a new feature in following PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precision/scale in decimal type should be the one we expect. Shall we update the underlying Decimal and correct the precision?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu @cloud-fan I've opened a separate PR #27217 to address this issue. PTAL.

@SparkQA
Copy link

SparkQA commented Jan 10, 2020

Test build #116500 has finished for PR 26881 at commit d08789a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Jan 10, 2020

@Ngone51 Could you update the migration guide, too?

@viirya
Copy link
Member

viirya commented Jan 11, 2020

Now it is not controlled by ansi mode? Then we should update the PR title and description.

finally:
self.spark.sql("set spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false")

def test_create_dataframe_from_objects(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is wrong

buildConf("spark.sql.legacy.allowNegativeScaleOfDecimal.enabled")
.internal()
.doc("When set to true, negative scale of Decimal type is allowed. For example, " +
"the type of number 1E10 under legacy mode is DecimalType(2, -9), but is " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1E10 BD

}

test("SPARK-30252: Decimal should set zero scale rather than negative scale by default") {
withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since the test name says "by default", we should not set config here.

}

test("SPARK-30252: Negative scale is not allowed by default") {
withSQLConf(SQLConf.LEGACY_ALLOW_NEGATIVE_SCALE_OF_DECIMAL_ENABLED.key -> "false") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jan 16, 2020

Test build #116850 has finished for PR 26881 at commit 64704dd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks fine to me if the tests passed.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 17, 2020

I reverted the check for max precision added in set(decimal: BigDecimal) because it can break overflow check. That is, Spark is allowed to create a decimal which has precision larger than 38 and then overflow check will decide to return null or throw exception which depends on ansi. So, if we try to add check early in the set, then we'll get exception early too before we check overflow.

For example, for the query below:

spark.sql("select cast(11111111111111111111.123 as decimal(23, 3)) * cast(99999999999999999999.123 as decimal(23, 3))").show

without max precision check, we'll get null; with it, we'll get an exception.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116897 has finished for PR 26881 at commit 156c31f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116921 has finished for PR 26881 at commit 603aed0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/sql/types.py", line 871, in __main__._parse_datatype_json_string
Failed example:
    check_datatype(DecimalType(1,-1))
Exception raised:
    Traceback (most recent call last):
      File "/usr/lib64/pypy-2.5.1/lib-python/2.7/doctest.py", line 1315, in __run
        compileflags, 1) in test.globs
      File "<doctest __main__._parse_datatype_json_string[15]>", line 1, in <module>
        check_datatype(DecimalType(1,-1))
      File "<doctest __main__._parse_datatype_json_string[1]>", line 4, in check_datatype
        scala_datatype = spark._jsparkSession.parseDataType(datatype.json())
      File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/sql/utils.py", line 102, in deco
        raise converted
    AnalysisException: Negative scale is not allowed: -1. You can use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true to enable legacy mode to allow it.;
**********************************************************************
   1 of  16 in __main__._parse_datatype_json_string
***Test Failed*** 1 failures.

@SparkQA
Copy link

SparkQA commented Jan 17, 2020

Test build #116948 has finished for PR 26881 at commit 563853b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 21, 2020

@cloud-fan @maropu @viirya Any more comments?

@cloud-fan cloud-fan closed this in ff39c92 Jan 21, 2020
@cloud-fan
Copy link
Contributor

thanks, merging to master!

Comment on lines +162 to +163
s"You can use spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=true " +
s"to enable legacy mode to allow it.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need s"".

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

late LGTM.

@Ngone51
Copy link
Member Author

Ngone51 commented Jan 22, 2020

Thanks all!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants