[SPARK-34246][SQL] New type coercion syntax rules in ANSI mode #31349

gengliangwang · 2021-01-26T15:45:16Z

What changes were proposed in this pull request?

In Spark ANSI mode, the type coercion rules are based on the type precedence lists of the input data types.
As per the section "Type precedence list determination" of "ISO/IEC 9075-2:2011
Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation)", the type precedence lists of primitive data types are as following:

Byte: Byte, Short, Int, Long, Decimal, Float, Double
Short: Short, Int, Long, Decimal, Float, Double
Int: Int, Long, Decimal, Float, Double
Long: Long, Decimal, Float, Double
Decimal: Any wider Numeric type
Float: Float, Double
Double: Double
String: String
Date: Date, Timestamp
Timestamp: Timestamp
Binary: Binary
Boolean: Boolean
Interval: Interval

As for complex data types, Spark will determine the precedent list recursively based on their sub-types.

With the definition of type precedent list, the general type coercion rules are as following:

Data type S is allowed to be implicitly cast as type T iff T is in the precedence list of S
Comparison is allowed iff the data type precedence list of both sides has at least one common element. When evaluating the comparison, Spark casts both sides as the tightest common data type of their precedent lists.
There should be at least one common data type among all the children's precedence lists for the following operators. The data type of the operator is the tightest common precedent data type.

 In, Except(odd), Intersect, Greatest, Least, Union, If, CaseWhen, CreateArray, Array Concat,Sequence, MapConcat, CreateMap

For complex types (struct, array, map), Spark recursively looks into the element type and applies the rules above. If the element nullability is converted from true to false, add runtime null check to the elements.

Note: this new type coercion system will allow implicit converting String type literals as other primitive types, in case of breaking too many existing Spark SQL queries. This is a special rule and it is not from the ANSI SQL standard.

Why are the changes needed?

The current type coercion rules are complex. Also, they are very hard to describe and understand. For details please refer the attached documentation "Default Type coercion rules of Spark"
Default Type coercion rules of Spark.pdf

This PR is to create a new and strict type coercion system under ANSI mode. The rules are simple and clean, so that users can follow them easily

Does this PR introduce any user-facing change?

Yes, new implicit cast syntax rules in ANSI mode. All the details are in the first section of this description.

How was this patch tested?

Unit tests

gengliangwang · 2021-01-26T16:13:08Z

Note: I have added some follow-ups for this new feature https://issues.apache.org/jira/browse/SPARK-34246.
Breaking down the implementation can make the review easier.

gengliangwang · 2021-01-26T16:28:44Z

Here are the syntax rules of Type precedence list determination from ANSI SQL standard

SparkQA · 2021-01-26T16:40:56Z

Test build #134509 has started for PR 31349 at commit dfabacc.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

cloud-fan · 2021-01-27T15:19:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

... promotion to string seems not true for ansi mode?

SparkQA · 2021-01-27T19:36:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39148/

SparkQA · 2021-01-27T20:07:45Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39148/

SparkQA · 2021-01-27T20:38:17Z

Test build #134562 has finished for PR 31349 at commit 4a8ce27.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu

This update is pretty nice, thanks, @gengliangwang ! Btw, could you summarize major behaviour changes from the existing one in Does this PR introduce any user-facing change? The description about the new behaivour looks pretty clear, but it is not easy to understand which implicit cast behavior will change, actually.

maropu · 2021-01-28T00:53:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala

maropu · 2021-01-28T01:36:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala

The standard says something about this behaviour, too?

No, this is not from the standard. It is learned from PostgreSQL. I planed to put this feature in another follow-up PR. But there were multiple test failures from the "SQLQueryTestSuite", so I did it in this PR as well.
What do you think of it?

I just updated the comment to claim that it is not from the standard.

Also, this is for passing TPCDSQueryANSISuite.

E.g. without promoting string literals, q83 will fail with

org.apache.spark.sql.AnalysisException: cannot resolve '(spark_catalog.default.date_dim.`d_date` IN ('2000-06-30', '2000-09-27', '2000-11-17'))' due to data type mismatch: Arguments must be same type but were: date != string; line 12 pos 17;

maropu · 2021-01-28T01:49:59Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

abstract? https://github.com/databricks/scala-style-guide#java-traits

SparkQA · 2021-01-28T07:39:25Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39182/

SparkQA · 2021-01-28T08:15:21Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39182/

SparkQA · 2021-01-28T08:49:49Z

Test build #134595 has finished for PR 31349 at commit 22a8979.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
abstract class TypeCoercionBase

SparkQA · 2021-01-28T09:33:47Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39196/

SparkQA · 2021-01-28T09:59:18Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39196/

SparkQA · 2021-01-28T13:24:04Z

Test build #134608 has finished for PR 31349 at commit 8283e98.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala

maropu · 2021-01-29T15:12:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala

What's a relationship between AnsiTypeCoercion and DecimalPrecision? In the existing DecimalPrecision, it seems all the (decimal, decimal) cases are handled in DecimalPrecision though.

Here it I am merging findWiderTypeForDecimal into findTightestCommonType, to make the logic more clear.
For the rule DecimalPrecision, it decides the result data type of various operators involving decimals, e.g. Add, Subtract.

Ah, ok. The latest change looks good.

maropu · 2021-01-29T15:13:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala

nit: _ @ StringType() => StringType()

maropu · 2021-01-29T15:15:37Z

sql/core/src/test/resources/sql-tests/results/postgreSQL/text.sql.out

hm, string literal=>numeric literals is allowed, but numeric literals=>string literals is disallowed?

Yes, only string literals are specially handled.

SparkQA · 2021-02-03T05:55:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39397/

SparkQA · 2021-02-03T06:14:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39397/

SparkQA · 2021-02-03T10:04:30Z

Test build #134809 has finished for PR 31349 at commit 462eea1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-03T16:07:02Z

Test build #134839 has finished for PR 31349 at commit 51ca988.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2021-02-03T16:13:06Z

retest this please.

SparkQA · 2021-02-03T16:41:53Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39429/

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala

sql/core/src/test/resources/sql-tests/inputs/postgreSQL/with.sql

cloud-fan · 2021-02-23T16:14:19Z

sql/core/src/test/resources/sql-tests/results/postgreSQL/with.sql.out


 -- !query
-CREATE TABLE withz USING parquet AS SELECT i AS k, CAST(i AS string) || ' v' AS v FROM (SELECT EXPLODE(SEQUENCE(1, 16, 3)) i)
+CREATE TABLE withz USING parquet AS SELECT i AS k, CAST(i || ' v' AS string) v FROM (SELECT EXPLODE(SEQUENCE(1, 16, 3)) i)


I see what's going on now. This test is for the SQL WITH feature, not string concat, and we lose test coverage because we failed to create the table and following queries all fail.

Let's change it to CAST(i AS string) || ' v'.

yeah, that was my purpose.

SparkQA · 2021-02-23T17:04:43Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39961/

SparkQA · 2021-02-23T17:33:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39961/

SparkQA · 2021-02-23T18:14:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39962/

SparkQA · 2021-02-23T18:41:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39962/

SparkQA · 2021-02-23T18:43:25Z

Test build #135382 has finished for PR 31349 at commit 1c8a911.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-23T18:47:05Z

Test build #135381 has finished for PR 31349 at commit 025f915.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2021-02-24T00:30:10Z

It looks like a valid failure.

…rsion

maropu

I've checked the latest changes and it looks fine.

gengliangwang · 2021-02-24T05:40:10Z

GA passes. Merging to master.
@maropu @cloud-fan thank you so much for the review!

SparkQA · 2021-02-24T09:16:17Z

Test build #135403 has finished for PR 31349 at commit 1f5ecdc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…ype` for backward compatibility ### What changes were proposed in this pull request? Change the definition of `findTightestCommonType` from ``` def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType] ``` to ``` val findTightestCommonType: (DataType, DataType) => Option[DataType] ``` ### Why are the changes needed? For backward compatibility. When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error ``` java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2 ``` from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150 In the previous release, the function was ``` static public scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType () ``` After #31349, the function becomes: ``` static public scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2) ``` This PR is to reduce the unnecessary API change. ### Does this PR introduce _any_ user-facing change? Yes, the definition of `TypeCoercion.findTightestCommonType` is consistent with previous release again. ### How was this patch tested? Existing unit tests Closes #32493 from gengliangwang/typecoercion. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>

…ype` for backward compatibility ### What changes were proposed in this pull request? Change the definition of `findTightestCommonType` from ``` def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType] ``` to ``` val findTightestCommonType: (DataType, DataType) => Option[DataType] ``` ### Why are the changes needed? For backward compatibility. When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error ``` java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2 ``` from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150 In the previous release, the function was ``` static public scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType () ``` After apache/spark#31349, the function becomes: ``` static public scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2) ``` This PR is to reduce the unnecessary API change. ### Does this PR introduce _any_ user-facing change? Yes, the definition of `TypeCoercion.findTightestCommonType` is consistent with previous release again. ### How was this patch tested? Existing unit tests Closes #32493 from gengliangwang/typecoercion. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>

### What changes were proposed in this pull request? Add documentation for the ANSI implicit cast rules which are introduced from #31349 ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build and preview in local: ![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png) ![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png) ![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png) ![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png) ![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png) Closes #33516 from gengliangwang/addDoc. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Add documentation for the ANSI implicit cast rules which are introduced from #31349 ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build and preview in local: ![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png) ![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png) ![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png) ![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png) ![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png) Closes #33516 from gengliangwang/addDoc. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit df98d5b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Add documentation for the ANSI implicit cast rules which are introduced from apache/spark#31349 ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build and preview in local: ![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png) ![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png) ![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png) ![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png) ![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png) Closes #33516 from gengliangwang/addDoc. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? Add documentation for the ANSI implicit cast rules which are introduced from apache#31349 ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build and preview in local: ![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png) ![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png) ![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png) ![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png) ![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png) Closes apache#33516 from gengliangwang/addDoc. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit df98d5b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

gengliangwang force-pushed the ansiImplicitConversion branch from 5233edb to 0faed1f Compare January 26, 2021 15:45

gengliangwang requested review from maropu, cloud-fan and HyukjinKwon January 26, 2021 15:46

github-actions bot added the SQL label Jan 26, 2021

gengliangwang changed the title ~~[SPARK-34246][SQL] New implicit cast syntax rules in ANSI mode~~ [SPARK-34246][SQL] New type coercion syntax rules in ANSI mode Jan 26, 2021

cloud-fan reviewed Jan 27, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Jan 27, 2021

View reviewed changes

gengliangwang force-pushed the ansiImplicitConversion branch from dfabacc to 4a8ce27 Compare January 27, 2021 17:49

maropu reviewed Jan 28, 2021

View reviewed changes

maropu reviewed Jan 29, 2021

View reviewed changes

cloud-fan reviewed Feb 23, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 23, 2021

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AnsiTypeCoercion.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 23, 2021

View reviewed changes

sql/core/src/test/resources/sql-tests/inputs/postgreSQL/with.sql Show resolved Hide resolved

address comments

025f915

cloud-fan reviewed Feb 23, 2021

View reviewed changes

update with.sql

1c8a911

cloud-fan approved these changes Feb 23, 2021

View reviewed changes

gengliangwang added 2 commits February 24, 2021 11:58

fix query outputs

7a57144

Merge remote-tracking branch 'upstream/master' into ansiImplicitConve…

1f5ecdc

…rsion

maropu approved these changes Feb 24, 2021

View reviewed changes

gengliangwang closed this in 5d9cfd7 Feb 24, 2021

cloud-fan mentioned this pull request Mar 4, 2021

[SPARK-27658][SQL] Add FunctionCatalog API #24559

Closed

gengliangwang mentioned this pull request May 10, 2021

[SPARK-34246][SQL][FOLLOWUP] Change the definition of findTightestCommonType for backward compatibility #32493

Closed

gengliangwang mentioned this pull request Jul 26, 2021

[SPARK-34249][DOCS] Add documentation for ANSI implicit cast rules #33516

Closed

vitaliili-db mentioned this pull request May 5, 2022

[SPARK-39060][SQL][3.2] Typo in error messages of decimal overflow #36458

Closed

[SPARK-34246][SQL] New type coercion syntax rules in ANSI mode #31349

[SPARK-34246][SQL] New type coercion syntax rules in ANSI mode #31349

Uh oh!

Conversation

gengliangwang commented Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Jan 26, 2021

Uh oh!

gengliangwang commented Jan 26, 2021

Uh oh!

SparkQA commented Jan 26, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 27, 2021

Uh oh!

SparkQA commented Jan 27, 2021

Uh oh!

SparkQA commented Jan 27, 2021

Uh oh!

maropu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 28, 2021

Uh oh!

SparkQA commented Jan 28, 2021

Uh oh!

SparkQA commented Jan 28, 2021

Uh oh!

SparkQA commented Jan 28, 2021

Uh oh!

SparkQA commented Jan 28, 2021

Uh oh!

SparkQA commented Jan 28, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

Uh oh!

gengliangwang commented Feb 3, 2021

Uh oh!

SparkQA commented Feb 3, 2021

gengliangwang commented Jan 26, 2021 •

edited

Loading