Skip to content

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Jan 26, 2021

What changes were proposed in this pull request?

In Spark ANSI mode, the type coercion rules are based on the type precedence lists of the input data types.
As per the section "Type precedence list determination" of "ISO/IEC 9075-2:2011
Information technology — Database languages - SQL — Part 2: Foundation (SQL/Foundation)", the type precedence lists of primitive data types are as following:

  • Byte: Byte, Short, Int, Long, Decimal, Float, Double
  • Short: Short, Int, Long, Decimal, Float, Double
  • Int: Int, Long, Decimal, Float, Double
  • Long: Long, Decimal, Float, Double
  • Decimal: Any wider Numeric type
  • Float: Float, Double
  • Double: Double
  • String: String
  • Date: Date, Timestamp
  • Timestamp: Timestamp
  • Binary: Binary
  • Boolean: Boolean
  • Interval: Interval

As for complex data types, Spark will determine the precedent list recursively based on their sub-types.

With the definition of type precedent list, the general type coercion rules are as following:

  • Data type S is allowed to be implicitly cast as type T iff T is in the precedence list of S
  • Comparison is allowed iff the data type precedence list of both sides has at least one common element. When evaluating the comparison, Spark casts both sides as the tightest common data type of their precedent lists.
  • There should be at least one common data type among all the children's precedence lists for the following operators. The data type of the operator is the tightest common precedent data type.
 In, Except(odd), Intersect, Greatest, Least, Union, If, CaseWhen, CreateArray, Array Concat,Sequence, MapConcat, CreateMap
  • For complex types (struct, array, map), Spark recursively looks into the element type and applies the rules above. If the element nullability is converted from true to false, add runtime null check to the elements.

Note: this new type coercion system will allow implicit converting String type literals as other primitive types, in case of breaking too many existing Spark SQL queries. This is a special rule and it is not from the ANSI SQL standard.

Why are the changes needed?

The current type coercion rules are complex. Also, they are very hard to describe and understand. For details please refer the attached documentation "Default Type coercion rules of Spark"
Default Type coercion rules of Spark.pdf

This PR is to create a new and strict type coercion system under ANSI mode. The rules are simple and clean, so that users can follow them easily

Does this PR introduce any user-facing change?

Yes, new implicit cast syntax rules in ANSI mode. All the details are in the first section of this description.

How was this patch tested?

Unit tests

@gengliangwang
Copy link
Member Author

Note: I have added some follow-ups for this new feature https://issues.apache.org/jira/browse/SPARK-34246.
Breaking down the implementation can make the review easier.

@gengliangwang
Copy link
Member Author

Here are the syntax rules of Type precedence list determination from ANSI SQL standard
image
image
image

@gengliangwang gengliangwang changed the title [SPARK-34246][SQL] New implicit cast syntax rules in ANSI mode [SPARK-34246][SQL] New type coercion syntax rules in ANSI mode Jan 26, 2021
@SparkQA
Copy link

SparkQA commented Jan 26, 2021

Test build #134509 has started for PR 31349 at commit dfabacc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... promotion to string seems not true for ansi mode?

@gengliangwang gengliangwang force-pushed the ansiImplicitConversion branch from dfabacc to 4a8ce27 Compare January 27, 2021 17:49
@SparkQA
Copy link

SparkQA commented Jan 27, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39148/

@SparkQA
Copy link

SparkQA commented Jan 27, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39148/

@SparkQA
Copy link

SparkQA commented Jan 27, 2021

Test build #134562 has finished for PR 31349 at commit 4a8ce27.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update is pretty nice, thanks, @gengliangwang ! Btw, could you summarize major behaviour changes from the existing one in Does this PR introduce any user-facing change? The description about the new behaivour looks pretty clear, but it is not easy to understand which implicit cast behavior will change, actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

Comment on lines 75 to 76
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard says something about this behaviour, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is not from the standard. It is learned from PostgreSQL. I planed to put this feature in another follow-up PR. But there were multiple test failures from the "SQLQueryTestSuite", so I did it in this PR as well.
What do you think of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just updated the comment to claim that it is not from the standard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is for passing TPCDSQueryANSISuite.

E.g. without promoting string literals, q83 will fail with

org.apache.spark.sql.AnalysisException: cannot resolve '(spark_catalog.default.date_dim.`d_date` IN ('2000-06-30', '2000-09-27', '2000-11-17'))' due to data type mismatch: Arguments must be same type but were: date != string; line 12 pos 17;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39182/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39182/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134595 has finished for PR 31349 at commit 22a8979.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • abstract class TypeCoercionBase

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39196/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39196/

@SparkQA
Copy link

SparkQA commented Jan 28, 2021

Test build #134608 has finished for PR 31349 at commit 8283e98.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 132 to 133
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a relationship between AnsiTypeCoercion and DecimalPrecision? In the existing DecimalPrecision, it seems all the (decimal, decimal) cases are handled in DecimalPrecision though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it I am merging findWiderTypeForDecimal into findTightestCommonType, to make the logic more clear.
For the rule DecimalPrecision, it decides the result data type of various operators involving decimals, e.g. Add, Subtract.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok. The latest change looks good.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _ @ StringType() => StringType()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, string literal=>numeric literals is allowed, but numeric literals=>string literals is disallowed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only string literals are specially handled.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39397/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39397/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Test build #134809 has finished for PR 31349 at commit 462eea1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Test build #134839 has finished for PR 31349 at commit 51ca988.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39429/


-- !query
CREATE TABLE withz USING parquet AS SELECT i AS k, CAST(i AS string) || ' v' AS v FROM (SELECT EXPLODE(SEQUENCE(1, 16, 3)) i)
CREATE TABLE withz USING parquet AS SELECT i AS k, CAST(i || ' v' AS string) v FROM (SELECT EXPLODE(SEQUENCE(1, 16, 3)) i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what's going on now. This test is for the SQL WITH feature, not string concat, and we lose test coverage because we failed to create the table and following queries all fail.

Let's change it to CAST(i AS string) || ' v'.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that was my purpose.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39961/

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39961/

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39962/

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39962/

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Test build #135382 has finished for PR 31349 at commit 1c8a911.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 23, 2021

Test build #135381 has finished for PR 31349 at commit 025f915.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 24, 2021

It looks like a valid failure.

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the latest changes and it looks fine.

@gengliangwang
Copy link
Member Author

GA passes. Merging to master.
@maropu @cloud-fan thank you so much for the review!

@SparkQA
Copy link

SparkQA commented Feb 24, 2021

Test build #135403 has finished for PR 31349 at commit 1f5ecdc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

gengliangwang added a commit that referenced this pull request May 10, 2021
…ype` for backward compatibility

### What changes were proposed in this pull request?

Change the definition of `findTightestCommonType` from
```
def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType]
```
to
```
val findTightestCommonType: (DataType, DataType) => Option[DataType]
```

### Why are the changes needed?

For backward compatibility.
When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error
```
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2
```
from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150

In the previous release, the function was
```
static public  scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType ()
```
After #31349, the function becomes:
```
static public  scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2)
```

This PR is to reduce the unnecessary API change.
### Does this PR introduce _any_ user-facing change?

Yes, the definition of `TypeCoercion.findTightestCommonType`  is consistent with previous release again.

### How was this patch tested?

Existing unit tests

Closes #32493 from gengliangwang/typecoercion.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request May 10, 2021
…ype` for backward compatibility

### What changes were proposed in this pull request?

Change the definition of `findTightestCommonType` from
```
def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType]
```
to
```
val findTightestCommonType: (DataType, DataType) => Option[DataType]
```

### Why are the changes needed?

For backward compatibility.
When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error
```
java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2
```
from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150

In the previous release, the function was
```
static public  scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType ()
```
After apache/spark#31349, the function becomes:
```
static public  scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2)
```

This PR is to reduce the unnecessary API change.
### Does this PR introduce _any_ user-facing change?

Yes, the definition of `TypeCoercion.findTightestCommonType`  is consistent with previous release again.

### How was this patch tested?

Existing unit tests

Closes #32493 from gengliangwang/typecoercion.

Authored-by: Gengliang Wang <ltnwgl@gmail.com>
Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
cloud-fan pushed a commit that referenced this pull request Jul 27, 2021
### What changes were proposed in this pull request?

Add documentation for the ANSI implicit cast rules which are introduced from #31349

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build and preview in local:
![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png)
![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png)

![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png)
![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png)
![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png)

Closes #33516 from gengliangwang/addDoc.

Lead-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Serge Rielau <serge@rielau.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Jul 27, 2021
### What changes were proposed in this pull request?

Add documentation for the ANSI implicit cast rules which are introduced from #31349

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build and preview in local:
![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png)
![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png)

![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png)
![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png)
![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png)

Closes #33516 from gengliangwang/addDoc.

Lead-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Serge Rielau <serge@rielau.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df98d5b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
a0x8o added a commit to a0x8o/spark that referenced this pull request Jul 27, 2021
### What changes were proposed in this pull request?

Add documentation for the ANSI implicit cast rules which are introduced from apache/spark#31349

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build and preview in local:
![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png)
![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png)

![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png)
![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png)
![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png)

Closes #33516 from gengliangwang/addDoc.

Lead-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Serge Rielau <serge@rielau.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
xuanyuanking pushed a commit to xuanyuanking/spark that referenced this pull request Sep 29, 2021
### What changes were proposed in this pull request?

Add documentation for the ANSI implicit cast rules which are introduced from apache#31349

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Build and preview in local:
![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png)
![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png)

![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png)
![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png)
![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png)

Closes apache#33516 from gengliangwang/addDoc.

Lead-authored-by: Gengliang Wang <gengliang@apache.org>
Co-authored-by: Serge Rielau <serge@rielau.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit df98d5b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants