Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34769][SQL]AnsiTypeCoercion: return closest convertible type among TypeCollection #31859

Closed

Conversation

gengliangwang
Copy link
Member

@gengliangwang gengliangwang commented Mar 17, 2021

What changes were proposed in this pull request?

Currently, when implicit casting a data type to a TypeCollection, Spark returns the first convertible data type among TypeCollection.
In ANSI mode, we can make the behavior more reasonable by returning the closet convertible data type in TypeCollection.

In details, we first try to find the all the expected types we can implicitly cast:

  1. if there is no convertible data types, return None;
  2. if there is only one convertible data type, cast input as it;
  3. otherwise if there are multiple convertible data types, find the closet data
    type among them. If there is no such closet data type, return None.

Note that if the closet type is Float type and the convertible types contains Double type, simply return Double type as the closet type to avoid potential
precision loss on converting the Integral type as Float type.

Why are the changes needed?

Make the type coercion rule for TypeCollection more reasonable and ANSI compatible.
E.g. returning Long instead of Double forimplicast(int, TypeCollect(Double, Long)).

From ANSI SQL Spec section 4.33 "SQL-invoked routines"
Screen Shot 2021-03-17 at 4 05 06 PM

Section 9.6 "Subject routine determination"
Screen Shot 2021-03-17 at 1 36 55 PM

Section 10.4 "routine invocation"
Screen Shot 2021-03-17 at 4 08 41 PM

Does this PR introduce any user-facing change?

Yes, in ANSI mode, implicit casting to a TypeCollection returns the narrowest convertible data type instead of the first convertible one.

How was this patch tested?

Unit tests.

@github-actions github-actions bot added the SQL label Mar 17, 2021
@gengliangwang gengliangwang changed the title [SPARK-34769][SQL]AnsiTypeCoercion: return narrowest convertible type among TypeCollection [SPARK-34769][SQL]AnsiTypeCoercion: return closest convertible type among TypeCollection Mar 17, 2021
@gengliangwang
Copy link
Member Author

BTW I will add Spark document for the ANSI Type Coercion after this one.

@SparkQA
Copy link

SparkQA commented Mar 17, 2021

Test build #136138 has finished for PR 31859 at commit 13f774a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment on lines 212 to 216
val narrowestCommonType = convertibleTypes.find { dt =>
convertibleTypes.forall { target =>
implicitCast(dt, target, isInputFoldable = false).isDefined
}
}
Copy link
Member

@maropu maropu Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there is no explicit definition of narrowest data types. I'm not 100% sure about this logic though, does the find method (finding a first data type) works fine for any input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu I have updated the related context in PR description. This part is quite complicated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the find method (finding a first data type) works fine for any input?

The logic here is to find a data type that can be converted to all other convertible types. If there is one, it is the closest convertible type.

@HyukjinKwon HyukjinKwon self-assigned this Mar 18, 2021
@cloud-fan
Copy link
Contributor

Shall we simply fail if there are multiple matches in the type collection? It's a bit tricky to define "closest"

@gengliangwang
Copy link
Member Author

Shall we simply fail if there are multiple matches in the type collection? It's a bit tricky to define "closest"

That sounds not reasonable. E.g., failing an input of Integer type for TypeCollection[Long, Double].

@cloud-fan
Copy link
Contributor

Is there any function using TypeCollection[Long, Double]?

@gengliangwang
Copy link
Member Author

Is there any function using TypeCollection[Long, Double]?

There are similar expressions:
Divide: TypeCollection(DoubleType, DecimalType)
IntegralDivide: TypeCollection(LongType, DecimalType)

shouldNotCastStringInput(TypeCollection(NumericType, BinaryType))
// When there are multiple convertible types in the `TypeCollection` and there is no closest
// convertible data type among the convertible types.
shouldNotCastStringLiteral(TypeCollection(NumericType, BinaryType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the previous case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is for String Literal, which can convert to any of (NumericType, BinaryType). The previous one is for String type column, which can't do the conversion.

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41018/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41018/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Test build #136434 has finished for PR 31859 at commit 60c4684.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Test build #136439 has started for PR 31859 at commit 4cea569.

// When there are multiple convertible types in the `TypeCollection`, use the closet convertible
// data type among convertible types.
shouldCast(IntegerType, TypeCollection(BinaryType, FloatType, LongType), LongType)
shouldCast(IntegerType, TypeCollection(BinaryType, LongType, NumericType), IntegerType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't go to the code path to pick the closest convertible type. This hits case _ if expectedType.acceptsType(inType) => Some(inType)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right

@@ -377,10 +368,31 @@ class AnsiTypeCoercionSuite extends AnalysisTest {
ArrayType(StringType, true),
TypeCollection(ArrayType(StringType), StringType),
ArrayType(StringType, true))

// When there are multiple convertible types in the `TypeCollection`, use the closet convertible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: closet -> closest

@@ -73,8 +73,8 @@ select left(null, -2), left("abcd", -2), left("abcd", 0), left("abcd", 'a')
-- !query schema
struct<>
-- !query output
java.lang.NumberFormatException
invalid input syntax for type numeric: a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move left(null, -2) out to a new query? otherwise we can't test the error message of left("abcd", 'a')

@@ -92,7 +92,7 @@ select right(null, -2), right("abcd", -2), right("abcd", 0), right("abcd", 'a')
struct<>
-- !query output
org.apache.spark.sql.AnalysisException
cannot resolve 'substring('abcd', (- CAST('a' AS DOUBLE)), 2147483647)' due to data type mismatch: argument 2 requires int type, however, '(- CAST('a' AS DOUBLE))' is of double type.; line 1 pos 61
cannot resolve 'substring(NULL, (- -2), 2147483647)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'NULL' is of null type.; line 1 pos 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41023/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41023/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41032/

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41032/

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@SparkQA
Copy link

SparkQA commented Mar 24, 2021

Test build #136448 has finished for PR 31859 at commit a72dcfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan closed this in abfd9b2 Mar 24, 2021
@maropu
Copy link
Member

maropu commented Mar 24, 2021

late lgtm.

@HyukjinKwon HyukjinKwon removed their assignment Mar 29, 2021
@HyukjinKwon
Copy link
Member

I just found out that I mistakenly assigned it to myself .. I removed it back now ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants