New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27638][SQL]: Cast string to date/timestamp in binary comparisons with dates/timestamps #24567
Conversation
Some corner cases like |
case (StringType, DateType) => Some(StringType) | ||
case (DateType, StringType) => Some(StringType) | ||
case (StringType, DateType) => Some(DateType) | ||
case (DateType, StringType) => Some(DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean we always find the common type as date when any arbitrary strings are compared to any dates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it there any issue in your opinion?
@@ -123,8 +123,8 @@ object TypeCoercion { | |||
// We should cast all relative timestamp/date/string comparison into string comparisons | |||
// This behaves as a user would expect because timestamp strings sort lexicographically. | |||
// i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks we should update this comments too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for pointing it out
It will be super weird if |
It returns null indeed. Sorry for any misunderstanding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change discussed with @gatorsmile before, he has a long-term planning.
// We should cast all relative timestamp/date/string comparison into string comparisons | ||
case (StringType, DateType) => Some(DateType) | ||
case (DateType, StringType) => Some(DateType) | ||
// We should cast all relative timestamp/string comparison into string comparisons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's justified to do it for date, I think we should do it for timestamp as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@pengbo Could you change the title of the PR to reflect the change. For example:
|
@@ -3024,6 +3024,20 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
sql("reset") | |||
} | |||
} | |||
|
|||
test("string date comparison") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, add test cases for <=>
, =
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -120,11 +120,11 @@ object TypeCoercion { | |||
*/ | |||
private def findCommonTypeForBinaryComparison( | |||
dt1: DataType, dt2: DataType, conf: SQLConf): Option[DataType] = (dt1, dt2) match { | |||
// We should cast all relative timestamp/date/string comparison into string comparisons | |||
case (StringType, DateType) => Some(DateType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed these 2 cases but the added test still completes successfully. We need a few tests that fail without the changes in findCommonTypeForBinaryComparison
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is explained by https://github.com/apache/spark/pull/24567/files#issuecomment-491579283
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Thanks for pointing it out. It simplifies the code a lot.
It seems the updated cases: case (StringType, DateType) => Some(DateType)
case (DateType, StringType) => Some(DateType) are sub-set of other cases in the same method spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala Lines 147 to 148 in 72a572f
I think the cases for pairs of |
Done |
@cloud-fan @HyukjinKwon Could you trigger build & tests. |
ok to test |
Test build #105342 has finished for PR 24567 at commit
|
A general note: eventually we do need to support different type coercion modes for different users. But the type coercion rule should be reasonable under any mode. It's not reasonable to compare date/timestamp and string by casting to string, as it returns the wrong result when the string is a partial date/timestamp. For safety, we can add a legacy config to fallback to the old behavior. |
…tamp comparison behavior & failed unit tests correction
@@ -1760,6 +1760,13 @@ object SQLConf { | |||
.internal() | |||
.intConf | |||
.createWithDefault(Int.MaxValue) | |||
|
|||
val LEGACY_CAST_DATE_TIMESTAMP_TO_STRING = | |||
buildConf("spark.sql.legacy.binaryComparison.castDateTimestampToString") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.legacy.typeCoercion.datetimeToString
|
||
val LEGACY_CAST_DATE_TIMESTAMP_TO_STRING = | ||
buildConf("spark.sql.legacy.binaryComparison.castDateTimestampToString") | ||
.doc("If it is set to true, date/timestamp will cast to string in binary comparisons " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add a migration guide for this behavior change and mention this config?
retest it please |
Test build #105352 has finished for PR 24567 at commit
|
LGTM |
Test build #105354 has finished for PR 24567 at commit
|
thanks, merging to master! |
@@ -128,6 +128,8 @@ license: | | |||
|
|||
- Since Spark 3.0, if `hive.default.fileformat` is not found in `Spark SQL configuration` then it will fallback to hive-site.xml present in the `Hadoop configuration` of `SparkContext`. | |||
|
|||
- Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary comparisons with dates/timestamps. The previous behaviour of casting `Date/Timestamp` to `String` can be restored by setting `spark.sql.legacy.typeCoercion.datetimeToString` to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark.sql.legacy.typeCoercion.datetimeToString
-> spark.sql.legacy.typeCoercion.datetimeToString.enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Ngone51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
The below example works with both Mysql and Hive, however not with spark.
The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420.
Based on some tests, the behavior of Date and String comparison in Hive and Mysql:
Hive: Cast to Date, partial date is not supported
Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
As below date patterns have been supported, the PR is to cast string to date when comparing string and date:
How was this patch tested?
UT has been added