Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27638][SQL]: Cast string to date/timestamp in binary comparisons with dates/timestamps #24567

Closed
wants to merge 7 commits into from

Conversation

pengbo
Copy link
Contributor

@pengbo pengbo commented May 9, 2019

What changes were proposed in this pull request?

The below example works with both Mysql and Hive, however not with spark.

mysql> select * from date_test where date_col >= '2000-1-1';
+------------+
| date_col   |
+------------+
| 2000-01-01 |
+------------+

The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and Mysql:
Hive: Cast to Date, partial date is not supported
Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

As below date patterns have been supported, the PR is to cast string to date when comparing string and date:

`yyyy`
`yyyy-[m]m`
`yyyy-[m]m-[d]d`
`yyyy-[m]m-[d]d `
`yyyy-[m]m-[d]d *`
`yyyy-[m]m-[d]dT*

How was this patch tested?

UT has been added

@pengbo
Copy link
Contributor Author

pengbo commented May 9, 2019

cc @cloud-fan @MaxGekk

Some corner cases like date_col > 'invalid_date_string' will always return false in this solution seems to be reasonable as well In my opinion such as date_col > '2000-' or '19999' (the same behavior in Mysql).

case (StringType, DateType) => Some(StringType)
case (DateType, StringType) => Some(StringType)
case (StringType, DateType) => Some(DateType)
case (DateType, StringType) => Some(DateType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean we always find the common type as date when any arbitrary strings are compared to any dates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it there any issue in your opinion?

@@ -123,8 +123,8 @@ object TypeCoercion {
// We should cast all relative timestamp/date/string comparison into string comparisons
// This behaves as a user would expect because timestamp strings sort lexicographically.
// i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks we should update this comments too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for pointing it out

@cloud-fan
Copy link
Contributor

Some corner cases like date_col > 'invalid_date_string' will always return false in this solution seems to be reasonable

It will be super weird if date_col > 'invalid_date_string', date_col < 'invalid_date_string' and date_col = 'invalid_date_string' all return false. I think returning null is more reasonable.

@pengbo
Copy link
Contributor Author

pengbo commented May 10, 2019

Some corner cases like date_col > 'invalid_date_string' will always return false in this solution seems to be reasonable

It will be super weird if date_col > 'invalid_date_string', date_col < 'invalid_date_string' and date_col = 'invalid_date_string' all return false. I think returning null is more reasonable.

It returns null indeed. Sorry for any misunderstanding

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change discussed with @gatorsmile before, he has a long-term planning.

// We should cast all relative timestamp/date/string comparison into string comparisons
case (StringType, DateType) => Some(DateType)
case (DateType, StringType) => Some(DateType)
// We should cast all relative timestamp/string comparison into string comparisons
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's justified to do it for date, I think we should do it for timestamp as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@MaxGekk
Copy link
Member

MaxGekk commented May 12, 2019

@pengbo Could you change the title of the PR to reflect the change. For example:

[SPARK-27638][SQL] Cast string to date in binary comparisons with dates

@@ -3024,6 +3024,20 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
sql("reset")
}
}

test("string date comparison") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add test cases for <=>, =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -120,11 +120,11 @@ object TypeCoercion {
*/
private def findCommonTypeForBinaryComparison(
dt1: DataType, dt2: DataType, conf: SQLConf): Option[DataType] = (dt1, dt2) match {
// We should cast all relative timestamp/date/string comparison into string comparisons
case (StringType, DateType) => Some(DateType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these 2 cases but the added test still completes successfully. We need a few tests that fail without the changes in findCommonTypeForBinaryComparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Thanks for pointing it out. It simplifies the code a lot.

@MaxGekk
Copy link
Member

MaxGekk commented May 12, 2019

It seems the updated cases:

case (StringType, DateType) => Some(DateType)
case (DateType, StringType) => Some(DateType)

are sub-set of other cases in the same method

case (l: StringType, r: AtomicType) if r != StringType => Some(r)
case (l: AtomicType, r: StringType) if l != StringType => Some(l)

I think the cases for pairs of StringType and DateType can be just removed.

@pengbo pengbo changed the title [SPARK-27638][SQL]: date format 'yyyy-M-dd' string comparison not handled properly [SPARK-27638][SQL]: Cast string to date/timestamp in binary comparisons with dates/timestamps May 12, 2019
@pengbo
Copy link
Contributor Author

pengbo commented May 12, 2019

@pengbo Could you change the title of the PR to reflect the change. For example:

[SPARK-27638][SQL] Cast string to date in binary comparisons with dates

Done

@MaxGekk
Copy link
Member

MaxGekk commented May 12, 2019

@cloud-fan @HyukjinKwon Could you trigger build & tests.

@kiszk
Copy link
Member

kiszk commented May 12, 2019

ok to test

@SparkQA
Copy link

SparkQA commented May 12, 2019

Test build #105342 has finished for PR 24567 at commit ac82eb2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

A general note: eventually we do need to support different type coercion modes for different users. But the type coercion rule should be reasonable under any mode. It's not reasonable to compare date/timestamp and string by casting to string, as it returns the wrong result when the string is a partial date/timestamp.

For safety, we can add a legacy config to fallback to the old behavior.

@@ -1760,6 +1760,13 @@ object SQLConf {
.internal()
.intConf
.createWithDefault(Int.MaxValue)

val LEGACY_CAST_DATE_TIMESTAMP_TO_STRING =
buildConf("spark.sql.legacy.binaryComparison.castDateTimestampToString")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.legacy.typeCoercion.datetimeToString


val LEGACY_CAST_DATE_TIMESTAMP_TO_STRING =
buildConf("spark.sql.legacy.binaryComparison.castDateTimestampToString")
.doc("If it is set to true, date/timestamp will cast to string in binary comparisons " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add a migration guide for this behavior change and mention this config?

@pengbo
Copy link
Contributor Author

pengbo commented May 13, 2019

retest it please

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105352 has finished for PR 24567 at commit 34b00e9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 13, 2019

Test build #105354 has finished for PR 24567 at commit c3e80f7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 66f5a42 May 14, 2019
@@ -128,6 +128,8 @@ license: |

- Since Spark 3.0, if `hive.default.fileformat` is not found in `Spark SQL configuration` then it will fallback to hive-site.xml present in the `Hadoop configuration` of `SparkContext`.

- Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary comparisons with dates/timestamps. The previous behaviour of casting `Date/Timestamp` to `String` can be restored by setting `spark.sql.legacy.typeCoercion.datetimeToString` to `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark.sql.legacy.typeCoercion.datetimeToString -> spark.sql.legacy.typeCoercion.datetimeToString.enabled

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
9 participants