Skip to content
Permalink
Browse files

[SPARK-27638][SQL] Cast string to date/timestamp in binary comparison…

…s with dates/timestamps

## What changes were proposed in this pull request?

The below example works with both Mysql and Hive, however not with spark.

```
mysql> select * from date_test where date_col >= '2000-1-1';
+------------+
| date_col   |
+------------+
| 2000-01-01 |
+------------+
```
The reason is that Spark casts both sides to String type during date and string comparison for partial date support. Please find more details in https://issues.apache.org/jira/browse/SPARK-8420.

Based on some tests, the behavior of Date and String comparison in Hive and Mysql:
Hive: Cast to Date, partial date is not supported
Mysql: Cast to Date, certain "partial date" is supported by defining certain date string parse rules. Check out str_to_datetime in https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c

As below date patterns have been supported, the PR is to cast string to date when comparing string and date:
```
`yyyy`
`yyyy-[m]m`
`yyyy-[m]m-[d]d`
`yyyy-[m]m-[d]d `
`yyyy-[m]m-[d]d *`
`yyyy-[m]m-[d]dT*
```

## How was this patch tested?
UT has been added

Closes #24567 from pengbo/SPARK-27638.

Authored-by: mingbo.pb <mingbo.pb@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
  • Loading branch information...
mingbo.pb authored and cloud-fan committed May 14, 2019
1 parent 8b0bdaa commit 66f5a42ca5d259038f0749ae2b9a04cc2f658880
@@ -128,6 +128,8 @@ license: |

- Since Spark 3.0, if `hive.default.fileformat` is not found in `Spark SQL configuration` then it will fallback to hive-site.xml present in the `Hadoop configuration` of `SparkContext`.

- Since Spark 3.0, Spark will cast `String` to `Date/TimeStamp` in binary comparisons with dates/timestamps. The previous behaviour of casting `Date/Timestamp` to `String` can be restored by setting `spark.sql.legacy.typeCoercion.datetimeToString` to `true`.

## Upgrading from Spark SQL 2.4 to 2.4.1

- The value of `spark.executor.heartbeatInterval`, when specified without units like "30" rather than "30s", was
@@ -120,13 +120,14 @@ object TypeCoercion {
*/
private def findCommonTypeForBinaryComparison(
dt1: DataType, dt2: DataType, conf: SQLConf): Option[DataType] = (dt1, dt2) match {
// We should cast all relative timestamp/date/string comparison into string comparisons
// This behaves as a user would expect because timestamp strings sort lexicographically.
// i.e. TimeStamp(2013-01-01 00:00 ...) < "2014" = true
case (StringType, DateType) => Some(StringType)
case (DateType, StringType) => Some(StringType)
case (StringType, TimestampType) => Some(StringType)
case (TimestampType, StringType) => Some(StringType)
case (StringType, DateType)
=> if (conf.castDatetimeToString) Some(StringType) else Some(DateType)
case (DateType, StringType)
=> if (conf.castDatetimeToString) Some(StringType) else Some(DateType)
case (StringType, TimestampType)
=> if (conf.castDatetimeToString) Some(StringType) else Some(TimestampType)
case (TimestampType, StringType)
=> if (conf.castDatetimeToString) Some(StringType) else Some(TimestampType)
case (StringType, NullType) => Some(StringType)
case (NullType, StringType) => Some(StringType)

@@ -1760,6 +1760,13 @@ object SQLConf {
.internal()
.intConf
.createWithDefault(Int.MaxValue)

val LEGACY_CAST_DATETIME_TO_STRING =
buildConf("spark.sql.legacy.typeCoercion.datetimeToString")
.doc("If it is set to true, date/timestamp will cast to string in binary comparisons " +
"with String")
.booleanConf
.createWithDefault(false)
}

/**
@@ -2211,6 +2218,8 @@ class SQLConf extends Serializable with Logging {
def setCommandRejectsSparkCoreConfs: Boolean =
getConf(SQLConf.SET_COMMAND_REJECTS_SPARK_CORE_CONFS)

def castDatetimeToString: Boolean = getConf(SQLConf.LEGACY_CAST_DATETIME_TO_STRING)

/** ********************** SQLConf functionality methods ************ */

/** Set Spark SQL configuration properties. */
@@ -85,7 +85,7 @@ false
-- !query 10
select to_date('2009-07-30 04:17:52') > '2009-07-30 04:17:52'
-- !query 10 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) > 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') > CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 10 output
false

@@ -141,9 +141,9 @@ true
-- !query 17
select to_date('2009-07-30 04:17:52') >= '2009-07-30 04:17:52'
-- !query 17 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) >= 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') >= CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 17 output
false
true


-- !query 18
@@ -197,9 +197,9 @@ false
-- !query 24
select to_date('2009-07-30 04:17:52') < '2009-07-30 04:17:52'
-- !query 24 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) < 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') < CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 24 output
true
false


-- !query 25
@@ -253,7 +253,7 @@ true
-- !query 31
select to_date('2009-07-30 04:17:52') <= '2009-07-30 04:17:52'
-- !query 31 schema
struct<(CAST(to_date('2009-07-30 04:17:52') AS STRING) <= 2009-07-30 04:17:52):boolean>
struct<(to_date('2009-07-30 04:17:52') <= CAST(2009-07-30 04:17:52 AS DATE)):boolean>
-- !query 31 output
true

0 comments on commit 66f5a42

Please sign in to comment.
You can’t perform that action at this time.