[SPARK-19496][SQL]to_date udf to return null when input date is invalid #16870

windpiger · 2017-02-09T10:05:06Z

What changes were proposed in this pull request?

Currently the udf to_date has different return value with an invalid date input.

SELECT to_date('2015-07-22', 'yyyy-dd-MM') ->  return `2016-10-07`
SELECT to_date('2014-31-12')    -> return null

As discussed in JIRA SPARK-19496, we should return null in both situations when the input date is invalid

How was this patch tested?

unit test added

windpiger · 2017-02-09T10:15:16Z

retest this please

SparkQA · 2017-02-09T11:39:40Z

Test build #72639 has finished for PR 16870 at commit de28fd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-09T12:41:08Z

Test build #72637 has finished for PR 16870 at commit 8db6253.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-09T13:52:04Z

retest this please

SparkQA · 2017-02-09T16:01:11Z

Test build #72646 has finished for PR 16870 at commit de28fd4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-10T03:59:45Z

Test build #72669 has finished for PR 16870 at commit ad6528c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-10T04:04:52Z

cc @cloud-fan @gatorsmile @HyukjinKwon

hvanhovell

The general direction is good. I left a few comments.

hvanhovell · 2017-02-10T12:12:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

-  def newDateFormat(formatString: String, timeZone: TimeZone): DateFormat = {
+  def newDateFormat(formatString: String,
+      timeZone: TimeZone,
+      isLenient: Boolean = true): DateFormat = {


Let's not make this a default parameter.

ok let me modify it

hvanhovell · 2017-02-10T12:15:38Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+    checkAnswer(df1.select(unix_timestamp(col("x"), "yyyy-dd-MM HH:mm:ss")), Seq(
+      Row(ts2.getTime / 1000L), Row(null), Row(null), Row(null)))
+    checkAnswer(df1.selectExpr(s"unix_timestamp(x, 'yyyy-MM-dd mm:HH:ss')"), Seq(
+      Row(ts3.getTime / 1000L), Row(ts4.getTime / 1000L), Row(null), Row(null)))


Shouldn't the order be Row(ts4.getTime / 1000L), Row(null), Row(ts3.getTime / 1000L), Row(null)? It does not matter for testing since we sort results, but it makes it less confusing.

you are right~ thanks a lot!

hvanhovell · 2017-02-10T12:21:35Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+    checkAnswer(df1.selectExpr("to_unix_timestamp(x)"), Seq(
+      Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null)))
+    checkAnswer(df1.selectExpr(s"to_unix_timestamp(x, 'yyyy-MM-dd mm:HH:ss')"), Seq(
+      Row(ts3.getTime / 1000L), Row(ts4.getTime / 1000L), Row(null), Row(null)))


Shouldn't the order be Row(ts4.getTime / 1000L), Row(null), Row(ts3.getTime / 1000L), Row(null)? It does not matter for testing since we sort results, but it makes it less confusing.

hvanhovell · 2017-02-10T14:04:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala

    UTF8String.fromString(df.format(new java.util.Date(timestamp.asInstanceOf[Long] / 1000)))
  }

  override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
    val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
    val tz = ctx.addReferenceMinorObj(timeZone)
    defineCodeGen(ctx, ev, (timestamp, format) => {
-      s"""UTF8String.fromString($dtu.newDateFormat($format.toString(), $tz)
+      s"""UTF8String.fromString($dtu.newDateFormat($format.toString(), $tz, false)


These are inconsistent.

oh, sorry let me fix it, thanks!

SparkQA · 2017-02-10T16:36:31Z

Test build #72710 has finished for PR 16870 at commit a279889.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-11T04:29:55Z

Test build #72730 has finished for PR 16870 at commit 2dc241e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-12T01:25:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

    val sdf = new SimpleDateFormat(formatString, Locale.US)
    sdf.setTimeZone(timeZone)
+    sdf.setLenient(isLenient)


what if we always set lenient to false?

we can test it with lenient false. this is a util func, if test is ok, should we always set it to false?

yes, it will be good if we don't need to introduce new parameter

tejasapatil · 2017-02-12T07:23:21Z

format could also be invalid. Since the model we are going with is to return null for bad inputs, the same could be done for format. Please add a test case for this.

windpiger · 2017-02-12T10:00:50Z

@tejasapatil thanks for your review!

SparkQA · 2017-02-12T12:27:24Z

Test build #72768 has finished for PR 16870 at commit 21046f0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-12T12:32:32Z

Test build #72769 has finished for PR 16870 at commit 3779984.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

windpiger · 2017-02-12T13:57:50Z

@cloud-fan I remove the islenient param, the tests passed, it seems remove islenient is ok

gatorsmile · 2017-02-13T01:37:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

@@ -98,6 +98,7 @@ object DateTimeUtils {
  def newDateFormat(formatString: String, timeZone: TimeZone): DateFormat = {
    val sdf = new SimpleDateFormat(formatString, Locale.US)
    sdf.setTimeZone(timeZone)
+    sdf.setLenient(false)


Please add one line comment:

// Enable strict parsing, inputs must precisely match this object's format.

gatorsmile · 2017-02-13T01:49:12Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+
+    val df1 = Seq(x1, x2, x3, x4).toDF("x")
+    checkAnswer(df1.select(unix_timestamp(col("x"))), Seq(
+      Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null)))


yes, it is ts1, the timestamp of x1 is ts1

@gatorsmile the ts1 var is defined at the beginning of the test.

uh, got it. Thanks!

gatorsmile · 2017-02-13T01:50:03Z

sql/core/src/test/scala/org/apache/spark/sql/DateFunctionsSuite.scala

+
+    val df1 = Seq(x1, x2, x3, x4).toDF("x")
+    checkAnswer(df1.selectExpr("to_unix_timestamp(x)"), Seq(
+      Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null)))


The same issue here

the same with above~

gatorsmile · 2017-02-13T01:50:30Z

Could you also add one more case for verifying to_date on "2016-02-29" and "2017-02-29"?

SparkQA · 2017-02-13T07:46:54Z

Test build #72798 has finished for PR 16870 at commit 7238e94.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-13T07:53:32Z

Test build #72799 has finished for PR 16870 at commit 3b1cfd4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-02-13T11:24:13Z

LGTM - merging to master. Thanks!

hvanhovell · 2017-02-13T11:25:44Z

@windpiger can you open a backport to branch-2.1? Thanks!

windpiger · 2017-02-13T11:31:52Z

ok~ I am glad to take this! thanks~

windpiger · 2017-02-14T12:15:22Z

@hvanhovell branch-2.1 has no func to_date with formate param, the backport should contain it?
if it is , SPARK-16609 will also backport to branch-2.1.

This is a branch that both backport the SPAR-16609 and this current PR to branch-2.1.
https://github.com/windpiger/spark/commits/backport-2.1-todate (the last three commits)
Is it ok?

hvanhovell · 2017-02-14T12:21:08Z

Yeah, you are right. Lets leave this as it currently is.

windpiger · 2017-02-14T12:31:25Z

ok, so this work finished?

hvanhovell · 2017-02-14T12:32:40Z

Yes it is

windpiger · 2017-02-14T12:34:37Z

ok~

## What changes were proposed in this pull request? Currently the udf `to_date` has different return value with an invalid date input. ``` SELECT to_date('2015-07-22', 'yyyy-dd-MM') -> return `2016-10-07` SELECT to_date('2014-31-12') -> return null ``` As discussed in JIRA [SPARK-19496](https://issues.apache.org/jira/browse/SPARK-19496), we should return null in both situations when the input date is invalid ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes apache#16870 from windpiger/to_date.

windpiger added 6 commits February 9, 2017 17:58

[SPARK-19496][SQL]to_date udf to return null when input date is invalid

24c99fa

remove a redundant line

81ed4a5

fix code style

8db6253

optimize code

fb5a43c

remove func

c6f58b1

fix a bug

de28fd4

fix test failed

ad6528c

hvanhovell requested changes Feb 10, 2017

View reviewed changes

unset default value

a279889

hvanhovell reviewed Feb 10, 2017

View reviewed changes

fix an inconsitence

2dc241e

cloud-fan reviewed Feb 12, 2017

View reviewed changes

windpiger added 2 commits February 12, 2017 17:54

add invalid format

21046f0

remove islenient param

3779984

gatorsmile reviewed Feb 13, 2017

View reviewed changes

windpiger added 2 commits February 13, 2017 13:04

add more test cases

7238e94

add a comment

3b1cfd4

asfgit closed this in 04ad822 Feb 13, 2017

[SPARK-19496][SQL]to_date udf to return null when input date is invalid #16870

[SPARK-19496][SQL]to_date udf to return null when input date is invalid #16870

Conversation

windpiger commented Feb 9, 2017

What changes were proposed in this pull request?

How was this patch tested?

windpiger commented Feb 9, 2017

SparkQA commented Feb 9, 2017

SparkQA commented Feb 9, 2017

windpiger commented Feb 9, 2017

SparkQA commented Feb 9, 2017

SparkQA commented Feb 10, 2017

windpiger commented Feb 10, 2017 • edited Loading

hvanhovell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hvanhovell Feb 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Feb 10, 2017

SparkQA commented Feb 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tejasapatil commented Feb 12, 2017

windpiger commented Feb 12, 2017

SparkQA commented Feb 12, 2017

SparkQA commented Feb 12, 2017

windpiger commented Feb 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Feb 13, 2017

SparkQA commented Feb 13, 2017

SparkQA commented Feb 13, 2017

hvanhovell commented Feb 13, 2017

hvanhovell commented Feb 13, 2017

windpiger commented Feb 13, 2017

windpiger commented Feb 14, 2017 • edited Loading

hvanhovell commented Feb 14, 2017

windpiger commented Feb 14, 2017

hvanhovell commented Feb 14, 2017

windpiger commented Feb 14, 2017

windpiger commented Feb 10, 2017 •

edited

Loading

hvanhovell Feb 10, 2017 •

edited

Loading

windpiger commented Feb 14, 2017 •

edited

Loading