Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19496][SQL]to_date udf to return null when input date is invalid #16870

Closed
wants to merge 13 commits into from

Conversation

windpiger
Copy link
Contributor

What changes were proposed in this pull request?

Currently the udf to_date has different return value with an invalid date input.

SELECT to_date('2015-07-22', 'yyyy-dd-MM') ->  return `2016-10-07`
SELECT to_date('2014-31-12')    -> return null

As discussed in JIRA SPARK-19496, we should return null in both situations when the input date is invalid

How was this patch tested?

unit test added

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72639 has finished for PR 16870 at commit de28fd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72637 has finished for PR 16870 at commit 8db6253.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72646 has finished for PR 16870 at commit de28fd4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72669 has finished for PR 16870 at commit ad6528c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

windpiger commented Feb 10, 2017

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general direction is good. I left a few comments.

def newDateFormat(formatString: String, timeZone: TimeZone): DateFormat = {
def newDateFormat(formatString: String,
timeZone: TimeZone,
isLenient: Boolean = true): DateFormat = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not make this a default parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok let me modify it

checkAnswer(df1.select(unix_timestamp(col("x"), "yyyy-dd-MM HH:mm:ss")), Seq(
Row(ts2.getTime / 1000L), Row(null), Row(null), Row(null)))
checkAnswer(df1.selectExpr(s"unix_timestamp(x, 'yyyy-MM-dd mm:HH:ss')"), Seq(
Row(ts3.getTime / 1000L), Row(ts4.getTime / 1000L), Row(null), Row(null)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the order be Row(ts4.getTime / 1000L), Row(null), Row(ts3.getTime / 1000L), Row(null)? It does not matter for testing since we sort results, but it makes it less confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right~ thanks a lot!

checkAnswer(df1.selectExpr("to_unix_timestamp(x)"), Seq(
Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null)))
checkAnswer(df1.selectExpr(s"to_unix_timestamp(x, 'yyyy-MM-dd mm:HH:ss')"), Seq(
Row(ts3.getTime / 1000L), Row(ts4.getTime / 1000L), Row(null), Row(null)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the order be Row(ts4.getTime / 1000L), Row(null), Row(ts3.getTime / 1000L), Row(null)? It does not matter for testing since we sort results, but it makes it less confusing.

UTF8String.fromString(df.format(new java.util.Date(timestamp.asInstanceOf[Long] / 1000)))
}

override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val dtu = DateTimeUtils.getClass.getName.stripSuffix("$")
val tz = ctx.addReferenceMinorObj(timeZone)
defineCodeGen(ctx, ev, (timestamp, format) => {
s"""UTF8String.fromString($dtu.newDateFormat($format.toString(), $tz)
s"""UTF8String.fromString($dtu.newDateFormat($format.toString(), $tz, false)
Copy link
Contributor

@hvanhovell hvanhovell Feb 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are inconsistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, sorry let me fix it, thanks!

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72710 has finished for PR 16870 at commit a279889.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 11, 2017

Test build #72730 has finished for PR 16870 at commit 2dc241e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val sdf = new SimpleDateFormat(formatString, Locale.US)
sdf.setTimeZone(timeZone)
sdf.setLenient(isLenient)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we always set lenient to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can test it with lenient false. this is a util func, if test is ok, should we always set it to false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it will be good if we don't need to introduce new parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok thanks~

@tejasapatil
Copy link
Contributor

format could also be invalid. Since the model we are going with is to return null for bad inputs, the same could be done for format. Please add a test case for this.

@windpiger
Copy link
Contributor Author

@tejasapatil thanks for your review!

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72768 has finished for PR 16870 at commit 21046f0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 12, 2017

Test build #72769 has finished for PR 16870 at commit 3779984.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@windpiger
Copy link
Contributor Author

@cloud-fan I remove the islenient param, the tests passed, it seems remove islenient is ok

@@ -98,6 +98,7 @@ object DateTimeUtils {
def newDateFormat(formatString: String, timeZone: TimeZone): DateFormat = {
val sdf = new SimpleDateFormat(formatString, Locale.US)
sdf.setTimeZone(timeZone)
sdf.setLenient(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add one line comment:

// Enable strict parsing, inputs must precisely match this object's format.


val df1 = Seq(x1, x2, x3, x4).toDF("x")
checkAnswer(df1.select(unix_timestamp(col("x"))), Seq(
Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ts1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is ts1, the timestamp of x1 is ts1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile the ts1 var is defined at the beginning of the test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh, got it. Thanks!


val df1 = Seq(x1, x2, x3, x4).toDF("x")
checkAnswer(df1.selectExpr("to_unix_timestamp(x)"), Seq(
Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issue here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same with above~

@gatorsmile
Copy link
Member

Could you also add one more case for verifying to_date on "2016-02-29" and "2017-02-29"?

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72798 has finished for PR 16870 at commit 7238e94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72799 has finished for PR 16870 at commit 3b1cfd4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM - merging to master. Thanks!

@hvanhovell
Copy link
Contributor

@windpiger can you open a backport to branch-2.1? Thanks!

@asfgit asfgit closed this in 04ad822 Feb 13, 2017
@windpiger
Copy link
Contributor Author

ok~ I am glad to take this! thanks~

@windpiger
Copy link
Contributor Author

windpiger commented Feb 14, 2017

@hvanhovell branch-2.1 has no func to_date with formate param, the backport should contain it?
if it is , SPARK-16609 will also backport to branch-2.1.

This is a branch that both backport the SPAR-16609 and this current PR to branch-2.1.
https://github.com/windpiger/spark/commits/backport-2.1-todate (the last three commits)
Is it ok?

@hvanhovell
Copy link
Contributor

Yeah, you are right. Lets leave this as it currently is.

@windpiger
Copy link
Contributor Author

ok, so this work finished?

@hvanhovell
Copy link
Contributor

Yes it is

@windpiger
Copy link
Contributor Author

ok~

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Currently the udf  `to_date` has different return value with an invalid date input.

```
SELECT to_date('2015-07-22', 'yyyy-dd-MM') ->  return `2016-10-07`
SELECT to_date('2014-31-12')    -> return null
```

As discussed in JIRA [SPARK-19496](https://issues.apache.org/jira/browse/SPARK-19496), we should return null in both situations when the input date is invalid

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes apache#16870 from windpiger/to_date.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants