-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19496][SQL]to_date udf to return null when input date is invalid #16870
Conversation
retest this please |
Test build #72639 has finished for PR 16870 at commit
|
Test build #72637 has finished for PR 16870 at commit
|
retest this please |
Test build #72646 has finished for PR 16870 at commit
|
Test build #72669 has finished for PR 16870 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general direction is good. I left a few comments.
def newDateFormat(formatString: String, timeZone: TimeZone): DateFormat = { | ||
def newDateFormat(formatString: String, | ||
timeZone: TimeZone, | ||
isLenient: Boolean = true): DateFormat = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not make this a default parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok let me modify it
checkAnswer(df1.select(unix_timestamp(col("x"), "yyyy-dd-MM HH:mm:ss")), Seq( | ||
Row(ts2.getTime / 1000L), Row(null), Row(null), Row(null))) | ||
checkAnswer(df1.selectExpr(s"unix_timestamp(x, 'yyyy-MM-dd mm:HH:ss')"), Seq( | ||
Row(ts3.getTime / 1000L), Row(ts4.getTime / 1000L), Row(null), Row(null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the order be Row(ts4.getTime / 1000L), Row(null), Row(ts3.getTime / 1000L), Row(null)
? It does not matter for testing since we sort results, but it makes it less confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right~ thanks a lot!
checkAnswer(df1.selectExpr("to_unix_timestamp(x)"), Seq( | ||
Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null))) | ||
checkAnswer(df1.selectExpr(s"to_unix_timestamp(x, 'yyyy-MM-dd mm:HH:ss')"), Seq( | ||
Row(ts3.getTime / 1000L), Row(ts4.getTime / 1000L), Row(null), Row(null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the order be Row(ts4.getTime / 1000L), Row(null), Row(ts3.getTime / 1000L), Row(null)
? It does not matter for testing since we sort results, but it makes it less confusing.
UTF8String.fromString(df.format(new java.util.Date(timestamp.asInstanceOf[Long] / 1000))) | ||
} | ||
|
||
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { | ||
val dtu = DateTimeUtils.getClass.getName.stripSuffix("$") | ||
val tz = ctx.addReferenceMinorObj(timeZone) | ||
defineCodeGen(ctx, ev, (timestamp, format) => { | ||
s"""UTF8String.fromString($dtu.newDateFormat($format.toString(), $tz) | ||
s"""UTF8String.fromString($dtu.newDateFormat($format.toString(), $tz, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are inconsistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, sorry let me fix it, thanks!
Test build #72710 has finished for PR 16870 at commit
|
Test build #72730 has finished for PR 16870 at commit
|
val sdf = new SimpleDateFormat(formatString, Locale.US) | ||
sdf.setTimeZone(timeZone) | ||
sdf.setLenient(isLenient) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we always set lenient
to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can test it with lenient false. this is a util func, if test is ok, should we always set it to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it will be good if we don't need to introduce new parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks~
|
@tejasapatil thanks for your review! |
Test build #72768 has finished for PR 16870 at commit
|
Test build #72769 has finished for PR 16870 at commit
|
@cloud-fan I remove the islenient param, the tests passed, it seems remove islenient is ok |
@@ -98,6 +98,7 @@ object DateTimeUtils { | |||
def newDateFormat(formatString: String, timeZone: TimeZone): DateFormat = { | |||
val sdf = new SimpleDateFormat(formatString, Locale.US) | |||
sdf.setTimeZone(timeZone) | |||
sdf.setLenient(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add one line comment:
// Enable strict parsing, inputs must precisely match this object's format.
|
||
val df1 = Seq(x1, x2, x3, x4).toDF("x") | ||
checkAnswer(df1.select(unix_timestamp(col("x"))), Seq( | ||
Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ts1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is ts1, the timestamp of x1
is ts1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile the ts1 var is defined at the beginning of the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh, got it. Thanks!
|
||
val df1 = Seq(x1, x2, x3, x4).toDF("x") | ||
checkAnswer(df1.selectExpr("to_unix_timestamp(x)"), Seq( | ||
Row(ts1.getTime / 1000L), Row(null), Row(null), Row(null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issue here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same with above~
Could you also add one more case for verifying |
Test build #72798 has finished for PR 16870 at commit
|
Test build #72799 has finished for PR 16870 at commit
|
LGTM - merging to master. Thanks! |
@windpiger can you open a backport to branch-2.1? Thanks! |
ok~ I am glad to take this! thanks~ |
@hvanhovell branch-2.1 has no func to_date with formate param, the backport should contain it? This is a branch that both backport the SPAR-16609 and this current PR to branch-2.1. |
Yeah, you are right. Lets leave this as it currently is. |
ok, so this work finished? |
Yes it is |
ok~ |
## What changes were proposed in this pull request? Currently the udf `to_date` has different return value with an invalid date input. ``` SELECT to_date('2015-07-22', 'yyyy-dd-MM') -> return `2016-10-07` SELECT to_date('2014-31-12') -> return null ``` As discussed in JIRA [SPARK-19496](https://issues.apache.org/jira/browse/SPARK-19496), we should return null in both situations when the input date is invalid ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes apache#16870 from windpiger/to_date.
What changes were proposed in this pull request?
Currently the udf
to_date
has different return value with an invalid date input.As discussed in JIRA SPARK-19496, we should return null in both situations when the input date is invalid
How was this patch tested?
unit test added