-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16609] Add to_date/to_timestamp with format functions #16138
Conversation
@@ -2666,7 +2666,18 @@ object functions { | |||
* @group datetime_funcs | |||
* @since 1.5.0 | |||
*/ | |||
def to_date(e: Column): Column = withExpr { ToDate(e.expr) } | |||
def to_date(e: Column): Column = withExpr { ToDate(e.expr, Literal("yyyy-MM-dd")) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this maintains previous capabilities.
* (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) | ||
* | ||
* @group datetime_funcs | ||
* @since 2.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will need to make sure this version is correct.
* (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) | ||
* to Unix time stamp (in seconds), return null if fail. | ||
* @group datetime_funcs | ||
* @since 1.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will need to change this version to the correct branch.
Test build #69645 has finished for PR 16138 at commit
|
Can you use "SPARK-16609" instead of "Spark-16609"? |
* @group datetime_funcs | ||
* @since 1.5.0 | ||
*/ | ||
def to_timestamp(s: Column, p: String): Column = withExpr {UnixTimestamp(s.expr, Literal(p)) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still missing a cast here...
Test build #69647 has finished for PR 16138 at commit
|
Test build #69648 has finished for PR 16138 at commit
|
Row(Date.valueOf("2014-12-31")))) | ||
checkAnswer( | ||
df.select(to_date(col("s"), "yyyy-MM-dd")), | ||
Seq(Row(Date.valueOf("2015-07-22")), Row(Date.valueOf("2014-12-31")), Row(null))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scala> val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
format: java.text.SimpleDateFormat = java.text.SimpleDateFormat@f67a0200
scala> format.parse("2014-31-12")
res1: java.util.Date = Tue Jul 12 00:00:00 PDT 2016
This is strange to me but is expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's strange that this test does not pass on my local machine. Is there something that I am missing that is causing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nevermind, this does fail on here too. Seems like I was just looking at the wrong result.
Test build #69649 has finished for PR 16138 at commit
|
Test build #69650 has finished for PR 16138 at commit
|
} | ||
|
||
override def flatArguments: Iterator[Any] = Iterator(left, format) | ||
override def sql: String = s"$prettyName(${left.sql}, ${format.sql})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the problem: you set format
to empty string at https://github.com/apache/spark/pull/16138/files#diff-b83497f7bc11578a0b63a814a2a30f48R1072 , but this is not safe. Although the format
will be ignored, we use it to build the sql string of this expression, which maybe something like to_date('2012-12-12 12:12:12', '')
.
we should make format
an Option[String]
, and generate corrected sql string here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
Test build #72301 has started for PR 16138 at commit |
} | ||
|
||
override def flatArguments: Iterator[Any] = Iterator(left, format) | ||
override def sql: String = s"$prettyName(${left.sql}, ${format.sql})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it have the same issue as ParseToDate
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, to_timestamp
doesn't exist at all in hive, it's a brand new Spark SQL only function. I also can't actually find a failure in the test results. Do you know what actually failed?
retest this please |
Test build #72311 has finished for PR 16138 at commit
|
Test build #72394 has finished for PR 16138 at commit
|
Test build #72396 has finished for PR 16138 at commit
|
Test build #72397 has finished for PR 16138 at commit
|
this is the CRAN check error:
you probably change it to |
@felixcheung thanks, made those changes :). Hopefully this will start passing sometime :P |
// now switch format | ||
checkAnswer( | ||
df.select(to_date(col("s"), "yyyy-dd-MM")), | ||
Seq(Row(Date.valueOf("2016-10-07")), Row(Date.valueOf("2016-07-12")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised, converting 2015-07-22 10:00:00
to date with format yyyy-dd-MM
will result to 2016-10-07
? When we convert 2014-31-12
to date with default format, it returns null.
Test build #72422 has finished for PR 16138 at commit
|
@cloud-fan great questions. I thought that was strange too. However this is the current behavior as well as Java For example, today: spark.sql("""
SELECT
cast(
cast(
unix_timestamp('2016-30-12', 'yyyy-MM-dd')
as timestamp)
as date)
""").take(5)
// res4: Array[org.apache.spark.sql.Row] = Array([2018-06-12]) Which seems wrong to me, but that's the logic that exists. The issue seems to be that unix_timestamp returns an incorrect value when it should just return null. The source is the behavior of scala> import java.text.SimpleDateFormat
// import java.text.SimpleDateFormat
scala> val x = new SimpleDateFormat("yyyy-MM-dd")
// x: java.text.SimpleDateFormat = java.text.SimpleDateFormat@f67a0200
scala> x.parse("2016-30-12")
// res5: java.util.Date = Tue Jun 12 00:00:00 PDT 2018 |
Manual Correctness tests: Python
R
|
Since this is an existing behavior, this PR LGTM |
LGTM - merging master. |
## What changes were proposed in this pull request? This pull request adds two new user facing functions: - `to_date` which accepts an expression and a format and returns a date. - `to_timestamp` which accepts an expression and a format and returns a timestamp. For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM) ### Date Function *Previously* ``` to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")) ``` *Current* ``` to_date(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Timestamp Function *Previously* ``` unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp") ``` *Current* ``` to_timestamp(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Tasks - [X] Add `to_date` to Scala Functions - [x] Add `to_date` to Python Functions - [x] Add `to_date` to SQL Functions - [X] Add `to_timestamp` to Scala Functions - [x] Add `to_timestamp` to Python Functions - [x] Add `to_timestamp` to SQL Functions - [x] Add function to R ## How was this patch tested? - [x] Add Functions to `DateFunctionsSuite` - Test new `ParseToTimestamp` Expression (*not necessary*) - Test new `ParseToDate` Expression (*not necessary*) - [x] Add test for R - [x] Add test for Python in test.py Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <bill@databricks.com> Author: anabranch <bill@databricks.com> Closes apache#16138 from anabranch/SPARK-16609.
## What changes were proposed in this pull request? This pull request adds two new user facing functions: - `to_date` which accepts an expression and a format and returns a date. - `to_timestamp` which accepts an expression and a format and returns a timestamp. For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM) ### Date Function *Previously* ``` to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")) ``` *Current* ``` to_date(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Timestamp Function *Previously* ``` unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp") ``` *Current* ``` to_timestamp(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Tasks - [X] Add `to_date` to Scala Functions - [x] Add `to_date` to Python Functions - [x] Add `to_date` to SQL Functions - [X] Add `to_timestamp` to Scala Functions - [x] Add `to_timestamp` to Python Functions - [x] Add `to_timestamp` to SQL Functions - [x] Add function to R ## How was this patch tested? - [x] Add Functions to `DateFunctionsSuite` - Test new `ParseToTimestamp` Expression (*not necessary*) - Test new `ParseToDate` Expression (*not necessary*) - [x] Add test for R - [x] Add test for Python in test.py Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <bill@databricks.com> Author: anabranch <bill@databricks.com> Closes apache#16138 from anabranch/SPARK-16609.
@anabranch I met a failure case when using
And this following exception happened:
I locate this problem. When resolving function
In my case, the child expression is
Considering that we barely use the second parameter So, does my case relate to this PR? |
I don't think this is necessarily my call but you're effectively writing a
hive udf, not a Spark one that depends on this. Writing a Spark UDF would
allow this to work just fine and the expression to be evaluated. Your work
around seems fine in your local version though.
…On Fri, Aug 25, 2017 at 12:23 AM, Yann Byron ***@***.***> wrote:
@anabranch <https://github.com/anabranch>
I met a failure case when using to_date, like this:
spark-sql> select func(to_date('2017-08-25 12:00:00'));
func is a custom udf which extends org.apache.hadoop.hive.ql.udf.
generic.GenericUDF.
And this following exception happened:
Error in query: No handler for Hive UDF 'xxx.xxx.func': java.lang.UnsupportedOperationException: Cannot evaluate expression: to_date(2017-08-25 12:00:00, None);
I locate this problem. When resolving function func(to_date('2017-08-25
12:00:00'))(at this time Spark has already finished the resolving to func's
child to_date('2017-08-25 12:00:00')), it will call func's
FunctionBuilder that is defined in HiveSessionCatalog.makeFunctionBuilder
<https://github.com/apache/spark/blob/v2.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala>
).
In it, the GenericUDF instance will be forced to check input data type.
During this process, All of the child catalyst expression which are the
parameters of the subclass of GenericUDF, need to be mapped to
ObjectInspector(see details: HiveInspector.toInspector
<https://github.com/apache/spark/blob/v2.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala>).
Inside, the child expression need to be evaluated though eval() firstly,
if expression doesn't match with Literal.
def toInspector(expr: Expression): ObjectInspector = expr match {
case Literal(value, XXXType) =>
//...
// ideally, we don't test the foldable here(but in optimizer), however, some of the
// Hive UDF / UDAF requires its argument to be constant objectinspector, we do it eagerly.
case _ if expr.foldable => toInspector(Literal.create(expr.eval(), expr.dataType))
}
In my case, the child expression is to_date('2017-08-25 12:00:00') that
is the instance of ParseToDate
<https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala>
(extends RuntimeReplaceable
<https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala>).
When calling its eval(), we call the Unevaluable
<https://github.com/apache/spark/blob/v2.2.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala>'s
eval() that will throw this exception mentioned above.
trait Unevaluable extends Expression {
final override def eval(input: InternalRow = null): Any =
throw new UnsupportedOperationException(s"Cannot evaluate expression: $this")
final override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode =
throw new UnsupportedOperationException(s"Cannot evaluate expression: $this")
}
Considering that we barely use the second parameter pattern, I remap
to_date to ToDate again to fix the problem.
So, does my case relate to this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#16138 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABkQBwa1lkNT81uAWDV3ZRSRz8XqApPHks5sbnaMgaJpZM4LDrz0>
.
|
What changes were proposed in this pull request?
This pull request adds two new user facing functions:
to_date
which accepts an expression and a format and returns a date.to_timestamp
which accepts an expression and a format and returns a timestamp.For example, Given a date in format:
2016-21-05
. (YYYY-dd-MM)Date Function
Previously
Current
Timestamp Function
Previously
Current
Tasks
to_date
to Scala Functionsto_date
to Python Functionsto_date
to SQL Functionsto_timestamp
to Scala Functionsto_timestamp
to Python Functionsto_timestamp
to SQL FunctionsHow was this patch tested?
DateFunctionsSuite
ParseToTimestamp
Expression (not necessary)ParseToDate
Expression (not necessary)Please review http://spark.apache.org/contributing.html before opening a pull request.