Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16609] Add to_date/to_timestamp with format functions #16138

Closed
wants to merge 56 commits into from

Conversation

bllchmbrs
Copy link
Contributor

@bllchmbrs bllchmbrs commented Dec 4, 2016

What changes were proposed in this pull request?

This pull request adds two new user facing functions:

  • to_date which accepts an expression and a format and returns a date.
  • to_timestamp which accepts an expression and a format and returns a timestamp.

For example, Given a date in format: 2016-21-05. (YYYY-dd-MM)

Date Function

Previously

to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))

Current

to_date(lit("2016-21-05"), "yyyy-dd-MM")

Timestamp Function

Previously

unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")

Current

to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")

Tasks

  • Add to_date to Scala Functions
  • Add to_date to Python Functions
  • Add to_date to SQL Functions
  • Add to_timestamp to Scala Functions
  • Add to_timestamp to Python Functions
  • Add to_timestamp to SQL Functions
  • Add function to R

How was this patch tested?

  • Add Functions to DateFunctionsSuite
  • Test new ParseToTimestamp Expression (not necessary)
  • Test new ParseToDate Expression (not necessary)
  • Add test for R
  • Add test for Python in test.py

Please review http://spark.apache.org/contributing.html before opening a pull request.

@@ -2666,7 +2666,18 @@ object functions {
* @group datetime_funcs
* @since 1.5.0
*/
def to_date(e: Column): Column = withExpr { ToDate(e.expr) }
def to_date(e: Column): Column = withExpr { ToDate(e.expr, Literal("yyyy-MM-dd")) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this maintains previous capabilities.

* (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html])
*
* @group datetime_funcs
* @since 2.2.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need to make sure this version is correct.

* (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html])
* to Unix time stamp (in seconds), return null if fail.
* @group datetime_funcs
* @since 1.5.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to change this version to the correct branch.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69645 has finished for PR 16138 at commit b0f1ced.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Dec 4, 2016

Can you use "SPARK-16609" instead of "Spark-16609"?

@bllchmbrs bllchmbrs changed the title [WIP][Spark-16609] Add to_date with format function. [WIP][SPARK-16609] Add to_date/to_timestamp with format function Dec 4, 2016
@bllchmbrs bllchmbrs changed the title [WIP][SPARK-16609] Add to_date/to_timestamp with format function [WIP][SPARK-16609] Add to_date/to_timestamp with format functions Dec 4, 2016
* @group datetime_funcs
* @since 1.5.0
*/
def to_timestamp(s: Column, p: String): Column = withExpr {UnixTimestamp(s.expr, Literal(p)) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still missing a cast here...

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69647 has finished for PR 16138 at commit 9a10a34.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69648 has finished for PR 16138 at commit ecde7b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Row(Date.valueOf("2014-12-31"))))
checkAnswer(
df.select(to_date(col("s"), "yyyy-MM-dd")),
Seq(Row(Date.valueOf("2015-07-22")), Row(Date.valueOf("2014-12-31")), Row(null)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> val format = new java.text.SimpleDateFormat("yyyy-MM-dd")
format: java.text.SimpleDateFormat = java.text.SimpleDateFormat@f67a0200

scala> format.parse("2014-31-12")
res1: java.util.Date = Tue Jul 12 00:00:00 PDT 2016

This is strange to me but is expected behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange that this test does not pass on my local machine. Is there something that I am missing that is causing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nevermind, this does fail on here too. Seems like I was just looking at the wrong result.

@SparkQA
Copy link

SparkQA commented Dec 4, 2016

Test build #69649 has finished for PR 16138 at commit 89dccbd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastInputTypes

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69650 has finished for PR 16138 at commit c17c0bd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

override def flatArguments: Iterator[Any] = Iterator(left, format)
override def sql: String = s"$prettyName(${left.sql}, ${format.sql})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the problem: you set format to empty string at https://github.com/apache/spark/pull/16138/files#diff-b83497f7bc11578a0b63a814a2a30f48R1072 , but this is not safe. Although the format will be ignored, we use it to build the sql string of this expression, which maybe something like to_date('2012-12-12 12:12:12', '').

we should make format an Option[String], and generate corrected sql string here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72301 has started for PR 16138 at commit a2d0221.

}

override def flatArguments: Iterator[Any] = Iterator(left, format)
override def sql: String = s"$prettyName(${left.sql}, ${format.sql})"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it have the same issue as ParseToDate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, to_timestamp doesn't exist at all in hive, it's a brand new Spark SQL only function. I also can't actually find a failure in the test results. Do you know what actually failed?

@hvanhovell
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72311 has finished for PR 16138 at commit a2d0221.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ParseToDate(left: Expression, format: Option[Expression], child: Expression)

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72394 has finished for PR 16138 at commit 479243f.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72396 has finished for PR 16138 at commit e00b446.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 5, 2017

Test build #72397 has finished for PR 16138 at commit da16c66.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

felixcheung commented Feb 5, 2017

this is the CRAN check error:

* checking for missing documentation entries ... WARNING
Undocumented S4 methods:
  generic 'to_date' and siglist 'Column,missing'
  generic 'to_timestamp' and siglist 'Column,missing'
All user-level objects in a package (including S4 classes and methods)
should have documentation entries.
See the chapter 'Writing R documentation files' in the 'Writing R
Extensions' manual.
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... WARNING
Objects in \usage without \alias in documentation object 'to_date':
  '\S4method{to_date}{Column,missing}'

Objects in \usage without \alias in documentation object 'to_timestamp':
  '\S4method{to_timestamp}{Column,missing}'

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See the chapter 'Writing R documentation files' in the 'Writing R

you probably change it to @aliases to_date,Column,missing-method

@bllchmbrs
Copy link
Contributor Author

@felixcheung thanks, made those changes :). Hopefully this will start passing sometime :P

// now switch format
checkAnswer(
df.select(to_date(col("s"), "yyyy-dd-MM")),
Seq(Row(Date.valueOf("2016-10-07")), Row(Date.valueOf("2016-07-12")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised, converting 2015-07-22 10:00:00 to date with format yyyy-dd-MM will result to 2016-10-07? When we convert 2014-31-12 to date with default format, it returns null.

@SparkQA
Copy link

SparkQA commented Feb 6, 2017

Test build #72422 has finished for PR 16138 at commit c3063a2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bllchmbrs
Copy link
Contributor Author

@cloud-fan great questions.

I thought that was strange too. However this is the current behavior as well as Java SimpleDateFormat's behavior. I did not implement that logic.

For example, today:

spark.sql("""
   SELECT 
   cast(
	   cast(
	   	unix_timestamp('2016-30-12', 'yyyy-MM-dd') 
	   as timestamp) 
   as date)
   """).take(5)
// res4: Array[org.apache.spark.sql.Row] = Array([2018-06-12])

Which seems wrong to me, but that's the logic that exists. The issue seems to be that unix_timestamp returns an incorrect value when it should just return null. The source is the behavior of SimpleDateFormat.

scala> import java.text.SimpleDateFormat
// import java.text.SimpleDateFormat

scala> val x = new SimpleDateFormat("yyyy-MM-dd")
// x: java.text.SimpleDateFormat = java.text.SimpleDateFormat@f67a0200

scala> x.parse("2016-30-12")
// res5: java.util.Date = Tue Jun 12 00:00:00 PDT 2018

@bllchmbrs
Copy link
Contributor Author

Manual Correctness tests:

Python

>>> from pyspark.sql.functions import to_date, to_timestamp, lit
>>> spark.range(1).select(to_date(lit('2016-01-02'),'yyyy-MM-dd').alias("dt")).collect()
# [Row(dt=datetime.date(2016, 1, 2))]
>>> spark.range(1).select(to_timestamp(lit('2016-01-02'),'yyyy-MM-dd').alias("dt")).collect()
# [Row(dt=datetime.datetime(2016, 1, 2, 0, 0))]

R

> collect(select(people, to_date(lit('2016-01-02'), 'yyyy-MM-dd')))
  to_date('2016-01-02', 'yyyy-MM-dd'
#1                         2016-01-02
#2                         2016-01-02
#3                         2016-01-02
> collect(select(people, to_timestamp(lit('2016-01-02'), 'yyyy-MM-dd')))
  to_timestamp('2016-01-02', 'yyyy-MM-dd')
#1                               2016-01-02
#2                               2016-01-02
#3                               2016-01-02
> select(people, to_timestamp(lit('2016-01-02'), 'yyyy-MM-dd'))

@cloud-fan
Copy link
Contributor

Since this is an existing behavior, this PR LGTM

@hvanhovell
Copy link
Contributor

LGTM - merging master.

@asfgit asfgit closed this in 7a7ce27 Feb 7, 2017
windpiger pushed a commit to windpiger/spark that referenced this pull request Feb 14, 2017
## What changes were proposed in this pull request?

This pull request adds two new user facing functions:
- `to_date` which accepts an expression and a format and returns a date.
- `to_timestamp` which accepts an expression and a format and returns a timestamp.

For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)

### Date Function
*Previously*
```
to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
```
*Current*
```
to_date(lit("2016-21-05"), "yyyy-dd-MM")
```

### Timestamp Function
*Previously*
```
unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
```
*Current*
```
to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
```
### Tasks

- [X] Add `to_date` to Scala Functions
- [x] Add `to_date` to Python Functions
- [x] Add `to_date` to SQL Functions
- [X] Add `to_timestamp` to Scala Functions
- [x] Add `to_timestamp` to Python Functions
- [x] Add `to_timestamp` to SQL Functions
- [x] Add function to R

## How was this patch tested?

- [x] Add Functions to `DateFunctionsSuite`
- Test new `ParseToTimestamp` Expression (*not necessary*)
- Test new `ParseToDate` Expression (*not necessary*)
- [x] Add test for R
- [x] Add test for Python in test.py

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Author: anabranch <bill@databricks.com>

Closes apache#16138 from anabranch/SPARK-16609.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

This pull request adds two new user facing functions:
- `to_date` which accepts an expression and a format and returns a date.
- `to_timestamp` which accepts an expression and a format and returns a timestamp.

For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)

### Date Function
*Previously*
```
to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
```
*Current*
```
to_date(lit("2016-21-05"), "yyyy-dd-MM")
```

### Timestamp Function
*Previously*
```
unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
```
*Current*
```
to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
```
### Tasks

- [X] Add `to_date` to Scala Functions
- [x] Add `to_date` to Python Functions
- [x] Add `to_date` to SQL Functions
- [X] Add `to_timestamp` to Scala Functions
- [x] Add `to_timestamp` to Python Functions
- [x] Add `to_timestamp` to SQL Functions
- [x] Add function to R

## How was this patch tested?

- [x] Add Functions to `DateFunctionsSuite`
- Test new `ParseToTimestamp` Expression (*not necessary*)
- Test new `ParseToDate` Expression (*not necessary*)
- [x] Add test for R
- [x] Add test for Python in test.py

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>
Author: Bill Chambers <bill@databricks.com>
Author: anabranch <bill@databricks.com>

Closes apache#16138 from anabranch/SPARK-16609.
@YannByron
Copy link

@anabranch

I met a failure case when using to_date, like this:

 spark-sql> select func(to_date('2017-08-25 12:00:00'));

func is a custom udf which extends org.apache.hadoop.hive.ql.udf.generic.GenericUDF.

And this following exception happened:

Error in query: No handler for Hive UDF 'xxx.xxx.func': java.lang.UnsupportedOperationException: Cannot evaluate expression: to_date(2017-08-25 12:00:00, None);

I locate this problem. When resolving function func(to_date('2017-08-25 12:00:00'))(at this time Spark has already finished the resolving to func's child to_date('2017-08-25 12:00:00')), it will call func's FunctionBuilder that is defined in HiveSessionCatalog.makeFunctionBuilder).
In it, the GenericUDF instance will be forced to check input data type. During this process, All of the child catalyst expression which are the parameters of the subclass of GenericUDF, need to be mapped to ObjectInspector(see details: HiveInspector.toInspector). Inside, the child expression need to be evaluated though eval() firstly, if expression doesn't match with Literal.

def toInspector(expr: Expression): ObjectInspector = expr match {
    case Literal(value, XXXType) =>
        //...
    // ideally, we don't test the foldable here(but in optimizer), however, some of the
    // Hive UDF / UDAF requires its argument to be constant objectinspector, we do it eagerly.
    case _ if expr.foldable => toInspector(Literal.create(expr.eval(), expr.dataType))
}

In my case, the child expression is to_date('2017-08-25 12:00:00') that is the instance of ParseToDate (extends RuntimeReplaceable). When calling its eval(), we call the Unevaluable's eval() that will throw this exception mentioned above.

trait Unevaluable extends Expression {

  final override def eval(input: InternalRow = null): Any =
    throw new UnsupportedOperationException(s"Cannot evaluate expression: $this")

  final override protected def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode =
    throw new UnsupportedOperationException(s"Cannot evaluate expression: $this")
}

Considering that we barely use the second parameter pattern, I remap to_date to ToDate again to fix the problem.

So, does my case relate to this PR?

@bllchmbrs
Copy link
Contributor Author

bllchmbrs commented Aug 30, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants