Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8995][SQL] cast date strings like '2015-01-01 12:15:31' to date #7353

Closed
wants to merge 13 commits into from

Conversation

tarekbecker
Copy link
Contributor

Jira https://issues.apache.org/jira/browse/SPARK-8995

In PR #6981we noticed that we cannot cast date strings that contains a time, like '2015-03-18 12:39:40' to date. Besides it's not possible to cast a string like '18:03:20' to a timestamp.

If a time is passed without a date, today is inferred as date.

@cloud-fan
Copy link
Contributor

This was by intentional as hive doesn't support it. However, our new design about date time says we should support cast string yyyy­-mm­-dd * to date, which means your case should be supported.
It would be better if you can fix the string to date cast completely, i.e. including all the formats defined in the design doc :)

@tarekbecker
Copy link
Contributor Author

Okay, yes I'm going to extend it. What Hive does support is parsing the minute from 13:20:08. Our minute method requires a TimeStamp. Either we don't support the minute expression on pure time information or we define a cast from a time string to timestamp. But even if we do that, we are still not compatible with Hive. Because we would parse a pure date like 2015-10-20 to 2015-10-20 00:00:00 and we could extract the minute 0, Hive would return null

@tarekbecker
Copy link
Contributor Author

@cloud-fan
The following strings can be parsed now: (String --> Date)
yyyy, yyyy-[m]m, yyyy-[m]m-[d]d, yyyy-[m]m-[d]d, yyyy-[m]m-[d]d *, yyyy-[m]m-[d]dT*

(String -> Timestamp)
yyyy, yyyy-[m]m, yyyy-[m]m-[d]d, yyyy-[m]m-[d]d, yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms], yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]Z, yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m, yyyy-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m, yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms], yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]Z, yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]-[h]h:[m]m, yyyy-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][ms]+[h]h:[m]m

@yhuai
Copy link
Contributor

yhuai commented Jul 13, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #37124 has finished for PR 7353 at commit 71f89c1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #1051 has finished for PR 7353 at commit 71f89c1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 13, 2015

Test build #37129 has finished for PR 7353 at commit cfbaed7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

try DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf(n))
catch { case _: java.lang.IllegalArgumentException => null }
val parsedDateString = DateTimeUtils.stringToTimestamp(utfs)
if (parsedDateString == null) null else DateTimeUtils.fromJavaTimestamp(parsedDateString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a Timestamp object, should we have a better name? parsedTime?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to adjust this.

@davies
Copy link
Contributor

davies commented Jul 13, 2015

@tarekauel Thanks for working on this, it's in a good shape, just a few comments.

As you mentioned in the description, 18:03:20 could be parsed as timestamp, but it's still not done yet?

@tarekbecker
Copy link
Contributor Author

@davies thanks for all your good comments. I'm going to incorporate your suggestions.

I don't accept 18:03:20 , yet. The design document doesn't allow this. I think we should parse a pure time string to today + time. But I wanted to double check this with you guys.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37168 has finished for PR 7353 at commit 71622c0.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37171 has finished for PR 7353 at commit 34ec573.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tarekbecker
Copy link
Contributor Author

@davies How should we deal with this? I don't know the value of 'value, but it seems to be something that the parser can parse can be parsed to 1.1.1970.
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/37171/testReport/org.apache.spark.sql.hive.execution/HiveQuerySuite/Cast_Timestamp_to_Timestamp_in_UDF/

Update: My assumption is that value is a numeric value (see links below). Because of that 238 could
be interpreted as long value since 1970. So I don't know why Hives answer is null

https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/scripts/q_test_init.sql#L4-L8
https://github.com/apache/spark/blob/master/sql/hive/src/test/resources/data/files/kv1.txt#L1

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37177 has finished for PR 7353 at commit 01c9ff3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jul 15, 2015

In master, we can't cast '238' as timestamp,

spark-sql> select DATEDIFF(CAST(238 AS timestamp), CAST('2002-03-21 00:00:00' AS timestamp));
-11767
Time taken: 0.171 seconds, Fetched 1 row(s)
spark-sql> select DATEDIFF(CAST('238' AS timestamp), CAST('2002-03-21 00:00:00' AS timestamp));
NULL
Time taken: 0.1 seconds, Fetched 1 row(s)

I think you should check that the year is larger than 1000, or we do not support casting from YYYY

@tarekbecker
Copy link
Contributor Author

@davies somehow Jenkins wasn't able to fetch from GitHub. Could you trigger Jenkins, again?

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #1075 has finished for PR 7353 at commit ef05753.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37415 has finished for PR 7353 at commit d20b8b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37454 has finished for PR 7353 at commit ca1ae69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

c.getTimeInMillis * 1000)
c.set(2015, 0, 1, 0, 0, 0)
c.set(Calendar.MILLISECOND, 0)
assert(DateTimeUtils.stringToTimestamp(UTF8String.fromString("2015")).get ==
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use === in tests (for better failure message).

@davies
Copy link
Contributor

davies commented Jul 16, 2015

LGTM, just some minor comments.

@SparkQA
Copy link

SparkQA commented Jul 16, 2015

Test build #37476 has finished for PR 7353 at commit 14f333b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Jul 16, 2015

Thanks for the update, merging this into master!

@asfgit asfgit closed this in 4ea6480 Jul 16, 2015
asfgit pushed a commit that referenced this pull request Jul 17, 2015
a follow up of #7353

1. we should use `Calendar.HOUR_OF_DAY` instead of `Calendar.HOUR`(this is for AM, PM).
2. we should call `c.set(Calendar.MILLISECOND, 0)` after `Calendar.getInstance`

I'm not sure why the tests didn't fail in jenkins, but I ran latest spark master branch locally and `DateTimeUtilsSuite` failed.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7473 from cloud-fan/datetime and squashes the following commits:

66cdaf2 [Wenchen Fan] fix several bugs in DateTimeUtils.stringToTimestamp
asfgit pushed a commit that referenced this pull request Jul 18, 2015
fix 2 bugs introduced in #7353

1. we should use UTC Calendar when cast string to date . Before #7353 , we use `DateTimeUtils.fromJavaDate(Date.valueOf(s.toString))` to cast string to date, and `fromJavaDate` will call `millisToDays` to avoid the time zone issue. Now we use `DateTimeUtils.stringToDate(s)`, we should create a Calendar with UTC in the begging.
2. we should not change the default time zone in test cases. The `threadLocalLocalTimeZone` and `threadLocalTimestampFormat` in `DateTimeUtils` will only be evaluated once for each thread, so we can't set the default time zone back anymore.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #7488 from cloud-fan/datetime and squashes the following commits:

9cd6005 [Wenchen Fan] address comments
21ef293 [Wenchen Fan] fix 2 bugs in datetime
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants