Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types #27980

Closed
wants to merge 11 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 22, 2020

What changes were proposed in this pull request?

In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions fromJavaDate(), toJavaDate(), toJavaTimestamp() and fromJavaTimestamp(). The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.

Why are the changes needed?

The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date 1582-10-15 but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 java.util.GregorianCalendar is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.

Does this PR introduce any user-facing change?

Yes, this PR can introduce behavior changes for dates after 1582-10-15, in particular conversions of zone ids to zone offsets will be much more accurate.

How was this patch tested?

By existing test suites DateTimeUtilsSuite, DateFunctionsSuite, DateExpressionsSuite, CollectionExpressionsSuite, HiveOrcHadoopFsRelationSuite, ParquetIOSuite.

@SparkQA
Copy link

SparkQA commented Mar 22, 2020

Test build #120166 has finished for PR 27980 at commit b466cac.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Timestamp.valueOf("2018-10-28 03:00:00"),
Timestamp.valueOf("2018-10-28 03:30:00")))
ts("2018-10-28 01:30:00"),
ts("2018-10-28 02:00:00", noDST = true),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to change the test because of the 2 lines. I wasn't able to build 2 timestamps via Timestamp.valueOf after rebasing. Timestamp class does normalization underneath, and replaces milliseconds since the epoch, and as a consequence of that checking of the results fails. Even textual representation, and year, month, ... nanos are the same.

@@ -508,8 +504,8 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper {
test("2016-03-13 03:00:00", "PST", "2016-03-13 10:00:00.0")
test("2016-11-06 00:59:59", "PST", "2016-11-06 07:59:59.0")
// 2016-11-06 01:00:00 PST could be 2016-11-06 08:00:00 UTC or 2016-11-06 09:00:00 UTC
test("2016-11-06 01:00:00", "PST", "2016-11-06 09:00:00.0")
test("2016-11-06 01:59:59", "PST", "2016-11-06 09:59:59.0")
test("2016-11-06 01:00:00", "PST", "2016-11-06 08:00:00.0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment above. 08 or 09 depends on the implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my only concern.

From what I know, Pacific Standard Time (PST) is UTC-8:00, and Pacific Daylight Time (PDT) is UTC-7:00. So 2016-11-06 09:00:00 UTC is the only corrected answer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, I have found the reason this behavior. While resolving PST to zone offsets, this short name is converted to America/Los_Angeles by the map:
https://github.com/frohoff/jdk8u-jdk/blob/master/src/share/classes/java/time/ZoneId.java#L248
And as a consequence, PST is not a constant offset:

   val pst = getZoneId("PST")
    println(pst.getRules.getOffset(LocalDateTime.of(2016, 11, 5, 23, 59, 59)))
    println(pst.getRules.getOffset(LocalDateTime.of(2016, 11, 6, 0, 0, 0)))
    println(pst.getRules.getOffset(LocalDateTime.of(2016, 11, 6, 1, 0, 0)))
    println(pst.getRules.getOffset(LocalDateTime.of(2016, 11, 6, 2, 0, 0)))
    println(pst.getRules.getOffset(LocalDateTime.of(2016, 11, 6, 3, 0, 0)))
-07:00
-07:00
-07:00
-08:00
-08:00

At the test local timestamp 2016-11-06 01:00:00, the offset is -07:00.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to work around this JDK bug in the test? e.g. not use PST

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I got 2016-11-06 09:00:00.0 in the test by modifying the map SHORT_IDS:

  def getZoneId(timeZoneId: String): ZoneId = {
    import scala.collection.JavaConverters._

    val m = Map("PST" -> "-08:00")
    ZoneId.of(timeZoneId, m.asJava)
  }
  def getDefaultZoneId(): ZoneId = {
    getZoneId(defaultTimeZone().getID)
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following works:

    withDefaultTimeZone(TimeZone.getTimeZone("-08:00")) {
      test("2016-11-06 01:00:00", "-08:00", "2016-11-06 09:00:00.0")
      test("2016-11-06 01:59:59", "-08:00", "2016-11-06 09:59:59.0")
    }

@cloud-fan Should I replace PST by -08:00 in the test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

def checkFromToJavaDate(d1: Date): Unit = {
val d2 = toJavaDate(fromJavaDate(d1))
assert(format(d2) === format(d1))
assert(d2.toString === d1.toString)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SparkQA
Copy link

SparkQA commented Mar 22, 2020

Test build #120169 has finished for PR 27980 at commit 0d4bc10.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120170 has finished for PR 27980 at commit db89c92.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SQL] Rebase any date-times in conversions to/from Java types [SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types Mar 23, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 23, 2020

@cloud-fan @HyukjinKwon Please, review this PR.

@SparkQA
Copy link

SparkQA commented Mar 23, 2020

Test build #120197 has finished for PR 27980 at commit 6e8dba0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 24, 2020

Test build #120251 has finished for PR 27980 at commit 46b6613.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

The last commit updated a scala test, which is not related to the pyspark tests. The pip is a known issue.

Thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 1fd4607 Mar 24, 2020
cloud-fan pushed a commit that referenced this pull request Mar 24, 2020
…types

In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.

The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.

Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.

By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.

Closes #27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 1fd4607)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…types

### What changes were proposed in this pull request?
In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.

### Why are the changes needed?
The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.

### Does this PR introduce any user-facing change?
Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.

Closes apache#27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the reuse-rebase-funcs-in-java-funcs branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants