Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27401][SQL] Refactoring conversion of Timestamp to/from java.sql.Timestamp #24311

Closed
wants to merge 7 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Apr 6, 2019

What changes were proposed in this pull request?

In the PR, I propose simpler implementation of toJavaTimestamp()/fromJavaTimestamp() by reusing existing functions of DateTimeUtils. This will allow to:

  • Simply implementation of toJavaTimestamp(), and handle properly negative inputs.
  • Detect Long overflow in conversion of milliseconds (java.sql.Timestamp) to microseconds (Catalyst's Timestamp).

How was this patch tested?

By existing test suites DateTimeUtilsSuite, DateFunctionsSuite, DateExpressionsSuite and CastSuite. And by new benchmark for export/import timestamps added to DateTimeBenchmark:

Before:

To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             290            335          49         17.2          58.0       1.0X
Collect longs                                      1234           1681         487          4.1         246.8       0.2X
Collect timestamps                                 1718           1755          63          2.9         343.7       0.2X

After:

To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             283            301          19         17.7          56.6       1.0X
Collect longs                                      1048           1087          36          4.8         209.6       0.3X
Collect timestamps                                 1425           1479          56          3.5         285.1       0.2X

@SparkQA
Copy link

SparkQA commented Apr 6, 2019

Test build #104344 has finished for PR 24311 at commit b0ced9f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2019

Test build #104345 has finished for PR 24311 at commit 6129c9a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 6, 2019

Failed ORC related test which generates random dates - HiveOrcHadoopFsRelationSuite. It seems ORC uses its own methods for converting Catalyst's days since epoch to something like java.sql.Date that is based on Julian calendar for old dates:

!== Correct Answer - 10 ==    == Spark Answer - 10 ==
 struct<index:int,col:date>   struct<index:int,col:date>
 [1,3477-08-12]               [1,3477-08-12]
 [2,7867-09-20]               [2,7867-09-20]
 [3,null]                     [3,null]
 [4,6577-03-26]               [4,6577-03-26]
 [5,8002-07-25]               [5,8002-07-25]
 [6,2154-05-10]               [6,2154-05-10]
 [7,4921-05-03]               [7,4921-05-03]
![8,0647-07-01]               [8,0647-06-28]
 [9,9422-12-25]               [9,9422-12-25]
 [10,2677-06-15]              [10,2677-06-15]

@dongjoon-hyun Am I right? It is strange that only this ORC test failed.

@MaxGekk MaxGekk changed the title [SPARK-27401][SQL] Refactoring conversion of Date/Timestamp to/from java.sql.Date/Timestamp [SPARK-27401][SQL] Refactoring conversion of Timestamp to/from java.sql.Timestamp Apr 6, 2019
@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 7, 2019

Thank you for pinging me, @MaxGekk . I'll take a look today.

BTW, @srowen . Sorry, but do you mind if I ask why you approve this PR suffering from the UT failures? Does this pass in your environment?

@srowen
Copy link
Member

srowen commented Apr 7, 2019

Oh, really I mean this looks fine pending tests. If they're transient failures, fine. If they're not, obviously they need to be fixed. here I assume they aren't significant failures, just tests that need to be adjusted.

@dongjoon-hyun
Copy link
Member

Oh, @MaxGekk . According to the last reverting commit, you are not aiming DATE in this PR. Are you going to file a JIRA issue for DATE type?

@dongjoon-hyun
Copy link
Member

Thank you, @srowen . I see~
It seems that @MaxGekk reduced the scope of this PR first to pass the Jenkins.

@dongjoon-hyun
Copy link
Member

Before the last reverting, I got the following result. @MaxGekk . If there was an issue, it might be an old Hive 1.2.1 issue which is fixed in Apache ORC. If needed, you can file a JIRA issue. For old Hive 1.2.1 issue, I can dig it for you but it might be difficult to fix it because it seems to be not inside both Apache Spark and Apache ORC.

  • OrcHadoopFsRelationSuite passed.
  • HiveOrcHadoopFsRelationSuite failed.

@SparkQA
Copy link

SparkQA commented Apr 7, 2019

Test build #104347 has finished for PR 24311 at commit bf4b6b6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

lazy val timestampLiteralGen: Gen[Literal] = {
for { millis <- Gen.choose(yearToMillis(-9999), yearToMillis(10000)) }
yield Literal.create(new Timestamp(millis), TimestampType)
}
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk . In general, we don't change the test cases during refactoring in order to ensure the test coverage. I'm wondering if this is a required change due to the known limitation of the new functions. Although this looks minor, I'd make this as an another PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the random generator for timestamps, and limit max/min values for milliseconds since epoch because the long random generator can produce milliseconds that caused Long overflow at conversion of milliseconds to microseconds. Internally as you know we store microseconds since epoch in TimestampType. For example, the old (current) generator can create an instance of java.sql.Timestamp(-3948373668011580000, 570000000). New function fromJavaTimestamp calls instantToMicros, and inside of it we use multiplyExact which can detect Long overflow on multiplication:

def instantToMicros(instant: Instant): Long = {
  val us = Math.multiplyExact(-3948373668011580, 1000000)
    ...
}

Screen Shot 2019-04-07 at 08 41 30

Previous (current) implementation doesn't detect the overflow at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long overflow exception caused failures in

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the exact reason you should not put that in this PR, @MaxGekk .
Please do that improvement in another PR first. And do the refactoring PR later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun ok. I will open another PR. Thanks.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 7, 2019

it because it seems to be not inside both Apache Spark and Apache ORC.

  • OrcHadoopFsRelationSuite passed.
  • HiveOrcHadoopFsRelationSuite failed.

Yeh, it seems there could be a calendar incompatibility issue somewhere inside Hive + ORC which could cause the difference ![8,0647-07-01] [8,0647-06-28] for old days before 1582 (I guess). I just thought we switched to Proleptic Gregorian calendar everywhere in Spark (https://issues.apache.org/jira/browse/SPARK-26651).

In any case, I think my date (not timestamp) related changes are potentially more expensive comparing to current implementation because of conversion of java.sql.Date -> java.time.LocalDate. The conversion extracts components like year, month and etc from java.sql.Date that are not cheap but current implementation does time zone shifting which is not necessary too. In any case, I am going to leave Java <-> Catalyst's date conversions as is for now.

@SparkQA
Copy link

SparkQA commented Apr 7, 2019

Test build #104360 has finished for PR 24311 at commit 20dfacf.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 7, 2019

Test build #104365 has finished for PR 24311 at commit ec15d82.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM (pending Jenkins).
Thank you, @MaxGekk and @srowen .

@SparkQA
Copy link

SparkQA commented Apr 8, 2019

Test build #104399 has finished for PR 24311 at commit ec15d82.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

To/from java.sql.Timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp 309 316 8 16.2 61.9 1.0X
Collect longs 1410 2747 1158 3.5 282.0 0.2X
Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, could you rerun this in the more stable environment?
Stdev looks too high.

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104421 has finished for PR 24311 at commit b4cade3.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 9, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104425 has finished for PR 24311 at commit b4cade3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 9, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104431 has finished for PR 24311 at commit b4cade3.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 9, 2019

I changed only sql/core/benchmarks/DateTimeBenchmark-results.txt since the last success run #24311 (comment) . @dongjoon-hyun @srowen Does it make sense to re-run this again? It seems there is a permanent problem with python tests.

@MaxGekk
Copy link
Member Author

MaxGekk commented Apr 9, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 9, 2019

Test build #104449 has finished for PR 24311 at commit b4cade3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Finally, it passed! Merged to master. Thank you, @MaxGekk and @srowen !

@MaxGekk MaxGekk deleted the conv-java-sql-date-timestamp branch September 18, 2019 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants