-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time #27807
Conversation
Test build #119382 has finished for PR 27807 at commit
|
Test build #119411 has finished for PR 27807 at commit
|
Test build #119418 has finished for PR 27807 at commit
|
Test build #119419 has finished for PR 27807 at commit
|
@cloud-fan The rebasing cause a few issues. For example, we cannot use TimestampFormatter/DateFormatter to format java.sql.Timestamp/Date anymore:
To format rebased Java Timestamp/Date instances before 1582, need to use Julian based formatter - SimpleDateFormat. |
Test build #119471 has finished for PR 27807 at commit
|
@cloud-fan @juliuszsompolski Please, take a look at the PR if you have time. |
sql/core/src/test/resources/sql-tests/results/postgreSQL/date.sql.out
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/execution/HiveResultSuite.scala
Outdated
Show resolved
Hide resolved
Test build #119484 has finished for PR 27807 at commit
|
Test build #119486 has finished for PR 27807 at commit
|
Test build #119511 has finished for PR 27807 at commit
|
@@ -136,7 +138,9 @@ public int getInt(int rowId) { | |||
public long getLong(int rowId) { | |||
int index = getRowIndex(rowId); | |||
if (isTimestamp) { | |||
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000; | |||
Timestamp ts = new Timestamp(timestampData.time[index]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly orthogonal bug fix. @dongjoon-hyun FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does parquet have the same issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @MaxGekk . Could you elaborate a little more what was the existing bug here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this PR add a test coverage for this vectorized code path change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a round trip test in OrcQuerySuite
to read/write date before 1582?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a little more what was the existing bug here?
ORC writer uses DateTimeUtils.toJavaTimestamp
, see
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
Lines 144 to 148 in 300ec1a
case TimestampType => (getter, ordinal) => | |
val ts = DateTimeUtils.toJavaTimestamp(getter.getLong(ordinal)) | |
val result = new OrcTimestamp(ts.getTime) | |
result.setNanos(ts.getNanos) | |
result |
DateTimeUtils.fromJavaTimestamp
.
And the replaced hand-written code is not equal to fromJavaTimestamp
Does this PR add a test coverage for this vectorized code path change?
I have to fix this place due test failures, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119486/testReport/
The changes covered by the round trip test
spark/sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala
Line 122 in 7782b61
test(s"test all data types") { |
OrcHadoopFsRelationSuite
for vectorized code path.
can we add a round trip test in OrcQuerySuite to read/write date before 1582?
@cloud-fan The test I pointed out above generates random dates/timestamps before 1582 with high probability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks.
@@ -36,8 +36,8 @@ class HiveResultSuite extends SharedSparkSession { | |||
test("timestamp formatting in hive result") { | |||
val timestamps = Seq( | |||
"2018-12-28 01:02:03", | |||
"1582-10-13 01:02:03", | |||
"1582-10-14 01:02:03", | |||
"1582-10-03 01:02:03", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversions of timestamps in the range 1582-10-04 - 1582-10-15 is implementation specific because of calendar switching.
@@ -618,7 +619,19 @@ private[hive] trait HiveInspectors { | |||
case x: DateObjectInspector if x.preferWritable() => | |||
data: Any => { | |||
if (data != null) { | |||
DateTimeUtils.fromJavaDate(x.getPrimitiveWritableObject(data).get()) | |||
val millis = Math.multiplyExact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add some comments to explain why we need to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@@ -136,7 +138,9 @@ public int getInt(int rowId) { | |||
public long getLong(int rowId) { | |||
int index = getRowIndex(rowId); | |||
if (isTimestamp) { | |||
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at ORC type spec but it doesn't mention the calendar. The physical timestamp type looks very similar to Timestamp
, so this looks correct to me.
How about the write side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write side uses toJavaTimestamp already:
spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcSerializer.scala
Lines 144 to 148 in 300ec1a
case TimestampType => (getter, ordinal) => | |
val ts = DateTimeUtils.toJavaTimestamp(getter.getLong(ordinal)) | |
val result = new OrcTimestamp(ts.getTime) | |
result.setNanos(ts.getNanos) | |
result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, @cloud-fan . Apache ORC inherited Apache Hive's original design.
// For example: | ||
// Proleptic Gregorian calendar: 1582-01-01 -> -141714 | ||
// Julian calendar: 1582-01-01 -> -141704 | ||
// The code below converts -141714 to -141704. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gregorian year is shorter than Julian year: 365.2425 days vs 365.25 days, so the same local date in Gregorian calendar requires less days in Julian calendar.
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Show resolved
Hide resolved
Test build #119609 has finished for PR 27807 at commit
|
Test build #119616 has finished for PR 27807 at commit
|
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000; | ||
Timestamp ts = new Timestamp(timestampData.time[index]); | ||
ts.setNanos(timestampData.nanos[index]); | ||
return DateTimeUtils.fromJavaTimestamp(ts); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a simpler API: DateTimeUtils.fromJavaTimestamp(timestampData.asScratchTimestamp(index))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Under the hood, this re-uses the Timestamp
object and should be more efficient.
Test build #119652 has finished for PR 27807 at commit
|
jenkins, retest this, please |
retest this please |
Test build #119657 has finished for PR 27807 at commit
|
thanks, merging to master/3.0! |
…estamp via local date-time In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components. This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar. Here is the example which demonstrates the issue: ```sql scala> sql("select date '1100-10-10'").collect() res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) ``` Yes, after the changes: ```sql scala> sql("select date '1100-10-10'").collect() res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10]) ``` By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes #27807 from MaxGekk/rebase-timestamp-before-1582. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 3d3e366) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Thank you, @MaxGekk and @cloud-fan . |
BTW, @MaxGekk . |
It's only a bug in 3.0 as we switch the calendar. |
…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by #27915, #27953, #27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes #28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a1dbcd1) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…estamp via local date-time ### What changes were proposed in this pull request? In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components. ### Why are the changes needed? This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar. Here is the example which demonstrates the issue: ```sql scala> sql("select date '1100-10-10'").collect() res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) ``` ### Does this PR introduce any user-facing change? Yes, after the changes: ```sql scala> sql("select date '1100-10-10'").collect() res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10]) ``` ### How was this patch tested? By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`. Closes apache#27807 from MaxGekk/rebase-timestamp-before-1582. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
…asource ### What changes were proposed in this pull request? In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar: 1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing. 2. In read, it loads previously saved parquet files by vectorized reader and by regular reader. Here is the summary of benchmarking: - Saving timestamps is **~6 times slower** - Loading timestamps w/ vectorized **off** is **~4 times slower** - Loading timestamps w/ vectorized **on** is **~10 times slower** ### Why are the changes needed? To know the impact of date-time rebasing introduced by apache#27915, apache#27953, apache#27807. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK8/11 | Closes apache#28057 from MaxGekk/rebase-bechmark. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day
1582-10-15
of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time componentyear
,month
,day
,hour
,minute
,second
andsecond fraction
, and construct java.sql.Timestamp/Date using the extracted components.Why are the changes needed?
This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.
Here is the example which demonstrates the issue:
Does this PR introduce any user-facing change?
Yes, after the changes:
How was this patch tested?
By running
DateTimeUtilsSuite
,DateFunctionsSuite
andDateExpressionsSuite
.