Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time #27807

Closed
wants to merge 24 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 5, 2020

What changes were proposed in this pull request?

In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day 1582-10-15 of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component year, month, day, hour, minute, second and second fraction, and construct java.sql.Timestamp/Date using the extracted components.

Why are the changes needed?

This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.

Here is the example which demonstrates the issue:

scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])

Does this PR introduce any user-facing change?

Yes, after the changes:

scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])

How was this patch tested?

By running DateTimeUtilsSuite, DateFunctionsSuite and DateExpressionsSuite.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119382 has finished for PR 27807 at commit b9532d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119411 has finished for PR 27807 at commit 503e3f0.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119418 has finished for PR 27807 at commit 4a13df3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 5, 2020

Test build #119419 has finished for PR 27807 at commit 120db04.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 6, 2020

@cloud-fan The rebasing cause a few issues. For example, we cannot use TimestampFormatter/DateFormatter to format java.sql.Timestamp/Date anymore:

timestampFormatter.format(DateTimeUtils.fromJavaTimestamp(t))

To format rebased Java Timestamp/Date instances before 1582, need to use Julian based formatter - SimpleDateFormat.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119471 has finished for PR 27807 at commit d2e3a27.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [WIP][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time [SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time Mar 6, 2020
@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 6, 2020

@cloud-fan @juliuszsompolski Please, take a look at the PR if you have time.

@SparkQA
Copy link

SparkQA commented Mar 6, 2020

Test build #119484 has finished for PR 27807 at commit 8e83e1f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119486 has finished for PR 27807 at commit fe8c1fb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 7, 2020

Test build #119511 has finished for PR 27807 at commit f4ced32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -136,7 +138,9 @@ public int getInt(int rowId) {
public long getLong(int rowId) {
int index = getRowIndex(rowId);
if (isTimestamp) {
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;
Timestamp ts = new Timestamp(timestampData.time[index]);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly orthogonal bug fix. @dongjoon-hyun FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does parquet have the same issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @MaxGekk . Could you elaborate a little more what was the existing bug here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this PR add a test coverage for this vectorized code path change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a round trip test in OrcQuerySuite to read/write date before 1582?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a little more what was the existing bug here?

ORC writer uses DateTimeUtils.toJavaTimestamp, see

case TimestampType => (getter, ordinal) =>
val ts = DateTimeUtils.toJavaTimestamp(getter.getLong(ordinal))
val result = new OrcTimestamp(ts.getTime)
result.setNanos(ts.getNanos)
result
but here we don't use opposite function DateTimeUtils.fromJavaTimestamp.

And the replaced hand-written code is not equal to fromJavaTimestamp

Does this PR add a test coverage for this vectorized code path change?

I have to fix this place due test failures, see https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119486/testReport/

The changes covered by the round trip test

which runs by OrcHadoopFsRelationSuite for vectorized code path.

can we add a round trip test in OrcQuerySuite to read/write date before 1582?

@cloud-fan The test I pointed out above generates random dates/timestamps before 1582 with high probability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thanks.

@@ -36,8 +36,8 @@ class HiveResultSuite extends SharedSparkSession {
test("timestamp formatting in hive result") {
val timestamps = Seq(
"2018-12-28 01:02:03",
"1582-10-13 01:02:03",
"1582-10-14 01:02:03",
"1582-10-03 01:02:03",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversions of timestamps in the range 1582-10-04 - 1582-10-15 is implementation specific because of calendar switching.

@@ -618,7 +619,19 @@ private[hive] trait HiveInspectors {
case x: DateObjectInspector if x.preferWritable() =>
data: Any => {
if (data != null) {
DateTimeUtils.fromJavaDate(x.getPrimitiveWritableObject(data).get())
val millis = Math.multiplyExact(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some comments to explain why we need to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -136,7 +138,9 @@ public int getInt(int rowId) {
public long getLong(int rowId) {
int index = getRowIndex(rowId);
if (isTimestamp) {
return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at ORC type spec but it doesn't mention the calendar. The physical timestamp type looks very similar to Timestamp, so this looks correct to me.

How about the write side?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The write side uses toJavaTimestamp already:

case TimestampType => (getter, ordinal) =>
val ts = DateTimeUtils.toJavaTimestamp(getter.getLong(ordinal))
val result = new OrcTimestamp(ts.getTime)
result.setNanos(ts.getNanos)
result
. So after this PR, it will do rebasing automatically

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// For example:
// Proleptic Gregorian calendar: 1582-01-01 -> -141714
// Julian calendar: 1582-01-01 -> -141704
// The code below converts -141714 to -141704.
Copy link
Member Author

@MaxGekk MaxGekk Mar 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gregorian year is shorter than Julian year: 365.2425 days vs 365.25 days, so the same local date in Gregorian calendar requires less days in Julian calendar.

@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119609 has finished for PR 27807 at commit 5f5fdd6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 10, 2020

Test build #119616 has finished for PR 27807 at commit da8de7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return timestampData.time[index] * 1000 + timestampData.nanos[index] / 1000 % 1000;
Timestamp ts = new Timestamp(timestampData.time[index]);
ts.setNanos(timestampData.nanos[index]);
return DateTimeUtils.fromJavaTimestamp(ts);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a simpler API: DateTimeUtils.fromJavaTimestamp(timestampData.asScratchTimestamp(index))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the hood, this re-uses the Timestamp object and should be more efficient.

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119652 has finished for PR 27807 at commit 407cc1f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 11, 2020

jenkins, retest this, please

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Mar 11, 2020

Test build #119657 has finished for PR 27807 at commit 407cc1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.0!

@cloud-fan cloud-fan closed this in 3d3e366 Mar 11, 2020
cloud-fan pushed a commit that referenced this pull request Mar 11, 2020
…estamp via local date-time

In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components.

This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.

Here is the example which demonstrates the issue:
```sql
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
```

Yes, after the changes:
```sql
scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])
```

By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.

Closes #27807 from MaxGekk/rebase-timestamp-before-1582.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 3d3e366)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

Thank you, @MaxGekk and @cloud-fan .
I marked this as a correctness issue according to @MaxGekk 's comment, #27807 (comment) .

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 11, 2020

BTW, @MaxGekk .
Is this PR only correctness issue in 3.0.0?
Since we touched old codes, could you make another PR for branch-2.4 for the followings with one small specific UT?

@cloud-fan
Copy link
Contributor

It's only a bug in 3.0 as we switch the calendar.

cloud-fan pushed a commit that referenced this pull request Mar 30, 2020
…asource

### What changes were proposed in this pull request?
In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:
1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:
- Saving timestamps is **~6 times slower**
- Loading timestamps w/ vectorized **off** is **~4 times slower**
- Loading timestamps w/ vectorized **on** is **~10 times slower**

### Why are the changes needed?
To know the impact of date-time rebasing introduced by #27915, #27953, #27807.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28057 from MaxGekk/rebase-bechmark.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan pushed a commit that referenced this pull request Mar 30, 2020
…asource

### What changes were proposed in this pull request?
In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:
1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:
- Saving timestamps is **~6 times slower**
- Loading timestamps w/ vectorized **off** is **~4 times slower**
- Loading timestamps w/ vectorized **on** is **~10 times slower**

### Why are the changes needed?
To know the impact of date-time rebasing introduced by #27915, #27953, #27807.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28057 from MaxGekk/rebase-bechmark.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a1dbcd1)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…estamp via local date-time

### What changes were proposed in this pull request?
In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components.

### Why are the changes needed?
This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.

Here is the example which demonstrates the issue:
```sql
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
```

### Does this PR introduce any user-facing change?
Yes, after the changes:
```sql
scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])
```

### How was this patch tested?
By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.

Closes apache#27807 from MaxGekk/rebase-timestamp-before-1582.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…asource

### What changes were proposed in this pull request?
In the PR, I propose to add new benchmark `DateTimeRebaseBenchmark` which should measure the performance of rebasing of dates/timestamps from/to to the hybrid calendar (Julian+Gregorian) to/from Proleptic Gregorian calendar:
1. In write, it saves separately dates and timestamps before and after 1582 year w/ and w/o rebasing.
2. In read, it loads previously saved parquet files by vectorized reader and by regular reader.

Here is the summary of benchmarking:
- Saving timestamps is **~6 times slower**
- Loading timestamps w/ vectorized **off** is **~4 times slower**
- Loading timestamps w/ vectorized **on** is **~10 times slower**

### Why are the changes needed?
To know the impact of date-time rebasing introduced by apache#27915, apache#27953, apache#27807.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run the `DateTimeRebaseBenchmark` benchmark using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes apache#28057 from MaxGekk/rebase-bechmark.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the rebase-timestamp-before-1582 branch June 5, 2020 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants