[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns #28489

MaxGekk · 2020-05-10T11:50:31Z

What changes were proposed in this pull request?

Modified the decodeDictionaryIds() method of VectorizedColumnReader to handle especially TimestampType when the passed parameter rebaseDateTime is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using RebaseDateTime.rebaseJulianToGregorianMicros().

Why are the changes needed?

This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:

spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala>
Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS")
  .select($"tsS".cast("timestamp").as("ts")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)

Load the dates back:

scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-07 00:32:20.123|
...
|1001-01-07 00:32:20.123|
+-----------------------+

Expected values must be 1001-01-01 01:02:03.123 but not 1001-01-07 00:32:20.123.

Does this PR introduce any user-facing change?

Yes. After the changes:

scala> spark.read.parquet(path).show(false)
+-----------------------+
|ts                     |
+-----------------------+
|1001-01-01 01:02:03.123|
...
|1001-01-01 01:02:03.123|
+-----------------------+

How was this patch tested?

Modified the test SPARK-31159: rebasing timestamps in write in ParquetIOSuite to checked reading dictionary encoded dates.

MaxGekk · 2020-05-10T11:51:00Z

@cloud-fan @HyukjinKwon Please, review this PR.

SparkQA · 2020-05-10T13:23:19Z

Test build #122474 has finished for PR 28489 at commit 8eecc90.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-05-10T18:50:44Z

Test build #122476 has finished for PR 28489 at commit 06dc785.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-05-11T04:57:27Z

LGTM, merging to master/3.0!

…ictionary encoded Parquet columns Modified the `decodeDictionaryIds()` method of `VectorizedColumnReader` to handle especially `TimestampType` when the passed parameter `rebaseDateTime` is true. In that case, decoded milliseconds/microseconds are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianMicros()`. This fixes the bug of loading timestamps before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> Seq.tabulate(8)(_ => "1001-01-01 01:02:03.123").toDF("tsS") .select($"tsS".cast("timestamp").as("ts")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ |ts | +-----------------------+ |1001-01-07 00:32:20.123| ... |1001-01-07 00:32:20.123| +-----------------------+ ``` Expected values **must be 1001-01-01 01:02:03.123** but not 1001-01-07 00:32:20.123. Yes. After the changes: ```scala scala> spark.read.parquet(path).show(false) +-----------------------+ |ts | +-----------------------+ |1001-01-01 01:02:03.123| ... |1001-01-01 01:02:03.123| +-----------------------+ ``` Modified the test `SPARK-31159: rebasing timestamps in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28489 from MaxGekk/fix-ts-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 5d5866b) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

HyukjinKwon

I thought I approved this. LGTM

MaxGekk added 2 commits May 10, 2020 13:35

Modify the test

39fa75f

Bug fix

8eecc90

probot-autolabeler bot added the SQL label May 10, 2020

Fix NPE

06dc785

cloud-fan closed this in 5d5866b May 11, 2020

HyukjinKwon reviewed May 11, 2020

View reviewed changes

MaxGekk deleted the fix-ts-rebase-parquet-dict-enc branch June 5, 2020 19:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns #28489

[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns #28489

MaxGekk commented May 10, 2020

MaxGekk commented May 10, 2020

SparkQA commented May 10, 2020

SparkQA commented May 10, 2020

cloud-fan commented May 11, 2020

HyukjinKwon left a comment

[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns #28489

[SPARK-31672][SQL] Fix loading of timestamps before 1582-10-15 from dictionary encoded Parquet columns #28489

Conversation

MaxGekk commented May 10, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented May 10, 2020

SparkQA commented May 10, 2020

SparkQA commented May 10, 2020

cloud-fan commented May 11, 2020

HyukjinKwon left a comment

Choose a reason for hiding this comment