[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata #28137

cloud-fan · 2020-04-06T14:21:55Z

What changes were proposed in this pull request?

This PR adds a new parquet/avro file metadata: org.apache.spark.legacyDatetime. It indicates that the file was written with the "rebaseInWrite" config enabled, and spark need to do rebase when reading it.

This makes Spark be able to do rebase more smartly:

If we don't know which Spark version writes the file, do rebase if the "rebaseInRead" config is true.
If the file was written by Spark 2.4 and earlier, then do rebase.
If the file was written by Spark 3.0 and later, do rebase if the org.apache.spark.legacyDatetime exists in file metadata.

Why are the changes needed?

It's very easy to have mixed-calendar parquet/avro files: e.g. A user upgrades to Spark 3.0 and writes some parquet files to an existing directory. Then he realizes that the directory contains legacy datetime values before 1582. However, it's too late and he has to find out all the legacy files manually and read them separately.

To support mixed-calendar parquet/avro files, we need to decide to rebase or not based on the file metadata.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Updated test

cloud-fan · 2020-04-06T14:22:17Z

cc @MaxGekk @dongjoon-hyun @HyukjinKwon

dongjoon-hyun · 2020-04-06T18:05:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala

+  def needRebaseDateTime(lookupFileMeta: String => String): Option[Boolean] = {
+    // If there is no version, we return None and let the caller side to decide.
+    // Files written by Spark 3.0 and later follow the new calendar and don't need to rebase.
+    Option(lookupFileMeta(SPARK_VERSION_METADATA_KEY)).map(_ < "3.0")


Thank you so much, @cloud-fan . This is much better solution.

Files written by Spark 3.0 and later follow the new calendar and don't need to rebase.

This is an unexpected solution for me which doesn't cover some use case from my point of view. I do believe versions in parquet/avro are not enough to make decision about rebasing.

I think the use case you pointed out (write with 3.0 and read with 2.4) is a bit artificial. If someone develops a data pipeline, it's very tricky if some parts of the pipeline use 3.0 and some parts use 2.4. IMO the more common problem is to deal with legacy data files.

BTW we can support this use case by creating a new metadata entry to record the calendar, but I think it's too complicated and not worth it.

I don't think we guarantee forward compatibility - things from Spark 2.4 should work in Spark 3.0 but I don't think the opposite case is guaranteed.

dongjoon-hyun · 2020-04-06T18:09:03Z

Could you update the PR description? It should be >= 3.0 instead of > 3.0.

Do not rebase datetime values when reading parquet/avro, if we know the file written version and it's > "3.0".

MaxGekk · 2020-04-06T18:17:15Z

I haven't looked at the implementation yet but the description looks dangerous for me:

always write the datetime values to parquet/avro without rebasing

@cloud-fan I think there are at least 2 cases:

Spark 3.0 saves parquets that are supposed to be loaded by Spark 3.0 - no need to rebase
Spark 3.0 saves parquets that should be loaded by Spark 2.4, 2.3 and etc. - need to rebase.

Do you think the seconds case is not possible?

Do not rebase datetime values when reading parquet/avro, if we know the file written version and it's > "3.0"

but you don't know the purpose of written files. Maybe an users wants to save parquet files by Spark 3.0 and read them back by Spark 2.4? For example, Spark 3.0 and Spark 2.4 can be used in different clusters that prepares data for each others. And later, if files are loaded by Spark 3.0, how users should read already saved parquet files. For example, when they migrated from 2.4 to 3.0 in a cluster but files were written for 2.4.

MaxGekk · 2020-04-06T18:31:33Z

sql/core/benchmarks/DateTimeRebaseBenchmark-jdk11-results.txt

@@ -26,17 +15,6 @@ before 1582, vec off, rebase on                   13836          13872
 before 1582, vec on, rebase off                    3664           3672           7         27.3          36.6       3.5X
 before 1582, vec on, rebase on                     6049           6078          30         16.5          60.5       2.1X

-OpenJDK 64-Bit Server VM 11.0.6+10-post-Ubuntu-1ubuntu118.04.1 on Linux 4.15.0-1063-aws
-Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
-Save timestamps to parquet:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative


Why do you delete it? How should I know the performance in end-to-end case in write?

If you decide to merge this PR, I would ask to do that after the optimizations #28119 otherwise I will not have good base line for end-to-end performance in write.

Yea we can wait for that. IIUC the optimization can also benefit rebasing when writing ORC files, right?

days rebasing in ORC will do after the optimization because it uses DaysWritable which is based on functions touched in this PR

timestamps rebasing in ORC won't get any benefits from the optimization because it uses to/fromJavaTimestamp

can't toJavaTimestamp benefit from your optimization?

No, it does not use rebaseGregorianToJulianMicros, see

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala

Lines 149 to 152 in 820bb99

def toJavaTimestamp(us: SQLTimestamp): Timestamp = {

val ldt = microsToInstant(us).atZone(ZoneId.systemDefault()).toLocalDateTime

Timestamp.valueOf(ldt)

}

. We need to migrate it on the function as I am going to do for days rebasing in #28091

SparkQA · 2020-04-06T19:05:12Z

Test build #120876 has finished for PR 28137 at commit 8295549.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType, rebaseDateTime: Boolean)
class ParquetReadSupport(

cloud-fan · 2020-04-07T06:58:00Z

@MaxGekk It's always an option to ask users to upgrade software if some bugs are not fixed in earlier versions. For example, if Spark 3.0 writes Parquet files and Spark 2.4 may read wrong data, we can ask them to upgrade to Spark 3.0. Forward compatibility is not officially supported.

On the other hand, if Spark 2.4 writes Parquet files and Spark 3.0 may read wrong data, asking users to downgrade is not a good option. This PR tries to improve backward compatibility by rebasing datetime more smartly: don't rely on the config if the file written Spark version is known. I think it's more important than forward compatibility.

MaxGekk · 2020-04-07T07:14:21Z

For example, if Spark 3.0 writes Parquet files and Spark 2.4 may read wrong data, we can ask them to upgrade to Spark 3.0. Forward compatibility is not officially supported.

but before you removed the SQL config spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled, it was possible. After this PR, we force users to upgrade all theirs clusters to Spark 3.0, so, partial migrations won't be possible.

cloud-fan · 2020-04-07T07:31:56Z

Partial upgrade is still possible, e.g. write files using 2.4 and read using 3.0. It's not possible to do the opposite though, as forward compatibility is not guaranteed. And that config doesn't really help too much with the forward compatibility: it's hard for users to know when to enable that config.

MaxGekk · 2020-04-07T07:38:37Z

It's not possible to do the opposite though, as forward compatibility is not guaranteed.

@cloud-fan I would propose to write value of spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled to Parquet/Avro meta. So, Spark 3.0 could check the flag in read and decide either do rebasing or skip it. If the flag does not present in meta info (saved by Spark 2.4), just do rebasing.

cloud-fan · 2020-04-07T08:08:49Z

Please see #28137 (comment)

My point is: it's more of an artificial use case and I'm not sure it worth introducing a new metadata key for it.

HyukjinKwon · 2020-04-07T09:36:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -2522,19 +2522,6 @@ object SQLConf {
      .booleanConf
      .createWithDefault(false)

-  val LEGACY_PARQUET_REBASE_DATETIME_IN_WRITE =
-    buildConf("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled")


One thing I am concerned, though, I suspect it won't be compatible with other external projects if we don't rebase? We have spark.sql.parquet.writeLegacyFormat in particular for the compatibility with Apache Hive and Apache Impala for example.

People would assume the storage systems to use the ISO standard calendar (proleptic Gregorian calendar) by default. The Parquet spec doesn't define the calendar explicitly but it refers to java 8 time API, which kind of implictly requires proleptic Gregorian calendar.

Spark has this issue because we use java 7 time API before 3.0.

Hive switched to the proleptic Gregorian calendar in 3.1, and AFAIK caused some troubles in other systems, which kind of proves that other systems are using the proleptic Gregorian calendar already.

Impala has added a config convert_legacy_hive_parquet_utc_timestamps to work around it: https://issues.apache.org/jira/browse/IMPALA-3933

Presto also hit issues when reading Hive tables with timestamp: prestodb/presto#12180

That why I think it's better to write datetime values without rebasing, as that's the correct data.

Thanks for investigation. SGTM.

SparkQA · 2020-04-15T18:12:20Z

Test build #121326 has finished for PR 28137 at commit 2e908c5.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AvroDeserializer(rootAvroType: Schema, rootCatalystType: DataType, rebaseDateTime: Boolean)
class AvroSerializer(
class ParquetReadSupport(

MaxGekk

LGTM, left minor comments

MaxGekk · 2020-04-21T06:12:28Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+        val path2_4 = getResourceAvroFilePath("before_1582_date_v2_4.avro")
+        val path3_0 = path.getCanonicalPath
+        val dateStr = "1001-01-01"
+        Seq(dateStr).toDF("str").select($"str".cast("date").as("date"))


If you test this things together, I would write by Spark 3.0 with rebase on as well

MaxGekk · 2020-04-21T06:14:28Z

external/avro/src/test/scala/org/apache/spark/sql/avro/AvroSuite.scala

+            Row(java.sql.Timestamp.valueOf(tsStr))))
+      }
+
+      withTempPath { path =>


It seems it possible to write a generic function which takes as parameters: before_1582_ts_millis_v2_4.avro, timestamp-millis, 1001-01-01 01:02:03.124

MaxGekk · 2020-04-21T06:25:36Z

...ain/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRecordMaterializer.scala

@@ -31,16 +31,20 @@ import org.apache.spark.sql.types.StructType
 * @param parquetSchema Parquet schema of the records to be read
 * @param catalystSchema Catalyst schema of the rows to be constructed
 * @param schemaConverter A Parquet-Catalyst schema converter that helps initializing row converters
+ * @param convertTz the optional time zone to convert to for int96 data


to for. some of them could be removed?

MaxGekk · 2020-04-21T06:30:35Z

sql/core/src/main/scala/org/apache/spark/sql/package.scala

+   * Parquet/Avro file metadata key to indicate that the file was written with legacy datetime
+   * values.
+   */
+  private[sql] val SPARK_LEGACY_DATETIME = "org.apache.spark.legacyDatetime"


not important but we use DateTime in other places. How about org.apache.spark.legacyDateTime

MaxGekk · 2020-04-21T06:32:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala

+            }
+          }
+
+          withTempPath { path =>


The same comment as for Avro. I think we could fold the tests

HyukjinKwon

Okay, looks good.

SparkQA · 2020-04-21T12:47:06Z

Test build #121576 has finished for PR 28137 at commit 6905247.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-04-21T15:26:04Z

Merged to master and branch-3.0.

…metadata ### What changes were proposed in this pull request? This PR adds a new parquet/avro file metadata: `org.apache.spark.legacyDatetime`. It indicates that the file was written with the "rebaseInWrite" config enabled, and spark need to do rebase when reading it. This makes Spark be able to do rebase more smartly: 1. If we don't know which Spark version writes the file, do rebase if the "rebaseInRead" config is true. 2. If the file was written by Spark 2.4 and earlier, then do rebase. 3. If the file was written by Spark 3.0 and later, do rebase if the `org.apache.spark.legacyDatetime` exists in file metadata. ### Why are the changes needed? It's very easy to have mixed-calendar parquet/avro files: e.g. A user upgrades to Spark 3.0 and writes some parquet files to an existing directory. Then he realizes that the directory contains legacy datetime values before 1582. However, it's too late and he has to find out all the legacy files manually and read them separately. To support mixed-calendar parquet/avro files, we need to decide to rebase or not based on the file metadata. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Updated test Closes #28137 from cloud-fan/datetime. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit a5ebbac) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…gacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up #28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes #30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…gacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up #28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes #30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 38c05af) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…gacyDateTime' in Avro/Parquet files ### What changes were proposed in this pull request? Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs: - spark.sql.legacy.avro.datetimeRebaseModeInWrite - spark.sql.legacy.parquet.datetimeRebaseModeInWrite This is a follow up apache#28137. ### Why are the changes needed? 1. To improve test coverage 2. To make sure that the metadata key is actually saved to Avro/Parquet files ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the added tests: ``` $ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite" $ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite" ``` Closes apache#30061 from MaxGekk/parquet-test-metakey. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> (cherry picked from commit 38c05af) Signed-off-by: HyukjinKwon <gurwls223@apache.org>

dongjoon-hyun reviewed Apr 6, 2020

View reviewed changes

dongjoon-hyun added the SQL label Apr 6, 2020

MaxGekk reviewed Apr 6, 2020

View reviewed changes

HyukjinKwon reviewed Apr 7, 2020

View reviewed changes

cloud-fan force-pushed the datetime branch from 8295549 to 2e908c5 Compare April 15, 2020 13:40

probot-autolabeler bot added the AVRO label Apr 15, 2020

cloud-fan changed the title ~~[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file written Spark version~~ [SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata Apr 15, 2020

MaxGekk approved these changes Apr 21, 2020

View reviewed changes

cloud-fan added 2 commits April 21, 2020 14:48

Rebase datetime in parquet/avro according to file metadata

dc2deb3

address comments

6905247

cloud-fan force-pushed the datetime branch from 2e908c5 to 6905247 Compare April 21, 2020 07:53

HyukjinKwon approved these changes Apr 21, 2020

View reviewed changes

HyukjinKwon closed this in a5ebbac Apr 21, 2020

MaxGekk mentioned this pull request Oct 15, 2020

[SPARK-33163][SQL][TESTS] Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files #30061

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata #28137

[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata #28137

cloud-fan commented Apr 6, 2020 •

edited

Loading

cloud-fan commented Apr 6, 2020

dongjoon-hyun Apr 6, 2020

MaxGekk Apr 6, 2020

cloud-fan Apr 7, 2020

HyukjinKwon Apr 7, 2020

dongjoon-hyun commented Apr 6, 2020

MaxGekk commented Apr 6, 2020 •

edited

Loading

MaxGekk Apr 6, 2020

MaxGekk Apr 6, 2020

cloud-fan Apr 7, 2020

MaxGekk Apr 7, 2020

MaxGekk Apr 7, 2020

cloud-fan Apr 7, 2020

MaxGekk Apr 7, 2020

SparkQA commented Apr 6, 2020

cloud-fan commented Apr 7, 2020

MaxGekk commented Apr 7, 2020

cloud-fan commented Apr 7, 2020

MaxGekk commented Apr 7, 2020

cloud-fan commented Apr 7, 2020

HyukjinKwon Apr 7, 2020

cloud-fan Apr 7, 2020

HyukjinKwon Apr 7, 2020

SparkQA commented Apr 15, 2020

MaxGekk left a comment

MaxGekk Apr 21, 2020

MaxGekk Apr 21, 2020

MaxGekk Apr 21, 2020

MaxGekk Apr 21, 2020

MaxGekk Apr 21, 2020

HyukjinKwon left a comment

SparkQA commented Apr 21, 2020

HyukjinKwon commented Apr 21, 2020

	def toJavaTimestamp(us: SQLTimestamp): Timestamp = {
	val ldt = microsToInstant(us).atZone(ZoneId.systemDefault()).toLocalDateTime
	Timestamp.valueOf(ldt)
	}

[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata #28137

[SPARK-31361][SQL] Rebase datetime in parquet/avro according to file metadata #28137

Conversation

cloud-fan commented Apr 6, 2020 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cloud-fan commented Apr 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Apr 6, 2020

MaxGekk commented Apr 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 6, 2020

cloud-fan commented Apr 7, 2020

MaxGekk commented Apr 7, 2020

cloud-fan commented Apr 7, 2020

MaxGekk commented Apr 7, 2020

cloud-fan commented Apr 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 15, 2020

MaxGekk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 21, 2020

HyukjinKwon commented Apr 21, 2020

cloud-fan commented Apr 6, 2020 •

edited

Loading

MaxGekk commented Apr 6, 2020 •

edited

Loading