[VL] Fallback Parquet scan if legacy timezone is found in file metadata by zhztheplayer · Pull Request #11117 · apache/gluten

zhztheplayer · 2025-11-18T14:28:59Z

This is to fix result mismatch issues when Gluten is reading Parquet file written by old Spark versions (< 3.0.0) or with legacy datetime rebase mode (spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY / spark.sql.parquet.int96RebaseModeInRead=LEGACY).

Test case Column DEFAULT value support with Delta Lake, positive tests which is being added in #11107 will cover this fix.

Before:

"000[2-12-30]" did not equal "000[1-01-01]"
ScalaTestFailureLocation: org.apache.spark.sql.delta.DeltaColumnDefaultsInsertSuite at (DeltaInsertIntoTableSuite.scala:829)
Expected :"000[1-01-01]"
Actual   :"000[2-12-30]"

After:

15:22:23.853 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Scan parquet spark_catalog.default.t4[QueryId=43], due to:
        - Detected unsupported metadata in parquet files: Legacy timezone found.

github-actions · 2025-11-18T14:29:30Z

Run Gluten Clickhouse CI on x86

FelixYBW · 2025-11-19T09:11:12Z

Is it because that velox doesn't support it?

zhztheplayer · 2025-11-19T10:30:26Z

Is it because that velox doesn't support it?

Yes, Velox doesn't support the settings.

github-actions · 2025-11-21T10:32:44Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-21T16:03:25Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-21T16:08:53Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-21T17:49:42Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-24T15:38:50Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-24T15:41:15Z

Run Gluten Clickhouse CI on x86

github-actions · 2025-11-24T15:44:38Z

Run Gluten Clickhouse CI on x86

zhztheplayer · 2025-11-25T14:44:59Z

cc @YannByron @FelixYBW

github-actions · 2025-11-25T22:47:31Z

Run Gluten Clickhouse CI on x86

Yohahaha · 2025-11-28T04:48:15Z

we should add switch to control it and respect fileLimit, this check is very heavy.

do you test it in tpc benchmark with remote object storage? in my case, tpcds q1 slow down by 10x.

jinchengchenghh · 2025-11-28T17:32:09Z

-      if (SparkShimLoader.getSparkShims.isParquetFileEncrypted(fileStatus, conf)) {
-        return true
+      if (
+        isEncryptionValidationEnabled && SparkShimLoader.getSparkShims.isParquetFileEncrypted(


The isEncryptionValidationEnabled is false, this change will cause though isEncryptionValidationEnabled is false, fetch all the file, the hadoop listStatus is is heavy

Could we detect only one parquet file to read the metadata? If you would like to do the optimization, could you please wait for this PR https://github.com/apache/incubator-gluten/pull/11225/files#diff-169ab07f3b1741a5742eecd22865d93fb00b32028ea4e6dfbeaae2be056a1103R147 to avoid too much conflict?

@zhztheplayer is one parquet file enough or should we sample several files? I remember anohter PR added a sample files number. We may use the same one.

Now the fileLimit is a new config default 10, but rootpaths is Seq, so it is 10 * rootpaths.length

The PR still respects the previous sample number. There should be another bug. I'll get back to this.

Could we detect only one parquet file to read the metadata? If you would like to do the optimization, could you please wait for this PR https://github.com/apache/incubator-gluten/pull/11225/files#diff-169ab07f3b1741a5742eecd22865d93fb00b32028ea4e6dfbeaae2be056a1103R147 to avoid too much conflict?

@jinchengchenghh Let's first address the performance regressions as it affects all main stream code users. I opened a PR #11233.

…e regression Performance regressions is [found](apache#11117 (comment)) in PR apache#11117. We should disable the validation by default temporarily.

zhztheplayer · 2025-12-01T11:37:29Z

do you test it in tpc benchmark with remote object storage? in my case, tpcds q1 slow down by 10x.

No I missed out the part that the encryption validation was disabled by default. I think PR #11233 should temporarily fix the problem.

If Q1 is slowed down that much by footer reading, and provided Spark also needs to read the footers to infer Parquet schema, perhaps it might be beneficial if we follow Spark's practice to parallelize the footer reading in the future:

https://github.com/apache/spark/blob/54cde812c2657717651156636041a84556619a5b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L539-L561

it's a temp fix of #11233 and #11117

github-actions Bot added CORE works for Gluten Core VELOX labels Nov 18, 2025

zhztheplayer marked this pull request as draft November 18, 2025 14:29

zhztheplayer mentioned this pull request Nov 18, 2025

[GLUTEN-11106][VL] Spark 3.5 / Delta 3.3: Add DeltaInsertIntoTableSuite, DeltaDDLSuite #11107

Merged

[VL] Fallback Parquet scan if legacy timezone is found in file metadata

aa03651

zhztheplayer force-pushed the wip-fix-legacy-timezone branch from 8f7e6b3 to aa03651 Compare November 21, 2025 10:32

fixup

b75e1bf

fixup

0eb46be

fixup

8698596

github-actions Bot added the DOCS label Nov 21, 2025

fixup

797e9a5

fixup

13ec449

zhztheplayer force-pushed the wip-fix-legacy-timezone branch from 0bd2fc2 to 13ec449 Compare November 24, 2025 15:44

zhztheplayer marked this pull request as ready for review November 25, 2025 10:57

FelixYBW approved these changes Nov 25, 2025

View reviewed changes

Merge branch 'main' into wip-fix-legacy-timezone

b45987d

zhztheplayer merged commit d5da832 into apache:main Nov 26, 2025
60 checks passed

jinchengchenghh reviewed Nov 28, 2025

View reviewed changes

zhztheplayer mentioned this pull request Dec 1, 2025

[VL] Disable Parquet metadata validation by default due to performance regression #11233

Merged

FelixYBW pushed a commit that referenced this pull request Dec 17, 2025

Disable parquet file metadata validation by default #11307

cb1d720

it's a temp fix of #11233 and #11117

Conversation

zhztheplayer commented Nov 18, 2025

Uh oh!

github-actions Bot commented Nov 18, 2025

Uh oh!

FelixYBW commented Nov 19, 2025

Uh oh!

zhztheplayer commented Nov 19, 2025

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

github-actions Bot commented Nov 21, 2025

Uh oh!

github-actions Bot commented Nov 24, 2025

Uh oh!

github-actions Bot commented Nov 24, 2025

Uh oh!

github-actions Bot commented Nov 24, 2025

Uh oh!

zhztheplayer commented Nov 25, 2025

Uh oh!

github-actions Bot commented Nov 25, 2025

Uh oh!

Uh oh!

Yohahaha commented Nov 28, 2025

Uh oh!

jinchengchenghh Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FelixYBW Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

zhztheplayer commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jinchengchenghh Nov 28, 2025 •

edited

Loading