Skip to content

[VL] Fallback Parquet scan if legacy timezone is found in file metadata#11117

Merged
zhztheplayer merged 7 commits intoapache:mainfrom
zhztheplayer:wip-fix-legacy-timezone
Nov 26, 2025
Merged

[VL] Fallback Parquet scan if legacy timezone is found in file metadata#11117
zhztheplayer merged 7 commits intoapache:mainfrom
zhztheplayer:wip-fix-legacy-timezone

Conversation

@zhztheplayer
Copy link
Copy Markdown
Member

This is to fix result mismatch issues when Gluten is reading Parquet file written by old Spark versions (< 3.0.0) or with legacy datetime rebase mode (spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY / spark.sql.parquet.int96RebaseModeInRead=LEGACY).

Test case Column DEFAULT value support with Delta Lake, positive tests which is being added in #11107 will cover this fix.

Before:

"000[2-12-30]" did not equal "000[1-01-01]"
ScalaTestFailureLocation: org.apache.spark.sql.delta.DeltaColumnDefaultsInsertSuite at (DeltaInsertIntoTableSuite.scala:829)
Expected :"000[1-01-01]"
Actual   :"000[2-12-30]"

After:

15:22:23.853 WARN org.apache.spark.sql.execution.GlutenFallbackReporter: Validation failed for plan: Scan parquet spark_catalog.default.t4[QueryId=43], due to:
        - Detected unsupported metadata in parquet files: Legacy timezone found.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels Nov 18, 2025
@zhztheplayer zhztheplayer marked this pull request as draft November 18, 2025 14:29
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@FelixYBW
Copy link
Copy Markdown
Contributor

Is it because that velox doesn't support it?

@zhztheplayer
Copy link
Copy Markdown
Member Author

Is it because that velox doesn't support it?

Yes, Velox doesn't support the settings.

@zhztheplayer zhztheplayer force-pushed the wip-fix-legacy-timezone branch from 8f7e6b3 to aa03651 Compare November 21, 2025 10:32
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions github-actions Bot added the DOCS label Nov 21, 2025
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer zhztheplayer force-pushed the wip-fix-legacy-timezone branch from 0bd2fc2 to 13ec449 Compare November 24, 2025 15:44
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer zhztheplayer marked this pull request as ready for review November 25, 2025 10:57
@zhztheplayer
Copy link
Copy Markdown
Member Author

cc @YannByron @FelixYBW

@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@zhztheplayer zhztheplayer merged commit d5da832 into apache:main Nov 26, 2025
60 checks passed
@Yohahaha
Copy link
Copy Markdown
Contributor

we should add switch to control it and respect fileLimit, this check is very heavy.

do you test it in tpc benchmark with remote object storage? in my case, tpcds q1 slow down by 10x.

if (SparkShimLoader.getSparkShims.isParquetFileEncrypted(fileStatus, conf)) {
return true
if (
isEncryptionValidationEnabled && SparkShimLoader.getSparkShims.isParquetFileEncrypted(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isEncryptionValidationEnabled is false, this change will cause though isEncryptionValidationEnabled is false, fetch all the file, the hadoop listStatus is is heavy

Copy link
Copy Markdown
Contributor

@jinchengchenghh jinchengchenghh Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we detect only one parquet file to read the metadata? If you would like to do the optimization, could you please wait for this PR https://github.com/apache/incubator-gluten/pull/11225/files#diff-169ab07f3b1741a5742eecd22865d93fb00b32028ea4e6dfbeaae2be056a1103R147 to avoid too much conflict?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhztheplayer is one parquet file enough or should we sample several files? I remember anohter PR added a sample files number. We may use the same one.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the fileLimit is a new config default 10, but rootpaths is Seq, so it is 10 * rootpaths.length

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR still respects the previous sample number. There should be another bug. I'll get back to this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we detect only one parquet file to read the metadata? If you would like to do the optimization, could you please wait for this PR https://github.com/apache/incubator-gluten/pull/11225/files#diff-169ab07f3b1741a5742eecd22865d93fb00b32028ea4e6dfbeaae2be056a1103R147 to avoid too much conflict?

@jinchengchenghh Let's first address the performance regressions as it affects all main stream code users. I opened a PR #11233.

zhztheplayer added a commit to zhztheplayer/gluten that referenced this pull request Dec 1, 2025
…e regression

Performance regressions is [found](apache#11117 (comment)) in PR apache#11117. We should disable the validation by default temporarily.
@zhztheplayer
Copy link
Copy Markdown
Member Author

do you test it in tpc benchmark with remote object storage? in my case, tpcds q1 slow down by 10x.

No I missed out the part that the encryption validation was disabled by default. I think PR #11233 should temporarily fix the problem.

If Q1 is slowed down that much by footer reading, and provided Spark also needs to read the footers to infer Parquet schema, perhaps it might be beneficial if we follow Spark's practice to parallelize the footer reading in the future:

https://github.com/apache/spark/blob/54cde812c2657717651156636041a84556619a5b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L539-L561

FelixYBW pushed a commit that referenced this pull request Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants