[VL] Fallback Parquet scan if legacy timezone is found in file metadata#11117
[VL] Fallback Parquet scan if legacy timezone is found in file metadata#11117zhztheplayer merged 7 commits intoapache:mainfrom
Conversation
|
Run Gluten Clickhouse CI on x86 |
|
Is it because that velox doesn't support it? |
Yes, Velox doesn't support the settings. |
8f7e6b3 to
aa03651
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
0bd2fc2 to
13ec449
Compare
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
we should add switch to control it and respect do you test it in tpc benchmark with remote object storage? in my case, tpcds q1 slow down by 10x. |
| if (SparkShimLoader.getSparkShims.isParquetFileEncrypted(fileStatus, conf)) { | ||
| return true | ||
| if ( | ||
| isEncryptionValidationEnabled && SparkShimLoader.getSparkShims.isParquetFileEncrypted( |
There was a problem hiding this comment.
The isEncryptionValidationEnabled is false, this change will cause though isEncryptionValidationEnabled is false, fetch all the file, the hadoop listStatus is is heavy
There was a problem hiding this comment.
Could we detect only one parquet file to read the metadata? If you would like to do the optimization, could you please wait for this PR https://github.com/apache/incubator-gluten/pull/11225/files#diff-169ab07f3b1741a5742eecd22865d93fb00b32028ea4e6dfbeaae2be056a1103R147 to avoid too much conflict?
There was a problem hiding this comment.
@zhztheplayer is one parquet file enough or should we sample several files? I remember anohter PR added a sample files number. We may use the same one.
There was a problem hiding this comment.
Now the fileLimit is a new config default 10, but rootpaths is Seq, so it is 10 * rootpaths.length
There was a problem hiding this comment.
The PR still respects the previous sample number. There should be another bug. I'll get back to this.
There was a problem hiding this comment.
Could we detect only one parquet file to read the metadata? If you would like to do the optimization, could you please wait for this PR https://github.com/apache/incubator-gluten/pull/11225/files#diff-169ab07f3b1741a5742eecd22865d93fb00b32028ea4e6dfbeaae2be056a1103R147 to avoid too much conflict?
@jinchengchenghh Let's first address the performance regressions as it affects all main stream code users. I opened a PR #11233.
…e regression Performance regressions is [found](apache#11117 (comment)) in PR apache#11117. We should disable the validation by default temporarily.
No I missed out the part that the encryption validation was disabled by default. I think PR #11233 should temporarily fix the problem. If Q1 is slowed down that much by footer reading, and provided Spark also needs to read the footers to infer Parquet schema, perhaps it might be beneficial if we follow Spark's practice to parallelize the footer reading in the future: |
This is to fix result mismatch issues when Gluten is reading Parquet file written by old Spark versions (< 3.0.0) or with legacy datetime rebase mode (
spark.sql.parquet.datetimeRebaseModeInWrite=LEGACY/spark.sql.parquet.int96RebaseModeInRead=LEGACY).Test case
Column DEFAULT value support with Delta Lake, positive testswhich is being added in #11107 will cover this fix.Before:
After: