[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId#55898
Closed
LuciferYang wants to merge 2 commits into
Closed
[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId#55898LuciferYang wants to merge 2 commits into
LuciferYang wants to merge 2 commits into
Conversation
…aryId `DowncastLongUpdater.decodeSingleDictionaryId` calls `values.putLong(...)`, but `DowncastLongUpdater` is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into `intData`, not `longData`. So `putLong` NPEs whenever this path runs. Switch the call to `putInt` with the same `(int) longValue` narrowing cast already used by `readValue` and `readValues`. The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold: 1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...). 2. Spark reads it as a Decimal with precision <= 9. 3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via `ParquetDictionary` and never touches this updater method. Without the fix, the new regression test fails with: ``` Cause: java.lang.NullPointerException: at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406) ... ``` Yes. Reads that previously NPE'd now return correct values. New `ParquetIOSuite` test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full `ParquetIOSuite` is green locally. No Closes apache#55890 from LuciferYang/SPARK-downcast-long-dict-fix. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Peter Toth <peter.toth@gmail.com> (cherry picked from commit 9f17e18) (cherry picked from commit bc51958)
Contributor
Author
|
cc @peter-toth |
peter-toth
approved these changes
May 15, 2026
Contributor
Looks like a Scala 2.12 issue? |
…quetIOSuite Remove numeric underscore separators (Scala 2.13+ syntax) that cause compilation failure on Scala 2.12.
Contributor
Author
|
fixed |
peter-toth
pushed a commit
that referenced
this pull request
May 16, 2026
…ctionaryId ### What changes were proposed in this pull request? `DowncastLongUpdater.decodeSingleDictionaryId` calls `values.putLong(...)`, but `DowncastLongUpdater` is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into `intData`, not `longData`. So `putLong` NPEs whenever this path runs. Switch the call to `putInt` with the same `(int) longValue` narrowing cast already used by `readValue` and `readValues`. ### Why are the changes needed? The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold: 1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...). 2. Spark reads it as a Decimal with precision <= 9. 3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via `ParquetDictionary` and never touches this updater method. Without the fix, the new regression test fails with: ``` Cause: java.lang.NullPointerException: at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406) ... ``` ### Does this PR introduce _any_ user-facing change? Yes. Reads that previously NPE'd now return correct values. ### How was this patch tested? New `ParquetIOSuite` test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full `ParquetIOSuite` is green locally. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55898 from LuciferYang/SPARK-56872-branch-3.5. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>
Contributor
|
Thanks @LuciferYang , merged to |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
DowncastLongUpdater.decodeSingleDictionaryIdcallsvalues.putLong(...), butDowncastLongUpdateris only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores intointData, notlongData. SoputLongNPEs whenever this path runs.Switch the call to
putIntwith the same(int) longValuenarrowing cast already used byreadValueandreadValues.Why are the changes needed?
The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold:
ParquetDictionaryand never touches this updater method.Without the fix, the new regression test fails with:
Does this PR introduce any user-facing change?
Yes. Reads that previously NPE'd now return correct values.
How was this patch tested?
New
ParquetIOSuitetest that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. FullParquetIOSuiteis green locally.Was this patch authored or co-authored using generative AI tooling?
No