[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId by LuciferYang · Pull Request #55898 · apache/spark

LuciferYang · 2026-05-15T09:59:43Z

What changes were proposed in this pull request?

DowncastLongUpdater.decodeSingleDictionaryId calls values.putLong(...), but DowncastLongUpdater is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into intData, not longData. So putLong NPEs whenever this path runs.

Switch the call to putInt with the same (int) longValue narrowing cast already used by readValue and readValues.

Why are the changes needed?

The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold:

Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...).
Spark reads it as a Decimal with precision <= 9.
The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via ParquetDictionary and never touches this updater method.

Without the fix, the new regression test fails with:

Cause: java.lang.NullPointerException:
  at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406)
  ...

Does this PR introduce any user-facing change?

Yes. Reads that previously NPE'd now return correct values.

How was this patch tested?

New ParquetIOSuite test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full ParquetIOSuite is green locally.

Was this patch authored or co-authored using generative AI tooling?

No

…aryId `DowncastLongUpdater.decodeSingleDictionaryId` calls `values.putLong(...)`, but `DowncastLongUpdater` is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into `intData`, not `longData`. So `putLong` NPEs whenever this path runs. Switch the call to `putInt` with the same `(int) longValue` narrowing cast already used by `readValue` and `readValues`. The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold: 1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...). 2. Spark reads it as a Decimal with precision <= 9. 3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via `ParquetDictionary` and never touches this updater method. Without the fix, the new regression test fails with: ``` Cause: java.lang.NullPointerException: at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406) ... ``` Yes. Reads that previously NPE'd now return correct values. New `ParquetIOSuite` test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full `ParquetIOSuite` is green locally. No Closes apache#55890 from LuciferYang/SPARK-downcast-long-dict-fix. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Peter Toth <peter.toth@gmail.com> (cherry picked from commit 9f17e18) (cherry picked from commit bc51958)

LuciferYang · 2026-05-15T10:02:27Z

cc @peter-toth

peter-toth · 2026-05-15T14:47:17Z

[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:1609:18: Invalid literal number
[error]       case 0 => -999_999_999L

Looks like a Scala 2.12 issue?

dongjoon-hyun

+1, LGTM.

…quetIOSuite Remove numeric underscore separators (Scala 2.13+ syntax) that cause compilation failure on Scala 2.12.

LuciferYang · 2026-05-16T07:11:57Z

fixed

…ctionaryId ### What changes were proposed in this pull request? `DowncastLongUpdater.decodeSingleDictionaryId` calls `values.putLong(...)`, but `DowncastLongUpdater` is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into `intData`, not `longData`. So `putLong` NPEs whenever this path runs. Switch the call to `putInt` with the same `(int) longValue` narrowing cast already used by `readValue` and `readValues`. ### Why are the changes needed? The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold: 1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...). 2. Spark reads it as a Decimal with precision <= 9. 3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via `ParquetDictionary` and never touches this updater method. Without the fix, the new regression test fails with: ``` Cause: java.lang.NullPointerException: at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713) at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406) ... ``` ### Does this PR introduce _any_ user-facing change? Yes. Reads that previously NPE'd now return correct values. ### How was this patch tested? New `ParquetIOSuite` test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full `ParquetIOSuite` is green locally. ### Was this patch authored or co-authored using generative AI tooling? No Closes #55898 from LuciferYang/SPARK-56872-branch-3.5. Authored-by: YangJie <yangjie01@baidu.com> Signed-off-by: Peter Toth <peter.toth@gmail.com>

peter-toth · 2026-05-16T10:37:00Z

Thanks @LuciferYang , merged to branch-3.5 (3.5.9).

peter-toth approved these changes May 15, 2026

View reviewed changes

dongjoon-hyun approved these changes May 15, 2026

View reviewed changes

[SPARK-56872][SQL][FOLLOW-UP] Fix Scala 2.12 compilation error in Par…

78654c6

…quetIOSuite Remove numeric underscore separators (Scala 2.13+ syntax) that cause compilation failure on Scala 2.12.

peter-toth closed this May 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId#55898

[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId#55898
LuciferYang wants to merge 2 commits into
apache:branch-3.5from
LuciferYang:SPARK-56872-branch-3.5

LuciferYang commented May 15, 2026 •

edited

Loading

Uh oh!

LuciferYang commented May 15, 2026

Uh oh!

peter-toth commented May 15, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

LuciferYang commented May 16, 2026

Uh oh!

peter-toth commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

LuciferYang commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

LuciferYang commented May 15, 2026

Uh oh!

peter-toth commented May 15, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented May 16, 2026

Uh oh!

peter-toth commented May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

LuciferYang commented May 15, 2026 •

edited

Loading