Skip to content

[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId#55898

Closed
LuciferYang wants to merge 2 commits into
apache:branch-3.5from
LuciferYang:SPARK-56872-branch-3.5
Closed

[SPARK-56872][SQL][3.5] Fix NPE in DowncastLongUpdater.decodeSingleDictionaryId#55898
LuciferYang wants to merge 2 commits into
apache:branch-3.5from
LuciferYang:SPARK-56872-branch-3.5

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented May 15, 2026

What changes were proposed in this pull request?

DowncastLongUpdater.decodeSingleDictionaryId calls values.putLong(...), but DowncastLongUpdater is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into intData, not longData. So putLong NPEs whenever this path runs.

Switch the call to putInt with the same (int) longValue narrowing cast already used by readValue and readValues.

Why are the changes needed?

The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold:

  1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...).
  2. Spark reads it as a Decimal with precision <= 9.
  3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via ParquetDictionary and never touches this updater method.

Without the fix, the new regression test fails with:

Cause: java.lang.NullPointerException:
  at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406)
  ...

Does this PR introduce any user-facing change?

Yes. Reads that previously NPE'd now return correct values.

How was this patch tested?

New ParquetIOSuite test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full ParquetIOSuite is green locally.

Was this patch authored or co-authored using generative AI tooling?

No

…aryId

`DowncastLongUpdater.decodeSingleDictionaryId` calls `values.putLong(...)`, but `DowncastLongUpdater` is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into `intData`, not `longData`. So `putLong` NPEs whenever this path runs.

Switch the call to `putInt` with the same `(int) longValue` narrowing cast already used by `readValue` and `readValues`.

The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold:

1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...).
2. Spark reads it as a Decimal with precision <= 9.
3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via `ParquetDictionary` and never touches this updater method.

Without the fix, the new regression test fails with:

```
Cause: java.lang.NullPointerException:
  at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406)
  ...
```

Yes. Reads that previously NPE'd now return correct values.

New `ParquetIOSuite` test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full `ParquetIOSuite` is green locally.

No

Closes apache#55890 from LuciferYang/SPARK-downcast-long-dict-fix.

Authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
(cherry picked from commit 9f17e18)
(cherry picked from commit bc51958)
@LuciferYang
Copy link
Copy Markdown
Contributor Author

cc @peter-toth

@peter-toth
Copy link
Copy Markdown
Contributor

[error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetIOSuite.scala:1609:18: Invalid literal number
[error]       case 0 => -999_999_999L

Looks like a Scala 2.12 issue?

Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

…quetIOSuite

Remove numeric underscore separators (Scala 2.13+ syntax) that cause compilation failure on Scala 2.12.
@LuciferYang
Copy link
Copy Markdown
Contributor Author

fixed

peter-toth pushed a commit that referenced this pull request May 16, 2026
…ctionaryId

### What changes were proposed in this pull request?

`DowncastLongUpdater.decodeSingleDictionaryId` calls `values.putLong(...)`, but `DowncastLongUpdater` is only chosen when the target is a 32-bit Decimal (precision <= 9), whose column vector stores into `intData`, not `longData`. So `putLong` NPEs whenever this path runs.

Switch the call to `putInt` with the same `(int) longValue` narrowing cast already used by `readValue` and `readValues`.

### Why are the changes needed?

The bug has been latent since SPARK-35640 (Jun 2021) because the path is only reachable when all three conditions hold:

1. Parquet stores the column as INT64 + DECIMAL(p<=9). Spark's own writer emits INT32 for this case, so the file must come from another writer (Hive, Impala, ...).
2. Spark reads it as a Decimal with precision <= 9.
3. The vectorized reader has to eagerly drain buffered dictionary IDs — typically when parquet-mr writes the column as a mix of dictionary-encoded and PLAIN pages and a non-dict page follows a dict page in the same batch. The normal lazy-dictionary path decodes at row read time via `ParquetDictionary` and never touches this updater method.

Without the fix, the new regression test fails with:

```
Cause: java.lang.NullPointerException:
  at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:393)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdaterFactory$DowncastLongUpdater.decodeSingleDictionaryId(ParquetVectorUpdaterFactory.java:713)
  at org.apache.spark.sql.execution.datasources.parquet.ParquetVectorUpdater.decodeDictionaryIds(ParquetVectorUpdater.java:75)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:288)
  at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:406)
  ...
```

### Does this PR introduce _any_ user-facing change?

Yes. Reads that previously NPE'd now return correct values.

### How was this patch tested?

New `ParquetIOSuite` test that writes an INT64 + DECIMAL(9, 2) column via parquet-mr's low-level writer with mix-cardinality data (4-value pool + unique-per-row) to force the dictionary -> PLAIN fallback. Without the fix it reproduces the NPE above; with the fix it passes. Full `ParquetIOSuite` is green locally.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #55898 from LuciferYang/SPARK-56872-branch-3.5.

Authored-by: YangJie <yangjie01@baidu.com>
Signed-off-by: Peter Toth <peter.toth@gmail.com>
@peter-toth
Copy link
Copy Markdown
Contributor

Thanks @LuciferYang , merged to branch-3.5 (3.5.9).

@peter-toth peter-toth closed this May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants