[SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector by sadikovi · Pull Request #36366 · apache/spark

sadikovi · 2022-04-26T23:28:16Z

What changes were proposed in this pull request?

This PR fixes an issue of reading null columns with the vectorised Parquet reader when the entire column is null or does not exist. This is especially noticeable when performing a merge or schema evolution in Parquet.

The issue is only exposed with the OffHeapColumnVector which does not handle isAllNull flag - OnHeapColumnVector already handles isAllNull so everything works fine there.

Why are the changes needed?

The change is needed to correctly read null columns using the vectorised reader in the off-heap mode.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I updated the existing unit tests to ensure we cover off-heap mode. I confirmed that the tests pass with the fix and fail without.

sadikovi · 2022-04-26T23:29:08Z

cc @sunchao @cloud-fan. I am going to update the description and add a test but would like you to take a look.

sadikovi · 2022-04-27T00:09:10Z

The implications of the bug are the following:

All missing columns are treated as null and either report wrong results or crash JVM.
All fully null columns have a similar behaviour.

Examples of errors:

[info]   
[info]   == Results ==
[info]   !== Correct Answer - 3 ==   == Spark Answer - 3 ==
[info]   !struct<>                   struct<_1:struct<_3:int,_4:bigint>>
[info]   ![null]                     [[336920597,0]]
[info]   ![null]                     [[827474256,0]]
[info]   ![null]                     [[872417200,1447062946291269968]] (QueryTest.scala:243)

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff3dd03ec6e, pid=19511, tid=0x00007ff2b39db700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 7824 C1 org.apache.spark.sql.execution.vectorized.OffHeapColumnVector.getLong(I)J (41 bytes) @ 0x00007ff3dd03ec6e [0x00007ff3dd03eb60+0x10e]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/ivan.sadikov/spark-oss/sql/core/hs_err_pid19511.log
Compiled method (c1)    9692 7823       3       org.apache.spark.sql.vectorized.ColumnarRow::getLong (16 bytes)
 total in heap  [0x00007ff3dd03cd90,0x00007ff3dd03d2f8] = 1384
 relocation     [0x00007ff3dd03ceb8,0x00007ff3dd03cf08] = 80
 main code      [0x00007ff3dd03cf20,0x00007ff3dd03d160] = 576
 stub code      [0x00007ff3dd03d160,0x00007ff3dd03d218] = 184
 oops           [0x00007ff3dd03d218,0x00007ff3dd03d220] = 8
 metadata       [0x00007ff3dd03d220,0x00007ff3dd03d228] = 8
 scopes data    [0x00007ff3dd03d228,0x00007ff3dd03d268] = 64
 scopes pcs     [0x00007ff3dd03d268,0x00007ff3dd03d2d8] = 112
 dependencies   [0x00007ff3dd03d2d8,0x00007ff3dd03d2e0] = 8
 nul chk table  [0x00007ff3dd03d2e0,0x00007ff3dd03d2f8] = 24
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

sunchao

Nice catch @sadikovi ! Sorry I missed this.

### What changes were proposed in this pull request? This PR fixes an issue of reading null columns with the vectorised Parquet reader when the entire column is null or does not exist. This is especially noticeable when performing a merge or schema evolution in Parquet. The issue is only exposed with the `OffHeapColumnVector` which does not handle `isAllNull` flag - `OnHeapColumnVector` already handles `isAllNull` so everything works fine there. ### Why are the changes needed? The change is needed to correctly read null columns using the vectorised reader in the off-heap mode. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I updated the existing unit tests to ensure we cover off-heap mode. I confirmed that the tests pass with the fix and fail without. Closes #36366 from sadikovi/fix-off-heap-cv. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Chao Sun <sunchao@apple.com>

sunchao · 2022-04-27T03:46:06Z

Merged to master/3.3

sadikovi · 2022-04-27T03:48:29Z

Thank you!

handle isAllNull in off heap cv

5bbca9e

github-actions bot added the SQL label Apr 26, 2022

update tests

147c550

sunchao approved these changes Apr 27, 2022

View reviewed changes

cloud-fan approved these changes Apr 27, 2022

View reviewed changes

sunchao closed this in 0d8ace5 Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector#36366

[SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector#36366
sadikovi wants to merge 2 commits intoapache:masterfrom
sadikovi:fix-off-heap-cv

sadikovi commented Apr 26, 2022 •

edited

Loading

Uh oh!

sadikovi commented Apr 26, 2022

Uh oh!

sadikovi commented Apr 27, 2022

Uh oh!

sunchao left a comment

Uh oh!

sunchao commented Apr 27, 2022

Uh oh!

sadikovi commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sadikovi commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sadikovi commented Apr 26, 2022

Uh oh!

sadikovi commented Apr 27, 2022

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

sunchao commented Apr 27, 2022

Uh oh!

sadikovi commented Apr 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sadikovi commented Apr 26, 2022 •

edited

Loading