Skip to content

[SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector#36366

Closed
sadikovi wants to merge 2 commits intoapache:masterfrom
sadikovi:fix-off-heap-cv
Closed

[SPARK-34863][SQL][FOLLOW-UP] Handle IsAllNull in OffHeapColumnVector#36366
sadikovi wants to merge 2 commits intoapache:masterfrom
sadikovi:fix-off-heap-cv

Conversation

@sadikovi
Copy link
Contributor

@sadikovi sadikovi commented Apr 26, 2022

What changes were proposed in this pull request?

This PR fixes an issue of reading null columns with the vectorised Parquet reader when the entire column is null or does not exist. This is especially noticeable when performing a merge or schema evolution in Parquet.

The issue is only exposed with the OffHeapColumnVector which does not handle isAllNull flag - OnHeapColumnVector already handles isAllNull so everything works fine there.

Why are the changes needed?

The change is needed to correctly read null columns using the vectorised reader in the off-heap mode.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I updated the existing unit tests to ensure we cover off-heap mode. I confirmed that the tests pass with the fix and fail without.

@github-actions github-actions bot added the SQL label Apr 26, 2022
@sadikovi
Copy link
Contributor Author

cc @sunchao @cloud-fan. I am going to update the description and add a test but would like you to take a look.

@sadikovi
Copy link
Contributor Author

The implications of the bug are the following:

  • All missing columns are treated as null and either report wrong results or crash JVM.
  • All fully null columns have a similar behaviour.

Examples of errors:

[info]   
[info]   == Results ==
[info]   !== Correct Answer - 3 ==   == Spark Answer - 3 ==
[info]   !struct<>                   struct<_1:struct<_3:int,_4:bigint>>
[info]   ![null]                     [[336920597,0]]
[info]   ![null]                     [[827474256,0]]
[info]   ![null]                     [[872417200,1447062946291269968]] (QueryTest.scala:243)
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007ff3dd03ec6e, pid=19511, tid=0x00007ff2b39db700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# J 7824 C1 org.apache.spark.sql.execution.vectorized.OffHeapColumnVector.getLong(I)J (41 bytes) @ 0x00007ff3dd03ec6e [0x00007ff3dd03eb60+0x10e]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/ivan.sadikov/spark-oss/sql/core/hs_err_pid19511.log
Compiled method (c1)    9692 7823       3       org.apache.spark.sql.vectorized.ColumnarRow::getLong (16 bytes)
 total in heap  [0x00007ff3dd03cd90,0x00007ff3dd03d2f8] = 1384
 relocation     [0x00007ff3dd03ceb8,0x00007ff3dd03cf08] = 80
 main code      [0x00007ff3dd03cf20,0x00007ff3dd03d160] = 576
 stub code      [0x00007ff3dd03d160,0x00007ff3dd03d218] = 184
 oops           [0x00007ff3dd03d218,0x00007ff3dd03d220] = 8
 metadata       [0x00007ff3dd03d220,0x00007ff3dd03d228] = 8
 scopes data    [0x00007ff3dd03d228,0x00007ff3dd03d268] = 64
 scopes pcs     [0x00007ff3dd03d268,0x00007ff3dd03d2d8] = 112
 dependencies   [0x00007ff3dd03d2d8,0x00007ff3dd03d2e0] = 8
 nul chk table  [0x00007ff3dd03d2e0,0x00007ff3dd03d2f8] = 24
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch @sadikovi ! Sorry I missed this.

@sunchao sunchao closed this in 0d8ace5 Apr 27, 2022
sunchao pushed a commit that referenced this pull request Apr 27, 2022
### What changes were proposed in this pull request?

This PR fixes an issue of reading null columns with the vectorised Parquet reader when the entire column is null or does not exist. This is especially noticeable when performing a merge or schema evolution in Parquet.

The issue is only exposed with the `OffHeapColumnVector` which does not handle `isAllNull` flag - `OnHeapColumnVector` already handles `isAllNull` so everything works fine there.

### Why are the changes needed?

The change is needed to correctly read null columns using the vectorised reader in the off-heap mode.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

I updated the existing unit tests to ensure we cover off-heap mode. I confirmed that the tests pass with the fix and fail without.

Closes #36366 from sadikovi/fix-off-heap-cv.

Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com>
Signed-off-by: Chao Sun <sunchao@apple.com>
@sunchao
Copy link
Member

sunchao commented Apr 27, 2022

Merged to master/3.3

@sadikovi
Copy link
Contributor Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants