[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

majdyz · 2023-10-19T08:31:23Z

What changes were proposed in this pull request?

This is a follow-up of #43451.
The scope of the PR:

Provided an explicit boundary check to avoid segmentation faults on OffHeapColumnVector.
Semantically fixed the usage of the isAllNull & numNulls field on ColumnVector.

Why are the changes needed?

Avoid executor dying due to segmentation fault.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

…plicit memory boundary check

…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>

github-actions · 2024-01-28T00:20:02Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: apache#43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing ex…

308abc4

…plicit memory boundary check

github-actions bot added the SQL label Oct 19, 2023

majdyz mentioned this pull request Oct 19, 2023

[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

Closed

majdyz added 2 commits October 19, 2023 10:36

Merge branch 'master' into SPARK-45604-colvec

c7333b5

Add private

228a476

majdyz changed the title ~~[WIP] [SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check~~ [SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check Oct 19, 2023

majdyz added 4 commits October 19, 2023 13:18

Add test

231fa51

Change test name

055723f

Change test name

6b554ca

Lint fix

a0258c3

github-actions bot added the Stale label Jan 28, 2024

github-actions bot closed this Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

majdyz commented Oct 19, 2023 •

edited

github-actions bot commented Jan 28, 2024

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

Conversation

majdyz commented Oct 19, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

github-actions bot commented Jan 28, 2024

majdyz commented Oct 19, 2023 •

edited