Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

Closed
wants to merge 7 commits into from

Conversation

majdyz
Copy link
Contributor

@majdyz majdyz commented Oct 19, 2023

What changes were proposed in this pull request?

This is a follow-up of #43451.
The scope of the PR:

  • Provided an explicit boundary check to avoid segmentation faults on OffHeapColumnVector.
  • Semantically fixed the usage of the isAllNull & numNulls field on ColumnVector.

Why are the changes needed?

Avoid executor dying due to segmentation fault.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@majdyz majdyz changed the title [WIP] [SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check [SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check Oct 19, 2023
MaxGekk pushed a commit that referenced this pull request Oct 22, 2023
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Oct 22, 2023
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 13b67ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Oct 22, 2023
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 13b67ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 28, 2024
@github-actions github-actions bot closed this Jan 29, 2024
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: apache#43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 13b67ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant