Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

Closed
wants to merge 4 commits into from

Conversation

majdyz
Copy link
Contributor

@majdyz majdyz commented Oct 19, 2023

What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is timestamp_ntz, and the logical type is array<timestamp_ntz>, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

Why are the changes needed?

Prevent NPE or Segfault from happening.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A new test is added in ParquetSchemaSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

spark.read.schema(df2.schema).parquet(s"$path/parquet").collect()
}
assert(e.getCause.isInstanceOf[SparkException])
assert(e.getCause.getCause.isInstanceOf[SchemaColumnConvertNotSupportedException])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception should be migrated to SparkThrowable, and we should throw an exception with proper error class. Please, add a follow up ticket to https://issues.apache.org/jira/browse/SPARK-37935

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -109,15 +109,17 @@ public ParquetVectorUpdater getUpdater(ColumnDescriptor descriptor, DataType spa
// For unsigned int64, it stores as plain signed int64 in Parquet when dictionary
// fallbacks. We read them as decimal values.
return new UnsignedLongUpdater();
} else if (isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) {
} else if (sparkType instanceof DatetimeType &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DatetimeType also includes the DATE type. Does this if handle the date types too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've limited this check only to Timestamp & TimestampNtz now.

@majdyz majdyz requested a review from MaxGekk October 19, 2023 16:28
@MaxGekk MaxGekk changed the title [SPARK-45604] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader [SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader Oct 20, 2023
@MaxGekk
Copy link
Member

MaxGekk commented Oct 20, 2023

@majdyz Does Spark 3.4.x have the same issue?

@majdyz
Copy link
Contributor Author

majdyz commented Oct 20, 2023

@MaxGekk yes, and I believe the previous branches too.

@MaxGekk
Copy link
Member

MaxGekk commented Oct 22, 2023

+1, LGTM. Merging to master/3.5/3.4.
Thank you, @majdyz.

@MaxGekk MaxGekk closed this in 13b67ee Oct 22, 2023
MaxGekk pushed a commit that referenced this pull request Oct 22, 2023
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 13b67ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Oct 22, 2023
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 13b67ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ersion on Parquet Vectorized Reader

### What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: apache#43452.

### Why are the changes needed?
Prevent NPE or Segfault from happening.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

A new test is added in `ParquetSchemaSuite`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#43451 from majdyz/SPARK-45604.

Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com>
Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 13b67ee)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
2 participants