-
Notifications
You must be signed in to change notification settings - Fork 28k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451
Conversation
…n on Parquet Vectorized Reader
spark.read.schema(df2.schema).parquet(s"$path/parquet").collect() | ||
} | ||
assert(e.getCause.isInstanceOf[SparkException]) | ||
assert(e.getCause.getCause.isInstanceOf[SchemaColumnConvertNotSupportedException]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception should be migrated to SparkThrowable
, and we should throw an exception with proper error class. Please, add a follow up ticket to https://issues.apache.org/jira/browse/SPARK-37935
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a subtask here: https://issues.apache.org/jira/browse/SPARK-45608
@@ -109,15 +109,17 @@ public ParquetVectorUpdater getUpdater(ColumnDescriptor descriptor, DataType spa | |||
// For unsigned int64, it stores as plain signed int64 in Parquet when dictionary | |||
// fallbacks. We read them as decimal values. | |||
return new UnsignedLongUpdater(); | |||
} else if (isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) { | |||
} else if (sparkType instanceof DatetimeType && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DatetimeType
also includes the DATE
type. Does this if
handle the date types too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've limited this check only to Timestamp & TimestampNtz now.
@majdyz Does Spark 3.4.x have the same issue? |
@MaxGekk yes, and I believe the previous branches too. |
+1, LGTM. Merging to master/3.5/3.4. |
…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>
…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: apache#43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is
timestamp_ntz
, and the logical type isarray<timestamp_ntz>
, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.Why are the changes needed?
Prevent NPE or Segfault from happening.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
A new test is added in
ParquetSchemaSuite
.Was this patch authored or co-authored using generative AI tooling?
No.