[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

majdyz · 2023-10-19T08:09:41Z

What changes were proposed in this pull request?

Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is timestamp_ntz, and the logical type is array<timestamp_ntz>, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452.

Why are the changes needed?

Prevent NPE or Segfault from happening.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

A new test is added in ParquetSchemaSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

…n on Parquet Vectorized Reader

MaxGekk · 2023-10-19T13:56:08Z

...e/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala

+          spark.read.schema(df2.schema).parquet(s"$path/parquet").collect()
+        }
+        assert(e.getCause.isInstanceOf[SparkException])
+        assert(e.getCause.getCause.isInstanceOf[SchemaColumnConvertNotSupportedException])


This exception should be migrated to SparkThrowable, and we should throw an exception with proper error class. Please, add a follow up ticket to https://issues.apache.org/jira/browse/SPARK-37935

Added a subtask here: https://issues.apache.org/jira/browse/SPARK-45608

MaxGekk · 2023-10-19T14:01:23Z

...ain/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java

@@ -109,15 +109,17 @@ public ParquetVectorUpdater getUpdater(ColumnDescriptor descriptor, DataType spa
          // For unsigned int64, it stores as plain signed int64 in Parquet when dictionary
          // fallbacks. We read them as decimal values.
          return new UnsignedLongUpdater();
-        } else if (isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) {
+        } else if (sparkType instanceof DatetimeType &&


DatetimeType also includes the DATE type. Does this if handle the date types too?

Good point. I've limited this check only to Timestamp & TimestampNtz now.

MaxGekk · 2023-10-20T06:28:48Z

@majdyz Does Spark 3.4.x have the same issue?

majdyz · 2023-10-20T07:24:15Z

@MaxGekk yes, and I believe the previous branches too.

MaxGekk · 2023-10-22T05:52:50Z

+1, LGTM. Merging to master/3.5/3.4.
Thank you, @majdyz.

…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: #43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…ersion on Parquet Vectorized Reader ### What changes were proposed in this pull request? Currently, the read logical type is not checked while converting physical types INT64 into DateTime. One valid scenario where this can break is where the physical type is `timestamp_ntz`, and the logical type is `array<timestamp_ntz>`, since the logical type check does not happen, this conversion is allowed. However, the vectorized reader does not support this and will produce NPE on on-heap memory mode and SEGFAULT on off-heap memory mode. Segmentation fault on off-heap memory mode can be prevented by having an explicit boundary check on OffHeapColumnVector, but this is outside of the scope of this PR, and will be done here: apache#43452. ### Why are the changes needed? Prevent NPE or Segfault from happening. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A new test is added in `ParquetSchemaSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43451 from majdyz/SPARK-45604. Lead-authored-by: Zamil Majdy <zamil.majdy@databricks.com> Co-authored-by: Zamil Majdy <zamil.majdy@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 13b67ee) Signed-off-by: Max Gekk <max.gekk@gmail.com>

[SPARK-45604] Add LogicalType checking on INT64 -> DateTime conversio…

745a94b

…n on Parquet Vectorized Reader

github-actions bot added the SQL label Oct 19, 2023

majdyz mentioned this pull request Oct 19, 2023

[SPARK-45604] Prevent SEGFAULT on OffHeapColumnVector by providing explicit memory boundary check #43452

Closed

Merge branch 'apache:master' into SPARK-45604

304fb37

MaxGekk reviewed Oct 19, 2023

View reviewed changes

majdyz added 2 commits October 19, 2023 17:48

Reduce the timestamp check scope

e6a06f9

Rename test

e93069c

majdyz requested a review from MaxGekk October 19, 2023 16:28

MaxGekk approved these changes Oct 20, 2023

View reviewed changes

MaxGekk changed the title ~~[SPARK-45604] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader~~ [SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader Oct 20, 2023

MaxGekk closed this in 13b67ee Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

majdyz commented Oct 19, 2023 •

edited

MaxGekk Oct 19, 2023

majdyz Oct 19, 2023

MaxGekk Oct 19, 2023

majdyz Oct 19, 2023

MaxGekk commented Oct 20, 2023

majdyz commented Oct 20, 2023

MaxGekk commented Oct 22, 2023

[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

[SPARK-45604][SQL] Add LogicalType checking on INT64 -> DateTime conversion on Parquet Vectorized Reader #43451

Conversation

majdyz commented Oct 19, 2023 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

MaxGekk Oct 19, 2023

Choose a reason for hiding this comment

majdyz Oct 19, 2023

Choose a reason for hiding this comment

MaxGekk Oct 19, 2023

Choose a reason for hiding this comment

majdyz Oct 19, 2023

Choose a reason for hiding this comment

MaxGekk commented Oct 20, 2023

majdyz commented Oct 20, 2023

MaxGekk commented Oct 22, 2023

majdyz commented Oct 19, 2023 •

edited