[SPARK-40610][SQL] Support unwrap date type to string type by wangyum · Pull Request #40294 · apache/spark

wangyum · 2023-03-06T06:47:10Z

What changes were proposed in this pull request?

This PR enhances UnwrapCastInBinaryComparison to support unwrapping date type to string type.

Why are the changes needed?

Avoid always fetching all partitions because the partition filters cannot be pushed down to the Hive metastore. For example:

CREATE TABLE t1(id int, dt string) using parquet PARTITIONED BY (dt);
EXPLAIN SELECT * FROM t1 WHERE dt > date_add(current_date(), -7);

Before Spark 3.0. It pushes partition filters to Hive metastore:

== Physical Plan ==
*(1) FileScan parquet default.t1[id#2,dt#3] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dt#3), (dt#3 > 2023-02-27)], PushedFilters: [], ReadSchema: struct<id:int>

After SPARK-27638. Because it can not convert partition filters to hive metastore filters, it will not push partition filters to Hive metastore. As a result, it always takes all the parititons:

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet default.t1[id#5,dt#6] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#6), (cast(dt#6 as date) > 2023-02-27)], PushedFilters: [], ReadSchema: struct<id:int>

After this PR. It unwraps date type to string type and then pushes partition filters to Hive metastore:

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), (dt#1 > 2023-02-27)], PushedFilters: [], ReadSchema: struct<id:int>

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

wangyum · 2023-03-06T08:32:27Z

@cloud-fan @sunchao

cloud-fan · 2023-03-07T14:34:07Z

...st/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala

+    case be @ BinaryComparison(
+      Cast(fromExp, _, timeZoneId, evalMode), date @ Literal(value, DateType))
+        if fromExp.dataType == StringType && value != null =>
+      be.withNewChildren(Seq(fromExp, Cast(date, StringType, timeZoneId, evalMode)))


This is tricky, as we need to be very careful about the reverse of date string parsing. Can we reference the detail of date string parsing in Cast and prove it's safe to do this optimization?

UnwrapCastInBinaryComparison class is only responsible for processing casts related to numeric types, and it is dangerous to cast date types. Can we add the related conversion of partition columns to the PruneHiveTablePartitions rule?

wangyum · 2023-03-30T02:46:56Z

Close it, because this change may have potential data issue. Users can set spark.sql.legacy.typeCoercion.datetimeToString.enabled to true to restore the old behavior.

Support unwrap date type to string type

16ca0aa

github-actions bot added the SQL label Mar 6, 2023

cloud-fan reviewed Mar 7, 2023

View reviewed changes

wangyum closed this Mar 30, 2023

puchengy mentioned this pull request May 26, 2023

[WIP][SPARK-43801][SQL] Support unwrap date type to string type in UnwrapCastInBinaryComparison #41332

Closed

puchengy mentioned this pull request Jul 26, 2023

Spark 3.2 query's partition filter with UDF inside did not get pushed to BatchScan operator apache/iceberg#4997

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40610][SQL] Support unwrap date type to string type#40294

[SPARK-40610][SQL] Support unwrap date type to string type#40294
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-40610

wangyum commented Mar 6, 2023 •

edited

Loading

Uh oh!

wangyum commented Mar 6, 2023

Uh oh!

cloud-fan Mar 7, 2023

Uh oh!

beryllw Mar 29, 2023

Uh oh!

wangyum commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

wangyum commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

wangyum commented Mar 6, 2023

Uh oh!

cloud-fan Mar 7, 2023

Choose a reason for hiding this comment

Uh oh!

beryllw Mar 29, 2023

Choose a reason for hiding this comment

Uh oh!

wangyum commented Mar 30, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

wangyum commented Mar 6, 2023 •

edited

Loading