Skip to content

[SPARK-40610][SQL] Support unwrap date type to string type#40294

Closed
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-40610
Closed

[SPARK-40610][SQL] Support unwrap date type to string type#40294
wangyum wants to merge 1 commit intoapache:masterfrom
wangyum:SPARK-40610

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Mar 6, 2023

What changes were proposed in this pull request?

This PR enhances UnwrapCastInBinaryComparison to support unwrapping date type to string type.

Why are the changes needed?

Avoid always fetching all partitions because the partition filters cannot be pushed down to the Hive metastore. For example:

CREATE TABLE t1(id int, dt string) using parquet PARTITIONED BY (dt);
EXPLAIN SELECT * FROM t1 WHERE dt > date_add(current_date(), -7);

Before Spark 3.0. It pushes partition filters to Hive metastore:

== Physical Plan ==
*(1) FileScan parquet default.t1[id#2,dt#3] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dt#3), (dt#3 > 2023-02-27)], PushedFilters: [], ReadSchema: struct<id:int>

After SPARK-27638. Because it can not convert partition filters to hive metastore filters, it will not push partition filters to Hive metastore. As a result, it always takes all the parititons:

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet default.t1[id#5,dt#6] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#6), (cast(dt#6 as date) > 2023-02-27)], PushedFilters: [], ReadSchema: struct<id:int>

After this PR. It unwraps date type to string type and then pushes partition filters to Hive metastore:

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), (dt#1 > 2023-02-27)], PushedFilters: [], ReadSchema: struct<id:int>

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test.

@github-actions github-actions bot added the SQL label Mar 6, 2023
@wangyum
Copy link
Member Author

wangyum commented Mar 6, 2023

@cloud-fan @sunchao

case be @ BinaryComparison(
Cast(fromExp, _, timeZoneId, evalMode), date @ Literal(value, DateType))
if fromExp.dataType == StringType && value != null =>
be.withNewChildren(Seq(fromExp, Cast(date, StringType, timeZoneId, evalMode)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tricky, as we need to be very careful about the reverse of date string parsing. Can we reference the detail of date string parsing in Cast and prove it's safe to do this optimization?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UnwrapCastInBinaryComparison class is only responsible for processing casts related to numeric types, and it is dangerous to cast date types. Can we add the related conversion of partition columns to the PruneHiveTablePartitions rule?

@wangyum
Copy link
Member Author

wangyum commented Mar 30, 2023

Close it, because this change may have potential data issue. Users can set spark.sql.legacy.typeCoercion.datetimeToString.enabled to true to restore the old behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments