Skip to content

Conversation

@IvanK-db
Copy link

@IvanK-db IvanK-db commented Aug 8, 2024

What changes were proposed in this pull request?

This PR:

  • Adds support for DATE_TRUNC in V2 optimization pushdown
  • Consumes this pushdown for Postgres & H2 Connectors

Why are the changes needed?

Performance improvements.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New integration test in PostgresIntegrationSuite.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Aug 8, 2024
"VAR_POP", "VAR_SAMP", "STDDEV_POP", "STDDEV_SAMP", "COVAR_POP", "COVAR_SAMP", "CORR",
"REGR_INTERCEPT", "REGR_R2", "REGR_SLOPE", "REGR_SXY")
private val supportedFunctions = supportedAggregateFunctions
private val supportedDatetimeFunctions = Set("DATE_TRUNC")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private val supportedAggregateFunctions: Set[String] = Set( "MAX", "MIN", "SUM", "COUNT", "AVG", "VAR_POP", "VAR_SAMP", "STDDEV_POP", "STDDEV_SAMP", "COVAR_POP", "COVAR_SAMP", "CORR", "REGR_INTERCEPT", "REGR_R2", "REGR_SLOPE", "REGR_SXY", "DATE_TRUNC" )

This approach simplifies the code and keeps everything in one place. Let me know what you think!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Imo DATE_TRUNC should not be in the supportedAggregateFunctions as it's not an aggregate function

@HyukjinKwon
Copy link
Member

cc @beliefer

s"""
SELECT DATE_TRUNC('$format', time), COUNT(*)
| FROM $catalogName.datetime_table
| GROUP BY 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding new tests! Can we do tests where DATE_TRUNC is a part of the predicate (where DATE_TRUNC = ....) to see if it works when we push down predicates. I am not sure if we do pushdowns in projection.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added tests for that as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce aggregate here? we just need test predicates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original query that was used to point out the function wasn't pushed down was similar to this one so i modeled the test case after it. Should we remove the aggregate test cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Let's remove them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private val supportedDatetimeFunctions = Set("DATE_TRUNC")
private val supportedFunctions =
supportedAggregateFunctions ++
supportedDatetimeFunctions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we remove supportedDatetimeFunctions by
private val supportedFunctions = supportedAggregateFunctions ++ Set("DATE_TRUNC") ?

Copy link
Author

@IvanK-db IvanK-db Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking here is that we might have other datetime functions pushed down in the future and this provides a convenient grouping for them, similar to existing supportedAggregateFunctions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregate functions is very different from others(string functions, datetime functions and so on) and already defined a lot of aggregate functions.
We can define supportedDatetimeFunctions in future if need.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def testAggregatePushdown(format: String, expectedResult: Set[Row]): Unit = {
val df = sql(
s"""
SELECT DATE_TRUNC('$format', time), COUNT(*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DATE_TRUNC supported by H2 database ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.h2database.com/html/functions.html#date_trunc

Looks like yes, however the original task was adding support only for Postgres for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should support for H2 dialect first, then other database dialect.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added pushdown for H2 as well

@IvanK-db IvanK-db requested a review from beliefer September 5, 2024 15:20
checkAnswer(df9, Seq(Row("alex")))

val df10 = sql("SELECT name FROM h2.test.datetime WHERE " +
"DATE_TRUNC('DAY', date1) = date'2022-05-19'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add more test cases for other format supported by Spark, such as: YEAR, MM and so on.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's create a separate test case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

s"""
SELECT DATE_TRUNC('$format', time), COUNT(*)
| FROM $catalogName.datetime_table
| GROUP BY 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce aggregate here? we just need test predicates.

@IvanK-db IvanK-db requested a review from beliefer September 12, 2024 09:08
val filters = df.queryExecution.optimizedPlan.collect {
case f: Filter => f
}
assert(filters.isEmpty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reuse checkFilterPushed here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@IvanK-db IvanK-db requested a review from beliefer September 12, 2024 15:07
@IvanK-db
Copy link
Author

@beliefer Kind reminder, all comments have been addressed.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. cc @cloud-fan Do you have any other comments?

}

private def checkFilterPushed(df: DataFrame, pushed: Boolean = true): Unit = {
protected def checkFilterPushed(df: DataFrame, pushed: Boolean = true): Unit = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you rebase the repository? because the modifier already is protected.

Copy link
Contributor

@milastdbx milastdbx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add more tests for date_trunc precisions.

Also, please update description a bit.
In this PR you are:

  • Adding support for DATE_TRUNC in V2 optimization pushdown
  • Consuming this pushdown for Postgres & H2 Connectors

}
}

test("SPARK-49162: Push down filter date_trunc function") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reading the documentation for pgsql:
https://www.postgresql.org/docs/current/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC
and databricks:
https://docs.databricks.com/en/sql/language-manual/functions/date_trunc.html

I see that they support microseconds while we support microsecond.

Can we test for all our supported precisions ? ALso can we try different casings ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let's test all the supported units.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Could we add other precisions for H2 and Postgres? @IvanK-db

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

case _: DateAdd => generateExpressionWithName("DATE_ADD", expr, isPredicate)
case _: DateDiff => generateExpressionWithName("DATE_DIFF", expr, isPredicate)
case _: TruncDate => generateExpressionWithName("TRUNC", expr, isPredicate)
case _: TruncTimestamp => generateExpressionWithName("DATE_TRUNC", expr, isPredicate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is DATE_TRUNC a standard SQL function or widely supported in the industry?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a standard function but it's supported by several databases - Postgres, Redshift, Snowflake, BQ

val expectedPlanFragment9 =
"PushedFilters: [(DATE_TRUNC('MicroseconD', TIME1)) = 1725560625000000]"
checkPushedInfo(df9, expectedPlanFragment9)
checkAnswer(df9, Seq(Row("adam")))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add test cases for all supported precisions.
Please add the negative test cases, such as: MicroSecondS.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added all precisions. MicroSecondS seems to be a positive test case, looks to be an alias of microsecond so i added another negative test case.

Copy link
Contributor

@beliefer beliefer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@cloud-fan
Copy link
Contributor

cloud-fan commented Oct 8, 2024

My biggest concern is we translate the Spark TruncTimestamp expression to v2 DATE_TRUNC function. DATE_TRUNC is not in the SQL standard and personally I think it's more reasonable to let the v2 TRUNC function accept timestamp inputs. Redshift's TRUNC function also supports timestamp: https://docs.aws.amazon.com/redshift/latest/dg/r_TRUNC.html

also cc @srielau

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 17, 2025
@github-actions github-actions bot closed this Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants