[SPARK-39259][SQL] Evaluate timestamps consistently in subqueries #36654

olaky · 2022-05-24T14:01:51Z

What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries

Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test was added

AmplabJenkins · 2022-05-26T02:59:14Z

Can one of the admins verify this patch?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

MaxGekk

LGTM except of a minor comments.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala

MaxGekk · 2022-06-02T18:40:27Z

+1, LGTM. Merging to master.
Thank you, @olaky.

MaxGekk · 2022-06-02T18:46:29Z

@olaky Could you open a separate PRs with backports to branch-3.3 and branch-3.2 (according to SPARK-39259, 3.2 has this issue).

Congratulations with the first contribution to Apache Spark, and welcome to Spark community!

### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of #36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes #36752 from olaky/SPARK-39259-spark_3_3. Authored-by: Ole Sasse <ole.sasse@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

dongjoon-hyun

Hi, @olaky and @MaxGekk . Unfortunately, this broke Scala 2.13 in master/branch-3.3 and RC4. I made a PR to fix it.

[SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 ClassCastException in ComputeCurrentTimeSuite #36762

…` in `ComputeCurrentTimeSuite` ### What changes were proposed in this pull request? Unfortunately, #36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4. This PR aims to fix Scala 2.13 ClassCastException in the test code. ### Why are the changes needed? ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds) [info] java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146) [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47) ... [info] *** 7 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite [error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually tests with Scala 2.13. ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals (545 milliseconds) [info] - analyzer should replace current_date with literals (11 milliseconds) [info] - SPARK-33469: Add current_timezone function (3 milliseconds) [info] - analyzer should replace localtimestamp with literals (4 milliseconds) [info] - analyzer should use equal timestamps across subqueries (182 milliseconds) [info] - analyzer should use consistent timestamps for different timezones (13 milliseconds) [info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds) [info] Run completed in 1 second, 579 milliseconds. [info] Total number of tests run: 7 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM ``` Closes #36762 from dongjoon-hyun/SPARK-39259. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…` in `ComputeCurrentTimeSuite` ### What changes were proposed in this pull request? Unfortunately, #36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4. This PR aims to fix Scala 2.13 ClassCastException in the test code. ### Why are the changes needed? ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds) [info] java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146) [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47) ... [info] *** 7 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite [error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually tests with Scala 2.13. ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals (545 milliseconds) [info] - analyzer should replace current_date with literals (11 milliseconds) [info] - SPARK-33469: Add current_timezone function (3 milliseconds) [info] - analyzer should replace localtimestamp with literals (4 milliseconds) [info] - analyzer should use equal timestamps across subqueries (182 milliseconds) [info] - analyzer should use consistent timestamps for different timezones (13 milliseconds) [info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds) [info] Run completed in 1 second, 579 milliseconds. [info] Total number of tests run: 7 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM ``` Closes #36762 from dongjoon-hyun/SPARK-39259. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit d79aa36) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

JoshRosen · 2022-06-04T00:58:36Z

This looks like it might be a fix for a correctness issue? If so, we should probably backport this change to maintenance branches for the other currently-supported Spark versions.

JoshRosen · 2022-06-04T01:30:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala

@@ -479,21 +479,24 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
   * first to this node, then this node's subqueries and finally this node's children.
   * When the partial function does not apply to a given node, it is left unchanged.
   */
-  def transformDownWithSubqueries(f: PartialFunction[PlanType, PlanType]): PlanType = {
+  def transformDownWithSubqueries(


Instead of modifying this method's signature, I slightly prefer adding an overload named transformDownWithSubqueriesAndPruning in order to remain consistent with the naming conventions for other transform methods.

Adding a new method also avoids introducing source or binary compatibility issues for third-party code that calls Catalyst APIs. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see source), but I think it's nice to try to avoid API breakage when feasible.

I also ran into problems when trying to call transformDownWithSubqueries without supplying any arguments in the first argument group: calling transformDownWithSubqueries() { f } resulted in confusing compilation errors.

As a result, I'm going to submit a followup to this PR to split this into two methods as I've suggested above.

Here's my PR: #36765

…in transformDownWithSubqueries ### What changes were proposed in this pull request? This is a followup to #36654. That PR modified the existing `QueryPlan.transformDownWithSubqueries` to add additional arguments for tree pattern pruning. In this PR, I roll back the change to that method's signature and instead add a new `transformDownWithSubqueriesAndPruning` method. ### Why are the changes needed? The original change breaks binary and source compatibility in Catalyst. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see [source](https://github.com/apache/spark/blob/bb51add5c79558df863d37965603387d40cc4387/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L20-L24)), but I think it's nice to try to avoid API breakage when possible. While trying to compile some custom Catalyst code, I ran into issues when trying to call the `transformDownWithSubqueries` method without supplying a tree pattern filter condition. If I do `transformDownWithSubqueries() { f} ` then I get a compilation error. I think this is due to the first parameter group containing all default parameters. My PR's solution of adding a new `transformDownWithSubqueriesAndPruning` method solves this problem. It's also more consistent with the naming convention used for other pruning-enabled tree transformation methods. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36765 from JoshRosen/SPARK-39259-binary-compatibility-followup. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

…in transformDownWithSubqueries ### What changes were proposed in this pull request? This is a followup to #36654. That PR modified the existing `QueryPlan.transformDownWithSubqueries` to add additional arguments for tree pattern pruning. In this PR, I roll back the change to that method's signature and instead add a new `transformDownWithSubqueriesAndPruning` method. ### Why are the changes needed? The original change breaks binary and source compatibility in Catalyst. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see [source](https://github.com/apache/spark/blob/bb51add5c79558df863d37965603387d40cc4387/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L20-L24)), but I think it's nice to try to avoid API breakage when possible. While trying to compile some custom Catalyst code, I ran into issues when trying to call the `transformDownWithSubqueries` method without supplying a tree pattern filter condition. If I do `transformDownWithSubqueries() { f} ` then I get a compilation error. I think this is due to the first parameter group containing all default parameters. My PR's solution of adding a new `transformDownWithSubqueriesAndPruning` method solves this problem. It's also more consistent with the naming convention used for other pruning-enabled tree transformation methods. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #36765 from JoshRosen/SPARK-39259-binary-compatibility-followup. Authored-by: Josh Rosen <joshrosen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit eda6c4b) Signed-off-by: Max Gekk <max.gekk@gmail.com>

…` in `ComputeCurrentTimeSuite` ### What changes were proposed in this pull request? Unfortunately, apache#36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4. This PR aims to fix Scala 2.13 ClassCastException in the test code. ### Why are the changes needed? ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds) [info] java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146) [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47) ... [info] *** 7 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite [error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually tests with Scala 2.13. ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals (545 milliseconds) [info] - analyzer should replace current_date with literals (11 milliseconds) [info] - SPARK-33469: Add current_timezone function (3 milliseconds) [info] - analyzer should replace localtimestamp with literals (4 milliseconds) [info] - analyzer should use equal timestamps across subqueries (182 milliseconds) [info] - analyzer should use consistent timestamps for different timezones (13 milliseconds) [info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds) [info] Run completed in 1 second, 579 milliseconds. [info] Total number of tests run: 7 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM ``` Closes apache#36762 from dongjoon-hyun/SPARK-39259. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of #36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes #36753 from olaky/SPARK-39259-spark_3_2. Lead-authored-by: Ole Sasse <ole.sasse@databricks.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of #36654 with adjustements: * The rule does not use pruning * The transformWithSubqueries function was also backported ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes #36818 from olaky/SPARK-39259-spark_3_1. Authored-by: Ole Sasse <ole.sasse@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>

### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of apache#36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes apache#36753 from olaky/SPARK-39259-spark_3_2. Lead-authored-by: Ole Sasse <ole.sasse@databricks.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

github-actions bot added the SQL label May 24, 2022

olaky added 3 commits May 25, 2022 09:02

[SPARK-39259] Evaluate timestamps consistently in subqueries

5f20717

wip

ff68d5b

Use Instant for CurrentDate, more test coverage

8075b25

olaky force-pushed the SPARK-39259 branch from 3bd5a37 to 8075b25 Compare May 25, 2022 09:21

olaky added 2 commits May 27, 2022 09:55

Add pruning for subqueries, fix scalastyle

7897342

Comment why there is an extension point not used in Spark

a854a37

olaky changed the title ~~[WIP][SPARK-39259] Evaluate timestamps consistently in subqueries~~ [SPARK-39259] Evaluate timestamps consistently in subqueries May 27, 2022

olaky added 2 commits May 27, 2022 11:14

Java 8 compliant local date conversion

e9e2c69

Add explicit cast to fix compilation

60b4bcf

MaxGekk reviewed May 27, 2022

View reviewed changes

olaky added 3 commits May 27, 2022 17:03

Use DateTimeUtils.microsToDays

87e1d92

Remove unused import

d70ea4d

Inline applyWithTimestamp

a242bcf

olaky requested a review from MaxGekk May 31, 2022 09:19

MaxGekk changed the title ~~[SPARK-39259] Evaluate timestamps consistently in subqueries~~ [SPARK-39259][SQL] Evaluate timestamps consistently in subqueries Jun 2, 2022

MaxGekk approved these changes Jun 2, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala Outdated Show resolved Hide resolved

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala Outdated Show resolved Hide resolved

Remove asInstanceOf casts

e48dc4c

MaxGekk closed this in 52e2717 Jun 2, 2022

MaxGekk mentioned this pull request Jun 3, 2022

[SPARK-39259][SQL][3.3] Evaluate timestamps consistently in subqueries #36752

Closed

olaky mentioned this pull request Jun 3, 2022

[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

Closed

dongjoon-hyun reviewed Jun 3, 2022

View reviewed changes

dongjoon-hyun mentioned this pull request Jun 3, 2022

[SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 ClassCastException in ComputeCurrentTimeSuite #36762

Closed

JoshRosen reviewed Jun 4, 2022

View reviewed changes

JoshRosen mentioned this pull request Jun 4, 2022

[SPARK-39259][SQL][FOLLOWUP] Fix source and binary incompatibilities in transformDownWithSubqueries #36765

Closed

olaky mentioned this pull request Jun 9, 2022

[SPARK-39259][SQL][3.1] Evaluate timestamps consistently in subqueries #36818

Closed

dongjoon-hyun mentioned this pull request Jun 9, 2022

Revert "[SPARK-37670][SQL] Support predicate pushdown and column pruning for de-duped CTEs" #36819

Closed

olaky mentioned this pull request Jun 9, 2022

[SPARK-39259][SQL][3.0] Evaluate timestamps consistently in subqueries #36822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39259][SQL] Evaluate timestamps consistently in subqueries #36654

[SPARK-39259][SQL] Evaluate timestamps consistently in subqueries #36654

olaky commented May 24, 2022

AmplabJenkins commented May 26, 2022

MaxGekk left a comment

MaxGekk commented Jun 2, 2022 •

edited

MaxGekk commented Jun 2, 2022

dongjoon-hyun left a comment

JoshRosen commented Jun 4, 2022

JoshRosen Jun 4, 2022

JoshRosen Jun 4, 2022

[SPARK-39259][SQL] Evaluate timestamps consistently in subqueries #36654

[SPARK-39259][SQL] Evaluate timestamps consistently in subqueries #36654

Conversation

olaky commented May 24, 2022

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented May 26, 2022

MaxGekk left a comment

Choose a reason for hiding this comment

MaxGekk commented Jun 2, 2022 • edited

MaxGekk commented Jun 2, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

JoshRosen commented Jun 4, 2022

JoshRosen Jun 4, 2022

Choose a reason for hiding this comment

JoshRosen Jun 4, 2022

Choose a reason for hiding this comment

MaxGekk commented Jun 2, 2022 •

edited