Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39259][SQL] Evaluate timestamps consistently in subqueries #36654

Closed
wants to merge 11 commits into from

Conversation

olaky
Copy link
Contributor

@olaky olaky commented May 24, 2022

What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries

Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test was added

@github-actions github-actions bot added the SQL label May 24, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@olaky olaky changed the title [WIP][SPARK-39259] Evaluate timestamps consistently in subqueries [SPARK-39259] Evaluate timestamps consistently in subqueries May 27, 2022
@olaky olaky requested a review from MaxGekk May 31, 2022 09:19
@MaxGekk MaxGekk changed the title [SPARK-39259] Evaluate timestamps consistently in subqueries [SPARK-39259][SQL] Evaluate timestamps consistently in subqueries Jun 2, 2022
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except of a minor comments.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 2, 2022

+1, LGTM. Merging to master.
Thank you, @olaky.

@MaxGekk MaxGekk closed this in 52e2717 Jun 2, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Jun 2, 2022

@olaky Could you open a separate PRs with backports to branch-3.3 and branch-3.2 (according to SPARK-39259, 3.2 has this issue).

Congratulations with the first contribution to Apache Spark, and welcome to Spark community!

MaxGekk pushed a commit that referenced this pull request Jun 3, 2022
### What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of #36654.

### Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test was added

Closes #36752 from olaky/SPARK-39259-spark_3_3.

Authored-by: Ole Sasse <ole.sasse@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @olaky and @MaxGekk . Unfortunately, this broke Scala 2.13 in master/branch-3.3 and RC4. I made a PR to fix it.

dongjoon-hyun added a commit that referenced this pull request Jun 3, 2022
…` in `ComputeCurrentTimeSuite`

### What changes were proposed in this pull request?

Unfortunately, #36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4.
This PR aims to fix Scala 2.13 ClassCastException in the test code.

### Why are the changes needed?

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds)
[info]   java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47)
...
[info] *** 7 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite
[error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manually tests with Scala 2.13.

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals (545 milliseconds)
[info] - analyzer should replace current_date with literals (11 milliseconds)
[info] - SPARK-33469: Add current_timezone function (3 milliseconds)
[info] - analyzer should replace localtimestamp with literals (4 milliseconds)
[info] - analyzer should use equal timestamps across subqueries (182 milliseconds)
[info] - analyzer should use consistent timestamps for different timezones (13 milliseconds)
[info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds)
[info] Run completed in 1 second, 579 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM
```

Closes #36762 from dongjoon-hyun/SPARK-39259.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
dongjoon-hyun added a commit that referenced this pull request Jun 3, 2022
…` in `ComputeCurrentTimeSuite`

### What changes were proposed in this pull request?

Unfortunately, #36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4.
This PR aims to fix Scala 2.13 ClassCastException in the test code.

### Why are the changes needed?

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds)
[info]   java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47)
...
[info] *** 7 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite
[error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manually tests with Scala 2.13.

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals (545 milliseconds)
[info] - analyzer should replace current_date with literals (11 milliseconds)
[info] - SPARK-33469: Add current_timezone function (3 milliseconds)
[info] - analyzer should replace localtimestamp with literals (4 milliseconds)
[info] - analyzer should use equal timestamps across subqueries (182 milliseconds)
[info] - analyzer should use consistent timestamps for different timezones (13 milliseconds)
[info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds)
[info] Run completed in 1 second, 579 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM
```

Closes #36762 from dongjoon-hyun/SPARK-39259.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit d79aa36)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@JoshRosen
Copy link
Contributor

This looks like it might be a fix for a correctness issue? If so, we should probably backport this change to maintenance branches for the other currently-supported Spark versions.

@@ -479,21 +479,24 @@ abstract class QueryPlan[PlanType <: QueryPlan[PlanType]]
* first to this node, then this node's subqueries and finally this node's children.
* When the partial function does not apply to a given node, it is left unchanged.
*/
def transformDownWithSubqueries(f: PartialFunction[PlanType, PlanType]): PlanType = {
def transformDownWithSubqueries(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of modifying this method's signature, I slightly prefer adding an overload named transformDownWithSubqueriesAndPruning in order to remain consistent with the naming conventions for other transform methods.

Adding a new method also avoids introducing source or binary compatibility issues for third-party code that calls Catalyst APIs. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see source), but I think it's nice to try to avoid API breakage when feasible.

I also ran into problems when trying to call transformDownWithSubqueries without supplying any arguments in the first argument group: calling transformDownWithSubqueries() { f } resulted in confusing compilation errors.

As a result, I'm going to submit a followup to this PR to split this into two methods as I've suggested above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my PR: #36765

MaxGekk pushed a commit that referenced this pull request Jun 4, 2022
…in transformDownWithSubqueries

### What changes were proposed in this pull request?

This is a followup to #36654. That PR modified the existing `QueryPlan.transformDownWithSubqueries` to add additional arguments for tree pattern pruning.

In this PR, I roll back the change to that method's signature and instead add a new `transformDownWithSubqueriesAndPruning` method.

### Why are the changes needed?

The original change breaks binary and source compatibility in Catalyst. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see [source](https://github.com/apache/spark/blob/bb51add5c79558df863d37965603387d40cc4387/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L20-L24)), but I think it's nice to try to avoid API breakage when possible.

While trying to compile some custom Catalyst code, I ran into issues when trying to call the `transformDownWithSubqueries` method without supplying a tree pattern filter condition. If I do `transformDownWithSubqueries() { f} ` then I get a compilation error. I think this is due to the first parameter group containing all default parameters.

My PR's solution of adding a new `transformDownWithSubqueriesAndPruning` method solves this problem. It's also more consistent with the naming convention used for other pruning-enabled tree transformation methods.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #36765 from JoshRosen/SPARK-39259-binary-compatibility-followup.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Jun 4, 2022
…in transformDownWithSubqueries

### What changes were proposed in this pull request?

This is a followup to #36654. That PR modified the existing `QueryPlan.transformDownWithSubqueries` to add additional arguments for tree pattern pruning.

In this PR, I roll back the change to that method's signature and instead add a new `transformDownWithSubqueriesAndPruning` method.

### Why are the changes needed?

The original change breaks binary and source compatibility in Catalyst. Technically speaking, Catalyst APIs are considered internal to Spark and are subject to change between minor releases (see [source](https://github.com/apache/spark/blob/bb51add5c79558df863d37965603387d40cc4387/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/package.scala#L20-L24)), but I think it's nice to try to avoid API breakage when possible.

While trying to compile some custom Catalyst code, I ran into issues when trying to call the `transformDownWithSubqueries` method without supplying a tree pattern filter condition. If I do `transformDownWithSubqueries() { f} ` then I get a compilation error. I think this is due to the first parameter group containing all default parameters.

My PR's solution of adding a new `transformDownWithSubqueriesAndPruning` method solves this problem. It's also more consistent with the naming convention used for other pruning-enabled tree transformation methods.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #36765 from JoshRosen/SPARK-39259-binary-compatibility-followup.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit eda6c4b)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
olaky pushed a commit to olaky/spark that referenced this pull request Jun 7, 2022
…` in `ComputeCurrentTimeSuite`

### What changes were proposed in this pull request?

Unfortunately, apache#36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4.
This PR aims to fix Scala 2.13 ClassCastException in the test code.

### Why are the changes needed?

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds)
[info]   java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47)
...
[info] *** 7 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite
[error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manually tests with Scala 2.13.

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals (545 milliseconds)
[info] - analyzer should replace current_date with literals (11 milliseconds)
[info] - SPARK-33469: Add current_timezone function (3 milliseconds)
[info] - analyzer should replace localtimestamp with literals (4 milliseconds)
[info] - analyzer should use equal timestamps across subqueries (182 milliseconds)
[info] - analyzer should use consistent timestamps for different timezones (13 milliseconds)
[info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds)
[info] Run completed in 1 second, 579 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM
```

Closes apache#36762 from dongjoon-hyun/SPARK-39259.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
MaxGekk pushed a commit that referenced this pull request Jun 7, 2022
### What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of #36654.

### Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test was added

Closes #36753 from olaky/SPARK-39259-spark_3_2.

Lead-authored-by: Ole Sasse <ole.sasse@databricks.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Jun 12, 2022
### What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of #36654 with adjustements:
* The rule does not use pruning
* The transformWithSubqueries function was also backported

### Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test was added

Closes #36818 from olaky/SPARK-39259-spark_3_1.

Authored-by: Ole Sasse <ole.sasse@databricks.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
### What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of apache#36654.

### Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test was added

Closes apache#36753 from olaky/SPARK-39259-spark_3_2.

Lead-authored-by: Ole Sasse <ole.sasse@databricks.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants