Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

Closed
wants to merge 3 commits into from

Conversation

olaky
Copy link
Contributor

@olaky olaky commented Jun 2, 2022

What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of #36654.

Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test was added

@github-actions github-actions bot added the SQL label Jun 2, 2022
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@olaky
Copy link
Contributor Author

olaky commented Jun 3, 2022

There are some Python errors in the build, which seem to me like there is a problem with pandas or python versions in the build pipeline. Anyway, I really doubt that I broke those tests

@MaxGekk MaxGekk changed the title [SPARK-39259] Evaluate timestamps consistently in subqueries [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries Jun 3, 2022
@MaxGekk
Copy link
Member

MaxGekk commented Jun 3, 2022

There are some Python errors in the build, which seem to me like there is a problem with pandas or python versions in the build pipeline.

Looking at GAs for commits in https://github.com/apache/spark/commits/branch-3.2, they have been broken for some time already. cc @HyukjinKwon @ueshin @itholic

@HyukjinKwon
Copy link
Member

Let me fix it

@HyukjinKwon
Copy link
Member

#36759

@MaxGekk
Copy link
Member

MaxGekk commented Jun 3, 2022

Let's wait for #36759 and then re-trigger the build via an empty commit:

$ git commit --allow-empty -m "Trigger build"

@MaxGekk
Copy link
Member

MaxGekk commented Jun 3, 2022

@olaky Could you re-trigger builds, please.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, folks. Please hold on this.
This patch seems to break Scala 2.13 in master/branch-3.3 (including RC4).

@dongjoon-hyun
Copy link
Member

@JoshRosen
Copy link
Contributor

When updating this PR, let's also pull in my changes from #36765 . When merging this, we should probably pick it all the way back to 3.0 (since it looks like a correctness issue that would also impact those versions).

@olaky
Copy link
Contributor Author

olaky commented Jun 7, 2022

I cherry-picked 583a9c7 and d79aa36

…` in `ComputeCurrentTimeSuite`

### What changes were proposed in this pull request?

Unfortunately, apache#36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4.
This PR aims to fix Scala 2.13 ClassCastException in the test code.

### Why are the changes needed?

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds)
[info]   java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146)
[info]   at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47)
...
[info] *** 7 TESTS FAILED ***
[error] Failed tests:
[error] 	org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite
[error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and manually tests with Scala 2.13.

```
$ dev/change-scala-version.sh 2.13
$ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13
...
[info] ComputeCurrentTimeSuite:
[info] - analyzer should replace current_timestamp with literals (545 milliseconds)
[info] - analyzer should replace current_date with literals (11 milliseconds)
[info] - SPARK-33469: Add current_timezone function (3 milliseconds)
[info] - analyzer should replace localtimestamp with literals (4 milliseconds)
[info] - analyzer should use equal timestamps across subqueries (182 milliseconds)
[info] - analyzer should use consistent timestamps for different timezones (13 milliseconds)
[info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds)
[info] Run completed in 1 second, 579 milliseconds.
[info] Total number of tests run: 7
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM
```

Closes apache#36762 from dongjoon-hyun/SPARK-39259.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Copy link
Contributor

@JoshRosen JoshRosen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for backport pending tests.

For committers: whoever merges this should also try to merge to 3.1 and 3.0.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 7, 2022

I guess the failure is not related to PR's changes:

[info] - check simplified (tpcds-v1.4/q4) *** FAILED *** (945 milliseconds)
[info]   Plans did not match:

Last commits in https://github.com/apache/spark/commits/branch-3.2 fail on the same.

@MaxGekk
Copy link
Member

MaxGekk commented Jun 7, 2022

+1, LGTM. Merging to 3.2 and trying to merge to 3.1/3.0.
Thank you, @olaky and @JoshRosen @dongjoon-hyun for review.

MaxGekk pushed a commit that referenced this pull request Jun 7, 2022
### What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of #36654.

### Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test was added

Closes #36753 from olaky/SPARK-39259-spark_3_2.

Lead-authored-by: Ole Sasse <ole.sasse@databricks.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk
Copy link
Member

MaxGekk commented Jun 7, 2022

@olaky The changes cause conflicts in branch-3.1. Could you PRs w/ backports to 3.1 and 3.0, please.

@olaky
Copy link
Contributor Author

olaky commented Jun 7, 2022

Merging is blocked because of a test failure that also surfaces in #36386

@MaxGekk MaxGekk closed this Jun 7, 2022
@olaky
Copy link
Contributor Author

olaky commented Jun 8, 2022

@MaxGekk since you closed this, should I still work on propagating this to 3.1 and 3.0? And how should we deal with the test failures happening on branch-3.2?

@MaxGekk
Copy link
Member

MaxGekk commented Jun 8, 2022

@olaky This PR has been merged to branch-3.2 already, see d611d1f
Screenshot 2022-06-08 at 16 55 02

Please, open separate PRs against branch-3.1 and branch-3.0

@olaky
Copy link
Contributor Author

olaky commented Jun 8, 2022

@MaxGekk 3.1 already does not have any transform with subqueries function, so I would have to backport this as well. I personally feel that this could be a risky endeavour not worth doing to get this fix, what do you think?

@MaxGekk
Copy link
Member

MaxGekk commented Jun 8, 2022

3.1 already does not have any transform with subqueries function ...

Could you list required PRs, please. Is it possible to extract only needed functions from them?

@olaky
Copy link
Contributor Author

olaky commented Jun 9, 2022

Transforming with subqueries comes from 1a35685 and pruning comes from 3db8ec2.
Both of these are fairly big, and I would not just want so cherry pick them.
That said Id did create: #36818

@olaky
Copy link
Contributor Author

olaky commented Jun 9, 2022

And for 3.0 we have: #36822 (looks to me like there is an infrastructure problem with building it)

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 11, 2022

And for 3.0 we have: #36822 (looks to me like there is an infrastructure problem with building it)

Thank you for making a backporting PR to branch-3.0 together. However, branch-3.0 is EOL currently because Apache Spark 3.0.0 was released two years ago (2020-06-18). Please see the Apache Spark Versioning Policy.

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
### What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of apache#36654.

### Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test was added

Closes apache#36753 from olaky/SPARK-39259-spark_3_2.

Lead-authored-by: Ole Sasse <ole.sasse@databricks.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants