[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

olaky · 2022-06-02T18:58:43Z

What changes were proposed in this pull request?

Apply the optimizer rule ComputeCurrentTime consistently across subqueries.

This is a backport of #36654.

Why are the changes needed?

At the moment timestamp functions like now() can return different values within a query if subqueries are involved

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new unit test was added

AmplabJenkins · 2022-06-03T02:08:12Z

Can one of the admins verify this patch?

olaky · 2022-06-03T08:11:40Z

There are some Python errors in the build, which seem to me like there is a problem with pandas or python versions in the build pipeline. Anyway, I really doubt that I broke those tests

MaxGekk · 2022-06-03T08:47:13Z

There are some Python errors in the build, which seem to me like there is a problem with pandas or python versions in the build pipeline.

Looking at GAs for commits in https://github.com/apache/spark/commits/branch-3.2, they have been broken for some time already. cc @HyukjinKwon @ueshin @itholic

HyukjinKwon · 2022-06-03T08:55:43Z

Let me fix it

HyukjinKwon · 2022-06-03T09:01:47Z

#36759

MaxGekk · 2022-06-03T09:16:46Z

Let's wait for #36759 and then re-trigger the build via an empty commit:

$ git commit --allow-empty -m "Trigger build"

MaxGekk · 2022-06-03T13:03:21Z

@olaky Could you re-trigger builds, please.

dongjoon-hyun

Hi, folks. Please hold on this.
This patch seems to break Scala 2.13 in master/branch-3.3 (including RC4).

dongjoon-hyun · 2022-06-03T17:41:22Z

Here is the PR to fix Scala 2.13 issue.

[SPARK-39259][SQL][TEST][FOLLOWUP] Fix Scala 2.13 ClassCastException in ComputeCurrentTimeSuite #36762

JoshRosen · 2022-06-06T17:10:52Z

When updating this PR, let's also pull in my changes from #36765 . When merging this, we should probably pick it all the way back to 3.0 (since it looks like a correctness issue that would also impact those versions).

olaky · 2022-06-07T06:41:18Z

I cherry-picked 583a9c7 and d79aa36

…` in `ComputeCurrentTimeSuite` ### What changes were proposed in this pull request? Unfortunately, apache#36654 causes seven Scala 2.13 test failures in master/3.3 and Apache Spark 3.3 RC4. This PR aims to fix Scala 2.13 ClassCastException in the test code. ### Why are the changes needed? ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals *** FAILED *** (1 second, 189 milliseconds) [info] java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to scala.collection.immutable.Seq [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.literals(ComputeCurrentTimeSuite.scala:146) [info] at org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite.$anonfun$new$1(ComputeCurrentTimeSuite.scala:47) ... [info] *** 7 TESTS FAILED *** [error] Failed tests: [error] org.apache.spark.sql.catalyst.optimizer.ComputeCurrentTimeSuite [error] (catalyst / Test / testOnly) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 189 s (03:09), completed Jun 3, 2022 10:29:39 AM ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and manually tests with Scala 2.13. ``` $ dev/change-scala-version.sh 2.13 $ build/sbt "catalyst/testOnly *.ComputeCurrentTimeSuite" -Pscala-2.13 ... [info] ComputeCurrentTimeSuite: [info] - analyzer should replace current_timestamp with literals (545 milliseconds) [info] - analyzer should replace current_date with literals (11 milliseconds) [info] - SPARK-33469: Add current_timezone function (3 milliseconds) [info] - analyzer should replace localtimestamp with literals (4 milliseconds) [info] - analyzer should use equal timestamps across subqueries (182 milliseconds) [info] - analyzer should use consistent timestamps for different timezones (13 milliseconds) [info] - analyzer should use consistent timestamps for different timestamp functions (2 milliseconds) [info] Run completed in 1 second, 579 milliseconds. [info] Total number of tests run: 7 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 7, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [success] Total time: 12 s, completed Jun 3, 2022, 10:54:03 AM ``` Closes apache#36762 from dongjoon-hyun/SPARK-39259. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

JoshRosen

LGTM for backport pending tests.

For committers: whoever merges this should also try to merge to 3.1 and 3.0.

MaxGekk · 2022-06-07T10:49:32Z

I guess the failure is not related to PR's changes:

[info] - check simplified (tpcds-v1.4/q4) *** FAILED *** (945 milliseconds)
[info]   Plans did not match:

Last commits in https://github.com/apache/spark/commits/branch-3.2 fail on the same.

MaxGekk · 2022-06-07T10:54:17Z

+1, LGTM. Merging to 3.2 and trying to merge to 3.1/3.0.
Thank you, @olaky and @JoshRosen @dongjoon-hyun for review.

### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of #36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes #36753 from olaky/SPARK-39259-spark_3_2. Lead-authored-by: Ole Sasse <ole.sasse@databricks.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

MaxGekk · 2022-06-07T10:58:27Z

@olaky The changes cause conflicts in branch-3.1. Could you PRs w/ backports to 3.1 and 3.0, please.

olaky · 2022-06-07T12:57:20Z

Merging is blocked because of a test failure that also surfaces in #36386

olaky · 2022-06-08T12:31:29Z

@MaxGekk since you closed this, should I still work on propagating this to 3.1 and 3.0? And how should we deal with the test failures happening on branch-3.2?

MaxGekk · 2022-06-08T13:53:55Z

@olaky This PR has been merged to branch-3.2 already, see d611d1f

Please, open separate PRs against branch-3.1 and branch-3.0

olaky · 2022-06-08T18:18:30Z

@MaxGekk 3.1 already does not have any transform with subqueries function, so I would have to backport this as well. I personally feel that this could be a risky endeavour not worth doing to get this fix, what do you think?

MaxGekk · 2022-06-08T18:26:35Z

3.1 already does not have any transform with subqueries function ...

Could you list required PRs, please. Is it possible to extract only needed functions from them?

olaky · 2022-06-09T07:58:24Z

Transforming with subqueries comes from 1a35685 and pruning comes from 3db8ec2.
Both of these are fairly big, and I would not just want so cherry pick them.
That said Id did create: #36818

olaky · 2022-06-09T11:16:40Z

And for 3.0 we have: #36822 (looks to me like there is an infrastructure problem with building it)

dongjoon-hyun · 2022-06-11T03:00:44Z

And for 3.0 we have: #36822 (looks to me like there is an infrastructure problem with building it)

Thank you for making a backporting PR to branch-3.0 together. However, branch-3.0 is EOL currently because Apache Spark 3.0.0 was released two years ago (2020-06-18). Please see the Apache Spark Versioning Policy.

### What changes were proposed in this pull request? Apply the optimizer rule ComputeCurrentTime consistently across subqueries. This is a backport of apache#36654. ### Why are the changes needed? At the moment timestamp functions like now() can return different values within a query if subqueries are involved ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new unit test was added Closes apache#36753 from olaky/SPARK-39259-spark_3_2. Lead-authored-by: Ole Sasse <ole.sasse@databricks.com> Co-authored-by: Josh Rosen <joshrosen@databricks.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>

github-actions bot added the SQL label Jun 2, 2022

MaxGekk changed the title ~~[SPARK-39259] Evaluate timestamps consistently in subqueries~~ [SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries Jun 3, 2022

[SPARK-39259] Evaluate timestamps consistently in subqueries

ae808ea

olaky force-pushed the SPARK-39259-spark_3_2 branch from 63cc981 to ae808ea Compare June 3, 2022 13:05

dongjoon-hyun requested changes Jun 3, 2022

View reviewed changes

Fix source and binary incompatibilities

88f2e02

JoshRosen approved these changes Jun 7, 2022

View reviewed changes

MaxGekk approved these changes Jun 7, 2022

View reviewed changes

olaky mentioned this pull request Jun 7, 2022

[SPARK-38918][SQL][3.2] Nested column pruning should filter out attributes that do not belong to the current relation #36386

Closed

MaxGekk closed this Jun 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

olaky commented Jun 2, 2022 •

edited

Loading

AmplabJenkins commented Jun 3, 2022

olaky commented Jun 3, 2022

MaxGekk commented Jun 3, 2022

HyukjinKwon commented Jun 3, 2022

HyukjinKwon commented Jun 3, 2022

MaxGekk commented Jun 3, 2022

MaxGekk commented Jun 3, 2022

dongjoon-hyun left a comment

dongjoon-hyun commented Jun 3, 2022

JoshRosen commented Jun 6, 2022

olaky commented Jun 7, 2022 •

edited

Loading

JoshRosen left a comment

MaxGekk commented Jun 7, 2022 •

edited

Loading

MaxGekk commented Jun 7, 2022

MaxGekk commented Jun 7, 2022

olaky commented Jun 7, 2022

olaky commented Jun 8, 2022

MaxGekk commented Jun 8, 2022 •

edited

Loading

olaky commented Jun 8, 2022

MaxGekk commented Jun 8, 2022

olaky commented Jun 9, 2022

olaky commented Jun 9, 2022

dongjoon-hyun commented Jun 11, 2022 •

edited

Loading

[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

[SPARK-39259][SQL][3.2] Evaluate timestamps consistently in subqueries #36753

Conversation

olaky commented Jun 2, 2022 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Jun 3, 2022

olaky commented Jun 3, 2022

MaxGekk commented Jun 3, 2022

HyukjinKwon commented Jun 3, 2022

HyukjinKwon commented Jun 3, 2022

MaxGekk commented Jun 3, 2022

MaxGekk commented Jun 3, 2022

dongjoon-hyun left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Jun 3, 2022

JoshRosen commented Jun 6, 2022

olaky commented Jun 7, 2022 • edited Loading

JoshRosen left a comment

Choose a reason for hiding this comment

MaxGekk commented Jun 7, 2022 • edited Loading

MaxGekk commented Jun 7, 2022

MaxGekk commented Jun 7, 2022

olaky commented Jun 7, 2022

olaky commented Jun 8, 2022

MaxGekk commented Jun 8, 2022 • edited Loading

olaky commented Jun 8, 2022

MaxGekk commented Jun 8, 2022

olaky commented Jun 9, 2022

olaky commented Jun 9, 2022

dongjoon-hyun commented Jun 11, 2022 • edited Loading

olaky commented Jun 2, 2022 •

edited

Loading

olaky commented Jun 7, 2022 •

edited

Loading

MaxGekk commented Jun 7, 2022 •

edited

Loading

MaxGekk commented Jun 8, 2022 •

edited

Loading

dongjoon-hyun commented Jun 11, 2022 •

edited

Loading