Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36424][SQL] Support eliminate limits in AQE Optimizer #33651

Closed
wants to merge 5 commits into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Aug 5, 2021

What changes were proposed in this pull request?

  • override the maxRows method in LogicalQueryStage
  • add rule EliminateLimits in AQEOptimizer

Why are the changes needed?

In Ad-hoc scenario, we always add limit for the query if user have no special limit value, but not all limit is nesessary.

With the power of AQE, we can eliminate limits using running statistics.

Does this PR introduce any user-facing change?

no

How was this patch tested?

add test

@ulysses-you
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46592/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46595/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46592/

* if we can eliminate limits. And we check if [[LogicalQueryStage]] is materialized at stats,
* if it is not materialized the maxRows is none.
*/
object AQEEliminateLimits extends EliminateLimitsBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between this and EliminateLimits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are same. Just consider we don't need transformDownWithPruning in AQE Optimizer since it's batch only run once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's too minor to justify creating a new class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, seems can reuse EliminateLimits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'm fine to reuse the EliminateLimits

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46595/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Test build #142082 has finished for PR 33651 at commit 4299af5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait EliminateLimitsBase extends Rule[LogicalPlan]

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Test build #142084 has finished for PR 33651 at commit c076d76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait EliminateLimitsBase extends Rule[LogicalPlan]

@ulysses-you ulysses-you force-pushed the SPARK-36424 branch 2 times, most recently from 8aae958 to 4510816 Compare August 5, 2021 15:09
case Limit(l, child) if canEliminate(l, child) =>
child
case GlobalLimit(l, child) if canEliminate(l, child) =>
child

case GlobalLimit(le, GlobalLimit(ne, grandChild)) =>
GlobalLimit(Least(Seq(ne, le)), grandChild)
GlobalLimit(Literal(Least(Seq(ne, le)).eval().asInstanceOf[Int]), grandChild)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's flaky that this expression highly depend on ConstantFolding.

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46613/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46614/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46614/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Test build #142101 has finished for PR 33651 at commit 8aae958.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Test build #142102 has finished for PR 33651 at commit 4510816.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46629/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46629/

@@ -54,4 +54,6 @@ case class LogicalQueryStage(
}
physicalStats.getOrElse(logicalPlan.stats)
}

override def maxRows: Option[Long] = stats.rowCount.map(_.min(Long.MaxValue).toLong)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is other physical nodes above QueryStageExec the stats is not accurate. Seems the only node can exist here is aggregate, so the maxRows is still accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should trust the existing framework. If the maxRows stats can be wrong, then EliminateLimits is also wrong even without AQE.

I don't think we need to highlight LogicalQueryStage in the doc of EliminateLimits. We just need to follow the existing framework and make sure the maxRows is not under-estimated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, cleaned the comment of EliminateLimits

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46632/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Test build #142130 has started for PR 33651 at commit e8f17a4.

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46632/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Test build #142117 has finished for PR 33651 at commit ba7c42a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46642/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46642/

@ulysses-you
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Test build #142152 has finished for PR 33651 at commit e8f17a4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Test build #142119 has finished for PR 33651 at commit 180ef97.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46664/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46664/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Test build #142162 has finished for PR 33651 at commit 6f28a37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46675/

@SparkQA
Copy link

SparkQA commented Aug 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46675/

@viirya
Copy link
Member

viirya commented Aug 6, 2021

retest this please

1 similar comment
@ulysses-you
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46699/

@SparkQA
Copy link

SparkQA commented Aug 7, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46699/

@SparkQA
Copy link

SparkQA commented Aug 7, 2021

Test build #142187 has finished for PR 33651 at commit 6f28a37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in bb6f65a Aug 9, 2021
@ulysses-you ulysses-you deleted the SPARK-36424 branch August 9, 2021 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants