Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-35853][SQL] Remark the shuffle origin to ENSURE_REQUIREMENTS as far as possible #33015

Closed
wants to merge 3 commits into from

Conversation

ulysses-you
Copy link
Contributor

What changes were proposed in this pull request?

Add a rule RemarkShuffleOrigin in AQE queryStagePreparationRules after EnsureRequirements.

Why are the changes needed?

In some queries, we might repartition by some columns with a large partition number manually to make parallelism big enough. However if its output partitioning satisfies some other node (e.g. join/aggregate), this shuffle can not be optimized by AQE due to the shuffle origin.

So, this new rule aims to remark the shuffle origin to ENSURE_REQUIREMENTS as far as possible if it's safe.

Does this PR introduce any user-facing change?

yes, the plan may be changed

How was this patch tested?

Add test.

@github-actions github-actions bot added the SQL label Jun 22, 2021
@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140126 has finished for PR 33015 at commit f1beaf0.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44653/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44653/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44654/

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44654/

@ulysses-you
Copy link
Contributor Author

@cloud-fan
Copy link
Contributor

In some queries, we might repartition by some columns with a large partition number manually to make parallelism big enough.

I don't think it's safe to "guess" the user intention and allow AQE to break partitioning. Let's create a new hint for this optimize-write repartition, to make the user intention clearer.

@ulysses-you
Copy link
Contributor Author

How about that PR #32932 ?

@SparkQA
Copy link

SparkQA commented Jun 22, 2021

Test build #140128 has finished for PR 33015 at commit 7ce90ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ulysses-you ulysses-you deleted the shuffle-origin branch June 30, 2021 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants