Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-36414][SQL] Disable timeout for BroadcastQueryStageExec in AQE #33636

Closed
wants to merge 2 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Aug 4, 2021

What changes were proposed in this pull request?

This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually.

 

image

 

The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something

 

Why are the changes needed?

AQE is default now, we can make it more stable with this PR

Does this PR introduce any user-facing change?

yes, broadcast timeout now is not used for AQE

How was this patch tested?

modified test

@github-actions github-actions bot added the SQL label Aug 4, 2021
@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46547/

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46547/

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Test build #142037 has started for PR 33636 at commit b5870e6.

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Test build #142035 has finished for PR 33636 at commit 88be9e9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46549/

@SparkQA
Copy link

SparkQA commented Aug 4, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46549/

@HyukjinKwon
Copy link
Member

retest this please

@yaooqinn yaooqinn requested a review from cloud-fan August 5, 2021 02:01
@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46568/

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46568/

@cloud-fan
Copy link
Contributor

makes sense to me, cc @maryannxue

@SparkQA
Copy link

SparkQA commented Aug 5, 2021

Test build #142057 has finished for PR 33636 at commit b5870e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.2!

@cloud-fan cloud-fan closed this in 0c94e47 Aug 5, 2021
cloud-fan pushed a commit that referenced this pull request Aug 5, 2021
### What changes were proposed in this pull request?

This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually.

 

![image](https://user-images.githubusercontent.com/8326978/128169612-4c96c8f6-6f8e-48ed-8eaf-450f87982c3b.png)

 

The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something

 

### Why are the changes needed?

AQE is default now, we can make it more stable with this PR

### Does this PR introduce _any_ user-facing change?

yes, broadcast timeout now is not used for AQE

### How was this patch tested?

modified test

Closes #33636 from yaooqinn/SPARK-36414.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0c94e47)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@dongjoon-hyun
Copy link
Member

Could you make a backport to branch-3.1 please, @yaooqinn ?

@yaooqinn
Copy link
Member Author

yaooqinn commented Aug 6, 2021

Could you make a backport to branch-3.1 please, @yaooqinn ?

OK @dongjoon-hyun

@AngersZhuuuu
Copy link
Contributor

Could you make a backport to branch-3.1 please, @yaooqinn ?

OK @dongjoon-hyun

Seems missed?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Nov 11, 2022

@AngersZhuuuu . It's too late if this is not backported already.
branch-3.1 is EOL as of today.

@AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu . It's too late if this is not backported already. branch-3.1 is EOL as of today.

Yea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants