-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-36414][SQL] Disable timeout for BroadcastQueryStageExec in AQE #33636
Conversation
Kubernetes integration test starting |
Kubernetes integration test status failure |
Test build #142037 has started for PR 33636 at commit |
Test build #142035 has finished for PR 33636 at commit
|
Kubernetes integration test starting |
Kubernetes integration test status failure |
retest this please |
Kubernetes integration test starting |
Kubernetes integration test status success |
makes sense to me, cc @maryannxue |
Test build #142057 has finished for PR 33636 at commit
|
thanks, merging to master/3.2! |
### What changes were proposed in this pull request? This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually. ![image](https://user-images.githubusercontent.com/8326978/128169612-4c96c8f6-6f8e-48ed-8eaf-450f87982c3b.png) The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something ### Why are the changes needed? AQE is default now, we can make it more stable with this PR ### Does this PR introduce _any_ user-facing change? yes, broadcast timeout now is not used for AQE ### How was this patch tested? modified test Closes #33636 from yaooqinn/SPARK-36414. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 0c94e47) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Could you make a backport to |
|
Seems missed? |
@AngersZhuuuu . It's too late if this is not backported already. |
Yea |
What changes were proposed in this pull request?
This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually.
The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something
Why are the changes needed?
AQE is default now, we can make it more stable with this PR
Does this PR introduce any user-facing change?
yes, broadcast timeout now is not used for AQE
How was this patch tested?
modified test