[SPARK-37043][SQL] Cancel all running job after AQE plan finished#34316
[SPARK-37043][SQL] Cancel all running job after AQE plan finished#34316ulysses-you wants to merge 1 commit intoapache:masterfrom
Conversation
|
cc @yaooqinn @cloud-fan @maryannxue @viirya @HyukjinKwon if you have time to take a look |
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #144368 has finished for PR 34316 at commit
|
| Some((planChangeLogger, "AQE Post Stage Creation"))) | ||
| isFinalPlan = true | ||
| executionId.foreach(onUpdatePlan(_, Seq(currentPhysicalPlan))) | ||
| cancelRunningStages() |
There was a problem hiding this comment.
If there is still a stage running, why it escapes from the loop? Isn't allChildStagesMaterialized false?
There was a problem hiding this comment.
the currentPhysicalPlan converted to LocalTableScanExec during re-optimize and the LocalTableScanExec is a leaf node, then the flag of allChildStagesMaterialized is awlays true.
|
I see some test failed in GA, and it related to this PR. So let me convert to draft now. |
|
Any update on this issue? |
|
I realize that we can not cancel the running stage easily. Many code place check the SQL execution status based on whether the SQL exists failed stage/job so if we cancel the running stage the status of the SQL will be failure, e.g. in UI. Given this, I don't have a good idea now. |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Cancel running job after AQE plan finished, so this PR add a
runningStagesinAdaptiveExecutionContextto record the running stages.Why are the changes needed?
We see stage was still running after AQE plan finished. This is because the plan which contains a join with one empty side has been converted to
LocalTableScanExecduringAQEOptimizer, but the other side of this join is still running (shuffle map stage).It's no meaning to keep running the stage, so It's better to cancel the running stage after AQE plan finished in case wasting the task resource.
Does this PR introduce any user-facing change?
no
How was this patch tested?
add test.