New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40082] Schedule mergeFinalize when push merge shuffleMapStage retry but no running tasks #40393
Conversation
+CC @otterc |
@Stove-hust Thank you for reporting and the patch. Would you be able to share driver logs? |
Sure(Add some comments) -- resubmit stage 10 && parentStage 9 -- The first stage10 task completes one after another, and notifyDriverAboutPushCompletion to end stage 10, and mark finalizeTask, because the stage is not in runningStages, so the stage cannot be marked shuffleMergeFinalized. --- Removed TaskSet 10.0, whose tasks have all completed --- notifyDriverAboutPushCompletion stage 10 --- stage 9 finished --- resubmit stage 10 --- stage 10 can not finished |
@otterc Hello, is there anything else I should add? |
@Stove-hust Haven't had a chance to look at it yet. I'll take a look at it this week. |
tks |
So this is an interesting coincidence, I literally encountered a production job which seems to be hitting this exact same issue :-) Can you create a test case to validate this behavior @Stove-hust ? Thanks for working on this fix |
No problem |
c133877
to
e7061bf
Compare
Added UT |
@Stove-hust The main change in |
Instead of only testing specifically for the flag - which is subject to change as the implementation evolves, we should also test for behavior here. This is the reproducible test I was using (with some changes) to test approaches for this bug - and it mimics the case I saw in our production reasonably well.
Would be good to adapt/clean it up for your PR, in addition to the existing test - so that the observed bug does not recur. (Good news is, this PR works against it :-) ) |
Thank you for your advice on the UT I wrote, it was very important to me. I will delete my UT. thanks again very much |
e7061bf
to
60a5d07
Compare
@Stove-hust To clarify - I meant add this as well (after you had a chance to look at it and clean it up if required - this was from my test setup). |
Sorry, I misunderstood what you meant。😂 |
Technically, 3 :-) You dont need to mark it as written by me ! We can include it in your PR - with any changes you make as part of the adding it. |
Thanks for your answer, I have added all three UTs (including you wrote) |
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good - just a few minor nits.
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Outdated
Show resolved
Hide resolved
The test failure is unrelated to this PR - once the changes above are made, the reexecution should pass |
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Outdated
Show resolved
Hide resolved
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
Merged to master. |
I could not cherry pick this into 3.4 and 3.3 - we should fix for those branches as well IMO. |
No problem |
Is your apache jira id |
@mridulm |
What changes were proposed in this pull request?
Copy the logic of handleTaskCompletion in DAGScheduler for processing the last shuffleMapTask into submitMissingTasks.
Why are the changes needed?
In condition of push-based shuffle being enabled and speculative tasks existing, a shuffleMapStage will be resubmitting once fetchFailed occurring, then its parent stages will be resubmitting firstly and it will cost some time to compute. Before the shuffleMapStage being resubmitted, its all speculative tasks success and register map output, but speculative task successful events can not trigger shuffleMergeFinalized( shuffleBlockPusher.notifyDriverAboutPushCompletion ) because this stage has been removed from runningStages.
Then this stage is resubmitted, but speculative tasks have registered map output and there are no missing tasks to compute, resubmitting stages will also not trigger shuffleMergeFinalized. Eventually this stage‘s _shuffleMergedFinalized keeps false.
Then AQE will submit next stages which are dependent on this shuffleMapStage occurring fetchFailed. And in getMissingParentStages, this stage will be marked as missing and will be resubmitted, but next stages are added to waitingStages after this stage being finished, so next stages will not be submitted even though this stage's resubmitting has been finished.
Does this PR introduce any user-facing change?
No
How was this patch tested?
This extreme case is very difficult to construct, and we added logs to our production environment to capture the number of problems and verify the stability of the job. I am happy to provide a timeline of the various events in which the problem arose。