New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899
Conversation
gently ping @cloud-fan |
Can one of the admins verify this patch? |
Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix. +CC @Ngone51. |
@mridulm Hi, i have updated the description. Could you verify the patch again? |
gently ping @Ngone51 |
If a result stage does not have pending partitions, it does not need to be aborted - since there are no partitions to be computed. If a result stage has pending partitions with an indeterminate parent failing, it would have been aborted the first time it failed - so the assumption that Please let me know if there are queries. |
Thanks for the ping. I agree with @mridulm . The original condition doesn't seem to have a chance for the result stage to retry. Is there anything missed? |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
Abort result stage directly when it failed caused by FetchFailedException.
Why are the changes needed?
Here's a very serious bug:
The resultStage with indeterminate parent mapStage resubmit and it led to data inconsistency problems.
And The reasons for data inconsistency are as follows:
When result stage failed caused by
FetchFailedException
, spark will determine whether it can be retried.And the original condition is
numMissingPartitions < resultStage.numTasks
. It is not an exact condition.If this condition holds on retry, at this time some other running tasks at the current failed result stage might not have been killed yet, when result stage was resubmit, it would got wrong partitions to recalculation.
It is possible that the number of partitions to be recalculated is smaller than the actual number of partitions at result stage and data inconsistency might occur.
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests and new test