Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899

Closed
wants to merge 2 commits into from

Conversation

caican00
Copy link
Contributor

@caican00 caican00 commented Sep 15, 2022

What changes were proposed in this pull request?

Abort result stage directly when it failed caused by FetchFailedException.

Why are the changes needed?

Here's a very serious bug:
The resultStage with indeterminate parent mapStage resubmit and it led to data inconsistency problems.

And The reasons for data inconsistency are as follows:
When result stage failed caused by FetchFailedException,  spark will determine whether it can be retried.
And the original condition is numMissingPartitions < resultStage.numTasks. It is not an exact condition.

If this condition holds on retry, at this time some other running tasks at the current failed result stage might not have been killed yet, when result stage was resubmit, it would got wrong partitions to recalculation.

// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() 

It is possible that the number of partitions to be recalculated is smaller than the actual number of partitions at result stage and data inconsistency might occur.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests and new test

@github-actions github-actions bot added the CORE label Sep 15, 2022
@caican00
Copy link
Contributor Author

gently ping @cloud-fan
Could you help to verify this patch?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Sep 16, 2022

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.

+CC @Ngone51.
I will take a look at this PR hopefully early next week.

@caican00
Copy link
Contributor Author

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.

+CC @Ngone51. I will take a look at this PR hopefully early next week.

@mridulm Hi, i have updated the description. Could you verify the patch again?

@caican00
Copy link
Contributor Author

caican00 commented Sep 20, 2022

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.
+CC @Ngone51. I will take a look at this PR hopefully early next week.

@mridulm Hi, i have updated the description. Could you verify the patch again?

gently ping @Ngone51

@mridulm
Copy link
Contributor

mridulm commented Sep 23, 2022

If a result stage does not have pending partitions, it does not need to be aborted - since there are no partitions to be computed.

If a result stage has pending partitions with an indeterminate parent failing, it would have been aborted the first time it failed - so the assumption that If this condition holds on retry, from description does not apply - the first failure would have been aborted already.

Please let me know if there are queries.
+CC @Ngone51

@Ngone51
Copy link
Member

Ngone51 commented Sep 26, 2022

Thanks for the ping. I agree with @mridulm . The original condition doesn't seem to have a chance for the result stage to retry. Is there anything missed?

@github-actions
Copy link

github-actions bot commented Jan 5, 2023

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 5, 2023
@github-actions github-actions bot closed this Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants