[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899

caican00 · 2022-09-15T12:01:21Z

What changes were proposed in this pull request?

Abort result stage directly when it failed caused by FetchFailedException.

Why are the changes needed?

Here's a very serious bug：
The resultStage with indeterminate parent mapStage resubmit and it led to data inconsistency problems.

And The reasons for data inconsistency are as follows：
When result stage failed caused by FetchFailedException, spark will determine whether it can be retried.
And the original condition is numMissingPartitions < resultStage.numTasks. It is not an exact condition.

If this condition holds on retry, at this time some other running tasks at the current failed result stage might not have been killed yet, when result stage was resubmit, it would got wrong partitions to recalculation.

// DAGScheduler#submitMissingTasks
 
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()

It is possible that the number of partitions to be recalculated is smaller than the actual number of partitions at result stage and data inconsistency might occur.

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests and new test

caican00 · 2022-09-15T12:02:30Z

gently ping @cloud-fan
Could you help to verify this patch?

AmplabJenkins · 2022-09-16T19:15:44Z

Can one of the admins verify this patch?

mridulm · 2022-09-16T22:24:12Z

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.

+CC @Ngone51.
I will take a look at this PR hopefully early next week.

caican00 · 2022-09-20T06:28:53Z

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.

+CC @Ngone51. I will take a look at this PR hopefully early next week.

@mridulm Hi, i have updated the description. Could you verify the patch again?

caican00 · 2022-09-20T06:39:47Z

Can you update the description with what is the behavior we are actually observing ? The details in jira and PR description does not detail what the issue is, just the proposal for a fix.
+CC @Ngone51. I will take a look at this PR hopefully early next week.

@mridulm Hi, i have updated the description. Could you verify the patch again?

gently ping @Ngone51

mridulm · 2022-09-23T01:28:12Z

If a result stage does not have pending partitions, it does not need to be aborted - since there are no partitions to be computed.

If a result stage has pending partitions with an indeterminate parent failing, it would have been aborted the first time it failed - so the assumption that If this condition holds on retry, from description does not apply - the first failure would have been aborted already.

Please let me know if there are queries.
+CC @Ngone51

Ngone51 · 2022-09-26T14:51:00Z

Thanks for the ping. I agree with @mridulm . The original condition doesn't seem to have a chance for the result stage to retry. Is there anything missed?

github-actions · 2023-01-05T00:19:08Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

caican00 added 2 commits September 15, 2022 17:28

Abort result stage to retry when it failed caused by FetchFailed

ab34644

Abort result stage to retry when it failed caused by FetchFailed

080fad1

github-actions bot added the CORE label Sep 15, 2022

github-actions bot added the Stale label Jan 5, 2023

github-actions bot closed this Jan 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899

[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899

caican00 commented Sep 15, 2022 •

edited

caican00 commented Sep 15, 2022

AmplabJenkins commented Sep 16, 2022

mridulm commented Sep 16, 2022 •

edited

caican00 commented Sep 20, 2022

caican00 commented Sep 20, 2022 •

edited

mridulm commented Sep 23, 2022

Ngone51 commented Sep 26, 2022

github-actions bot commented Jan 5, 2023

[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899

[SPARK-40455][CORE]Abort result stage directly when it failed caused by FetchFailedException #37899

Conversation

caican00 commented Sep 15, 2022 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

caican00 commented Sep 15, 2022

AmplabJenkins commented Sep 16, 2022

mridulm commented Sep 16, 2022 • edited

caican00 commented Sep 20, 2022

caican00 commented Sep 20, 2022 • edited

mridulm commented Sep 23, 2022

Ngone51 commented Sep 26, 2022

github-actions bot commented Jan 5, 2023

caican00 commented Sep 15, 2022 •

edited

mridulm commented Sep 16, 2022 •

edited

caican00 commented Sep 20, 2022 •

edited