Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25161][Core] Fix several bugs in failure handling of barrier execution mode #22158

Closed
wants to merge 1 commit into from

Conversation

jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

Fix several bugs in failure handling of barrier execution mode:

  • Mark TaskSet for a barrier stage as zombie when a task attempt fails;
  • Multiple barrier task failures from a single barrier stage should not trigger multiple stage retries;
  • Barrier task failure from a previous failed stage attempt should not trigger stage retry;
  • Fail the job when a task from a barrier ResultStage failed;
  • RDD.isBarrier() should not rely on ShuffleDependencys.

How was this patch tested?

Added corresponding test cases in DAGSchedulerSuite and TaskSchedulerImplSuite.

@SparkQA
Copy link

SparkQA commented Aug 20, 2018

Test build #94968 has finished for PR 22158 at commit 32ea946.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@mengxr
Copy link
Contributor

mengxr commented Aug 21, 2018

LGTM pending Jenkins. Thanks for finding those corner cases!

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #94998 has finished for PR 22158 at commit 32ea946.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 21, 2018

Test build #95000 has finished for PR 22158 at commit 32ea946.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Aug 21, 2018

Merged into master. Thanks!

@asfgit asfgit closed this in 5059255 Aug 21, 2018
dongjoon-hyun pushed a commit that referenced this pull request Jun 26, 2023
…tiveJob`

### What changes were proposed in this pull request?
Remove useless `resetAllPartitions` method in `ActiveJob`. It unused when #22158.

### Why are the changes needed?
Clean code

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unnecessary

Closes #41737 from Hisoka-X/SPARK-44188_remove_activejob_method.

Authored-by: Jia Fan <fanjiaeminem@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants