-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException #5636
Conversation
…ting function to check whether to abort a stage when it fails for a single reason more than N times.
Test build #30773 has finished for PR 5636 at commit
|
@@ -96,6 +96,30 @@ class DAGScheduler( | |||
// Stages that must be resubmitted due to fetch failures | |||
private[scheduler] val failedStages = new HashSet[Stage] | |||
|
|||
// The maximum number of times to retry a stage before aborting | |||
val maxStageFailures = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a conf? there is already spark.task.maxFailures
, so how about spark.stage.maxFailures
? Also it should get added to the docs
Test build #30775 has finished for PR 5636 at commit
|
Test build #30778 has finished for PR 5636 at commit
|
Test build #30779 has finished for PR 5636 at commit
|
Thanks @ilganeli I took a quick look and have some high-level comments:
thanks for working on this! This will be a great addition, I've seen this come up in a number of cases and its really hard for the average user to figure out what is going on, this will be a big help. |
@squito Given that we can't check against the failure message (as I expected), any ideas on what we can do instead? Is this information exposed in any way at the level of the |
Test build #30893 has finished for PR 5636 at commit
|
Test build #30896 has finished for PR 5636 at commit
|
retest this please |
Test build #30947 has started for PR 5636 at commit |
retest this please |
Hi @ilganeli thanks for updating this. Not sure if you are still working on this or not, but we definitely need tests for the new behavior as well. There are tests around fetch failures in |
btw I have no idea what is going on in those test failures ... do the tests pass when you run them locally? |
No Imran - they don't. However I see the same on he master branch. I don't think they have anything to do with my changes. Sent with Good (www.good.com) -----Original Message----- btw I have no idea what is going on in those test failures ... do the tests pass when you run them locally? — The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. |
Roger - I'll add tests to the suite. Sent with Good (www.good.com) -----Original Message----- Hi @ilganelihttps://github.com/ilganeli thanks for updating this. Not sure if you are still working on this or not, but we definitely need tests for the new behavior as well. There are tests around fetch failures in DagSchedulerSuite already, so you can probably add something which follows those examples. — The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. |
retest this please |
Test build #30990 has finished for PR 5636 at commit
|
Test build #31083 has finished for PR 5636 at commit
|
retest this please |
Test build #31107 has finished for PR 5636 at commit
|
|
||
private[scheduler] object Stage { | ||
// The number of consecutive failures allowed before a stage is aborted | ||
val MAX_CONSECUTIVE_FAILURES = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is really MAX_CONSECUTIVE_FETCH_FAILURES
; we don't have use this cap for other kinds of failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to clarify a little bit -- fetch failures are the only way we currently fail stages. Separately there are task failures, and job failures. In any case its good to make that clear here.
@ilganeli LGTM Thanks for fixing this tricky issue; I'm sure many in the community will find this helpful. All my comments are concerned with style / renaming / test code. Once you address these I'll merge this. |
|
||
completeShuffleMapStageSuccessfully(0, 0, numShufflePartitions = parts) | ||
|
||
completeNextStageWithFetchFailure(1, 0, shuffleDep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait I think things got a little confused between all the comments from Kay, Andrew, and me ...
As this stands now, its not a single fetch failure -- there is fetch failure from every task. I think the options were either (a) move this test to be first (as you've already done), but keep the name "multiple tasks w/ fetch failures" or (b) change the other tests to only have a single fetch failure by the refactoring to completeStageWithFetchFailure
, and keep this one w/ multiple tasks w/ fetch failures.
Maybe the name should actually be "multiple task with fetch failures in a single stage attempt should not abort the stage"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What? Why does it matter if there are one vs multiple tasks that failed with the fetch failure? Your suggestion is very verbose...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was your concern that "Single fetch failure" could refer to a task? If so we can call this "Single stage fetch failure"
thanks for reviews @kayousterhout and @andrewor14 , and the quick updates @ilganeli ! |
Test build #41951 has finished for PR 5636 at commit
|
Alright, merging into master. I fixed the test name on merge. Thanks everyone. |
just trying to increase test coverage in the scheduler, this already works. It includes a regression test for SPARK-9809 copied some test utils from #5636, we can wait till that is merged first Author: Imran Rashid <irashid@cloudera.com> Closes #8402 from squito/test_retry_in_shared_shuffle_dep. (cherry picked from commit 33112f9) Signed-off-by: Andrew Or <andrew@databricks.com>
just trying to increase test coverage in the scheduler, this already works. It includes a regression test for SPARK-9809 copied some test utils from #5636, we can wait till that is merged first Author: Imran Rashid <irashid@cloudera.com> Closes #8402 from squito/test_retry_in_shared_shuffle_dep.
just trying to increase test coverage in the scheduler, this already works. It includes a regression test for SPARK-9809 copied some test utils from apache/spark#5636, we can wait till that is merged first Author: Imran Rashid <irashid@cloudera.com> Closes #8402 from squito/test_retry_in_shared_shuffle_dep.
…edException The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort. To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully. I've added test cases to exercise the most obvious scenarios. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes apache#5636 from ilganeli/SPARK-5945. (cherry picked from commit 4bd85d0)
The
Stage
class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort.To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully.
I've added test cases to exercise the most obvious scenarios.