[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException #5636

ilganeli · 2015-04-22T18:14:08Z

The Stage class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort.

To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully.

I've added test cases to exercise the most obvious scenarios.

…ting function to check whether to abort a stage when it fails for a single reason more than N times.

SparkQA · 2015-04-22T18:21:14Z

Test build #30773 has finished for PR 5636 at commit 40aefbe.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StageFailure(failureReason : String)
This patch does not change any dependencies.

squito · 2015-04-22T18:23:21Z

core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala

@@ -96,6 +96,30 @@ class DAGScheduler(
  // Stages that must be resubmitted due to fetch failures
  private[scheduler] val failedStages = new HashSet[Stage]

+  // The maximum number of times to retry a stage before aborting
+  val maxStageFailures = 5


can you make this a conf? there is already spark.task.maxFailures, so how about spark.stage.maxFailures? Also it should get added to the docs

SparkQA · 2015-04-22T18:30:11Z

Test build #30775 has finished for PR 5636 at commit f8744be.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StageFailure(failureReason : String)
This patch does not change any dependencies.

SparkQA · 2015-04-22T18:40:02Z

Test build #30778 has finished for PR 5636 at commit 8fe31e0.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-22T20:20:45Z

Test build #30779 has finished for PR 5636 at commit 729b7ef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

squito · 2015-04-22T21:02:11Z

Thanks @ilganeli I took a quick look and have some high-level comments:

Checking for the exact same string is too restrictive IMO. Eg., will the failure message include host names of the fetch failed? Even without that detail, I bet there are a lot of cases where the same real error can result in different msgs.
I think the count for stage failure should be reset every time the job completes. If you have a really long running job , I could imagine some stage that is dependency on many downstream stages (eg., imagine a streaming job, where some common RDD is joined against lots of incoming batches). On a big cluster, eventually nodes will go down and stages will fail, but as long as the subsequent retry works, everything is fine. Over time, that same stage might fail a number of times, but as long as there is no more than one failure between each success, it would be completely normal (even expected to some extent).
Maybe we should still allow the old behavior of infinite retry, eg., maybe if the spark.stage.maxFailures is set to -1? Though to be honest I can't really think of any reason you'd want infinite retry, I just wonder if we should leave the door open since it is a behavior change.

thanks for working on this! This will be a great addition, I've seen this come up in a number of cases and its really hard for the average user to figure out what is going on, this will be a big help.

ilganeli · 2015-04-22T21:17:46Z

@squito Given that we can't check against the failure message (as I expected), any ideas on what we can do instead? Is this information exposed in any way at the level of the DAGScheduler or do I need to figure out a mechanism to propagate the error info up in a more detailed way. I can add the config change to allow infinite retries and I'll add the clear at the end of the job - that seems reasonable.

…on Stage success

SparkQA · 2015-04-24T03:00:37Z

Test build #30893 has finished for PR 5636 at commit d5fa622.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

SparkQA · 2015-04-24T03:05:54Z

Test build #30896 has finished for PR 5636 at commit 0335b96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

ilganeli · 2015-04-24T17:34:12Z

retest this please

SparkQA · 2015-04-24T17:38:48Z

Test build #30947 has started for PR 5636 at commit 0335b96.

ilganeli · 2015-04-24T22:45:22Z

retest this please

squito · 2015-04-25T01:54:07Z

Hi @ilganeli thanks for updating this. Not sure if you are still working on this or not, but we definitely need tests for the new behavior as well. There are tests around fetch failures in DagSchedulerSuite already, so you can probably add something which follows those examples.

squito · 2015-04-25T01:54:29Z

btw I have no idea what is going on in those test failures ... do the tests pass when you run them locally?

ilganeli · 2015-04-25T04:57:56Z

No Imran - they don't. However I see the same on he master branch. I don't think they have anything to do with my changes.

Sent with Good (www.good.com)

-----Original Message-----
From: Imran Rashid [notifications@github.commailto:notifications@github.com]
Sent: Friday, April 24, 2015 09:55 PM Eastern Standard Time
To: apache/spark
Cc: Ganelin, Ilya
Subject: Re: [spark] [SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException (#5636)

btw I have no idea what is going on in those test failures ... do the tests pass when you run them locally?

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5636#issuecomment-96120951.

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

ilganeli · 2015-04-25T05:03:50Z

Roger - I'll add tests to the suite.

Sent with Good (www.good.com)

-----Original Message-----
From: Imran Rashid [notifications@github.commailto:notifications@github.com]
Sent: Friday, April 24, 2015 09:54 PM Eastern Standard Time
To: apache/spark
Cc: Ganelin, Ilya
Subject: Re: [spark] [SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException (#5636)

Hi @ilganelihttps://github.com/ilganeli thanks for updating this. Not sure if you are still working on this or not, but we definitely need tests for the new behavior as well. There are tests around fetch failures in DagSchedulerSuite already, so you can probably add something which follows those examples.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5636#issuecomment-96120931.

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

ilganeli · 2015-04-27T18:16:49Z

retest this please

SparkQA · 2015-04-27T19:51:44Z

Test build #30990 has finished for PR 5636 at commit 0335b96.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch adds the following new dependencies:
- tachyon-0.6.4.jar
- tachyon-client-0.6.4.jar
This patch removes the following dependencies:
- tachyon-0.5.0.jar
- tachyon-client-0.5.0.jar

SparkQA · 2015-04-28T01:48:04Z

Test build #31083 has finished for PR 5636 at commit 914b2cb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

ilganeli · 2015-04-28T03:34:09Z

retest this please

SparkQA · 2015-04-28T05:13:07Z

Test build #31107 has finished for PR 5636 at commit 914b2cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.
This patch does not change any dependencies.

andrewor14 · 2015-09-02T19:45:32Z

core/src/main/scala/org/apache/spark/scheduler/Stage.scala

+
+private[scheduler] object Stage {
+  // The number of consecutive failures allowed before a stage is aborted
+  val MAX_CONSECUTIVE_FAILURES = 4


this is really MAX_CONSECUTIVE_FETCH_FAILURES; we don't have use this cap for other kinds of failures.

to clarify a little bit -- fetch failures are the only way we currently fail stages. Separately there are task failures, and job failures. In any case its good to make that clear here.

andrewor14 · 2015-09-02T20:11:51Z

@ilganeli LGTM Thanks for fixing this tricky issue; I'm sure many in the community will find this helpful. All my comments are concerned with style / renaming / test code. Once you address these I'll merge this.

squito · 2015-09-03T01:14:33Z

core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala

+
+    completeShuffleMapStageSuccessfully(0, 0, numShufflePartitions = parts)
+
+    completeNextStageWithFetchFailure(1, 0, shuffleDep)


wait I think things got a little confused between all the comments from Kay, Andrew, and me ...
As this stands now, its not a single fetch failure -- there is fetch failure from every task. I think the options were either (a) move this test to be first (as you've already done), but keep the name "multiple tasks w/ fetch failures" or (b) change the other tests to only have a single fetch failure by the refactoring to completeStageWithFetchFailure, and keep this one w/ multiple tasks w/ fetch failures.

Maybe the name should actually be "multiple task with fetch failures in a single stage attempt should not abort the stage"?

What? Why does it matter if there are one vs multiple tasks that failed with the fetch failure? Your suggestion is very verbose...

Was your concern that "Single fetch failure" could refer to a task? If so we can call this "Single stage fetch failure"

squito · 2015-09-03T01:33:00Z

thanks for reviews @kayousterhout and @andrewor14 , and the quick updates @ilganeli !

SparkQA · 2015-09-03T02:43:13Z

Test build #41951 has finished for PR 5636 at commit 5bb1ae6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

andrewor14 · 2015-09-03T05:05:37Z

Alright, merging into master. I fixed the test name on merge. Thanks everyone.

just trying to increase test coverage in the scheduler, this already works. It includes a regression test for SPARK-9809 copied some test utils from #5636, we can wait till that is merged first Author: Imran Rashid <irashid@cloudera.com> Closes #8402 from squito/test_retry_in_shared_shuffle_dep. (cherry picked from commit 33112f9) Signed-off-by: Andrew Or <andrew@databricks.com>

just trying to increase test coverage in the scheduler, this already works. It includes a regression test for SPARK-9809 copied some test utils from #5636, we can wait till that is merged first Author: Imran Rashid <irashid@cloudera.com> Closes #8402 from squito/test_retry_in_shared_shuffle_dep.

just trying to increase test coverage in the scheduler, this already works. It includes a regression test for SPARK-9809 copied some test utils from apache/spark#5636, we can wait till that is merged first Author: Imran Rashid <irashid@cloudera.com> Closes #8402 from squito/test_retry_in_shared_shuffle_dep.

…edException The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort. To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully. I've added test cases to exercise the most obvious scenarios. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes apache#5636 from ilganeli/SPARK-5945. (cherry picked from commit 4bd85d0)

[SPARK-5945] Added map to track reasons for stage failures and suppor…

40aefbe

…ting function to check whether to abort a stage when it fails for a single reason more than N times.

squito reviewed Apr 22, 2015
View reviewed changes

Fixed method scoping error

f8744be

Made StageFailure private to spark scheduler

8fe31e0

Ilya Ganelin added 2 commits April 22, 2015 11:41

Made fail() method public

e0f8b55

Added config option for stageFailure count and documentation

729b7ef

Ilya Ganelin added 2 commits April 23, 2015 18:24

Moved failure tracking to Stage class. Added clear of failre count up…

d5fa622

…on Stage success

Removed stale documentation and fixed some erroneous spacing

0335b96

Ilya Ganelin added 2 commits April 27, 2015 17:16

Added test case for stage abort after N failures

2b91940

Nit

914b2cb

andrewor14 reviewed Sep 2, 2015
View reviewed changes

Ilya Ganelin added 2 commits September 2, 2015 16:28

Test case updates and nit fixes

1d44e0c

Added more comments to clarify tricky test case

5bb1ae6

squito reviewed Sep 3, 2015
View reviewed changes

asfgit closed this in 4bd85d0 Sep 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException #5636

[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException #5636

ilganeli commented Apr 22, 2015

SparkQA commented Apr 22, 2015

squito Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

squito commented Apr 22, 2015

ilganeli commented Apr 22, 2015

SparkQA commented Apr 24, 2015

SparkQA commented Apr 24, 2015

ilganeli commented Apr 24, 2015

SparkQA commented Apr 24, 2015

ilganeli commented Apr 24, 2015

squito commented Apr 25, 2015

squito commented Apr 25, 2015

ilganeli commented Apr 25, 2015

ilganeli commented Apr 25, 2015

ilganeli commented Apr 27, 2015

SparkQA commented Apr 27, 2015

SparkQA commented Apr 28, 2015

ilganeli commented Apr 28, 2015

SparkQA commented Apr 28, 2015

andrewor14 Sep 2, 2015

squito Sep 3, 2015

andrewor14 commented Sep 2, 2015

squito Sep 3, 2015

andrewor14 Sep 3, 2015

andrewor14 Sep 3, 2015

squito commented Sep 3, 2015

SparkQA commented Sep 3, 2015

andrewor14 commented Sep 3, 2015


		completeShuffleMapStageSuccessfully(0, 0, numShufflePartitions = parts)

		completeNextStageWithFetchFailure(1, 0, shuffleDep)

[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException #5636

[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException #5636

Conversation

ilganeli commented Apr 22, 2015

SparkQA commented Apr 22, 2015

squito Apr 22, 2015

Choose a reason for hiding this comment

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

SparkQA commented Apr 22, 2015

squito commented Apr 22, 2015

ilganeli commented Apr 22, 2015

SparkQA commented Apr 24, 2015

SparkQA commented Apr 24, 2015

ilganeli commented Apr 24, 2015

SparkQA commented Apr 24, 2015

ilganeli commented Apr 24, 2015

squito commented Apr 25, 2015

squito commented Apr 25, 2015

ilganeli commented Apr 25, 2015

ilganeli commented Apr 25, 2015

ilganeli commented Apr 27, 2015

SparkQA commented Apr 27, 2015

SparkQA commented Apr 28, 2015

ilganeli commented Apr 28, 2015

SparkQA commented Apr 28, 2015

andrewor14 Sep 2, 2015

Choose a reason for hiding this comment

squito Sep 3, 2015

Choose a reason for hiding this comment

andrewor14 commented Sep 2, 2015

squito Sep 3, 2015

Choose a reason for hiding this comment

andrewor14 Sep 3, 2015

Choose a reason for hiding this comment

andrewor14 Sep 3, 2015

Choose a reason for hiding this comment

squito commented Sep 3, 2015

SparkQA commented Sep 3, 2015

andrewor14 commented Sep 3, 2015