[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

JoshRosen · 2016-08-08T19:28:01Z

What changes were proposed in this pull request?

This patch introduces a new configuration, spark.deploy.maxExecutorRetries, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.

Background: This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.

Motivation for making this configurable: Previously, if a Spark Standalone application experienced more than ApplicationState.MAX_NUM_RETRY executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if spark.deploy.maxExecutorRetries is negative.

I'd like to merge this patch into master, branch-2.0, and branch-1.6.

How was this patch tested?

I tested this manually using spark-shell and local-cluster mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.

SparkQA · 2016-08-08T21:44:46Z

Test build #63374 has finished for PR 14544 at commit d9bab26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zsxwing · 2016-08-08T23:34:03Z

LGTM

rxin · 2016-08-09T17:53:16Z

docs/spark-standalone.md

+    Limit on the maximum number of back-to-back executor failures that can occur before the
+    standalone cluster manager removes a faulty application. An application will never be removed
+    if it has any running executors. If an application experiences more than
+    <code>spark.deploy.maxExecutorRetries</code> failures in a row, no executors


Does "in a row" mean anything here?

Yes: if you have a sequence of executor events like FAIL RUNNING FAIL RUNNING ... then this resets the retry count, whereas FAIL FAIL FAIL FAIL... increments it.

rxin · 2016-08-09T17:54:42Z

LGTM too.

JoshRosen · 2016-08-09T18:21:24Z

Merging to master, branch-2.0, and branch-1.6.

## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5) Signed-off-by: Josh Rosen <joshrosen@databricks.com>

SparkQA · 2016-08-09T20:35:36Z

Test build #63446 has finished for PR 14544 at commit ef3297d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration. **Background:** This application-killing was added in 6b5980d (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (apache#1360), this feature was refined to prevent applications which have running executors from being killed by this code path. **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative. I'd like to merge this patch into master, branch-2.0, and branch-1.6. ## How was this patch tested? I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#14544 from JoshRosen/add-setting-for-max-executor-failures. (cherry picked from commit b89b3a5) Signed-off-by: Josh Rosen <joshrosen@databricks.com> (cherry picked from commit ace458f)

JoshRosen added 3 commits August 8, 2016 11:29

Add configuration.

ed12bb7

Fix docs.

3d539ba

Add comment warning of lack of test coverage.

d9bab26

rxin reviewed Aug 9, 2016
View reviewed changes

Update spark-standalone.md

ef3297d

asfgit closed this in b89b3a5 Aug 9, 2016

JoshRosen deleted the add-setting-for-max-executor-failures branch August 9, 2016 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

JoshRosen commented Aug 8, 2016 •

edited

SparkQA commented Aug 8, 2016

zsxwing commented Aug 8, 2016

rxin Aug 9, 2016

JoshRosen Aug 9, 2016

rxin commented Aug 9, 2016

JoshRosen commented Aug 9, 2016

SparkQA commented Aug 9, 2016

[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

Conversation

JoshRosen commented Aug 8, 2016 • edited

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 8, 2016

zsxwing commented Aug 8, 2016

rxin Aug 9, 2016

Choose a reason for hiding this comment

JoshRosen Aug 9, 2016

Choose a reason for hiding this comment

rxin commented Aug 9, 2016

JoshRosen commented Aug 9, 2016

SparkQA commented Aug 9, 2016

JoshRosen commented Aug 8, 2016 •

edited