Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable #14544

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
Expand Up @@ -22,6 +22,4 @@ private[master] object ApplicationState extends Enumeration {
type ApplicationState = Value

val WAITING, RUNNING, FINISHED, FAILED, KILLED, UNKNOWN = Value

val MAX_NUM_RETRY = 10
}
Expand Up @@ -58,6 +58,7 @@ private[deploy] class Master(
private val RETAINED_DRIVERS = conf.getInt("spark.deploy.retainedDrivers", 200)
private val REAPER_ITERATIONS = conf.getInt("spark.dead.worker.persistence", 15)
private val RECOVERY_MODE = conf.get("spark.deploy.recoveryMode", "NONE")
private val MAX_EXECUTOR_RETRIES = conf.getInt("spark.deploy.maxExecutorRetries", 10)

val workers = new HashSet[WorkerInfo]
val idToApp = new HashMap[String, ApplicationInfo]
Expand Down Expand Up @@ -265,7 +266,11 @@ private[deploy] class Master(

val normalExit = exitStatus == Some(0)
// Only retry certain number of times so we don't go into an infinite loop.
if (!normalExit && appInfo.incrementRetryCount() >= ApplicationState.MAX_NUM_RETRY) {
// Important note: this code path is not exercised by tests, so be very careful when
// changing this `if` condition.
if (!normalExit
&& appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
&& MAX_EXECUTOR_RETRIES >= 0) { // < 0 disables this application-killing path
val execs = appInfo.executors.values
if (!execs.exists(_.state == ExecutorState.RUNNING)) {
logError(s"Application ${appInfo.desc.name} with ID ${appInfo.id} failed " +
Expand Down
15 changes: 15 additions & 0 deletions docs/spark-standalone.md
Expand Up @@ -195,6 +195,21 @@ SPARK_MASTER_OPTS supports the following system properties:
the whole cluster by default. <br/>
</td>
</tr>
<tr>
<td><code>spark.deploy.maxExecutorRetries</code></td>
<td>10</td>
<td>
Limit on the maximum number of back-to-back executor failures that can occur before the
standalone cluster manager removes a faulty application. An application will never be removed
if it has any running executors. If an application experiences more than
<code>spark.deploy.maxExecutorRetries</code> failures in a row, no executors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "in a row" mean anything here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes: if you have a sequence of executor events like FAIL RUNNING FAIL RUNNING ... then this resets the retry count, whereas FAIL FAIL FAIL FAIL... increments it.

successfully start running in between those failures, and the application has no running
executors then the standalone cluster manager will remove the application and mark it as failed.
To disable this automatic removal, set <code>spark.deploy.maxExecutorRetries</code> to
<code>-1</code>.
<br/>
</td>
</tr>
<tr>
<td><code>spark.worker.timeout</code></td>
<td>60</td>
Expand Down