[SPARK-20904][core] Don't report task failures to driver during shutdown. #18594

vanzin · 2017-07-11T00:16:03Z

Executors run a thread pool with daemon threads to run tasks. This means
that those threads remain active when the JVM is shutting down, meaning
those tasks are affected by code that runs in shutdown hooks.

So if a shutdown hook messes with something that the task is using (e.g.
an HDFS connection), the task will fail and will report that failure to
the driver. That will make the driver mark the task as failed regardless
of what caused the executor to shut down. So, for example, if YARN pre-empted
that executor, the driver would consider that task failed when it should
instead ignore the failure.

This change avoids reporting failures to the driver when shutdown hooks
are executing; this fixes the YARN preemption accounting, and doesn't really
change things much for other scenarios, other than reporting a more generic
error ("Executor lost") when the executor shuts down unexpectedly - which
is arguably more correct.

Tested with a hacky app running on spark-shell that tried to cause failures
only when shutdown hooks were running, verified that preemption didn't cause
the app to fail because of task failures exceeding the threshold.

…own. Executors run a thread pool with daemon threads to run tasks. This means that those threads remain active when the JVM is shutting down, meaning those tasks are affected by code that runs in shutdown hooks. So if a shutdown hook messes with something that the task is using (e.g. an HDFS connection), the task will fail and will report that failure to the driver. That will make the driver mark the task as failed regardless of what caused the executor to shut down. So, for example, if YARN pre-empted that executor, the driver would consider that task failed when it should instead ignore the failure. This change avoids reporting failures to the driver when shutdown hooks are executing; this fixes the YARN preemption accounting, and doesn't really change things much for other scenarios, other than reporting a more generic error ("Executor lost") when the executor shuts down unexpectedly - which is arguably more correct. Tested with a hacky app running on spark-shell that tried to cause failures only when shutdown hooks were running, verified that preemption didn't cause the app to fail because of task failures exceeding the threshold.

vanzin · 2017-07-11T00:17:10Z

Here's the hacky test code for the interested:

sc.parallelize(1 to 100, 100).foreach { _ =>
  var shuttingDown = false
  
  while (!shuttingDown) {
    try {
      val hook = new Thread {
        override def run() {}
      }
      // scalastyle:off runtimeaddshutdownhook
      Runtime.getRuntime.addShutdownHook(hook)
      // scalastyle:on runtimeaddshutdownhook
      Runtime.getRuntime.removeShutdownHook(hook)
      
      Thread.sleep(10)
    } catch {
      case ise: IllegalStateException => shuttingDown = true
    }
  }

  throw new Exception("Task failure during shutdown.")
}

Ran that in two shells, one in a low priority queue and one in a high priority one, restarting the high priority one to force several rounds of executors being killed by preemption in the low priority queue.

I also checked the exception did show up in the executor logs of the low priority shell (the driver did not see those errors because they were caused by preemption, which is the goal of the change).

SparkQA · 2017-07-11T03:09:32Z

Test build #79485 has finished for PR 18594 at commit 76de32a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-07-11T08:51:00Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+          // spurious failures may occur and can result in improper accounting in the driver (e.g.
+          // the task failure would not be ignored if the shutdown happened because of premption,
+          // instead of an app issue).
+          if (!ShutdownHookManager.inShutdown()) {


If the shut down is caused by an app issue, do we want to report the task failure to the driver?

The task will still fail in that case, just with a different error ("Executor lost").

Because the executor shutdown in that case won't be caused by the cluster manager (e.g. preemption), the task failure will still count. So aside from a different error message, everything else behaves the same in that case.

At this point I don't think we have any information on why we're in shutdown, whether it is an app issue, the Spark executor process being killed from the command line, etc.

Yes, a nice log message would be nice. Maybe, in the else clause to this if, something like logInfo(s"Not reporting failure as we are in the middle of a shutdown").

Sure, I can add a log, but it's not guaranteed to be printed. During shutdown the JVM can die at any moment (only shutdown hooks run to completion, and this is not one of them)...

Yeah, it isn't guaranteed. I'm thinking that if this happens often enough maybe one executor will print the message, giving a clue to the user. Also it's a de-facto code comment. Yes, any daemon thread will terminate at any time at shutdown - even finishing this block isn't guaranteed. Thanks!

jiangxb1987 · 2017-07-12T04:32:09Z

I'm hesitant to support the change. If we don't notify the failure to driver, the status of the failed task would not be updated, thus not rescheduled, perhaps it's not the behavior we expect to see?

vanzin · 2017-07-12T16:52:39Z

I don't think you understand what the change is doing. The task will still fail, because the executor is dying.

The only thing that changes is the failure reason, which will now be "Executor lost", which is actually more correct (any failure caused by races in shutdown are basically because the executor is dying). That allows the driver to ignore the failure in certain cases like it already does (e.g. YARN preempting executors).

vanzin · 2017-07-13T16:58:33Z

@squito

jsoltren

This looks good to me. It fixes a race and will improve error counting. Thanks for looking into this.

jsoltren · 2017-07-21T22:08:51Z

core/src/main/scala/org/apache/spark/executor/Executor.scala

+          // spurious failures may occur and can result in improper accounting in the driver (e.g.
+          // the task failure would not be ignored if the shutdown happened because of premption,
+          // instead of an app issue).
+          if (!ShutdownHookManager.inShutdown()) {


At this point I don't think we have any information on why we're in shutdown, whether it is an app issue, the Spark executor process being killed from the command line, etc.

Yes, a nice log message would be nice. Maybe, in the else clause to this if, something like logInfo(s"Not reporting failure as we are in the middle of a shutdown").

SparkQA · 2017-07-22T01:24:34Z

Test build #79852 has finished for PR 18594 at commit a68c2f2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-23T15:23:24Z

LGTM, merging to master/2.2!

…own. Executors run a thread pool with daemon threads to run tasks. This means that those threads remain active when the JVM is shutting down, meaning those tasks are affected by code that runs in shutdown hooks. So if a shutdown hook messes with something that the task is using (e.g. an HDFS connection), the task will fail and will report that failure to the driver. That will make the driver mark the task as failed regardless of what caused the executor to shut down. So, for example, if YARN pre-empted that executor, the driver would consider that task failed when it should instead ignore the failure. This change avoids reporting failures to the driver when shutdown hooks are executing; this fixes the YARN preemption accounting, and doesn't really change things much for other scenarios, other than reporting a more generic error ("Executor lost") when the executor shuts down unexpectedly - which is arguably more correct. Tested with a hacky app running on spark-shell that tried to cause failures only when shutdown hooks were running, verified that preemption didn't cause the app to fail because of task failures exceeding the threshold. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #18594 from vanzin/SPARK-20904. (cherry picked from commit cecd285) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…own. Executors run a thread pool with daemon threads to run tasks. This means that those threads remain active when the JVM is shutting down, meaning those tasks are affected by code that runs in shutdown hooks. So if a shutdown hook messes with something that the task is using (e.g. an HDFS connection), the task will fail and will report that failure to the driver. That will make the driver mark the task as failed regardless of what caused the executor to shut down. So, for example, if YARN pre-empted that executor, the driver would consider that task failed when it should instead ignore the failure. This change avoids reporting failures to the driver when shutdown hooks are executing; this fixes the YARN preemption accounting, and doesn't really change things much for other scenarios, other than reporting a more generic error ("Executor lost") when the executor shuts down unexpectedly - which is arguably more correct. Tested with a hacky app running on spark-shell that tried to cause failures only when shutdown hooks were running, verified that preemption didn't cause the app to fail because of task failures exceeding the threshold. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes apache#18594 from vanzin/SPARK-20904. (cherry picked from commit cecd285) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

jiangxb1987 reviewed Jul 11, 2017

View reviewed changes

jsoltren approved these changes Jul 21, 2017

View reviewed changes

Add log.

a68c2f2

asfgit closed this in cecd285 Jul 23, 2017

vanzin deleted the SPARK-20904 branch August 7, 2017 20:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20904][core] Don't report task failures to driver during shutdown. #18594

[SPARK-20904][core] Don't report task failures to driver during shutdown. #18594

vanzin commented Jul 11, 2017

vanzin commented Jul 11, 2017 •

edited

Loading

SparkQA commented Jul 11, 2017

jiangxb1987 Jul 11, 2017

vanzin Jul 11, 2017

jsoltren Jul 21, 2017

vanzin Jul 21, 2017

jsoltren Jul 21, 2017

jiangxb1987 commented Jul 12, 2017

vanzin commented Jul 12, 2017

vanzin commented Jul 13, 2017

jsoltren left a comment

jsoltren Jul 21, 2017

SparkQA commented Jul 22, 2017

cloud-fan commented Jul 23, 2017

[SPARK-20904][core] Don't report task failures to driver during shutdown. #18594

[SPARK-20904][core] Don't report task failures to driver during shutdown. #18594

Conversation

vanzin commented Jul 11, 2017

vanzin commented Jul 11, 2017 • edited Loading

SparkQA commented Jul 11, 2017

jiangxb1987 Jul 11, 2017

Choose a reason for hiding this comment

vanzin Jul 11, 2017

Choose a reason for hiding this comment

jsoltren Jul 21, 2017

Choose a reason for hiding this comment

vanzin Jul 21, 2017

Choose a reason for hiding this comment

jsoltren Jul 21, 2017

Choose a reason for hiding this comment

jiangxb1987 commented Jul 12, 2017

vanzin commented Jul 12, 2017

vanzin commented Jul 13, 2017

jsoltren left a comment

Choose a reason for hiding this comment

jsoltren Jul 21, 2017

Choose a reason for hiding this comment

SparkQA commented Jul 22, 2017

cloud-fan commented Jul 23, 2017

vanzin commented Jul 11, 2017 •

edited

Loading