[SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting #18427

ericvandenbergfb-zz · 2017-06-26T22:26:59Z

What changes were proposed in this pull request?

There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor prior to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219

The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask

How was this patch tested?

Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…nding list and updating black list state.

ash211 · 2017-06-26T23:01:41Z

Jenkins this is ok to test

SparkQA · 2017-06-26T23:04:32Z

Test build #78659 has finished for PR 18427 at commit 3cf068d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-06-27T00:07:39Z

Please update the title to

[SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting

SparkQA · 2017-06-27T01:27:38Z

Test build #78661 has finished for PR 18427 at commit f2399e9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-07-05T14:23:03Z

retest this please

SparkQA · 2017-07-05T17:14:34Z

Test build #79224 has finished for PR 18427 at commit f2399e9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

I think the fix is correct, left a few comments.

jiangxb1987 · 2017-07-06T15:54:41Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    // the same executor that it was intended to be black listed from.
+    val conf = new SparkConf().
+      set(config.BLACKLIST_ENABLED, true).
+      set(config.MAX_TASK_ATTEMPTS_PER_EXECUTOR, 1)


The default value of config.MAX_TASK_ATTEMPTS_PER_EXECUTOR is 1, so we don't have to set it here.

Yes, I added it to make the test code (configuration) inputs more explicit, but I can remove if it's a default unlikely to change.

jiangxb1987 · 2017-07-06T16:33:27Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    val clock = new ManualClock
+    val mockListenerBus = mock(classOf[LiveListenerBus])
+    val blacklistTracker = new BlacklistTracker(mockListenerBus, conf, None, clock)
+    val taskSetManager = new TaskSetManager(sched, taskSet, 1, Some(blacklistTracker))


Why are we using SystemClock for taskSetManager?

It seems all the tests in this file are using ManualClock so was following convention here. This test doesn't validate anything specifically dependent on the clock/time.

jiangxb1987 · 2017-07-06T16:40:00Z

core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala

+    // Simulate an out of memory error
+    val e = new OutOfMemoryError
+    taskSetManagerSpy.handleFailedTask(
+      taskDesc.get.taskId, TaskState.FAILED, new ExceptionFailure(e, Seq()))


nit: ExceptionFailure is a case class, so you may use:

val e = ExceptionFailure("a", "b", Array(), "c", None) taskSetManagerSpy.handleFailedTask(taskDesc.get.taskId, TaskState.FAILED, endReason)

SparkQA · 2017-07-07T23:40:11Z

Test build #79347 has finished for PR 18427 at commit ddbe489.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987

LGTM, cc @cloud-fan

cloud-fan · 2017-07-10T06:40:34Z

LGTM, merging to master!

…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes apache#18427 from ericvandenbergfb/blacklistFix.

…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenbergfb.com> Closes #18427 from ericvandenbergfb/blacklistFix. ## What changes were proposed in this pull request? This is a backport of the fix to SPARK-21219, already checked in as 96d58f2. ## How was this patch tested? Ran TaskSetManagerSuite tests locally. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes #18604 from jsoltren/branch-2.2.

…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenbergfb.com> Closes apache#18427 from ericvandenbergfb/blacklistFix. ## What changes were proposed in this pull request? This is a backport of the fix to SPARK-21219, already checked in as 96d58f2. ## How was this patch tested? Ran TaskSetManagerSuite tests locally. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes apache#18604 from jsoltren/branch-2.2.

[SPARK-21219][scheduler] Fix race condition between adding task to pe…

3cf068d

…nding list and updating black list state.

Fix scalastyle issue and remove unnecessary null.

f2399e9

ericvandenbergfb-zz changed the title ~~[SPARK-21219][scheduler] Fix race condition between adding task to pe…~~ [SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting Jun 27, 2017

jiangxb1987 reviewed Jul 6, 2017

View reviewed changes

Code review comments

ddbe489

jiangxb1987 approved these changes Jul 9, 2017

View reviewed changes

asfgit closed this in 96d58f2 Jul 10, 2017

jsoltren mentioned this pull request Jul 11, 2017

[SPARK-21219][CORE] Task retry occurs on same executor due to race co… #18604

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting #18427

[SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting #18427

ericvandenbergfb-zz commented Jun 26, 2017 •

edited

Loading

ash211 commented Jun 26, 2017

SparkQA commented Jun 26, 2017

jiangxb1987 commented Jun 27, 2017

SparkQA commented Jun 27, 2017

jiangxb1987 commented Jul 5, 2017

SparkQA commented Jul 5, 2017

jiangxb1987 left a comment

jiangxb1987 Jul 6, 2017

ericvandenbergfb-zz Jul 7, 2017

jiangxb1987 Jul 6, 2017

ericvandenbergfb-zz Jul 7, 2017

jiangxb1987 Jul 6, 2017

ericvandenbergfb-zz Jul 7, 2017

SparkQA commented Jul 7, 2017

jiangxb1987 left a comment

cloud-fan commented Jul 10, 2017

[SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting #18427

[SPARK-21219][Core] Task retry occurs on same executor due to race condition with blacklisting #18427

Conversation

ericvandenbergfb-zz commented Jun 26, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

ash211 commented Jun 26, 2017

SparkQA commented Jun 26, 2017

jiangxb1987 commented Jun 27, 2017

SparkQA commented Jun 27, 2017

jiangxb1987 commented Jul 5, 2017

SparkQA commented Jul 5, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

jiangxb1987 Jul 6, 2017

Choose a reason for hiding this comment

ericvandenbergfb-zz Jul 7, 2017

Choose a reason for hiding this comment

jiangxb1987 Jul 6, 2017

Choose a reason for hiding this comment

ericvandenbergfb-zz Jul 7, 2017

Choose a reason for hiding this comment

jiangxb1987 Jul 6, 2017

Choose a reason for hiding this comment

ericvandenbergfb-zz Jul 7, 2017

Choose a reason for hiding this comment

SparkQA commented Jul 7, 2017

jiangxb1987 left a comment

Choose a reason for hiding this comment

cloud-fan commented Jul 10, 2017

ericvandenbergfb-zz commented Jun 26, 2017 •

edited

Loading