[SPARK-21219][CORE] Task retry occurs on same executor due to race co… #18604

jsoltren · 2017-07-11T22:57:26Z

…ndition with blacklisting

There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor prior to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219

The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask

Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Eric Vandenberg ericvandenberg@fb.com

Closes #18427 from ericvandenbergfb/blacklistFix.

What changes were proposed in this pull request?

This is a backport of the fix to SPARK-21219, already checked in as 96d58f2.

How was this patch tested?

Ran TaskSetManagerSuite tests locally.

…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes apache#18427 from ericvandenbergfb/blacklistFix.

jiangxb1987 · 2017-07-12T00:12:45Z

Do we want to backport this to 2.2? @cloud-fan

vanzin · 2017-07-12T00:37:24Z

ok to test

vanzin · 2017-07-12T00:37:33Z

I don't see why we shouldn't backport a fix to 2.2.

SparkQA · 2017-07-12T03:29:40Z

Test build #79545 has finished for PR 18604 at commit 2ea00a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jsoltren · 2017-07-12T03:42:47Z

My preference is to backport this and other blacklisting related fixes as far back as possible on Spark2 - meaning 2.1 and 2.0 as well, unless convinced otherwise. So, yes, @cloud-fan, I hope we do backport this!

…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenbergfb.com> Closes #18427 from ericvandenbergfb/blacklistFix. ## What changes were proposed in this pull request? This is a backport of the fix to SPARK-21219, already checked in as 96d58f2. ## How was this patch tested? Ran TaskSetManagerSuite tests locally. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes #18604 from jsoltren/branch-2.2.

cloud-fan · 2017-07-12T06:49:52Z

merging to 2.2!

vanzin · 2017-07-12T16:57:20Z

@jsoltren please close, this was merged but PRs to branches are not closed automatically.

jsoltren · 2017-07-12T17:00:08Z

👍

…hema ## What changes were proposed in this pull request? In apache#18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`). However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path. Fortunately, in apache#18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in apache#18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands. So the fix is simple, this PR partially revert apache#18064 and make `InsertIntoXXX` commands leaf nodes again. ## How was this patch tested? new regression test Author: Wenchen Fan <wenchen@databricks.com> Closes apache#19474 from cloud-fan/bug.

…ndition with blacklisting There's a race condition in the current TaskSetManager where a failed task is added for retry (addPendingTask), and can asynchronously be assigned to an executor *prior* to the blacklist state (updateBlacklistForFailedTask), the result is the task might re-execute on the same executor. This is particularly problematic if the executor is shutting down since the retry task immediately becomes a lost task (ExecutorLostFailure). Another side effect is that the actual failure reason gets obscured by the retry task which never actually executed. There are sample logs showing the issue in the https://issues.apache.org/jira/browse/SPARK-21219 The fix is to change the ordering of the addPendingTask and updatingBlackListForFailedTask calls in TaskSetManager.handleFailedTask Implemented a unit test that verifies the task is black listed before it is added to the pending task. Ran the unit test without the fix and it fails. Ran the unit test with the fix and it passes. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Vandenberg <ericvandenbergfb.com> Closes apache#18427 from ericvandenbergfb/blacklistFix. ## What changes were proposed in this pull request? This is a backport of the fix to SPARK-21219, already checked in as 96d58f2. ## How was this patch tested? Ran TaskSetManagerSuite tests locally. Author: Eric Vandenberg <ericvandenberg@fb.com> Closes apache#18604 from jsoltren/branch-2.2.

jsoltren closed this Jul 12, 2017

cloud-fan mentioned this pull request Oct 11, 2017

[SPARK-22252][SQL] FileFormatWriter should respect the input query schema #19474

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21219][CORE] Task retry occurs on same executor due to race co… #18604

[SPARK-21219][CORE] Task retry occurs on same executor due to race co… #18604

jsoltren commented Jul 11, 2017

jiangxb1987 commented Jul 12, 2017

vanzin commented Jul 12, 2017

vanzin commented Jul 12, 2017

SparkQA commented Jul 12, 2017

jsoltren commented Jul 12, 2017

cloud-fan commented Jul 12, 2017

vanzin commented Jul 12, 2017

jsoltren commented Jul 12, 2017

[SPARK-21219][CORE] Task retry occurs on same executor due to race co… #18604

[SPARK-21219][CORE] Task retry occurs on same executor due to race co… #18604

Conversation

jsoltren commented Jul 11, 2017

What changes were proposed in this pull request?

How was this patch tested?

jiangxb1987 commented Jul 12, 2017

vanzin commented Jul 12, 2017

vanzin commented Jul 12, 2017

SparkQA commented Jul 12, 2017

jsoltren commented Jul 12, 2017

cloud-fan commented Jul 12, 2017

vanzin commented Jul 12, 2017

jsoltren commented Jul 12, 2017