[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208

ankuriitg · 2019-03-25T18:17:03Z

What changes were proposed in this pull request?

SPARK-20898 added a new configuration to blacklist a node/executor on fetch
failures. This config was deemed risky at the time and was disabled by default
until more data is collected.

This commit aims to enable that feature by default as we have seen couple of
instances where that feature was found to be useful. Additionally, the commit
changes the blacklist criteria slightly.

If external shuffle service is not enabled, the commit will blacklist the
executor immediately (on first fetch failure). This is consistent with the fact
that we delete all the shuffle outputs on that executor. So, I think it is
useful that we also blacklist that executor temporarily.

Otherwise, if external shuffle service is enabled, instead of blacklisting the
node immediately, it keeps track of all such fetch failures on that node. If
unique and active fetch failures on that node exceed the configured threshold
(it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also
blacklisted. This will ensure that persistent issues with a node do not lead to
job failures.

How was this patch tested?

Added a unit test case to ensure that blacklisting works as configured

…lures by default SPARK-20898 added a new configuration to blacklist a node/executor on fetch failures. This config was deemed risky at the time and was disabled by default until more data is collected. This commit aims to enable that feature by default as we have seen couple of instances where that feature was found to be useful. Additionally, the commit changes the blacklist criteria slightly. If external shuffle service is not enabled, the commit will blacklist the executor immediately (on first fetch failure). This is consistent with the fact that we delete all the shuffle outputs on that executor. So, I think it is useful that we also blacklist that executor temporarily. Otherwise, if external shuffle service is enabled, instead of blacklisting the node immediately, it keeps track of all such fetch failures on that node. If unique and active fetch failures on that node exceed the configured threshold (it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also blacklisted. This will ensure that persistent issues with a node do not lead to job failures. Testing Done: 1. Added a unit test case to ensure that blacklisting works as configured

ankuriitg · 2019-03-25T18:24:08Z

cc @squito @tgravescs

tgravescs · 2019-03-25T19:23:38Z

I'm definitely good turning it on by default, you added the extra config and tracking because you saw instances the current version is to aggressive? If so can you give details. Right now it blacklists immediately because you get a stage failure immediately. I can see reasons to make this configurable but I'm not seeing how that ties into the config MAX_FAILED_EXEC_PER_NODE, which is at the application level not the stage level.

ankuriitg · 2019-03-25T20:03:34Z

I do not have any instances where the current version is too aggressive but I think that can be a case in some instances. So, I think it may be useful to track the failures before blacklisting a node.

I realize that MAX_FAILED_EXEC_PER_NODE is at the application level and I want to keep it that way, as there is a possibility that we may not observe another fetch failure from the same stage but we may get it from another stage or another stage attempt (this was observed in my situation).

SparkQA · 2019-03-25T22:21:06Z

Test build #103923 has finished for PR 24208 at commit 209c2cf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-26T04:39:50Z

Test build #103931 has finished for PR 24208 at commit 9751318.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-03-26T15:54:49Z

I guess I'm not sure I agree with tying the fetch failure count to the same config as MAX_FAILED_EXEC_PER_NODE. I can see someone wanting that to stay the same but be more aggressive on the fetch failures.

I used to run with the config on all the time at Yahoo where it did it immediately, never saw any issues with it that people reported but that doesn't mean it wasn't to aggressive since generally we had very large clusters. @abellina have you seen any issues with this being on?

If we don't have any data with needing it perhaps we leave it as is now, the customers out is they just turn it back off and if people report it to aggressive then at that point we add a new config, what do you think?

This config defines the threshold after which a node should be blacklisted.

ankuriitg · 2019-03-26T16:28:52Z

Thanks for your review @tgravescs

I just added another config MAX_FETCH_FAILURES_PER_NODE. So, this way users can choose to be more aggressive on fetch failures.

I think without this configuration, it can be too aggressive on smaller clusters, especially one with beefy nodes, where each node has many executors running but there are very few nodes overall.

If you still think we do not need this configuration, then I can remove it altogether.

squito · 2019-03-26T17:34:02Z

I feel the same as Tom --I'm in favor of turning it on by default, but not sure about the extra config. Given that the scheduler still treats all map output as missing on the node after one fetch failure, it seems inconsistent to have this part configurable, unless we have a case supporting it. (If its going to be true by default, I think we should remove "experimental" from the docs in the configuration.md also).

Separately, do you have any opinions on turning blacklisting on by default? maybe we should start a discussion on dev@ to get more input on that. Its been working for us for a while now, and seems critical for large clusters, so I think it would be good.

…s blacklisting on fetch failure by default. 2. Removed the additional config 3. Updated the documentation

ankuriitg · 2019-03-26T18:30:47Z

Thanks for the review @squito !

I have removed the additional logic and configuration and changed the PR to just enable blacklisting for fetch failure to on by default. As @tgravescs suggested, we can add that logic later on, if needed.

I also wanted to check if we should enable spark.files.fetchFailure.unRegisterOutputOnHost by default (added in SPARK-19753). This ensures that not only is the host blacklisted but all the files on that host are deleted.

@squito , I also agree with turning blacklisting on by default.

SparkQA · 2019-03-26T21:03:49Z

Test build #103983 has finished for PR 24208 at commit a4c77c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-26T23:04:31Z

Test build #103987 has finished for PR 24208 at commit 4e77e5d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

any reason we don't un-Experimental other config like spark.blacklist.killBlacklistedExecutors?

ankuriitg · 2019-03-27T15:27:14Z

Let me start an email thread where we can discuss which blacklisting configurations seem stable now and should be enabled by default.

rezasafi · 2019-03-28T23:25:23Z

Thanks for the work. The description said there was a added unit test, but it isn't included in the changed files. Did you decide to remove it?

ankuriitg · 2019-03-28T23:51:19Z

Yes, the PR body is not up-to-date. The PR just updates the config now, as mentioned in this comment above: #24208 (comment)

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala

rxin · 2019-03-29T18:26:51Z

Sorry -1 on this given current state. See my comment on the dev list thread.

I think in general this is the right direction but the current implementation is unfortunately not great.

ankuriitg · 2019-03-29T18:55:29Z

Thanks for the review @rxin!

Actually, this is a slightly different change than what I proposed in the dev thread. It enables blacklisting on fetch failure if general blacklisting feature is enabled. The general blacklisting feature (controlled by spark.blacklisting.enabled) is still disabled by default.

rxin · 2019-03-29T19:11:35Z

Ah got it. Thanks for explaining. What's the behavior if this is false? Do we just treat fetch failure as a normal failure?

ankuriitg · 2019-03-29T20:32:57Z

If this is false, there is a possibility that map task can be retried on the same executor/node multiple times and then, reduce stage retries max times before failing the job.

Overall, this seems to be an essential feature of blacklisting, to ensure executor/nodes are blacklisted if there are fetch failures originating from that executor/node.

rxin · 2019-03-30T17:50:11Z

Makes sense. Thanks for explaining!

SparkQA · 2019-04-01T21:39:23Z

Test build #104167 has finished for PR 24208 at commit 65f2eca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-08T02:35:10Z

Test build #108790 has finished for PR 24208 at commit 65f2eca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

tgravescs · 2019-08-08T13:18:13Z

this looks good to me, I think the discussion on dev didn't want blacklisting in general on by default but this is now just turning on the fetch failures handling by default and you still have to enable blacklisting to get it. @rxin just to confirm you are ok with this?

dongjoon-hyun · 2019-09-09T02:31:30Z

Retest this please.

dongjoon-hyun · 2019-09-09T02:33:41Z

Hi, All. Can we make a final decision on this PR from @ankuriitg ?

SparkQA · 2019-09-09T05:00:43Z

Test build #110326 has finished for PR 24208 at commit 65f2eca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2019-09-16T18:14:58Z

Can one of the admins verify this patch?

HeartSaVioR · 2019-10-06T23:41:04Z

Looks like this got two +1s but stuck with -1. Origin concern seemed to be resolved but the veto didn't get rid.
@rxin Shall we cancel the veto if you're OK with the explanation?

rxin · 2019-10-15T00:29:02Z

Sorry didn't get to this earlier. Do we ever release a node from blacklist? I'm asking because it'd be bad if a transient network error immediately blacklists all the nodes. It'd make sense if it blacklists after a few fetch failures, or if the blacklist is automatically released after a while.

github-actions · 2020-01-24T00:05:10Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Fix unit-test failure

9751318

ankuriitg force-pushed the ankurgupta/SPARK-27272 branch from 209c2cf to 9751318 Compare March 25, 2019 23:28

Added a new config called MAX_FETCH_FAILURES_PER_NODE

a4c77c7

This config defines the threshold after which a node should be blacklisted.

1. Removed the additional blacklist criteria. This commit just enable…

4e77e5d

…s blacklisting on fetch failure by default. 2. Removed the additional config 3. Updated the documentation

squito approved these changes Mar 26, 2019

View reviewed changes

felixcheung reviewed Mar 27, 2019

View reviewed changes

rxin reviewed Mar 29, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala Outdated Show resolved Hide resolved

rxin reviewed Mar 29, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala Outdated Show resolved Hide resolved

rxin reviewed Mar 29, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala Outdated Show resolved Hide resolved

Reverted unnecessary diffs

65f2eca

dongjoon-hyun added the SPARK CORE label Jun 14, 2019

github-actions bot added the Stale label Jan 24, 2020

github-actions bot closed this Jan 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208

[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208

ankuriitg commented Mar 25, 2019 •

edited

ankuriitg commented Mar 25, 2019

tgravescs commented Mar 25, 2019

ankuriitg commented Mar 25, 2019

SparkQA commented Mar 25, 2019

SparkQA commented Mar 26, 2019

tgravescs commented Mar 26, 2019

ankuriitg commented Mar 26, 2019

squito commented Mar 26, 2019

ankuriitg commented Mar 26, 2019

SparkQA commented Mar 26, 2019

SparkQA commented Mar 26, 2019

felixcheung left a comment

ankuriitg commented Mar 27, 2019

rezasafi commented Mar 28, 2019

ankuriitg commented Mar 28, 2019

rxin commented Mar 29, 2019 •

edited

ankuriitg commented Mar 29, 2019

rxin commented Mar 29, 2019

ankuriitg commented Mar 29, 2019

rxin commented Mar 30, 2019

SparkQA commented Apr 1, 2019

SparkQA commented Aug 8, 2019

tgravescs commented Aug 8, 2019 •

edited

dongjoon-hyun commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019

SparkQA commented Sep 9, 2019

AmplabJenkins commented Sep 16, 2019

HeartSaVioR commented Oct 6, 2019

rxin commented Oct 15, 2019

github-actions bot commented Jan 24, 2020

[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208

[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208

Conversation

ankuriitg commented Mar 25, 2019 • edited

What changes were proposed in this pull request?

How was this patch tested?

ankuriitg commented Mar 25, 2019

tgravescs commented Mar 25, 2019

ankuriitg commented Mar 25, 2019

SparkQA commented Mar 25, 2019

SparkQA commented Mar 26, 2019

tgravescs commented Mar 26, 2019

ankuriitg commented Mar 26, 2019

squito commented Mar 26, 2019

ankuriitg commented Mar 26, 2019

SparkQA commented Mar 26, 2019

SparkQA commented Mar 26, 2019

felixcheung left a comment

Choose a reason for hiding this comment

ankuriitg commented Mar 27, 2019

rezasafi commented Mar 28, 2019

ankuriitg commented Mar 28, 2019

rxin commented Mar 29, 2019 • edited

ankuriitg commented Mar 29, 2019

rxin commented Mar 29, 2019

ankuriitg commented Mar 29, 2019

rxin commented Mar 30, 2019

SparkQA commented Apr 1, 2019

SparkQA commented Aug 8, 2019

tgravescs commented Aug 8, 2019 • edited

dongjoon-hyun commented Sep 9, 2019

dongjoon-hyun commented Sep 9, 2019

SparkQA commented Sep 9, 2019

AmplabJenkins commented Sep 16, 2019

HeartSaVioR commented Oct 6, 2019

rxin commented Oct 15, 2019

github-actions bot commented Jan 24, 2020

ankuriitg commented Mar 25, 2019 •

edited

rxin commented Mar 29, 2019 •

edited

tgravescs commented Aug 8, 2019 •

edited