New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208
Conversation
…lures by default SPARK-20898 added a new configuration to blacklist a node/executor on fetch failures. This config was deemed risky at the time and was disabled by default until more data is collected. This commit aims to enable that feature by default as we have seen couple of instances where that feature was found to be useful. Additionally, the commit changes the blacklist criteria slightly. If external shuffle service is not enabled, the commit will blacklist the executor immediately (on first fetch failure). This is consistent with the fact that we delete all the shuffle outputs on that executor. So, I think it is useful that we also blacklist that executor temporarily. Otherwise, if external shuffle service is enabled, instead of blacklisting the node immediately, it keeps track of all such fetch failures on that node. If unique and active fetch failures on that node exceed the configured threshold (it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also blacklisted. This will ensure that persistent issues with a node do not lead to job failures. Testing Done: 1. Added a unit test case to ensure that blacklisting works as configured
I'm definitely good turning it on by default, you added the extra config and tracking because you saw instances the current version is to aggressive? If so can you give details. Right now it blacklists immediately because you get a stage failure immediately. I can see reasons to make this configurable but I'm not seeing how that ties into the config MAX_FAILED_EXEC_PER_NODE, which is at the application level not the stage level. |
I do not have any instances where the current version is too aggressive but I think that can be a case in some instances. So, I think it may be useful to track the failures before blacklisting a node. I realize that MAX_FAILED_EXEC_PER_NODE is at the application level and I want to keep it that way, as there is a possibility that we may not observe another fetch failure from the same stage but we may get it from another stage or another stage attempt (this was observed in my situation). |
Test build #103923 has finished for PR 24208 at commit
|
209c2cf
to
9751318
Compare
Test build #103931 has finished for PR 24208 at commit
|
I guess I'm not sure I agree with tying the fetch failure count to the same config as MAX_FAILED_EXEC_PER_NODE. I can see someone wanting that to stay the same but be more aggressive on the fetch failures. I used to run with the config on all the time at Yahoo where it did it immediately, never saw any issues with it that people reported but that doesn't mean it wasn't to aggressive since generally we had very large clusters. @abellina have you seen any issues with this being on? If we don't have any data with needing it perhaps we leave it as is now, the customers out is they just turn it back off and if people report it to aggressive then at that point we add a new config, what do you think? |
This config defines the threshold after which a node should be blacklisted.
Thanks for your review @tgravescs I just added another config MAX_FETCH_FAILURES_PER_NODE. So, this way users can choose to be more aggressive on fetch failures. I think without this configuration, it can be too aggressive on smaller clusters, especially one with beefy nodes, where each node has many executors running but there are very few nodes overall. If you still think we do not need this configuration, then I can remove it altogether. |
I feel the same as Tom --I'm in favor of turning it on by default, but not sure about the extra config. Given that the scheduler still treats all map output as missing on the node after one fetch failure, it seems inconsistent to have this part configurable, unless we have a case supporting it. (If its going to be true by default, I think we should remove "experimental" from the docs in the Separately, do you have any opinions on turning blacklisting on by default? maybe we should start a discussion on dev@ to get more input on that. Its been working for us for a while now, and seems critical for large clusters, so I think it would be good. |
…s blacklisting on fetch failure by default. 2. Removed the additional config 3. Updated the documentation
Thanks for the review @squito ! I have removed the additional logic and configuration and changed the PR to just enable blacklisting for fetch failure to on by default. As @tgravescs suggested, we can add that logic later on, if needed. I also wanted to check if we should enable @squito , I also agree with turning blacklisting on by default. |
Test build #103983 has finished for PR 24208 at commit
|
Test build #103987 has finished for PR 24208 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we don't un-Experimental other config like spark.blacklist.killBlacklistedExecutors
?
Let me start an email thread where we can discuss which blacklisting configurations seem stable now and should be enabled by default. |
Thanks for the work. The description said there was a added unit test, but it isn't included in the changed files. Did you decide to remove it? |
Yes, the PR body is not up-to-date. The PR just updates the config now, as mentioned in this comment above: #24208 (comment) |
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala
Outdated
Show resolved
Hide resolved
Sorry -1 on this given current state. See my comment on the dev list thread. I think in general this is the right direction but the current implementation is unfortunately not great. |
Thanks for the review @rxin! Actually, this is a slightly different change than what I proposed in the dev thread. It enables blacklisting on fetch failure if general blacklisting feature is enabled. The general blacklisting feature (controlled by spark.blacklisting.enabled) is still disabled by default. |
Ah got it. Thanks for explaining. What's the behavior if this is false? Do we just treat fetch failure as a normal failure? |
If this is false, there is a possibility that map task can be retried on the same executor/node multiple times and then, reduce stage retries max times before failing the job. Overall, this seems to be an essential feature of blacklisting, to ensure executor/nodes are blacklisted if there are fetch failures originating from that executor/node. |
Makes sense. Thanks for explaining! |
Test build #104167 has finished for PR 24208 at commit
|
Test build #108790 has finished for PR 24208 at commit
|
this looks good to me, I think the discussion on dev didn't want blacklisting in general on by default but this is now just turning on the fetch failures handling by default and you still have to enable blacklisting to get it. @rxin just to confirm you are ok with this? |
Retest this please. |
Hi, All. Can we make a final decision on this PR from @ankuriitg ? |
Test build #110326 has finished for PR 24208 at commit
|
Can one of the admins verify this patch? |
Looks like this got two +1s but stuck with -1. Origin concern seemed to be resolved but the veto didn't get rid. |
Sorry didn't get to this earlier. Do we ever release a node from blacklist? I'm asking because it'd be bad if a transient network error immediately blacklists all the nodes. It'd make sense if it blacklists after a few fetch failures, or if the blacklist is automatically released after a while. |
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
SPARK-20898 added a new configuration to blacklist a node/executor on fetch
failures. This config was deemed risky at the time and was disabled by default
until more data is collected.
This commit aims to enable that feature by default as we have seen couple of
instances where that feature was found to be useful. Additionally, the commit
changes the blacklist criteria slightly.
If external shuffle service is not enabled, the commit will blacklist the
executor immediately (on first fetch failure). This is consistent with the fact
that we delete all the shuffle outputs on that executor. So, I think it is
useful that we also blacklist that executor temporarily.
Otherwise, if external shuffle service is enabled, instead of blacklisting the
node immediately, it keeps track of all such fetch failures on that node. If
unique and active fetch failures on that node exceed the configured threshold
(it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also
blacklisted. This will ensure that persistent issues with a node do not lead to
job failures.
How was this patch tested?