Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-27272][CORE] Enable blacklisting of node/executor on fetch failures by default #24208

Closed
wants to merge 5 commits into from

Conversation

ankuriitg
Copy link
Contributor

@ankuriitg ankuriitg commented Mar 25, 2019

What changes were proposed in this pull request?

SPARK-20898 added a new configuration to blacklist a node/executor on fetch
failures. This config was deemed risky at the time and was disabled by default
until more data is collected.

This commit aims to enable that feature by default as we have seen couple of
instances where that feature was found to be useful. Additionally, the commit
changes the blacklist criteria slightly.

If external shuffle service is not enabled, the commit will blacklist the
executor immediately (on first fetch failure). This is consistent with the fact
that we delete all the shuffle outputs on that executor. So, I think it is
useful that we also blacklist that executor temporarily.

Otherwise, if external shuffle service is enabled, instead of blacklisting the
node immediately, it keeps track of all such fetch failures on that node. If
unique and active fetch failures on that node exceed the configured threshold
(it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also
blacklisted. This will ensure that persistent issues with a node do not lead to
job failures.

How was this patch tested?

  1. Added a unit test case to ensure that blacklisting works as configured

…lures by default

SPARK-20898 added a new configuration to blacklist a node/executor on fetch
failures. This config was deemed risky at the time and was disabled by default
until more data is collected.

This commit aims to enable that feature by default as we have seen couple of
instances where that feature was found to be useful. Additionally, the commit
changes the blacklist criteria slightly.

If external shuffle service is not enabled, the commit will blacklist the
executor immediately (on first fetch failure). This is consistent with the fact
that we delete all the shuffle outputs on that executor. So, I think it is
useful that we also blacklist that executor temporarily.

Otherwise, if external shuffle service is enabled, instead of blacklisting the
node immediately, it keeps track of all such fetch failures on that node. If
unique and active fetch failures on that node exceed the configured threshold
(it re-uses MAX_FAILED_EXEC_PER_NODE, but can be changed), then the node is also
blacklisted. This will ensure that persistent issues with a node do not lead to
job failures.

Testing Done:
1. Added a unit test case to ensure that blacklisting works as configured
@ankuriitg
Copy link
Contributor Author

cc @squito @tgravescs

@tgravescs
Copy link
Contributor

I'm definitely good turning it on by default, you added the extra config and tracking because you saw instances the current version is to aggressive? If so can you give details. Right now it blacklists immediately because you get a stage failure immediately. I can see reasons to make this configurable but I'm not seeing how that ties into the config MAX_FAILED_EXEC_PER_NODE, which is at the application level not the stage level.

@ankuriitg
Copy link
Contributor Author

I do not have any instances where the current version is too aggressive but I think that can be a case in some instances. So, I think it may be useful to track the failures before blacklisting a node.

I realize that MAX_FAILED_EXEC_PER_NODE is at the application level and I want to keep it that way, as there is a possibility that we may not observe another fetch failure from the same stage but we may get it from another stage or another stage attempt (this was observed in my situation).

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103923 has finished for PR 24208 at commit 209c2cf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103931 has finished for PR 24208 at commit 9751318.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

I guess I'm not sure I agree with tying the fetch failure count to the same config as MAX_FAILED_EXEC_PER_NODE. I can see someone wanting that to stay the same but be more aggressive on the fetch failures.

I used to run with the config on all the time at Yahoo where it did it immediately, never saw any issues with it that people reported but that doesn't mean it wasn't to aggressive since generally we had very large clusters. @abellina have you seen any issues with this being on?

If we don't have any data with needing it perhaps we leave it as is now, the customers out is they just turn it back off and if people report it to aggressive then at that point we add a new config, what do you think?

This config defines the threshold after which a node should be blacklisted.
@ankuriitg
Copy link
Contributor Author

Thanks for your review @tgravescs

I just added another config MAX_FETCH_FAILURES_PER_NODE. So, this way users can choose to be more aggressive on fetch failures.

I think without this configuration, it can be too aggressive on smaller clusters, especially one with beefy nodes, where each node has many executors running but there are very few nodes overall.

If you still think we do not need this configuration, then I can remove it altogether.

@squito
Copy link
Contributor

squito commented Mar 26, 2019

I feel the same as Tom --I'm in favor of turning it on by default, but not sure about the extra config. Given that the scheduler still treats all map output as missing on the node after one fetch failure, it seems inconsistent to have this part configurable, unless we have a case supporting it. (If its going to be true by default, I think we should remove "experimental" from the docs in the configuration.md also).

Separately, do you have any opinions on turning blacklisting on by default? maybe we should start a discussion on dev@ to get more input on that. Its been working for us for a while now, and seems critical for large clusters, so I think it would be good.

…s blacklisting on fetch failure by default.

2. Removed the additional config
3. Updated the documentation
@ankuriitg
Copy link
Contributor Author

Thanks for the review @squito !

I have removed the additional logic and configuration and changed the PR to just enable blacklisting for fetch failure to on by default. As @tgravescs suggested, we can add that logic later on, if needed.

I also wanted to check if we should enable spark.files.fetchFailure.unRegisterOutputOnHost by default (added in SPARK-19753). This ensures that not only is the host blacklisted but all the files on that host are deleted.

@squito , I also agree with turning blacklisting on by default.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103983 has finished for PR 24208 at commit a4c77c7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 26, 2019

Test build #103987 has finished for PR 24208 at commit 4e77e5d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we don't un-Experimental other config like spark.blacklist.killBlacklistedExecutors?

@ankuriitg
Copy link
Contributor Author

Let me start an email thread where we can discuss which blacklisting configurations seem stable now and should be enabled by default.

@rezasafi
Copy link
Contributor

Thanks for the work. The description said there was a added unit test, but it isn't included in the changed files. Did you decide to remove it?

@ankuriitg
Copy link
Contributor Author

Yes, the PR body is not up-to-date. The PR just updates the config now, as mentioned in this comment above: #24208 (comment)

@rxin
Copy link
Contributor

rxin commented Mar 29, 2019

Sorry -1 on this given current state. See my comment on the dev list thread.

I think in general this is the right direction but the current implementation is unfortunately not great.

@ankuriitg
Copy link
Contributor Author

Thanks for the review @rxin!

Actually, this is a slightly different change than what I proposed in the dev thread. It enables blacklisting on fetch failure if general blacklisting feature is enabled. The general blacklisting feature (controlled by spark.blacklisting.enabled) is still disabled by default.

@rxin
Copy link
Contributor

rxin commented Mar 29, 2019

Ah got it. Thanks for explaining. What's the behavior if this is false? Do we just treat fetch failure as a normal failure?

@ankuriitg
Copy link
Contributor Author

If this is false, there is a possibility that map task can be retried on the same executor/node multiple times and then, reduce stage retries max times before failing the job.

Overall, this seems to be an essential feature of blacklisting, to ensure executor/nodes are blacklisted if there are fetch failures originating from that executor/node.

@rxin
Copy link
Contributor

rxin commented Mar 30, 2019

Makes sense. Thanks for explaining!

@SparkQA
Copy link

SparkQA commented Apr 1, 2019

Test build #104167 has finished for PR 24208 at commit 65f2eca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 8, 2019

Test build #108790 has finished for PR 24208 at commit 65f2eca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@tgravescs
Copy link
Contributor

tgravescs commented Aug 8, 2019

this looks good to me, I think the discussion on dev didn't want blacklisting in general on by default but this is now just turning on the fetch failures handling by default and you still have to enable blacklisting to get it. @rxin just to confirm you are ok with this?

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

Hi, All. Can we make a final decision on this PR from @ankuriitg ?

@SparkQA
Copy link

SparkQA commented Sep 9, 2019

Test build #110326 has finished for PR 24208 at commit 65f2eca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HeartSaVioR
Copy link
Contributor

Looks like this got two +1s but stuck with -1. Origin concern seemed to be resolved but the veto didn't get rid.
@rxin Shall we cancel the veto if you're OK with the explanation?

@rxin
Copy link
Contributor

rxin commented Oct 15, 2019

Sorry didn't get to this earlier. Do we ever release a node from blacklist? I'm asking because it'd be bad if a transient network error immediately blacklists all the nodes. It'd make sense if it blacklists after a few fetch failures, or if the blacklist is automatically released after a while.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 24, 2020
@github-actions github-actions bot closed this Jan 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
10 participants