Skip to content

[SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list#34024

Closed
thejdeep wants to merge 2 commits intoapache:masterfrom
thejdeep:SPARK-36784
Closed

[SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list#34024
thejdeep wants to merge 2 commits intoapache:masterfrom
thejdeep:SPARK-36784

Conversation

@thejdeep
Copy link
Contributor

What changes were proposed in this pull request?

When a DNS issue happens on the executor node, shuffle nodes would get added to the exclude list due to FetchFailed exception. The change here is to have a configuration host value to test DNS resolution against before marking it as a FetchFailed Exception.

Why are the changes needed?

This would prevent shuffle nodes from getting added to the exclude list due to DNS issues

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added Unit Tests

…fle nodes from getting added to exclude list

 ### What changes were proposed in this pull request?

 When a DNS issue happens on the executor node, shuffle nodes would get added to the exclude list due to FetchFailed exception. The change here is to have a configuration host value to test DNS resolution against before marking it as a FetchFailed Exception.

 ### Why are the changes needed?

 This would prevent shuffle nodes from getting added to the exclude list due to DNS issues

 ### Does this PR introduce _any_ user-facing change?

 No

 ### How was this patch tested?

 Added Unit Tests
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mridulm
Copy link
Contributor

mridulm commented Sep 16, 2021

I am not sure if we can generalize this expectation - UnknownHostException at a node does not imply a local DNS resolution issue alone.
For example, if a node is decommissioned/removed - fetch failures could result due to UnknownHostException as the remote host is no longer resolvable. If we do not trigger a re-execution of the parent stage, tasks in the current stage will keep failing and result in application failure.

@thejdeep thejdeep changed the title [SPARK-36784][SHUFFLE] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list [SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list Sep 17, 2021
@thejdeep
Copy link
Contributor Author

@mridulm Thanks for reviewing it, Mridul. As you mentioned, this may not be the most reliable solution. I will keep this as a WIP and look into ways to resolve DNS without using the JVM or OS cache.

@xkrogen
Copy link
Contributor

xkrogen commented Sep 27, 2021

if a node is decommissioned/removed - fetch failures could result due to UnknownHostException as the remote host is no longer resolvable

Agreed, but the proposed changes will accommodate for this situation. If an UnknownHostException is seen, we don't immediately assume that there are DNS issues. We first check if some other "known good" hostname is resolvable, as defined by spark.network.dnsHealthCheck.host, and only if that host isn't resolvable, we assume there is a local DNS issue. If no such host is configured, we fall back to the current behavior, which is to assume that UnknownHostException indicates an issue with the remote host. See Executor#isDNSResolvableIfConfigured for more details.

This isn't perfect, but I think it acts as a very practical heuristic that should cover the majority of cases.

cc @venkata91 as well

@thejdeep
Copy link
Contributor Author

Resolving DNS as a way to estimating reachability is not reliable since DNS resolutions are cached at both the OS and the JVM layer. We should revisit this after exploring options that would help us bypass the DNS cache to perform resolution.

@venkata91
Copy link
Contributor

venkata91 commented Sep 29, 2021

We first check if some other "known good" hostname is resolvable, as defined by spark.network.dnsHealthCheck.host, and only if that host isn't resolvable, we assume there is a local DNS issue.

@xkrogen The problem with this solution is the heavy DNS caching happening at JVM, N/W layers etc which makes this very brittle. Also the case of decommissioning needs to be thought through further. But the overall idea still seems relevant to me with a modified approach (mainly resolving DNS of remote host without hitting the cache).

@github-actions
Copy link

github-actions bot commented Jan 8, 2022

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Jan 8, 2022
@github-actions github-actions bot closed this Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants