[SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list#34024
[SPARK-36784][SHUFFLE][WIP] Handle DNS issues on executor to prevent shuffle nodes from getting added to exclude list#34024thejdeep wants to merge 2 commits intoapache:masterfrom
Conversation
…fle nodes from getting added to exclude list ### What changes were proposed in this pull request? When a DNS issue happens on the executor node, shuffle nodes would get added to the exclude list due to FetchFailed exception. The change here is to have a configuration host value to test DNS resolution against before marking it as a FetchFailed Exception. ### Why are the changes needed? This would prevent shuffle nodes from getting added to the exclude list due to DNS issues ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit Tests
|
Can one of the admins verify this patch? |
|
I am not sure if we can generalize this expectation - |
|
@mridulm Thanks for reviewing it, Mridul. As you mentioned, this may not be the most reliable solution. I will keep this as a WIP and look into ways to resolve DNS without using the JVM or OS cache. |
Agreed, but the proposed changes will accommodate for this situation. If an This isn't perfect, but I think it acts as a very practical heuristic that should cover the majority of cases. cc @venkata91 as well |
|
Resolving DNS as a way to estimating reachability is not reliable since DNS resolutions are cached at both the OS and the JVM layer. We should revisit this after exploring options that would help us bypass the DNS cache to perform resolution. |
@xkrogen The problem with this solution is the heavy DNS caching happening at JVM, N/W layers etc which makes this very brittle. Also the case of decommissioning needs to be thought through further. But the overall idea still seems relevant to me with a modified approach (mainly resolving DNS of remote host without hitting the cache). |
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
What changes were proposed in this pull request?
When a DNS issue happens on the executor node, shuffle nodes would get added to the exclude list due to FetchFailed exception. The change here is to have a configuration host value to test DNS resolution against before marking it as a FetchFailed Exception.
Why are the changes needed?
This would prevent shuffle nodes from getting added to the exclude list due to DNS issues
Does this PR introduce any user-facing change?
No
How was this patch tested?
Added Unit Tests