New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable #6305
Fix for KAFKA-7974: Avoid zombie AdminClient when node host isn't resolvable #6305
Conversation
Cc @cmccabe |
Good find, @nickbp . The fix seems a bit incomplete in the sense that there are more exceptions that we could get, besides
Could catch the possible exceptions and then create a Or perhaps there is another way to make this more robust-- that was just the first that came to mind. |
Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed.
Good point. I was able to change the strategy of the fix to instead clean up when the host resolution actually occurs within Previously the resolve was happening automatically with the initial construction of the |
* @param id the id of the connection | ||
* @param now the current time | ||
* @throws UnknownHostException | ||
* @param now the current time in ms | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add JavaDoc for host
and clientDnsLookup
, while we're changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
LGTM |
retest this please |
FWIW I locally ran the failing CI tests against the branch and they were successful:
$ gradle :core:test --tests=*ConsumerBounce*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.ConsumerBounceTest > testCloseDuringRebalance PASSED
kafka.api.ConsumerBounceTest > testClose PASSED
kafka.api.ConsumerBounceTest > testSeekAndCommitWithBrokerFailures PASSED
kafka.api.ConsumerBounceTest > testConsumerReceivesFatalExceptionWhenGroupPassesMaxSize PASSED
kafka.api.ConsumerBounceTest > testSubscribeWhenTopicUnavailable PASSED
kafka.api.ConsumerBounceTest > testRollingBrokerRestartsWithSmallerMaxGroupSizeConfigDisruptsBigGroup PASSED
kafka.api.ConsumerBounceTest > testConsumptionWithBrokerFailures SKIPPED
BUILD SUCCESSFUL in 4m 41s
14 actionable tasks: 9 executed, 5 up-to-date
$ gradle :core:test --tests=*CustomQuotaCallbackTest*
> Configure project :
Building project 'core' with Scala version 2.12.8
Building project 'streams-scala' with Scala version 2.12.8
[...]
> Task :core:test
kafka.api.CustomQuotaCallbackTest > testCustomQuotaCallback PASSED
BUILD SUCCESSFUL in 1m 24s
14 actionable tasks: 8 executed, 6 up-to-date
* @param id the id of the connection | ||
* @param now the current time | ||
* @throws UnknownHostException | ||
* @param now the current time in ms | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Looks like the build is green now, btw. Anything else needed before a merge? |
@cmccabe looks like this is ready to be merged since you approved it and the build is green. |
Thanks, @nickbp ! |
…olvable (#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
@cmccabe I cherry-picked this to the 2.1 and 2.2 branches as it seems like an important fix. Also cc @rajinisivaram who is familiar with this code. |
@ijuma is this likely to go in to the 2.2 branch soon? |
…olvable (#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
@manderson23 it's done now. I was running the tests locally before doing the push and had ran out of time. |
I stumbled onto this issue a couple of days ago when trying to run an application in K8s. Any estimates for when a build will be ready with this fix? |
I likewise originally encountered this in a K8s environment. The core issue leading to this bug arising was the result of DNS requests round-robining between CoreDNS instances which were individually eventually consistent but as a group not consistent at all, with entries popping in and out of existence as pods were brought up and some but not all CoreDNS instances knew about them at any given moment. To avoid the inconsistency issues, I was able to use the following workarounds in my K8s 1.13.x environment. With these workarounds a patch to the Kafka clients was not required to avoid the issue. YMMV:
|
@nickbp interesting... I'll give that a go too. My workaround was to implement a simple, blocking, utility function that will attempt to resolve the host(s) for a finite duration |
Yeah the issue I was seeing was that after a utility check found the entry to be resolvable, it could still then become "unresolvable" due to dns lookup round-robining to a dns instance that didn't know about the host yet |
…olvable (apache#6305) * Fix for KAFKA-7974: Avoid calling disconnect() when not connecting * Resolve host only when currentAddress() is called Moves away from automatically resolving the host when the connection entry is constructed, which can leave ClusterConnectionStates in a confused state. Instead, resolution is done on demand, ensuring that the entry in the connection list is present even if the resolution failed. * Add Javadoc to ClusterConnectionStates.connecting()
Is this available in any release? I can't seem to find it... |
When attempting to get topic list via KafkaAdminClient against a server that isn't resolvable, the worker thread can get killed as follows, leading to a zombie KafkaAdminClient:
It looks like cause is a bug in state handling between
NetworkClient
andClusterConnectionStates
:NetworkClient.ready()
invokesthis.initiateConnect()
as seen in the above stacktraceNetworkClient.initiateConnect()
invokesClusterConnectionStates.connecting()
, which internally invokesClientUtils.resolve()
to resolve the host when creating an entry for the connection.UnknownHostException
can be thrown back toNetworkClient.initiateConnect()
and the connection entry is not created inClusterConnectionStates
. This exception doesn't currently get logged so this is a guess on my part.NetworkClient.initiateConnect()
catches the exception and attempts to callClusterConnectionStates.disconnected()
, which throws anIllegalStateException
because no entry had yet been created due to the lookup failure.IllegalStateException
ends up killing the worker thread andKafkaAdminClient
gets stuck, never returning fromlistTopics()
.This PR includes a unit test which reproduces the original issue (matching stacktrace) and verifies the fix.
Committer Checklist (excluded from commit message)