KAFKA-6916: Refresh metadata in admin client if broker connection fails #5050

rajinisivaram · 2018-05-21T08:03:07Z

Refresh metadata if broker connection fails so that new calls are sent only to nodes that are alive and requests to controller are sent to the new controller if controller changes due to broker failure. Also reassign calls that could not be sent.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

lindong28 · 2018-05-21T19:59:39Z

clients/src/main/java/org/apache/kafka/clients/admin/internals/AdminMetadataManager.java

@@ -123,7 +123,11 @@ public void handleCompletedMetadataResponse(RequestHeader requestHeader, long no

        @Override
        public void requestUpdate() {
-            // Do nothing
+            AdminMetadataManager.this.requestUpdate();


It seems that AdminMetadataManager.metadataFetchDelayMs() always return 0 if state is UPDATE_REQUESTED. This patch can make it more likely for AdminClient to ddos broker with a lot of MetadataRequest. Would it be safer to respect refreshBackoffMs when the metadata update is explicitly requested?

Looks like this has already been fixed in trunk.

lindong28 · 2018-05-21T20:08:14Z

clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

+                // For calls which have not yet been sent, update the node to send to if
+                // metadata has been refreshed. This is to handle controller change and
+                // node disconnections.
+                if (metadataManager.updater().lastMetadataUpdateMs() > lastIterationTimeMs) {


Would it be more intuitive to put this logic in AdminClientRunnable.makeMetadataCall.handleResponse()? This allows us to reorder calls in callsToSend only when the MetadataResponse has no error. And we don't have to check timestamp to know whether the metadata has been updated.

@lindong28 Thanks for the review. Yes, makes sense. Updated.

hachikuji · 2018-05-24T22:46:51Z

clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

@@ -1138,7 +1138,18 @@ private Call makeMetadataCall(long now) {
                @Override
                public void handleResponse(AbstractResponse abstractResponse) {
                    MetadataResponse response = (MetadataResponse) abstractResponse;
-                    metadataManager.update(response.cluster(), time.milliseconds());
+                    long now = time.milliseconds();
+                    metadataManager.update(response.cluster(), now);


This is a good find. What happens if the metadata request itself is queued up to be sent to a node which is no longer online? Will it be stuck in callsToSend until it times out? I am wondering if we should check NetworkClient.connectionFailed after every poll() for all requests in callsToSend and reenqueue them as we are doing here. This is what we do in ConsumerNetworkClient.

@hachikuji Thanks for the review. I think metadata requests are handled differently because they use LeastLoadedNodeProvider which assigns a node only if a ready node is available. If no ready node is available, then the call stays in pendingRequests and gets moved to callsToSend only when a node becomes ready after a subsequent poll. If a ready node is available, then it gets moved to callsToSend and is removed from callsToSend in the same iteration when the request is queued for send. If disconnection is processed in a subsequent poll, then the call is failed and retried using the retry path. We would never expect to see a metadata request in callsToSend at this point. Does that make sense?

Hmm, I'm not sure about that. The metadata request uses MetadataUpdateNodeIdProvider, which just calls leastLoadedNode directly. But the node chosen may not have an established connection, or am I missing something?

@hachikuji Sorry, that was my mistake. I thought that leastNodedNode returned ready nodes, but that is not the case. Have updated the code.

hachikuji · 2018-05-25T17:32:57Z

clients/src/main/java/org/apache/kafka/clients/admin/KafkaAdminClient.java

+         * @param disconnectedOnly Reassign only calls to nodes that were disconnected
+         *                         in the last poll
+         */
+        private void reassignUnsentCalls(long now, boolean disconnectedOnly) {


Now that we can use java 8 lambdas, I wonder if we can do this with a Predicate?

hachikuji

LGTM, thanks for the patch.

…ls (apache#5050) Refresh metadata if broker connection fails so that new calls are sent only to nodes that are alive and requests to controller are sent to the new controller if controller changes due to broker failure. Also reassign calls that could not be sent. Reviewers: Dong Lin <lindong28@gmail.com>, Jason Gustafson <jason@confluent.io>

ijuma requested a review from hachikuji May 21, 2018 13:58

lindong28 reviewed May 21, 2018

View reviewed changes

rajinisivaram added 2 commits May 24, 2018 10:44

KAFKA-6916: Refresh metadata in admin client if broker connection fails

b0ee2e9

Address review comment

6ba1283

rajinisivaram force-pushed the KAFKA-6916-admin-client branch from b0bbea1 to 6ba1283 Compare May 24, 2018 10:02

hachikuji reviewed May 24, 2018

View reviewed changes

Address review comment

f55693a

hachikuji reviewed May 25, 2018

View reviewed changes

hachikuji approved these changes May 28, 2018

View reviewed changes

hachikuji merged commit 3a8d3a7 into apache:trunk May 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-6916: Refresh metadata in admin client if broker connection fails #5050

KAFKA-6916: Refresh metadata in admin client if broker connection fails #5050

rajinisivaram commented May 21, 2018

lindong28 May 21, 2018

rajinisivaram May 24, 2018

lindong28 May 21, 2018

rajinisivaram May 24, 2018

hachikuji May 24, 2018 •

edited

Loading

rajinisivaram May 25, 2018

hachikuji May 25, 2018

rajinisivaram May 25, 2018

hachikuji May 25, 2018

hachikuji left a comment

KAFKA-6916: Refresh metadata in admin client if broker connection fails #5050

KAFKA-6916: Refresh metadata in admin client if broker connection fails #5050

Conversation

rajinisivaram commented May 21, 2018

Committer Checklist (excluded from commit message)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji May 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hachikuji left a comment

Choose a reason for hiding this comment

hachikuji May 24, 2018 •

edited

Loading