Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-6916: Refresh metadata in admin client if broker connection fails #5050

Merged
merged 3 commits into from
May 29, 2018

Conversation

rajinisivaram
Copy link
Contributor

Refresh metadata if broker connection fails so that new calls are sent only to nodes that are alive and requests to controller are sent to the new controller if controller changes due to broker failure. Also reassign calls that could not be sent.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@ijuma ijuma requested a review from hachikuji May 21, 2018 13:58
@@ -123,7 +123,11 @@ public void handleCompletedMetadataResponse(RequestHeader requestHeader, long no

@Override
public void requestUpdate() {
// Do nothing
AdminMetadataManager.this.requestUpdate();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that AdminMetadataManager.metadataFetchDelayMs() always return 0 if state is UPDATE_REQUESTED. This patch can make it more likely for AdminClient to ddos broker with a lot of MetadataRequest. Would it be safer to respect refreshBackoffMs when the metadata update is explicitly requested?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this has already been fixed in trunk.

// For calls which have not yet been sent, update the node to send to if
// metadata has been refreshed. This is to handle controller change and
// node disconnections.
if (metadataManager.updater().lastMetadataUpdateMs() > lastIterationTimeMs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more intuitive to put this logic in AdminClientRunnable.makeMetadataCall.handleResponse()? This allows us to reorder calls in callsToSend only when the MetadataResponse has no error. And we don't have to check timestamp to know whether the metadata has been updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lindong28 Thanks for the review. Yes, makes sense. Updated.

@@ -1138,7 +1138,18 @@ private Call makeMetadataCall(long now) {
@Override
public void handleResponse(AbstractResponse abstractResponse) {
MetadataResponse response = (MetadataResponse) abstractResponse;
metadataManager.update(response.cluster(), time.milliseconds());
long now = time.milliseconds();
metadataManager.update(response.cluster(), now);
Copy link

@hachikuji hachikuji May 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good find. What happens if the metadata request itself is queued up to be sent to a node which is no longer online? Will it be stuck in callsToSend until it times out? I am wondering if we should check NetworkClient.connectionFailed after every poll() for all requests in callsToSend and reenqueue them as we are doing here. This is what we do in ConsumerNetworkClient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji Thanks for the review. I think metadata requests are handled differently because they use LeastLoadedNodeProvider which assigns a node only if a ready node is available. If no ready node is available, then the call stays in pendingRequests and gets moved to callsToSend only when a node becomes ready after a subsequent poll. If a ready node is available, then it gets moved to callsToSend and is removed from callsToSend in the same iteration when the request is queued for send. If disconnection is processed in a subsequent poll, then the call is failed and retried using the retry path. We would never expect to see a metadata request in callsToSend at this point. Does that make sense?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure about that. The metadata request uses MetadataUpdateNodeIdProvider, which just calls leastLoadedNode directly. But the node chosen may not have an established connection, or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hachikuji Sorry, that was my mistake. I thought that leastNodedNode returned ready nodes, but that is not the case. Have updated the code.

* @param disconnectedOnly Reassign only calls to nodes that were disconnected
* in the last poll
*/
private void reassignUnsentCalls(long now, boolean disconnectedOnly) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we can use java 8 lambdas, I wonder if we can do this with a Predicate?

Copy link

@hachikuji hachikuji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the patch.

@hachikuji hachikuji merged commit 3a8d3a7 into apache:trunk May 29, 2018
ying-zheng pushed a commit to ying-zheng/kafka that referenced this pull request Jul 6, 2018
…ls (apache#5050)

Refresh metadata if broker connection fails so that new calls are sent only to nodes that are alive and requests to controller are sent to the new controller if controller changes due to broker failure. Also reassign calls that could not be sent.

Reviewers: Dong Lin <lindong28@gmail.com>, Jason Gustafson <jason@confluent.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants