KAFKA-2677: ensure consumer sees coordinator disconnects #349

hachikuji · 2015-10-22T00:11:49Z

No description provided.

hachikuji · 2015-10-23T01:28:06Z

ijuma · 2015-10-23T18:59:48Z

One alternative is to only have the listener at the ConsumerNetworkClient level and change the result type of NetworkClient.poll. I thought that this would be invasive, but it was actually really simple. Here's a commit that shows what I mean:

ijuma@2e654d9

The code compiles although I commented out some test code since I just meant to demonstrate the idea. Naming could probably be improved.

I think this approach fits better with the NetworkClient design and, as Jason pointed out, we don't have to worry about semantics for the case where the listener invokes NetworkClient methods.

Jason suggested that I should ask for @guozhangwang's input.

hachikuji · 2015-10-23T19:12:42Z

+1 on @ijuma's suggestion. I like this idea since it keeps the NetworkClient API "in front of you," so to speak. I wasn't sure if listener semantics is really something we want to introduce to NetworkClient, since it is so widely used in the codebase.

guozhangwang · 2015-10-25T23:56:41Z

@ijuma @hachikuji Just another idea I had before: we can check "coordinator alive" in each poll call instead of "coordinator known", i.e. checking this.coordinator != null && !client.connectionFailed(this.coordinator) in the AbstractCoordinator. The motivation is that as long as consumers does poll regularly (and they should in order to heartbeat anyways), we can still detect coordinator disconnection in time.

hachikuji · 2015-10-26T02:02:52Z

@guozhangwang That might work. The tricky thing about using connectionFailed() is that the disconnected state will persist until it is cleared with a call to ready() (after the backoff time has elapsed). But maybe we could solve that by always calling ready() when we discover the coordinator in the group metadata response handler.

hachikuji · 2015-10-26T16:14:10Z

@guozhangwang I think the tradeoff with this approach is that we'll only detect disconnects after we try to send something, which means detection will typically be limited by the heartbeat duration. That might be a reasonable trade since the approach is simpler.

guozhangwang · 2015-10-26T20:09:01Z

@hachikuji I think a) we need to respect the connection backoff even for coordinator; which means if the coordinator has disconnected, and we discover it is still the coordinator, while reconnecting we still need to wait until backoff is elapsed. b) since the disconnects are only detected inside the selector, for any of these approaches we are limited to selector's behavior that we can only detect disconnections when we select on that channel's key.

hachikuji · 2015-10-26T22:44:40Z

@guozhangwang I think that suggestion works. Have a look and let me know what you think.

ijuma · 2015-10-27T00:26:53Z

clients/src/test/java/org/apache/kafka/test/MockSelector.java

@@ -63,6 +63,10 @@ public void close(String id) {
        }
    }

+    public void disconnect(String id) {


Is this used?

Good catch.

ijuma · 2015-10-27T12:36:09Z

clients/src/main/java/org/apache/kafka/clients/consumer/ConsumerConfig.java

@@ -240,7 +240,7 @@
                                        FETCH_MAX_WAIT_MS_DOC)
                                .define(RECONNECT_BACKOFF_MS_CONFIG,
                                        Type.LONG,
-                                        50L,
+                                        500L,


For my benefit, why are we increasing the reconnect backoff?

I'll revert this change since I didn't mean to commit it. 50ms seems a little low, but I'm not sure what is reasonable.

guozhangwang · 2015-10-27T16:58:36Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractCoordinator.java

@@ -464,6 +464,8 @@ private void handleGroupMetadataResponse(ClientResponse resp, RequestFuture<Void
                        groupMetadataResponse.node().host(),
                        groupMetadataResponse.node().port());

+                client.tryConnect(coordinator);


Is this just for saving one more poll turn-around?

It's to ensure that the connection doesn't stay in the DISCONNECTED state indefinitely. The only way to break out of that state is to call NetworkClient.ready() (tryConnect delegates to NetworkCleint.ready()). Otherwise, the coordinatorUnknown() check in ensureCoordinatorKnown() would always return true after the connection had failed.

guozhangwang · 2015-10-27T22:42:41Z

LGTM except one minor comment about adding comments on checkDisconnects.

If the fetch response has no data, then the log append currently fails with an `Append failed unexpectedly` error. The problem is that there is no start offset for an empty append. This patch fixes the problem by adding a check in the response handler and skipping the append if the record set is empty. We also formally make empty appends invalid in the API and add some testing for this.

hachikuji force-pushed the KAFKA-2677 branch 2 times, most recently from f7888ff to 7177352 Compare October 23, 2015 01:21

hachikuji changed the title ~~KAFKA-2677 [WIP]: ensure consumer sees coordinator disconnects~~ KAFKA-2677: ensure consumer sees coordinator disconnects Oct 23, 2015

ijuma reviewed Oct 27, 2015
View reviewed changes

hachikuji force-pushed the KAFKA-2677 branch 3 times, most recently from 7f4c587 to 863e09a Compare October 27, 2015 03:20

ijuma reviewed Oct 27, 2015
View reviewed changes

KAFKA-2677: ensure consumer sees coordinator disconnects

477ff4e

hachikuji force-pushed the KAFKA-2677 branch from 863e09a to 477ff4e Compare October 27, 2015 16:39

guozhangwang reviewed Oct 27, 2015
View reviewed changes

add comment clarifying ConsumerNetworkClient.checkDisconnects behavior

ad3151c

asfgit closed this in 0b05d3b Oct 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-2677: ensure consumer sees coordinator disconnects #349

KAFKA-2677: ensure consumer sees coordinator disconnects #349

hachikuji commented Oct 22, 2015

hachikuji commented Oct 23, 2015

ijuma commented Oct 23, 2015

hachikuji commented Oct 23, 2015

guozhangwang commented Oct 25, 2015

hachikuji commented Oct 26, 2015

hachikuji commented Oct 26, 2015

guozhangwang commented Oct 26, 2015

hachikuji commented Oct 26, 2015

ijuma Oct 27, 2015

hachikuji Oct 27, 2015

ijuma Oct 27, 2015

hachikuji Oct 27, 2015

guozhangwang Oct 27, 2015

hachikuji Oct 27, 2015

guozhangwang commented Oct 27, 2015

KAFKA-2677: ensure consumer sees coordinator disconnects #349

KAFKA-2677: ensure consumer sees coordinator disconnects #349

Conversation

hachikuji commented Oct 22, 2015

hachikuji commented Oct 23, 2015

ijuma commented Oct 23, 2015

hachikuji commented Oct 23, 2015

guozhangwang commented Oct 25, 2015

hachikuji commented Oct 26, 2015

hachikuji commented Oct 26, 2015

guozhangwang commented Oct 26, 2015

hachikuji commented Oct 26, 2015

ijuma Oct 27, 2015

Choose a reason for hiding this comment

hachikuji Oct 27, 2015

Choose a reason for hiding this comment

ijuma Oct 27, 2015

Choose a reason for hiding this comment

hachikuji Oct 27, 2015

Choose a reason for hiding this comment

guozhangwang Oct 27, 2015

Choose a reason for hiding this comment

hachikuji Oct 27, 2015

Choose a reason for hiding this comment

guozhangwang commented Oct 27, 2015