Fixes outdated topology when no new leader is assigned #5979

npepinpe · 2020-12-07T20:04:09Z

Description

This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate.

This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID.

Related issues

closes #2501

Definition of Done

Not all items need to be done depending on the issue and the pull request.

Code changes:

The changes are backwards compatibility with previous versions
If it fixes a bug then PRs are created to backport the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. backport stable/0.25) to the PR, in case that fails you need to create backports manually.

Testing:

There are unit/integration tests that verify all acceptance criterias of the issue
New tests are written to ensure backwards compatibility with further versions
The behavior is tested manually
The impact of the changes is verified by a benchmark

Documentation:

The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.)
New content is added to the release announcement

npepinpe · 2020-12-07T20:04:51Z

One note: I think actually the fix is incomplete. We might need to also propagate the terms on follower updates as well to make sure that we aren't removing a leader with a newer term than the follower event we just received.

To check with Deepthi or Miguel.

deepthidevaki

Thanks. Looks good. Just a small comment. 👍

test-util/src/main/java/io/zeebe/test/util/asserts/TopologyAssert.java

deepthidevaki · 2020-12-09T08:50:37Z

One note: I think actually the fix is incomplete. We might need to also propagate the terms on follower updates as well to make sure that we aren't removing a leader with a newer term than the follower event we just received.

To check with Deepthi or Miguel.

Just for future reference, as we discussed it already: Updates from the same node is guaranteed to be delivered in order by our gossip. Hence we don't have to worry about receiving update from a broker saying it is the follower in previous term after the update with leader for newer term.

npepinpe · 2020-12-09T10:13:15Z

I tightened the conditions a little and fixed some things in TopologyAssert.

npepinpe · 2020-12-09T12:51:28Z

test-util/src/main/java/io/zeebe/test/util/asserts/TopologyAssert.java

@@ -30,7 +32,7 @@ public final TopologyAssert isComplete(final int clusterSize, final int partitio
    final List<BrokerInfo> brokers = actual.getBrokers();

    if (brokers.size() != clusterSize) {
-      failWithMessage("Expected broker count to be <%s> but was <%s>", clusterSize, brokers.size());
+      throw failure("Expected broker count to be <%s> but was <%s>", clusterSize, brokers.size());


The javadoc from failWithMessage actually recommends using throw failure instead, as the compiler can now realize we're throwing an error (whereas with failWithMessage it thinks execution will continue).

deepthidevaki

👍 Thanks.

npepinpe · 2020-12-09T16:10:11Z

bors r+

5979: Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe ## Description This PR fixes a bug in the gateway topology. The topology manager keeps track of who is leader and follower for each partition. This information is gossiped by all nodes in the cluster. Normally, when a node which was leader for partition 1 sends that it is now follower, another node will send that it is leader. There's an edge case, however, when no other node sends that it is the leader. In this case, we end up with a topology where a node is both leader and follower. This means that we report the wrong topology and that the gateway will keep trying to route requests to the node. The case where no new node becomes leader can happen due to network partitioning, for example, and is an expected case we should be able to tolerate. This PR adds more test coverage and fixes the issue by removing the old partition leader if, when adding a new follower, they have the same ID. ## Related issues  closes #2501 ## Definition of Done _Not all items need to be done depending on the issue and the pull request._ Code changes: * [x] The changes are backwards compatibility with previous versions * [x] If it fixes a bug then PRs are created to [backport](https://github.com/zeebe-io/zeebe/compare/stable/0.24...develop?expand=1&template=backport_template.md&title=[Backport%200.24]) the fix to the last two minor versions. You can trigger a backport by assigning labels (e.g. `backport stable/0.25`) to the PR, in case that fails you need to create backports manually. Testing: * [x] There are unit/integration tests that verify all acceptance criterias of the issue * [x] New tests are written to ensure backwards compatibility with further versions * [ ] The behavior is tested manually * [ ] The impact of the changes is verified by a benchmark Documentation: * [ ] The documentation is updated (e.g. BPMN reference, configuration, examples, get-started guides, etc.) * [ ] New content is added to the [release announcement](https://drive.google.com/drive/u/0/folders/1DTIeswnEEq-NggJ25rm2BsDjcCQpDape) Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>

zeebe-bors · 2020-12-09T16:34:42Z

Build failed:

continuous-integration/jenkins/branch

npepinpe · 2020-12-10T20:28:51Z

Looking at the failed container logs, it looks like we broker backwards compatibility. We were not catching this in the rolling update test before because we did not check in between that a leader was elected, just that the node was removed/added to the topology. In this instance, it fails (sometimes - not sure why) because node 0 is up (and updated), node 1 is down, and node 2 is up (but outdated). Then node 0 is printing out Kryo error, saying it cannot deserialize something, and node 2 is getting connection timeouts from node 0 trying to get elected (I can see it switches to candidate, but can never become leader).

However, I don't get why the test is flaky...I would expect, if we broke backwards compat with serialization, that this always fails. Maybe it depends who was previously leader? If Zeebe 2 was already leader, maybe it doesn't matter? idk

Logs from node 0:

2020-12-09 17:32:15.847 [Broker-0-TopologyManager] [Broker-0-zb-actors-0] DEBUG io.zeebe.broker.clustering - Received REACHABILITY_CHANGED from member 1, was not handled.

2020-12-09 17:32:19.580 [] [netty-messaging-event-epoll-client-7] ERROR io.atomix.cluster.messaging.impl.NettyMessagingService - Exception inside channel handling pipeline

com.esotericsoftware.kryo.KryoException: Unable to find class: �����	"

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) ~[kryo-4.0.2.jar:?]

	at io.atomix.utils.serializer.NamespaceImpl.lambda$deserialize$2(NamespaceImpl.java:209) ~[atomix-utils-0.25.0.jar:0.25.0]

	at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.run(KryoPoolQueueImpl.java:58) ~[kryo-4.0.2.jar:?]

	at io.atomix.utils.serializer.NamespaceImpl.lambda$deserialize$3(NamespaceImpl.java:206) ~[atomix-utils-0.25.0.j
ar:0.25.0]

	at io.atomix.utils.serializer.KryoIOPool.run(KryoIOPool.java:47) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.NamespaceImpl.deserialize(NamespaceImpl.java:203) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.NamespaceImpl.deserialize(NamespaceImpl.java:167) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.FallbackNamespace.deserialize(FallbackNamespace.java:100) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.Serializer$1.decode(Serializer.java:75) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.DefaultClusterCommunicationService$InternalMessageResponder.apply(DefaultClusterCommunicationService.java:293) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.DefaultClusterCommunicationService$InternalMessageResponder.apply(DefaultClusterCommunicationService.java:276) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$registerHandler$8(NettyMessagingService.java:218) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.AbstractServerConnection.dispatch(AbstractServerConnection.java:38) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.AbstractServerConnection.dispatch(AbstractServerConnection.java:25) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:836) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) [netty-codec-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) [netty-codec-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.53.Final.jar:4.1.53.Final]

	at java.lang.Thread.run(Unknown Source) [?:?]

Caused by: java.lang.ClassNotFoundException: �����	"

	at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source) ~[?:?]

	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown Source) ~[?:?]

	at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]

	at java.lang.Class.forName0(Native Method) ~[?:?]

	at java.lang.Class.forName(Unknown Source) ~[?:?]

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154) ~[kryo-4.0.2.jar:?]

	... 36 more

2020-12-09 17:32:19.829 [] [netty-messaging-event-epoll-client-4] ERROR io.atomix.cluster.messaging.impl.NettyMessagingService - Exception inside channel handling pipeline

com.esotericsoftware.kryo.KryoException: Unable to find class: ��+�"+���

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) ~[kryo-4.0.2.jar:?]

	at io.atomix.utils.serializer.NamespaceImpl.lambda$deserialize$2(NamespaceImpl.java:209) ~[atomix-utils-0.25.0.jar:0.25.0]

	at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.run(KryoPoolQueueImpl.java:58) ~[kryo-4.0.2.jar:?]

	at io.atomix.utils.serializer.NamespaceImpl.lambda$deserialize$3(NamespaceImpl.java:206) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.KryoIOPool.run(KryoIOPool.java:47) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.NamespaceImpl.deserialize(NamespaceImpl.java:203) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.NamespaceImpl.deserialize(NamespaceImpl.java:167) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.FallbackNamespace.deserialize(FallbackNamespace.java:100) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.Serializer$1.decode(Serializer.java:75) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.DefaultClusterCommunicationService$InternalMessageResponder.apply(DefaultClusterCommunicationService.java:293) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.DefaultClusterCommunicationService$InternalMessageResponder.apply(DefaultClusterCommunicationService.java:276) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$registerHandler$8(NettyMessagingService.java:218) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.AbstractServerConnection.dispatch(AbstractServerConnection.java:38) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.AbstractServerConnection.dispatch(AbstractServerConnection.java:25) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:836) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.netty.channel.SimpleChannelInboundHand
ler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) [netty-codec-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) [netty-codec-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.53.Final.jar:4.1.53.Final]

	at java.lang.Thread.run(Unknown Source) [?:?]

Caused by: java.lang.ClassNotFoundException: ��+�"+���

	at jdk.internal.loader.BuiltinClassLoader.loadClass(Unknown Source) ~[?:?]

	at jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(Unknown Source) ~[?:?]

	at java.lang.ClassLoader.loadClass(Unknown Source) ~[?:?]

	at java.lang.Class.forName0(Native Method) ~[?:?]

	at java.lang.Class.forName(Unknown Source) ~[?:?]

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:154) ~[kryo-4.0.2.jar:?]

	... 36 more

2020-12-09 17:32:20.078 [] [netty-messaging-event-epoll-client-0] ERROR io.atomix.cluster.messaging.impl.NettyMessagingService - Exception inside channel handling pipeline

com.esotericsoftware.kryo.KryoException: Unable to find class: ��+�"+���

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:160) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:133) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:693) ~[kryo-4.0.2.jar:?]

	at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:804) ~[kryo-4.0.2.jar:?]

	at io.atomix.utils.serializer.NamespaceImpl.lambda$deserialize$2(NamespaceImpl.java:209) ~[atomix-utils-0.25.0.jar:0.25.0]

	at com.esotericsoftware.kryo.pool.KryoPoolQueueImpl.run(KryoPoolQueueImpl.java:58) ~[kryo-4.0.2.jar:?]

	at io.atomix.utils.serializer.NamespaceImpl.lambda$deserialize$3(NamespaceImpl.java:206) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.KryoIOPool.run(KryoIOPool.java:47) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.NamespaceImpl.deserialize(NamespaceImpl.java:203) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.NamespaceImpl.deserialize(NamespaceImpl.java:167) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.FallbackNamespace.deserialize(FallbackNamespace.java:100) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.utils.serializer.Serializer$1.decode(Serializer.java:75) ~[atomix-utils-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.DefaultClusterCommunicationService$InternalMessageResponder.apply(DefaultClusterCommunicationService.java:293) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.DefaultClusterCommunicationService$InternalMessageResponder.apply(DefaultClusterCommunicationService.java:276) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.NettyMessagingService.lambda$registerHandler$8(NettyMessagingService.java:218) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.AbstractServerConnection.dispatch(AbstractServerConnection.java:38) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.AbstractServerConnection.dispatch(AbstractServerConnection.java:25) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.atomix.cluster.messaging.impl.NettyMessagingService$MessageDispatcher.channelRead0(NettyMessagingService.java:836) ~[atomix-cluster-0.25.0.jar:0.25.0]

	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) ~[netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) [netty-codec-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) [netty-codec-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [netty-transport-4.1.53.Fin
al.jar:4.1.53.Final]

	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [netty-transport-native-epoll-4.1.53.Final-linux-x86_64.jar:4.1.53.Final]

	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [netty-common-4.1.53.Final.jar:4.1.53.Final]

	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.53.Final.jar:4.1.53.Final]

	at java.lang.Thread.run(Unknown Source) [?:?]

Logs from node 2:

2020-12-09 17:32:13.257 [GatewayTopologyManager] [Broker-2-zb-actors-0] DEBUG io.zeebe.gateway - Received metadata change from Broker 1, partitions {1=LEADER}, terms {1=1} and health {1=HEALTHY}.

2020-12-09 17:32:15.851 [GatewayTopologyManager] [Broker-2-zb-actors-0] DEBUG io.zeebe.gateway - Received REACHABILITY_CHANGED for broker 1, do nothing.

2020-12-09 17:32:15.851 [Broker-2-TopologyManager] [Broker-2-zb-actors-1] DEBUG io.zeebe.broker.clustering - Received REACHABILITY_CHANGED from member 1, was not handled.

2020-12-09 17:32:19.512 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.FollowerRole - RaftServer{raft-partition-partition-1}{role=FOLLOWER} - Poll request to 1 failed: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: broker-1/192.168.192.2:26502

2020-12-09 17:32:19.542 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.CandidateRole - RaftServer{raft-partition-partition-1}{role=CANDIDATE} - io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: broker-1/192.168.192.2:26502

2020-12-09 17:32:19.542 [Broker-2-ZeebePartition-1] [Broker-2-zb-actors-0] DEBUG io.zeebe.broker.system - Partition role transitioning from FOLLOWER to CANDIDATE

2020-12-09 17:32:19.576 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - ConfigureRequest{term=2, leader=2, index=0, timestamp=1607535098419, members=[DefaultRaftMember{id=0, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=1, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=2, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}]} to 1 failed: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: broker-1/192.168.192.2:26502

2020-12-09 17:32:19.581 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - ConfigureRequest{term=2, leader=2, index=0, timestamp=1607535098419, members=[DefaultRaftMember{id=0, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=1, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=2, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}]} to 1 failed: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: broker-1/192.168.192.2:26502

2020-12-09 17:32:19.604 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=2, leader=2, prevLogIndex=17, prevLogTerm=1, entries=1, checksums=1, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:19.831 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=2, leader=2, prevLogIndex=18, prevLogTerm=2, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:20.080 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=2, leader=2, prevLogIndex=18, prevLogTerm=2, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:20.829 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - Confi
gureRequest{term=2, leader=2, index=0, timestamp=1607535098419, members=[DefaultRaftMember{id=0, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=1, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=2, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}]} to 1 failed: java.util.concurrent.CompletionException: io.netty.channel.ConnectTimeoutException: connection timed out: broker-1/192.168.192.2:26502

2020-12-09 17:32:24.582 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=2, leader=2, prevLogIndex=18, prevLogTerm=2, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:24.830 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=2, leader=2, prevLogIndex=18, prevLogTerm=2, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:25.094 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=2, leader=2, prevLogIndex=18, prevLogTerm=2, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:25.095 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - Suspected network partition after 3 failures from 0 over a period of time 5515 > 5000, stepping down

2020-12-09 17:32:25.106 [Broker-2-ZeebePartition-1] [Broker-2-zb-actors-0] DEBUG io.zeebe.broker.system - Partition role transitioning from CANDIDATE to FOLLOWER

2020-12-09 17:32:25.878 [Broker-2-TopologyManager] [Broker-2-zb-actors-1] DEBUG io.zeebe.broker.clustering - Received member removed BrokerInfo{nodeId=1, partitionsCount=1, clusterSize=3, replicationFactor=3, partitionRoles={1=LEADER}, partitionLeaderTerms={1=1}, partitionHealthStatuses={1=HEALTHY}, version=0.25.0} 

2020-12-09 17:32:25.878 [GatewayTopologyManager] [Broker-2-zb-actors-0] DEBUG io.zeebe.gateway - Received broker was removed BrokerInfo{nodeId=1, partitionsCount=1, clusterSize=3, replicationFactor=3, partitionRoles={1=LEADER}, partitionLeaderTerms={1=1}, partitionHealthStatuses={1=HEALTHY}, version=0.25.0}.

2020-12-09 17:32:29.440 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.CandidateRole - RaftServer{raft-partition-partition-1}{role=CANDIDATE} - java.net.ConnectException: Expected to send a message with subject 'raft-partition-partition-1-vote' to member '1', but member is not known. Known members are '[Member{id=2, address=broker-2:26502, properties={brokerInfo=EADJAAAAAwACAAAAAQAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTI6MjY1MDEFAAEBAAAAAQwAAA8AAAAwLjI2LjAtU05BUFNIT1QFAAEBAAAAAQ==, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=broker-0:26502, properties={brokerInfo=EADJAAAAAwAAAAAAAQAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTA6MjY1MDEFAAEBAAAAAQwAAAYAAAAwLjI1LjAFAAA=, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}]'.

2020-12-09 17:32:29.441 [Broker-2-ZeebePartition-1] [Broker-2-zb-actors-0] DEBUG io.zeebe.broker.system - Partition role transitioning from FOLLOWER to CANDIDATE

2020-12-09 17:32:29.460 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - ConfigureRequest{term=3, leader=2, index=0, timestamp=1607535098419, members=[DefaultRaftMember{id=0, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=1, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=2, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}]} to 1 failed: java.util.concurrent.CompletionException: java.net.ConnectException: Expected to send a message with subject 'raft-partition-partition-1-configure' to member '1', but member is not known. Known members are '[Member{id=2, address=broker-2:26502, properties={brokerInfo=EADJAAAAAwACAAAAAQAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTI6MjY1MDEFAAEBAAAAAQwAAA8AAAAwLjI2LjAtU05BUFNIT1QFAAEBAAAAAQ==, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=broker-0:26502, properties={brokerInfo=EADJAAAAAwAAAAAAAQAAAAMAAA
ADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTA6MjY1MDEFAAEBAAAAAQwAAAYAAAAwLjI1LjAFAAA=, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}]'.

2020-12-09 17:32:29.462 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - ConfigureRequest{term=3, leader=2, index=0, timestamp=1607535098419, members=[DefaultRaftMember{id=0, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=1, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=2, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}]} to 1 failed: java.util.concurrent.CompletionException: java.net.ConnectException: Expected to send a message with subject 'raft-partition-partition-1-configure' to member '1', but member is not known. Known members are '[Member{id=2, address=broker-2:26502, properties={brokerInfo=EADJAAAAAwACAAAAAQAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTI6MjY1MDEFAAEBAAAAAQwAAA8AAAAwLjI2LjAtU05BUFNIT1QFAAEBAAAAAQ==, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=broker-0:26502, properties={brokerInfo=EADJAAAAAwAAAAAAAQAAAAMAAA
ADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTA6MjY1MDEFAAEBAAAAAQwAAAYAAAAwLjI1LjAFAAA=, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}]'.

2020-12-09 17:32:29.464 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=18, prevLogTerm=2, entries=1, checksums=1, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:29.715 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - ConfigureRequest{term=3, leader=2, index=0, timestamp=1607535098419, members=[DefaultRaftMember{id=0, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=1, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}, DefaultRaftMember{id=2, type=ACTIVE, updated=2020-12-09T17:31:38.419097Z}]} to 1 failed: java.util.concurrent.CompletionException: java.net.ConnectException: Expected to send a message with subject 'raft-partition-partition-1-configure' to member '1', but member is not known. Known members are '[Member{id=2, address=broker-2:26502, properties={brokerInfo=EADJAAAAAwACAAAAAQAAAAMAAAADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTI6MjY1MDEFAAEBAAAAAQwAAA8AAAAwLjI2LjAtU05BUFNIT1QFAAEBAAAAAQ==, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}, Member{id=0, address=broker-0:26502, properties={brokerInfo=EADJAAAAAwAAAAAAAQAAAAMAAA
ADAAAAAAABCgAAAGNvbW1hbmRBcGkOAAAAYnJva2VyLTA6MjY1MDEFAAEBAAAAAQwAAAYAAAAwLjI1LjAFAAA=, event-service-topics-subscribed=Af8fAQEDAWpvYnNBdmFpbGFibOU=}}]'.

2020-12-09 17:32:29.723 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:29.964 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIn
dex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:32.970 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:33.216 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:33.481 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:36.976 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:37.219 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - AppendRequest{term=3, leader=2, prevLogIndex=19, prevLogTerm=3, entries=0, checksums=0, commitIndex=17} to 0 failed: java.util.concurrent.CompletionException: java.net.ConnectException

2020-12-09 17:32:37.466 [] [raft-server-2-raft-partition-partition-1] WARN  io.atomix.raft.roles.LeaderAppender - RaftServer{raft-partition-partition-1} - Suspected network partition after 6 failures from 1 over a period of time 8005 > 5000, stepping down

2020-12-09 17:32:37.471 [Broker-2-ZeebePartition-1] [Broker-2-zb-actors-0] DEBUG io.zeebe.broker.system - Partition role transitioning from CANDIDATE to FOLLOWER

This was an unintended side effect here, and it looks like by adding the condition we may have found caught an unexpected break in our rolling update, so I would like to keep this condition, but possibly the fix for it might go into another PR - so we'd need to extract the assert logic and the fix for this test into a different PR before merging this.

@MiguelPires could this be related to the checksum stuff? I can't think of anything else we did, but of course it's possible we broke something else.

npepinpe · 2020-12-12T17:58:21Z

I think I understand the issue - VersionFieldSerializer allows newer version to read previously written data (i.e. they can receive message from the older nodes), but it cannot read new fields. So the older nodes cannot read data from the newer nodes, and they don't ignore the fields either (why not? good question, it seems like an easy thing to do, just skip it if the version is higher than what you know).

Can this cause issues during updates? When we update one node, it can receive message from the other two, and will probably not be leader. When we update the second node, then the first updated node could become leader (which we see here), which will cause issue with the older node. However, the two updated nodes should be able to work together - however our fault tolerance guarantees are lowered, I guess, since the older node is now "useless" until it's updated.

I don't see an easy solution here - the only think I can think of is postponing adding checksums to 0.27, as we will most likely be breaking backwards compatibility with the new workflow engine. At this point we can change how we do serialization and ignore the issue. Let me know what you think.

- fixes an issue in the gateway topology when the old leader becomes follower, and no new node is elected leader yet, by removing the new follower if it's still identified as the leader

- TopologyAssert#isComplete now also checks that all partitions have a leader

npepinpe · 2020-12-14T13:12:59Z

bors r+

zeebe-bors · 2020-12-14T13:50:09Z

Build succeeded:

continuous-integration/jenkins/branch

github-actions · 2020-12-14T13:50:33Z

The process '/home/runner/work/_actions/zeebe-io/backport-action/master/backport.sh' failed with exit code 4

github-actions · 2020-12-14T13:50:35Z

The process '/home/runner/work/_actions/zeebe-io/backport-action/master/backport.sh' failed with exit code 4

6011: [Backport stable/0.25] Fixes outdated topology when no new leader is assigned r=npepinpe a=npepinpe # Description Backport of #5979 to `stable/0.25`. There was some minor conflicts, where I had to bump the AssertJ version as `failure` did not exist in 3.17. Co-authored-by: Nicolas Pépin-Perreault <nicolas.pepin-perreault@camunda.com>

…stcontainers dependency versions (#5979) * chore(backend): update elasticsearch, awssdk, aws-java, opensearch-testcontainers dependency versions * chore(backend): update spring-boot version to 3.1.6

npepinpe added backport stable/0.24 labels Dec 7, 2020

npepinpe self-assigned this Dec 7, 2020

npepinpe marked this pull request as ready for review December 8, 2020 08:02

npepinpe requested a review from deepthidevaki December 8, 2020 08:02

deepthidevaki requested changes Dec 9, 2020

View reviewed changes

test-util/src/main/java/io/zeebe/test/util/asserts/TopologyAssert.java Show resolved Hide resolved

npepinpe requested a review from deepthidevaki December 9, 2020 10:12

npepinpe commented Dec 9, 2020

View reviewed changes

deepthidevaki approved these changes Dec 9, 2020

View reviewed changes

npepinpe force-pushed the 2501-gateway-topology-fix branch from 01dddc2 to 11a295a Compare December 9, 2020 16:09

npepinpe added 3 commits December 14, 2020 13:53

fix(gateway): fix outdated leader when no other leader

23abda6

- fixes an issue in the gateway topology when the old leader becomes follower, and no new node is elected leader yet, by removing the new follower if it's still identified as the leader

chore(test-util): update definition of complete topology

4285ff4

- TopologyAssert#isComplete now also checks that all partitions have a leader

chore(qa): add topology fault tolerance tests

87cddba

npepinpe force-pushed the 2501-gateway-topology-fix branch from 11a295a to 87cddba Compare December 14, 2020 12:53

zeebe-bors bot merged commit c4ab987 into develop Dec 14, 2020

zeebe-bors bot deleted the 2501-gateway-topology-fix branch December 14, 2020 13:50

npepinpe mentioned this pull request Dec 14, 2020

[Backport stable/0.25] Fixes outdated topology when no new leader is assigned #6011

Merged

npepinpe added the Release: 0.26.0 label Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes outdated topology when no new leader is assigned #5979

Fixes outdated topology when no new leader is assigned #5979

npepinpe commented Dec 7, 2020

npepinpe commented Dec 7, 2020 •

edited

deepthidevaki left a comment

deepthidevaki commented Dec 9, 2020

npepinpe commented Dec 9, 2020

npepinpe Dec 9, 2020

deepthidevaki left a comment

npepinpe commented Dec 9, 2020

zeebe-bors bot commented Dec 9, 2020

npepinpe commented Dec 10, 2020 •

edited

npepinpe commented Dec 12, 2020

npepinpe commented Dec 14, 2020

zeebe-bors bot commented Dec 14, 2020

github-actions bot commented Dec 14, 2020

github-actions bot commented Dec 14, 2020

Fixes outdated topology when no new leader is assigned #5979

Fixes outdated topology when no new leader is assigned #5979

Conversation

npepinpe commented Dec 7, 2020

Description

Related issues

Definition of Done

npepinpe commented Dec 7, 2020 • edited

deepthidevaki left a comment

Choose a reason for hiding this comment

deepthidevaki commented Dec 9, 2020

npepinpe commented Dec 9, 2020

npepinpe Dec 9, 2020

Choose a reason for hiding this comment

deepthidevaki left a comment

Choose a reason for hiding this comment

npepinpe commented Dec 9, 2020

zeebe-bors bot commented Dec 9, 2020

npepinpe commented Dec 10, 2020 • edited

npepinpe commented Dec 12, 2020

npepinpe commented Dec 14, 2020

zeebe-bors bot commented Dec 14, 2020

github-actions bot commented Dec 14, 2020

github-actions bot commented Dec 14, 2020

npepinpe commented Dec 7, 2020 •

edited

npepinpe commented Dec 10, 2020 •

edited