Skip to content

When the loadmanager leader is not available, fall through regular least loaded selection#3688

Merged
merlimat merged 2 commits intoapache:masterfrom
merlimat:master
Mar 1, 2019
Merged

When the loadmanager leader is not available, fall through regular least loaded selection#3688
merlimat merged 2 commits intoapache:masterfrom
merlimat:master

Conversation

@merlimat
Copy link
Contributor

Motivation

Under certain conditions the topic failover can take ~30seconds even when doing a graceful broker shutdown.

This happens because of a race condition when the load-manager leader is being shut down. Since the ephemeral z-node for the leader election is not being explicitely deleted, in some cases it might hang around until the old zk-session gets expired.

The error that gets printed in brokers is:

00:07:47.874 [pulsar-client-io-41-1] WARN  org.apache.pulsar.client.impl.BinaryProtoLookupService - [persistent://system/functions-prod/assignments] failed to send lookup request : org.apache.pulsar.client.api.PulsarClientException$LookupException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /loadbalance/brokers/prod-broker-1.prod-broker.default.svc.cluster.local:8080
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$LookupException: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /loadbalance/brokers/prod-broker-1.prod-broker.default.svc.cluster.local:8080
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:647) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632) ~[?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [?:1.8.0_181]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) [?:1.8.0_181]
	at org.apache.pulsar.client.impl.ClientCnx.handleLookupResponse(ClientCnx.java:401) [org.apache.pulsar-pulsar-client-original-2.3.0-streamlio-14.jar:2.3.0-streamlio-14]
	at org.apache.pulsar.common.api.PulsarDecoder.channelRead(PulsarDecoder.java:118) [org.apache.pulsar-pulsar-common-2.3.0-streamlio-14.jar:2.3.0-streamlio-14]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:799) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:433) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:330) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

The reason for the error is:

  • The broker who's the load-manager leader is doing graceful shutdown
  • Clients reconnect almost immediately to a new broker
  • This broker still thinks the "leader" is the old broker and redirects lookup requests to it.
  • When finally the leader z-node gets cleared, the lookups are unblocked and everything goes back into place

The solution here is twofold:

  1. Cleanup pro-actively the leader z-node on shutdown, to avoid waiting for session timeout in case the session doesn't get cleaned up properly
  2. Double check for the load-manager leader to be active before trying to forward lookup requests to it.

@merlimat merlimat added the type/bug The PR fixed a bug or issue reported a bug label Feb 26, 2019
@merlimat merlimat added this to the 2.3.1 milestone Feb 26, 2019
@merlimat merlimat self-assigned this Feb 26, 2019
@merlimat
Copy link
Contributor Author

run java8 tests
run integration tests

@merlimat
Copy link
Contributor Author

run java8 tests
run integration tests

@jiazhai
Copy link
Member

jiazhai commented Feb 28, 2019

run java8 tests

2 similar comments
@merlimat
Copy link
Contributor Author

run java8 tests

@merlimat
Copy link
Contributor Author

run java8 tests

@merlimat merlimat merged commit ccfb949 into apache:master Mar 1, 2019
merlimat added a commit that referenced this pull request Mar 29, 2019
…ast loaded selection (#3688)

* When the loadmanager leader is not available, fall through regular least loaded selection

* Handle exceptions coming from mock zk in tests
@merlimat
Copy link
Contributor Author

merlimat commented Apr 1, 2019

Merged in 2.3.1 at
5746db9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug The PR fixed a bug or issue reported a bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants