[broker]Handle BadVersionException thrown by updateSchemaLocator() #6683

hrsakai · 2020-04-07T08:32:41Z

putShema() still throws KeeperException.BadVersionException even after #5563 was merged when health check api is requested.

After GC, health check api returns ok but PutShema() throws KeeperException.BadVersionException.
( By #6577, the client reconnects when getting the exception and so health check api returns ok . )

curl -v http://127.0.0.1:8080/admin/v2/brokers/health
ok

# Waiting for more than 60 seconds, waiting for the topic pulsar/standalone/127.0.0.1:8080/healthcheck to be deleted

# Api returns ok, but putShema() throws KeeperException.BadVersionException.
curl -v http://127.0.0.1:8080/admin/v2/brokers/health
ok

13:47:37.286 [pulsar-ordered-OrderedExecutor-0-0-EventThread] WARN  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:54360][persistent://pulsar/standalone/localhost:8080/healthcheck][reader-27cd902283] Failed to create consumer: KeeperErrorCode = BadVersion
java.util.concurrent.CompletionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_202]
	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$updateSchemaLocator$24(BookkeeperSchemaStorage.java:427) ~[pulsar-broker.jar:2.6.0-SNAPSHOT]
	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$22$1.processResult(ZooKeeperClient.java:1091) [bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592) [zookeeper-3.5.7.jar:3.5.7]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.7.jar:3.5.7]
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:122) ~[zookeeper-3.5.7.jar:3.5.7]
	... 4 more
13:47:37.288 [pulsar-client-io-92-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x8625b158, L:/127.0.0.1:54360 - R:localhost/127.0.0.1:6650] Received error from server: KeeperErrorCode = BadVersion
13:47:37.288 [pulsar-client-io-92-1] WARN  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/standalone/localhost:8080/healthcheck][reader-27cd902283] Failed to subscribe to topic on localhost/127.0.0.1:6650
13:47:37.288 [pulsar-client-io-92-1] WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/standalone/localhost:8080/healthcheck] [reader-27cd902283] Could not get connection to broker: KeeperErrorCode = BadVersion -- Will try again in 0.1 s
13:47:37.389 [pulsar-timer-94-1] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/standalone/localhost:8080/healthcheck] [reader-27cd902283] Reconnecting after connection was closed
13:47:37.390 [pulsar-client-io-92-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/standalone/localhost:8080/healthcheck][reader-27cd902283] Subscribing to topic on cnx [id: 0x8625b158, L:/127.0.0.1:54360 - R:localhost/127.0.0.1:6650]
13:47:37.391 [pulsar-io-51-3] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:54360] Subscribing on topic persistent://pulsar/standalone/localhost:8080/healthcheck / reader-27cd902283

…tor()

sijie · 2020-04-07T22:21:51Z

/pulsarbot run-failure-checks

hrsakai · 2020-04-08T04:54:57Z

/pulsarbot run-failure-checks

hrsakai · 2020-04-08T06:35:50Z

/pulsarbot run-failure-checks

cdbartholomew · 2020-04-10T00:50:45Z

I am seeing this exact issue while testing the 2.5.1 candidate load:

curl localhost:8080/admin/v2/brokers/health

 --- An unexpected error occurred in the server ---

Message: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion

Stacktrace:

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
	at java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1423)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$connectionOpened$14(ProducerImpl.java:1205)
	at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
	at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at org.apache.pulsar.client.impl.ClientCnx.handleError(ClientCnx.java:609)
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:171)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at org.apache.pulsar.client.impl.ClientCnx.getPulsarClientException(ClientCnx.java:1005)
	... 21 more

It happens every time the check is run and the topic doesn't already exist. So, the first time it fails like this and then it will work until the topic is deleted, which will happen automatically after a while with default broker settings.

I think it would be a good idea to cherry-pick this fix to 2.5.1 since it makes it look like the broker is broken when the user requests a heartbeat in Pulsar Manager. This is also a regression compared to 2.5.0 which doesn't have this issue.

cdbartholomew · 2020-04-10T09:27:59Z

This problem is worse with Docker images built using the 2.5.1 candidate. The health check fails every time, not just the first time. It's going to look really bad in Pulsar Manager.

@tuteng @sijie I think this fix should be added to the 2.5.1 release or the PR that caused the behavior to change from 2.5.0 should be pulled out.

sijie · 2020-04-10T10:32:08Z

@cdbartholomew thank you for reporting this. since rc2 is canceled to include AVRO related changes, we will some other issues that were labeled for 2.5.2 (including this one) in the new RC.

sijie · 2020-04-10T10:33:45Z

Just to clarify - I don't think there are any regressions between 2.5.0 and 2.5.1. The problem exists in 2.5.0. #5563 only addressed some problems.

cdbartholomew · 2020-04-10T10:35:01Z

The health check problem does not exist in 2.5.0, so there is definitely a regression here.

sijie · 2020-04-10T10:44:23Z

@cdbartholomew I have just reproduced the issue in 2.5.0. thanks

[sijie@sijies-mbp apache-pulsar-2.5.0]$ clear
[sijie@sijies-mbp apache-pulsar-2.5.0]$ curl localhost:8080/admin/v2/brokers/health
ok[sijie@sijies-mbp apache-pulsar-2.5.0]$ curl localhost:8080/admin/v2/brokers/health

 --- An unexpected error occurred in the server ---

Message: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion

Stacktrace:

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
	at java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$connectionOpened$13(ProducerImpl.java:1176)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ClientCnx.handleError(ClientCnx.java:606)
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:171)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:326)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at org.apache.pulsar.client.impl.ClientCnx.getPulsarClientException(ClientCnx.java:1002)
	... 23 more

cdbartholomew · 2020-04-10T10:47:53Z

@sijie Let me clarify. The fact that the health check never works when running with Docker images built from the 2.5.1 candidate is a regression. I assumed that it was related to this issue, but it is not.

I just built new Docker images with this fix (v2.5.1-candidate-2 + 6683), and it doesn't resolve that issue. So, there is a different issue here. I will test with master to see if the health check works there.

sijie · 2020-04-10T11:00:34Z

The fact that the health check never works when running with Docker images built from the 2.5.1 candidate is a regression.

I don't understand "never" here. The current buggy behavior is - it works when the topic exists, and it will fail the first time when the topic was deleted due to inactivity.

I have just verified that branch-2.5 + this fix has fixed the buggy behavior described above.

$ curl localhost:8080/admin/v2/brokers/health
ok[sijie@sijies-mbp pulsar ]$ curl localhost:8080/admin/v2/brokers/health
ok[sijie@sijies-mbp pulsar ]$ curl localhost:8080/admin/v2/brokers/health
ok

If you are seeing health check failure using Docker Images, I don't think it is a problem of the code. It sounds more like a problem of your docker image. Because the broker code shouldn't have any difference running locally than running in a docker.

cdbartholomew · 2020-04-10T11:04:33Z

This is the error:

curl localhost:8080/admin/v2/brokers/health

 --- An unexpected error occurred in the server ---

Message: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 5 lookup request timedout after ms 30000

Stacktrace:

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 5 lookup request timedout after ms 30000
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
	at java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$connectionOpened$14(ProducerImpl.java:1205)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1026)
	at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 5 lookup request timedout after ms 30000
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1025)
	... 10 more

This is when calling the endpoint from within the broker container.

sijie · 2020-04-10T11:08:35Z

5 lookup request timed out after ms 30000 is a different error than the problem we discussed here. The lookup request error indicates that the broker is not able to find the topic owner of the health check topic. Sounds like some network issues.

cdbartholomew · 2020-04-10T11:09:13Z

I am building the docker images using:

git checkout v2.5.1-candidate-2
mvn install -DskipTests
cd docker
./build.sh

Has anyone else tested images built from this tag?

tuteng · 2020-04-13T01:26:18Z

Add label release-2.5.1

…tor() (#6683) Co-authored-by: Sijie Guo <sijie@apache.org> (cherry picked from commit cf045e4)

…tor() (apache#6683) Co-authored-by: Sijie Guo <sijie@apache.org>

…tor() (apache#6683) Co-authored-by: Sijie Guo <sijie@apache.org>(cherry picked from commit cf045e4)

…tor() (apache#6683) Co-authored-by: Sijie Guo <sijie@apache.org>

Handle KeeperException.BadVersionException thrown by updateSchemaLoca…

ef674d7

…tor()

hrsakai added the area/broker label Apr 7, 2020

hrsakai added this to the 2.6.0 milestone Apr 7, 2020

hrsakai self-assigned this Apr 7, 2020

hrsakai added the type/bug The PR fixed a bug or issue reported a bug label Apr 7, 2020

sijie approved these changes Apr 7, 2020

View reviewed changes

jiazhai approved these changes Apr 7, 2020

View reviewed changes

Merge remote-tracking branch 'apache/master' into fix-schema-bad-version

e532a5b

merlimat approved these changes Apr 8, 2020

View reviewed changes

merlimat merged commit cf045e4 into apache:master Apr 8, 2020

hrsakai deleted the fix-schema-bad-version branch April 8, 2020 23:59

This was referenced Apr 9, 2020

zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion Stacktrace: java.util.concurrent.CompletionException: apache/pulsar-manager#259

Open

Broker health check throws exception badversion #6229

Closed

tuteng added the release/2.5.1 label Apr 13, 2020

tuteng pushed a commit that referenced this pull request Apr 13, 2020

Handle KeeperException.BadVersionException thrown by updateSchemaLoca…

0319552

…tor() (#6683) Co-authored-by: Sijie Guo <sijie@apache.org> (cherry picked from commit cf045e4)

merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 21, 2020

Handle KeeperException.BadVersionException thrown by updateSchemaLoca…

061e3c7

…tor() (apache#6683) Co-authored-by: Sijie Guo <sijie@apache.org>

jiazhai pushed a commit to jiazhai/pulsar that referenced this pull request May 18, 2020

Handle KeeperException.BadVersionException thrown by updateSchemaLoca…

3c4c2fc

…tor() (apache#6683) Co-authored-by: Sijie Guo <sijie@apache.org>(cherry picked from commit cf045e4)

huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020

Handle KeeperException.BadVersionException thrown by updateSchemaLoca…

7b1e41c

…tor() (apache#6683) Co-authored-by: Sijie Guo <sijie@apache.org>

BewareMyPower mentioned this pull request Oct 14, 2020

[WIP]Record unused schema ledgers to ZK for partitioned topics #8258

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[broker]Handle BadVersionException thrown by updateSchemaLocator() #6683

[broker]Handle BadVersionException thrown by updateSchemaLocator() #6683

hrsakai commented Apr 7, 2020

sijie commented Apr 7, 2020

hrsakai commented Apr 8, 2020

hrsakai commented Apr 8, 2020

cdbartholomew commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

sijie commented Apr 10, 2020

sijie commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

sijie commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

sijie commented Apr 10, 2020 •

edited

Loading

cdbartholomew commented Apr 10, 2020 •

edited

Loading

sijie commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

tuteng commented Apr 13, 2020

[broker]Handle BadVersionException thrown by updateSchemaLocator() #6683

[broker]Handle BadVersionException thrown by updateSchemaLocator() #6683

Conversation

hrsakai commented Apr 7, 2020

sijie commented Apr 7, 2020

hrsakai commented Apr 8, 2020

hrsakai commented Apr 8, 2020

cdbartholomew commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

sijie commented Apr 10, 2020

sijie commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

sijie commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

sijie commented Apr 10, 2020 • edited Loading

cdbartholomew commented Apr 10, 2020 • edited Loading

sijie commented Apr 10, 2020

cdbartholomew commented Apr 10, 2020

tuteng commented Apr 13, 2020

sijie commented Apr 10, 2020 •

edited

Loading

cdbartholomew commented Apr 10, 2020 •

edited

Loading