Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[broker]Handle BadVersionException thrown by updateSchemaLocator() #6683

Merged
merged 2 commits into from
Apr 8, 2020

Conversation

hrsakai
Copy link
Contributor

@hrsakai hrsakai commented Apr 7, 2020

putShema() still throws KeeperException.BadVersionException even after #5563 was merged when health check api is requested.

After GC, health check api returns ok but PutShema() throws KeeperException.BadVersionException.
( By #6577, the client reconnects when getting the exception and so health check api returns ok . )

curl -v http://127.0.0.1:8080/admin/v2/brokers/health
ok

# Waiting for more than 60 seconds, waiting for the topic pulsar/standalone/127.0.0.1:8080/healthcheck to be deleted

# Api returns ok, but putShema() throws KeeperException.BadVersionException.
curl -v http://127.0.0.1:8080/admin/v2/brokers/health
ok
13:47:37.286 [pulsar-ordered-OrderedExecutor-0-0-EventThread] WARN  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:54360][persistent://pulsar/standalone/localhost:8080/healthcheck][reader-27cd902283] Failed to create consumer: KeeperErrorCode = BadVersion
java.util.concurrent.CompletionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_202]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_202]
	at org.apache.pulsar.broker.service.schema.BookkeeperSchemaStorage.lambda$updateSchemaLocator$24(BookkeeperSchemaStorage.java:427) ~[pulsar-broker.jar:2.6.0-SNAPSHOT]
	at org.apache.bookkeeper.zookeeper.ZooKeeperClient$22$1.processResult(ZooKeeperClient.java:1091) [bookkeeper-server-4.10.0.jar:4.10.0]
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:592) [zookeeper-3.5.7.jar:3.5.7]
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510) [zookeeper-3.5.7.jar:3.5.7]
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:122) ~[zookeeper-3.5.7.jar:3.5.7]
	... 4 more
13:47:37.288 [pulsar-client-io-92-1] WARN  org.apache.pulsar.client.impl.ClientCnx - [id: 0x8625b158, L:/127.0.0.1:54360 - R:localhost/127.0.0.1:6650] Received error from server: KeeperErrorCode = BadVersion
13:47:37.288 [pulsar-client-io-92-1] WARN  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/standalone/localhost:8080/healthcheck][reader-27cd902283] Failed to subscribe to topic on localhost/127.0.0.1:6650
13:47:37.288 [pulsar-client-io-92-1] WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/standalone/localhost:8080/healthcheck] [reader-27cd902283] Could not get connection to broker: KeeperErrorCode = BadVersion -- Will try again in 0.1 s
13:47:37.389 [pulsar-timer-94-1] INFO  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://pulsar/standalone/localhost:8080/healthcheck] [reader-27cd902283] Reconnecting after connection was closed
13:47:37.390 [pulsar-client-io-92-1] INFO  org.apache.pulsar.client.impl.ConsumerImpl - [persistent://pulsar/standalone/localhost:8080/healthcheck][reader-27cd902283] Subscribing to topic on cnx [id: 0x8625b158, L:/127.0.0.1:54360 - R:localhost/127.0.0.1:6650]
13:47:37.391 [pulsar-io-51-3] INFO  org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:54360] Subscribing on topic persistent://pulsar/standalone/localhost:8080/healthcheck / reader-27cd902283

@hrsakai hrsakai added this to the 2.6.0 milestone Apr 7, 2020
@hrsakai hrsakai self-assigned this Apr 7, 2020
@hrsakai hrsakai added the type/bug The PR fixed a bug or issue reported a bug label Apr 7, 2020
@sijie
Copy link
Member

sijie commented Apr 7, 2020

/pulsarbot run-failure-checks

2 similar comments
@hrsakai
Copy link
Contributor Author

hrsakai commented Apr 8, 2020

/pulsarbot run-failure-checks

@hrsakai
Copy link
Contributor Author

hrsakai commented Apr 8, 2020

/pulsarbot run-failure-checks

@cdbartholomew
Copy link
Contributor

I am seeing this exact issue while testing the 2.5.1 candidate load:

curl localhost:8080/admin/v2/brokers/health

 --- An unexpected error occurred in the server ---

Message: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion

Stacktrace:

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
	at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
	at java.base/java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1423)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$connectionOpened$14(ProducerImpl.java:1205)
	at java.base/java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:986)
	at java.base/java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:970)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
	at org.apache.pulsar.client.impl.ClientCnx.handleError(ClientCnx.java:609)
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:171)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:321)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:295)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
	at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
	at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at org.apache.pulsar.client.impl.ClientCnx.getPulsarClientException(ClientCnx.java:1005)
	... 21 more

It happens every time the check is run and the topic doesn't already exist. So, the first time it fails like this and then it will work until the topic is deleted, which will happen automatically after a while with default broker settings.

I think it would be a good idea to cherry-pick this fix to 2.5.1 since it makes it look like the broker is broken when the user requests a heartbeat in Pulsar Manager. This is also a regression compared to 2.5.0 which doesn't have this issue.

@cdbartholomew
Copy link
Contributor

This problem is worse with Docker images built using the 2.5.1 candidate. The health check fails every time, not just the first time. It's going to look really bad in Pulsar Manager.

@tuteng @sijie I think this fix should be added to the 2.5.1 release or the PR that caused the behavior to change from 2.5.0 should be pulled out.

@sijie
Copy link
Member

sijie commented Apr 10, 2020

@cdbartholomew thank you for reporting this. since rc2 is canceled to include AVRO related changes, we will some other issues that were labeled for 2.5.2 (including this one) in the new RC.

@sijie
Copy link
Member

sijie commented Apr 10, 2020

Just to clarify - I don't think there are any regressions between 2.5.0 and 2.5.1. The problem exists in 2.5.0. #5563 only addressed some problems.

@cdbartholomew
Copy link
Contributor

The health check problem does not exist in 2.5.0, so there is definitely a regression here.

@sijie
Copy link
Member

sijie commented Apr 10, 2020

@cdbartholomew I have just reproduced the issue in 2.5.0. thanks

[sijie@sijies-mbp apache-pulsar-2.5.0]$ clear
[sijie@sijies-mbp apache-pulsar-2.5.0]$ curl localhost:8080/admin/v2/brokers/health
ok[sijie@sijies-mbp apache-pulsar-2.5.0]$ curl localhost:8080/admin/v2/brokers/health

 --- An unexpected error occurred in the server ---

Message: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion

Stacktrace:

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
	at java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$connectionOpened$13(ProducerImpl.java:1176)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ClientCnx.handleError(ClientCnx.java:606)
	at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:171)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:326)
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:300)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1422)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374)
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360)
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:931)
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:700)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:635)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:552)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:514)
	at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.pulsar.client.api.PulsarClientException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
	at org.apache.pulsar.client.impl.ClientCnx.getPulsarClientException(ClientCnx.java:1002)
	... 23 more

@cdbartholomew
Copy link
Contributor

@sijie Let me clarify. The fact that the health check never works when running with Docker images built from the 2.5.1 candidate is a regression. I assumed that it was related to this issue, but it is not.

I just built new Docker images with this fix (v2.5.1-candidate-2 + 6683), and it doesn't resolve that issue. So, there is a different issue here. I will test with master to see if the health check works there.

@sijie
Copy link
Member

sijie commented Apr 10, 2020

The fact that the health check never works when running with Docker images built from the 2.5.1 candidate is a regression.

I don't understand "never" here. The current buggy behavior is - it works when the topic exists, and it will fail the first time when the topic was deleted due to inactivity.

I have just verified that branch-2.5 + this fix has fixed the buggy behavior described above.

$ curl localhost:8080/admin/v2/brokers/health
ok[sijie@sijies-mbp pulsar ]$ curl localhost:8080/admin/v2/brokers/health
ok[sijie@sijies-mbp pulsar ]$ curl localhost:8080/admin/v2/brokers/health
ok

If you are seeing health check failure using Docker Images, I don't think it is a problem of the code. It sounds more like a problem of your docker image. Because the broker code shouldn't have any difference running locally than running in a docker.

@cdbartholomew
Copy link
Contributor

cdbartholomew commented Apr 10, 2020

This is the error:

curl localhost:8080/admin/v2/brokers/health

 --- An unexpected error occurred in the server ---

Message: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 5 lookup request timedout after ms 30000

Stacktrace:

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 5 lookup request timedout after ms 30000
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
	at java.util.concurrent.CompletableFuture.biRelay(CompletableFuture.java:1284)
	at java.util.concurrent.CompletableFuture$BiRelay.tryFire(CompletableFuture.java:1270)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ProducerImpl.lambda$connectionOpened$14(ProducerImpl.java:1205)
	at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
	at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1026)
	at org.apache.pulsar.client.impl.ClientCnx.lambda$channelActive$0(ClientCnx.java:187)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:176)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.pulsar.client.api.PulsarClientException$TimeoutException: 5 lookup request timedout after ms 30000
	at org.apache.pulsar.client.impl.ClientCnx.checkRequestTimeout(ClientCnx.java:1025)
	... 10 more

This is when calling the endpoint from within the broker container.

@sijie
Copy link
Member

sijie commented Apr 10, 2020

5 lookup request timed out after ms 30000 is a different error than the problem we discussed here. The lookup request error indicates that the broker is not able to find the topic owner of the health check topic. Sounds like some network issues.

@cdbartholomew
Copy link
Contributor

I am building the docker images using:

git checkout v2.5.1-candidate-2
mvn install -DskipTests
cd docker
./build.sh

Has anyone else tested images built from this tag?

@tuteng
Copy link
Member

tuteng commented Apr 13, 2020

Add label release-2.5.1

tuteng pushed a commit that referenced this pull request Apr 13, 2020
…tor() (#6683)

Co-authored-by: Sijie Guo <sijie@apache.org>
(cherry picked from commit cf045e4)
merlimat pushed a commit to merlimat/pulsar that referenced this pull request Apr 21, 2020
jiazhai pushed a commit to jiazhai/pulsar that referenced this pull request May 18, 2020
…tor() (apache#6683)

Co-authored-by: Sijie Guo <sijie@apache.org>(cherry picked from commit cf045e4)
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker release/2.5.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants