Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bookkeeper failed to connect on manual recovery #118

Closed
sschepens opened this issue Nov 17, 2016 · 7 comments
Closed

Bookkeeper failed to connect on manual recovery #118

sschepens opened this issue Nov 17, 2016 · 7 comments

Comments

@sschepens
Copy link
Contributor

sschepens commented Nov 17, 2016

I know this probably doesn't belong here.

We're trying to decomission a Bookkeeper node which is performing badly, but when we stop Bookkeeper service and try a manual recovery:
bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com:2181 nodeToDecomission:3181 newNode:3181

The command fails saying it cannot connect to the stopped node, which of course is true, but I don't know why it's trying to connect to that given node if it's shut down.
The stacks are:

java.net.ConnectException: syscall:getsockopt(...): /nodeToDecomission:3181
	at io.netty.channel.unix.Socket.finishConnect(...)(Unknown Source)
2016-11-17 18:16:05,084 - ERROR - [bookkeeper-io-1-4:PerChannelBookieClient$2@284] - Could not connect to bookie: [id: 0x0a698a5c, L:/clientIp:39932]/nodeToDecomission:3181, current state CONNECTING : 

Do you have any idea of what could be going on?

Another thing to note is that when we stopped the Bookkeeper in the first place, Brokers seemed to keep retrying their connections to that Bookie for a really long time even though it was turned off.

Edit: as @estebangarcia said, the same exceptions are thrown if instead of trying manual recovery we start an Autorecovery process.

@estebangarcia
Copy link

We have that exact same error with bookkeeper's autorecovery.

@sschepens
Copy link
Contributor Author

@merlimat do you know if this is normal?

@merlimat
Copy link
Contributor

@sschepens @estebangarcia How did you enabled the auto-recovery?

There are a couple of ways to run the replication workers:

  1. In the same JVM process as the bookies autoRecoveryDaemonEnabled=true in conf/bookkeeper.conf
  2. As a separate process running either on every bookie, or just in a few VMs. It can be started with
    bin/bookkeeper autorecovery

In both cases, the replication workers will watch for bookies that are not available and will
trigger the copy of all data that was stored in the bookie (from another available replica).

You can check the status of under-replicated ledgers with :

bin/bookkeeper shell listunderreplicated 

For errors in broker logs, that is somewhat expected, if the ledgers are still under-replicated the client is trying to read them from the failed bookie. Though, even when data is already replicated, the BK client might not see the metadata change unless there is a read-error, so it can be printing a bit of error logs.

We still have the pending task to backport a patch from Twitter branch to include read sequence reordering. That would make the client to first try to read from bookies that are marked available in ZK,

@sschepens
Copy link
Contributor Author

We did run separate autorecovery processes using bin/bookkeeper autorecovery but the failed consistently with connection errors to the failed broker.
We also saw a lot of underreplicated ledgers, but the autorecovery processes never managed to empty it, because they failed connecting to the failed broker, which makes little sense.

@merlimat
Copy link
Contributor

I don't know if that's the case, though the auto-recovery workers to be running when you shutdown the bookie, otherwise they won't notice the unavailable bookie.

There is later a full check for under-replicated ledgers that runs once a day.

@ivankelly
Copy link
Contributor

@sschepens is this still an issue?

@ivankelly
Copy link
Contributor

Closing due to lack of activity.

dragonls pushed a commit to dragonls/pulsar that referenced this issue Oct 21, 2022
…request !60)

Revert "[fix][broker] filter the virtual NIC with relative path (apache#14829)" (apache#118)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants