New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bookkeeper failed to connect on manual recovery #118
Comments
We have that exact same error with bookkeeper's autorecovery. |
@merlimat do you know if this is normal? |
@sschepens @estebangarcia How did you enabled the auto-recovery? There are a couple of ways to run the replication workers:
In both cases, the replication workers will watch for bookies that are not available and will You can check the status of under-replicated ledgers with :
For errors in broker logs, that is somewhat expected, if the ledgers are still under-replicated the client is trying to read them from the failed bookie. Though, even when data is already replicated, the BK client might not see the metadata change unless there is a read-error, so it can be printing a bit of error logs. We still have the pending task to backport a patch from Twitter branch to include read sequence reordering. That would make the client to first try to read from bookies that are marked available in ZK, |
We did run separate autorecovery processes using |
I don't know if that's the case, though the auto-recovery workers to be running when you shutdown the bookie, otherwise they won't notice the unavailable bookie. There is later a full check for under-replicated ledgers that runs once a day. |
@sschepens is this still an issue? |
Closing due to lack of activity. |
…request !60) Revert "[fix][broker] filter the virtual NIC with relative path (apache#14829)" (apache#118)
I know this probably doesn't belong here.
We're trying to decomission a Bookkeeper node which is performing badly, but when we stop Bookkeeper service and try a manual recovery:
bin/bookkeeper org.apache.bookkeeper.tools.BookKeeperTools zk1.example.com:2181 nodeToDecomission:3181 newNode:3181
The command fails saying it cannot connect to the stopped node, which of course is true, but I don't know why it's trying to connect to that given node if it's shut down.
The stacks are:
Do you have any idea of what could be going on?
Another thing to note is that when we stopped the Bookkeeper in the first place, Brokers seemed to keep retrying their connections to that Bookie for a really long time even though it was turned off.
Edit: as @estebangarcia said, the same exceptions are thrown if instead of trying manual recovery we start an Autorecovery process.
The text was updated successfully, but these errors were encountered: