-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Broker] Timeout opening managed ledger operation … #7506
[Broker] Timeout opening managed ledger operation … #7506
Conversation
*Motivation* Currently broker has a timeout mechanism on loading topics. However the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that: A TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never returns. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any sub-sequent attempts to load topics can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.
// Unable to get the future | ||
log.warn("[{}] Got exception while trying to retrieve ledger", name, e); | ||
} else { | ||
PendingInitializeManagedLedger pendingLedger = pendingInitializeLedgers.get(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As opposed to requiring another call to check if the timeout has elapsed, would it make sense to instead use a CompletableFuture with a timeout? Much like we do in the BrokerService, the future created on line 370 could instead be made to have a timeout if it isn't resolved in so many milliseconds with the error handler remove the future from the cache of futures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that. However, I didn't go down that route because of the following reason:
we need to keep the ManageLedger reference and to close it to release resources. Because the initialization involves a long pipeline including opening managed ledger and cursors. The behavior we observed is that the operation is stuck at opening cursors. So if we don't attempt to close the ManagedLedger instance, it can result in resource leaking.
With that being said, we need to keep a reference to ManagedLedgerImpl along with the CompletableFuture. Hence I chose the current implementation.
Another reason is to allow checking the timestamp proactively. I fixed a couple of issues before that NPE is thrown between creating a Future and registering error handling logic. I would like to have a mechanism in place that can fix any potential bugs by proactively checking if a CompletableFuture timed out.
@sijie Do you know why is that never never returning? Is that for DNS error on opening the ledger? |
@merlimat it is stuck in LedgerRecoveryOp when recovering cursors. I haven't caught the real exception. My feeling is more coming from the zookeeper side. The chaos test we did is killing the Kubernetes worker node hardly. In that worker node, it has one zookeeper pod, one bookkeeper pod, and one broker pod. This sounds like causing some zookeeper call didn't come back and the ledger recovery op stuck without triggering any callback which in return causes the problem in managed ledger library. A side note - I created an issue a while ago to separate loading cursors from loading managed ledger. The idea is that we should allow producers to produce messages once the managed ledger is ready. This would improve write availability. #7404 |
/pulsarbot run-failure-checks |
*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment. (cherry picked from commit 14e3b7a)
*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.
*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.
*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.
*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment. (cherry picked from commit 14e3b7a)
*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.
Motivation
Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library
doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out
after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger
factory. The completable future will never return. So any sub-sequent topic lookups will fail because any
attempts to load a topic will never attempt to re-open a managed ledger.
Modification
Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout
period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to
open the managed ledger again.
Tests
This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause
producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.