[Broker] Timeout opening managed ledger operation … #7506

sijie · 2020-07-10T19:39:54Z

Motivation

Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library
doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out
after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger
factory. The completable future will never return. So any sub-sequent topic lookups will fail because any
attempts to load a topic will never attempt to re-open a managed ledger.

Modification

Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout
period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to
open the managed ledger again.

Tests

This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause
producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.

*Motivation* Currently broker has a timeout mechanism on loading topics. However the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that: A TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never returns. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any sub-sequent attempts to load topics can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.

addisonj · 2020-07-10T20:46:14Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerFactoryImpl.java

-                // Unable to get the future
-                log.warn("[{}] Got exception while trying to retrieve ledger", name, e);
+            } else {
+                PendingInitializeManagedLedger pendingLedger = pendingInitializeLedgers.get(name);


As opposed to requiring another call to check if the timeout has elapsed, would it make sense to instead use a CompletableFuture with a timeout? Much like we do in the BrokerService, the future created on line 370 could instead be made to have a timeout if it isn't resolved in so many milliseconds with the error handler remove the future from the cache of futures.

We can do that. However, I didn't go down that route because of the following reason:

we need to keep the ManageLedger reference and to close it to release resources. Because the initialization involves a long pipeline including opening managed ledger and cursors. The behavior we observed is that the operation is stuck at opening cursors. So if we don't attempt to close the ManagedLedger instance, it can result in resource leaking.

With that being said, we need to keep a reference to ManagedLedgerImpl along with the CompletableFuture. Hence I chose the current implementation.

Another reason is to allow checking the timestamp proactively. I fixed a couple of issues before that NPE is thrown between creating a Future and registering error handling logic. I would like to have a mechanism in place that can fix any potential bugs by proactively checking if a CompletableFuture timed out.

merlimat · 2020-07-10T21:00:08Z

The completable future will never return. So any sub-sequent topic lookups will fail because any
attempts to load a topic will never attempt to re-open a managed ledger.

@sijie Do you know why is that never never returning? Is that for DNS error on opening the ledger?

sijie · 2020-07-10T21:55:51Z

@merlimat it is stuck in LedgerRecoveryOp when recovering cursors. I haven't caught the real exception. My feeling is more coming from the zookeeper side. The chaos test we did is killing the Kubernetes worker node hardly. In that worker node, it has one zookeeper pod, one bookkeeper pod, and one broker pod. This sounds like causing some zookeeper call didn't come back and the ledger recovery op stuck without triggering any callback which in return causes the problem in managed ledger library.

A side note - I created an issue a while ago to separate loading cursors from loading managed ledger. The idea is that we should allow producers to produce messages once the managed ledger is ready. This would improve write availability. #7404

codelipenghui · 2020-07-11T00:48:58Z

/pulsarbot run-failure-checks

*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment. (cherry picked from commit 14e3b7a)

*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.

*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment. (cherry picked from commit 14e3b7a)

devinbost · 2020-07-30T21:09:13Z

@sijie Could this timeout issue cause a topic to seem to freeze, as reported here: #6054 ?
If so, the odd thing about that freezing topic issue is that it occurs randomly in a baremetal docker environment while running functions without any sudden broker instance deaths.

*Motivation* Currently, broker has a timeout mechanism on loading topics. However, the underlying managed ledger library doesn't provide a timeout mechanism. This will get into a situation that a TopicLoad operation times out after 30 seconds. But the CompletableFuture of opening a managed ledger is still kept in the cache of managed ledger factory. The completable future will never return. So any sub-sequent topic lookups will fail because any attempts to load a topic will never attempt to re-open a managed ledger. *Modification* Introduce a timeout mechanism in the managed ledger factory. If a managed ledger is not open within a given timeout period, the CompletableFuture will be removed. This allows any subsequent attempts to load topics that can try to open the managed ledger again. *Tests* This problem can be constantly reproduced in a chaos test in Kubernetes by killing k8s worker nodes. It can cause producer stuck forever until the owner broker pod is restarted. The change has been verified in a chaos testing environment.

sijie added 2 commits July 10, 2020 12:15

Remove redundant imports

298bee7

sijie added area/broker component/storage release/2.5.3 release/2.6.1 labels Jul 10, 2020

sijie added this to the 2.7.0 milestone Jul 10, 2020

sijie requested review from ivankelly, merlimat, rdhabalia, jiazhai and codelipenghui July 10, 2020 19:39

sijie self-assigned this Jul 10, 2020

addisonj reviewed Jul 10, 2020

View reviewed changes

jiazhai approved these changes Jul 10, 2020

View reviewed changes

codelipenghui approved these changes Jul 11, 2020

View reviewed changes

codelipenghui merged commit 14e3b7a into apache:master Jul 16, 2020

sijie deleted the timeout_open_managed_ledger branch July 17, 2020 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Broker] Timeout opening managed ledger operation … #7506

[Broker] Timeout opening managed ledger operation … #7506

sijie commented Jul 10, 2020 •

edited

Loading

addisonj Jul 10, 2020

sijie Jul 10, 2020

merlimat commented Jul 10, 2020

sijie commented Jul 10, 2020

codelipenghui commented Jul 11, 2020

devinbost commented Jul 30, 2020

[Broker] Timeout opening managed ledger operation … #7506

[Broker] Timeout opening managed ledger operation … #7506

Conversation

sijie commented Jul 10, 2020 • edited Loading

addisonj Jul 10, 2020

Choose a reason for hiding this comment

sijie Jul 10, 2020

Choose a reason for hiding this comment

merlimat commented Jul 10, 2020

sijie commented Jul 10, 2020

codelipenghui commented Jul 11, 2020

devinbost commented Jul 30, 2020

sijie commented Jul 10, 2020 •

edited

Loading