Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pulsar-broker] close managed-ledgers before giving up bundle ownership to avoid bad zk-version #5599

Merged
merged 4 commits into from Jan 6, 2020

Conversation

@rdhabalia
Copy link
Contributor

rdhabalia commented Nov 9, 2019

Motivation

We have seen multiple below occurrence where unloading topic doesn't complete and gets stuck. and broker gives up ownership after a timeout and closing ml-factory closes unclosed managed-ledger which corrupts metadata zk-version and topic owned by new broker keeps failing with exception: ManagedLedgerException$BadVersionException

right now, while unloading bundle: broker removes ownership of bundle after timeout even if topic's managed-ledger is not closed successfully and ManagedLedgerFactoryImpl closes unclosed ml-ledger on broker shutdown which causes bad zk-version in to the new broker and because of that cursors are not able to update cursor-metadata into zk.

01:01:13.452 [shutdown-thread-57-1] INFO  org.apache.pulsar.broker.namespace.OwnedBundle - Disabling ownership: my-property/my-cluster/my-ns/0xd0000000_0xe0000000
:
01:01:13.653 [shutdown-thread-57-1] INFO  org.apache.pulsar.broker.service.BrokerService - [persistent://my-property/my-cluster/my-ns/topic-partition-53] Unloading topic
:
01:02:13.677 [shutdown-thread-57-1] INFO  org.apache.pulsar.broker.namespace.OwnedBundle - Unloading my-property/my-cluster/my-ns/0xd0000000_0xe0000000 namespace-bundle with 0 topics completed in 60225.0 ms
:
01:02:13.675 [shutdown-thread-57-1] ERROR org.apache.pulsar.broker.namespace.OwnedBundle - Failed to close topics in namespace my-property/my-cluster/my-ns/0xd0000000_0xe0000000 in 1/MINUTES timeout
01:02:13.677 [pulsar-ordered-OrderedExecutor-7-0-EventThread] INFO  org.apache.pulsar.broker.namespace.OwnershipCache - [/namespace/my-property/my-cluster/my-ns/0xd0000000_0xe0000000] Removed zk lock for service unit: OK
:
01:02:14.404 [shutdown-thread-57-1] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [my-property/my-cluster/my-ns/persistent/topic-partition-53] Closing managed ledger

Modification

This fix will make sure that broker closes managed-ledger before giving up bundle ownership to avoid below exception at new broker where bundle moves


01:02:30.995 [bookkeeper-ml-workers-OrderedExecutor-3-0] ERROR org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [my-property/my-cluster/my-ns/persistent/topic-partition-53][my-sub] Metadata ledger creation failed
org.apache.bookkeeper.mledger.ManagedLedgerException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:118) ~[zookeeper-3.4.13.jar:3.4.13-2d71af4dbe22557fda74f9a9b4309b15a7487f03]
        at org.apache.bookkeeper.mledger.impl.MetaStoreImplZookeeper.lambda$null$125(MetaStoreImplZookeeper.java:288) ~[managed-ledger-original-2.4.5-yahoo.jar:2.4.5-yahoo]
        at org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) [managed-ledger-original-2.4.5-yahoo.jar:2.4.5-yahoo]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [bookkeeper-common-4.9.0.jar:4.9.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-all-4.1.32.Final.jar:4.1.32.Final]
        at java.lang.Thread.run(Thread.java:834) [?:?]
…ip to avoid bad zk-version
@rdhabalia rdhabalia added this to the 2.4.2 milestone Nov 9, 2019
@rdhabalia rdhabalia self-assigned this Nov 9, 2019
rdhabalia added 2 commits Nov 10, 2019
@rdhabalia

This comment has been minimized.

Copy link
Contributor Author

rdhabalia commented Nov 11, 2019

rerun integration tests
rerun cpp tests

@rdhabalia

This comment has been minimized.

Copy link
Contributor Author

rdhabalia commented Nov 11, 2019

rerun integration tests
rerun cpp tests

@@ -1860,7 +1860,7 @@ protected void unloadTopic(TopicName topicName, boolean authoritative) {
validateTopicOwnership(topicName, authoritative);
try {
Topic topic = getTopicReference(topicName);
topic.close().get();
topic.close(false).get();

This comment has been minimized.

Copy link
@merlimat

merlimat Nov 11, 2019

Contributor

I'd rather have an enum here because it's not clear what false means in this context.

This comment has been minimized.

Copy link
@rdhabalia

rdhabalia Nov 13, 2019

Author Contributor

umm.. I think closing forcefully can be represent with boolean flag as we do similar thing at multiple places: PersistentTopic::delete(boolean... flags)

Also, I was also trying to think about how to accommodate enum here instead of flag. One thing I can think of is to add below enum under Topic instead of flag.

enum CLOSE_ACTION { CLOSE_ALL, CLOSE_WITHOUT_CLIENT_WAIT }

But I feel enum is not helping much. Instead we can rename the flag to give more meaning closeWithoutWaitingClientDisconnect.

So, for PersistentTopic if flag is enabled then broker skips waiting on client-disconnect and immediately closes managed-ledger before giving up bundle ownership.
And for NonPersistentTopic just completes the close if flag is enabled.

Can you please let me know if I am missing anything while renaming flag instead making enum.? any thoughts?

* @return
*/
public CompletableFuture<Integer> unloadServiceUnit(NamespaceBundle serviceUnit) {
public CompletableFuture<Integer> unloadServiceUnit(NamespaceBundle serviceUnit, boolean force) {

This comment has been minimized.

Copy link
@merlimat

merlimat Nov 11, 2019

Contributor

Same here, the force meaning is not evident when calling this method.

@@ -110,7 +110,7 @@ default long getOriginalSequenceId() {

CompletableFuture<Void> checkReplication();

CompletableFuture<Void> close();
CompletableFuture<Void> close(boolean force);

This comment has been minimized.

Copy link
@merlimat

merlimat Nov 11, 2019

Contributor

Apart from the bool vs enum discussion, why do we need 2 different behaviors? Can't we just consider "force" the only approach?

This comment has been minimized.

Copy link
@rdhabalia

rdhabalia Nov 13, 2019

Author Contributor

Can't we just consider "force" the only approach?

No, because first we want to close topic gracefully by closing all clients first and then close managed-ledger. If things don't get closed gracefully then close topic forcefully by closing managed-ledger before giving up ownership of the bundle. So, we need both behavior.

@wolfstudy

This comment has been minimized.

Copy link
Member

wolfstudy commented Nov 13, 2019

@rdhabalia l will change the Milestone to 2.4.3. So we can cut 2.4.2 and if needed
2.4.3 in a few weeks.

@wolfstudy wolfstudy removed this from the 2.4.2 milestone Nov 13, 2019
@rdhabalia

This comment has been minimized.

Copy link
Contributor Author

rdhabalia commented Nov 13, 2019

@merlimat addressed your comments and renamed the flag with more meaningful name instead making it enum as enum doesn't seem appropriate to define this method behavior. can you please review it again or can you please let me know if you have any thought on it.

@rdhabalia

This comment has been minimized.

Copy link
Contributor Author

rdhabalia commented Nov 15, 2019

@merlimat can you please review it.. somehow we are seeing this issue with 2.4 very often when we restart the broker. we also want to merge #5604 as part of this issue.

@rdhabalia

This comment has been minimized.

Copy link
Contributor Author

rdhabalia commented Dec 13, 2019

we are keep facing this issue and need this fix soon. so, @merlimat @sijie can we please review it.

@jiazhai
jiazhai approved these changes Jan 3, 2020
Copy link
Member

jiazhai left a comment

+1, overall lgtm

@jiazhai jiazhai merged commit 0a259ab into apache:master Jan 6, 2020
3 checks passed
3 checks passed
Jenkins: C++ / Python Tests SUCCESS
Details
Jenkins: Integration Tests SUCCESS
Details
Jenkins: Java 8 - Unit Tests SUCCESS
Details
jiazhai added a commit that referenced this pull request Jan 8, 2020
### Motivation

Since #5599 merged, it introduce some conflict code with master branch, maybe the reason is #5599 not rebase with master

### Verifying this change

This is a test change
@sijie sijie added the release/2.4.3 label Jan 22, 2020
@sijie sijie added this to the 2.6.0 milestone Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.