Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Broker] Fix and improve topic ownership assignment #13069

Merged
merged 14 commits into from
Dec 3, 2021

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Dec 1, 2021

Motivation

This PR depends on #13066 .

When a lookup is made to a topic that isn't currently loaded, the decision will be made in a distributed fashion on the follower brokers since the information about the leader broker is missing (because LeaderElectionService.getCurrentLeader() always returned Optional.empty()). This leads to races when assigning the topic ownership to a broker, since the decision isn't made centrally on the leader broker.

This PR adds a test to verify the behavior and also uses a cached way to get available brokers.

Modifications

Additional context

PR to backport this fix to branch-2.8 is #13117 . Pulsar 2.8.1 contains another topic ownership bug which is fixed by #12650 .

@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug area/broker doc-not-needed Your PR changes do not impact docs labels Dec 1, 2021
@lhotari lhotari added this to the 2.10.0 milestone Dec 1, 2021
@lhotari lhotari self-assigned this Dec 1, 2021
@lhotari lhotari force-pushed the lh-fix-topic-ownership-assignment branch from 534eec7 to 7c785d0 Compare December 1, 2021 17:41
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lhotari added a commit to lhotari/pulsar that referenced this pull request Dec 3, 2021
@lhotari
Copy link
Member Author

lhotari commented Dec 3, 2021

I fixed the test in https://github.com/lhotari/pulsar/commits/lh-fix-topic-ownership-assignment-branch-2.8 . I can now consistently reproduce the issue. It results in error 500 :

10:22:44.069 [metadata-store-150-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-220-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-80-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-45-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-325-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-185-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-255-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-290-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.069 [metadata-store-115-1] INFO  org.apache.pulsar.common.naming.NamespaceBundleFactory - Policy updated for namespace public/default, refreshing the bundle cache.
10:22:44.070 [metadata-store-255-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup broker for topic persistent://public/default/lookuptest0e133752-6e62-415f-994b-ef5d9ad57730-0: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$storePut$15(ZKMetadataStore.java:226) ~[pulsar-metadata-2.8.2.jar:2.8.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.68.Final.jar:4.1.68.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:308) ~[pulsar-metadata-2.8.2.jar:2.8.2]
	... 5 more
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:122) ~[zookeeper-3.6.3.jar:3.6.3]
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.6.3.jar:3.6.3]
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:304) ~[pulsar-metadata-2.8.2.jar:2.8.2]
	... 5 more
10:22:44.070 [metadata-store-185-1] WARN  org.apache.pulsar.broker.lookup.TopicLookupBase - Failed to lookup broker for topic persistent://public/default/lookuptest0e133752-6e62-415f-994b-ef5d9ad57730-0: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
java.util.concurrent.CompletionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
	at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346) ~[?:?]
	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:632) ~[?:?]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$storePut$15(ZKMetadataStore.java:226) ~[pulsar-metadata-2.8.2.jar:2.8.2]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [netty-common-4.1.68.Final.jar:4.1.68.Final]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:308) ~[pulsar-metadata-2.8.2.jar:2.8.2]
	... 5 more
Caused by: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /admin/local-policies/public/default
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:122) ~[zookeeper-3.6.3.jar:3.6.3]
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.6.3.jar:3.6.3]
	at org.apache.pulsar.metadata.impl.ZKMetadataStore.getException(ZKMetadataStore.java:304) ~[pulsar-metadata-2.8.2.jar:2.8.2]
	... 5 more

@lhotari
Copy link
Member Author

lhotari commented Dec 3, 2021

Problem happens here:

// If no local policies defined for namespace, copy from global config
copyToLocalPolicies(namespace)
.thenAccept(b -> future.complete(b))
.exceptionally(ex -> {
future.completeExceptionally(ex);
return null;
});

@lhotari
Copy link
Member Author

lhotari commented Dec 3, 2021

Exception gets handled here:

}).exceptionally(ex -> {
lookupFailures.inc();
return null;
});

and here:
}).exceptionally(exception -> {
log.warn("Failed to lookup broker for topic {}: {}", topicName, exception.getMessage(), exception);
completeLookupResponseExceptionally(asyncResponse, exception);
return null;
});

@lhotari
Copy link
Member Author

lhotari commented Dec 3, 2021

I have a fix for branch-2.8 in lhotari@1a350804 . I'll make similar changes to this PR since they aren't 2.8 specific.

@lhotari lhotari marked this pull request as ready for review December 3, 2021 09:40
@lhotari
Copy link
Member Author

lhotari commented Dec 3, 2021

@codelipenghui @eolivelli Please review the recent changes.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@lhotari
Copy link
Member Author

lhotari commented Dec 3, 2021

PR to backport this fix to branch-2.8 is #13117

lhotari added a commit to lhotari/pulsar that referenced this pull request Dec 3, 2021
@lhotari lhotari merged commit 537dee1 into apache:master Dec 3, 2021
lhotari added a commit to lhotari/pulsar that referenced this pull request Dec 3, 2021
lhotari added a commit that referenced this pull request Dec 3, 2021
* Add warning log message when leader broker isn't available

* Add more logging about load manager decisions

* Use cached information for available brokers

* Reproduce lookup race issue

* Use java.util.concurrent.Phaser to increase the chances of a race

* Address review feedback

* Increase concurrency of test case to reproduce race conditions

* Use real Zookeeper server in MultiBrokerLeaderElectionTest

* Add retry with backoff to loading namespace bundles

* Add more topics to test

* Address review comment

* Fix checkstyle

* Improve logging

* Address review comments

(cherry picked from commit 537dee1)
@lhotari lhotari added cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life labels Dec 3, 2021
fxbing pushed a commit to fxbing/pulsar that referenced this pull request Dec 19, 2021
* Add warning log message when leader broker isn't available

* Add more logging about load manager decisions

* Use cached information for available brokers

* Reproduce lookup race issue

* Use java.util.concurrent.Phaser to increase the chances of a race

* Address review feedback

* Increase concurrency of test case to reproduce race conditions

* Use real Zookeeper server in MultiBrokerLeaderElectionTest

* Add retry with backoff to loading namespace bundles

* Add more topics to test

* Address review comment

* Fix checkstyle

* Improve logging

* Address review comments
aloyszhang pushed a commit to aloyszhang/pulsar that referenced this pull request Aug 5, 2022
…!64)


Squash merge branch 'optimize-ownership-assign' into '2.8.1'
--story=872733891 负载均衡优化(apache#13069)(apache#13117 )

TAPD: --story=872733891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.2 release/2.9.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants