Fix Bugs Introduced by New Load Manager #332

bobbeyreese · 2017-04-05T19:02:36Z

Motivation

The new load management API introduced a few bugs that have been observed, namely that

After dynamically changing the Load Manager, watches from the old load manager will continue to try to parse jsons in the wrong format (LoadReport -> LocalBrokerData and vice-versa).
NamespaceService will fail to construct LookupResult since JSON cannot instantiate the abstract class ServiceLookupData.
LeastLongTermMessageRate always reported that brokers were overloaded due to an integer division bug

Modifications

Both load managers will now shutdown their schedulers and unregister themselves as watchers. Some log statements were added for the new load manager.

Result

Some bugs introduced by the new load manager will be fixed.

rdhabalia · 2017-04-05T19:11:32Z

pulsar-broker/src/main/java/com/yahoo/pulsar/broker/loadbalance/impl/SimpleLoadManagerImpl.java

@@ -310,6 +310,8 @@ public void start() throws PulsarServerException {

    @Override
    public void disableBroker() throws Exception {
+        loadReportCacheZk.unregisterListener(this);


I think we should add this logic to stop() method which is empty for both the loadManager right now.
And I think we should not delete the znode when we dynamically change load-manager, but new load-manager should just update new load-repot on this path

rdhabalia · 2017-04-05T19:26:35Z

can we also add unit test for changes.

rdhabalia · 2017-04-05T21:47:50Z

pulsar-broker/src/main/java/com/yahoo/pulsar/broker/loadbalance/impl/SimpleLoadManagerImpl.java

@@ -310,6 +310,7 @@ public void start() throws PulsarServerException {

    @Override
    public void disableBroker() throws Exception {
+        stop();


I think disableBroker should just disable broker by deleting the node and it should not stop the load-manager.

rdhabalia · 2017-04-06T02:26:12Z

pulsar-broker/src/main/java/com/yahoo/pulsar/broker/loadbalance/impl/SimpleLoadManagerImpl.java

-        // do nothing
+        loadReportCacheZk.shutdown();
+        availableActiveBrokers.shutdown();
+        scheduler.shutdown();


I think we should handle future scheduling task if it gets triggered by zk-watch, as we have already shutdown the scheduler.

Do you mean to say that it is already handled? There is actually an issue in that both the LocalBrokerData and LoadReport caches use the same underlying cache, being pulsar.getLocalZkCache(). What this means is that, if we were not to shutdown the caches, the integrity of the Object in the cache is determined by which watch fires last. This is the cause of the ClassCastException: Even though the scheduler is shutdown, the watch fires for all previously created load managers, deserializing the json to either a LocalBrokerData or LoadReport. The last one to do so wins, which could cause us to read a LoadReport when we want a LocalBrokerData or vice-versa.

Actually, I mean before scheduling task scheduler.isShutdown() in case scheduler is already shutdown.

Also, one more concern:

Now, we have two ZooKeeperChildrenCache objects (one in each load-manager) and both use the same zk-session to get data from zk.

Now, while getting data from zk, we pass these objects as watch. So, we will set multiple watch for the same znode

and I think per zk-session we may get only 1 watch so, there could be possibility that new load-manager may not get watch ??
@merlimat

There is no particular promblem in having multiple watches on the same znode within the same session

rdhabalia · 2017-04-06T21:14:03Z

@merlimat @saandrews I think we can include #336 and this #332 PR for patch. so, can you please review this one as well.

rdhabalia · 2017-04-06T21:16:29Z

pulsar-broker/src/main/java/com/yahoo/pulsar/broker/namespace/NamespaceService.java

-
+    private static final Deserializer<ServiceLookupData> serviceLookupDataDeserializer = (key, content) -> {
+        final String jsonString = new String(content);
+        if (jsonString.contains("\"allocatedCPU\":")) {


should we select deserialize class based on active load-balancer?

Yes, I would prefer that. Check which implementation is active and then try both format, starting from the active one and falling back to the inactive.

Actually as we have addressed it in #338 and

Both LoadReport and LocalBrokerData extends ServiceLookupData which has same getter name for broker-urls

ObjectMapperFactory doesn't fail on unknown fields
So, any deserializer will always be able to parse both load-report json so, I think deserialize load report based on active load-manager #338 will be enough. Do, you think we should have logic for falling back on top of deserialize load report based on active load-manager #338 ?

@bobbeyreese can you rebase with master to get the change and if we think to fall back then we can add logic on top of it.

rdhabalia · 2017-04-06T21:17:00Z

pulsar-zookeeper-utils/src/main/java/com/yahoo/pulsar/zookeeper/ZooKeeperDataCache.java

    public ZooKeeperDataCache(final ZooKeeperCache cache) {
        this.cache = cache;
+        isShutdown = new AtomicBoolean(false);


we can use AtomicFieldUpdater here.

saandrews · 2017-04-06T21:49:21Z

pulsar-zookeeper-utils/src/main/java/com/yahoo/pulsar/zookeeper/ZooKeeperChildrenCache.java

+        }
+    }
+
+    public void shutdown() {


Can we rename it to close() or something else? The name could confuse us since we call shutdown while stopping the service.

saandrews · 2017-04-06T21:50:17Z

pulsar-zookeeper-utils/src/main/java/com/yahoo/pulsar/zookeeper/ZooKeeperDataCache.java

+        }
+    }
+
+    public void shutdown() {


Same comment as before.

rdhabalia

LGTM.

merlimat

👍

This reverts commit 03ffb57.

* Fix bugs introduced by new load manager

This PR is a partial fix of apache#332. Before this PR, the MessagePublishContext completed with the current offset, which is the latest offset. Then PartitionResponse will be filled with the offset. However, Kafka producer treat the PartitionResponse's offset as the base offset. For example, before this PR, when Kafka producer sends a batch with 3 single messages to a new topic, the offsets in send callback will be (2, 3, 4) before this PR. After this PR, the offsets in send callback will be (0, 1, 2).

Fixes apache#332 The offset in produce callback may be not accurate when the messages are sent in batch, see streamnative/kop#332 (comment) for detail explanation. Since apache#9257 introduced a new API that supports fetching some metadata from the entry data, we can use this API to get the accurate offset.

rdhabalia assigned bobbeyreese Apr 5, 2017

rdhabalia added the type/bug The PR fixed a bug or issue reported a bug label Apr 5, 2017

rdhabalia added this to the 1.18 milestone Apr 5, 2017

rdhabalia reviewed Apr 5, 2017

View reviewed changes

rdhabalia reviewed Apr 6, 2017

View reviewed changes

saandrews reviewed Apr 6, 2017

View reviewed changes

rdhabalia mentioned this pull request Apr 7, 2017

deserialize load report based on active load-manager #338

Merged

bobbeyreese and others added 12 commits April 7, 2017 11:15

Fix bugs introduced by new load manager

c7d2af0

Add unit tests

aed2bf7

Remove erroneous @test tag

e85fb0f

Don't stop on disable, fix integer division bug

d1b5725

Don't stop broker on disable

7732725

Add unit test for LeastLongTermMessageRate

e129ffe

Fix imports

bdf4860

Don't throw exception if broker ZNode exists

1a6f5b3

Add NPE guard for load report in startup

32cb33d

Add functionality to shutdown caches

7da89f6

Check for shutdown instead of deregistration

f0ac6ee

Fix bugs introduced by new load manager

58b59fc

bradtm force-pushed the bug_fixes branch from f4625c2 to 58b59fc Compare April 7, 2017 18:21

Brad McMillen added 6 commits April 7, 2017 12:26

Convert AtomicBoolean to AtomicFieldUpdater

a6d2e5b

Rename shutdown() to close()

a90f1b8

Rename shutdown() to close()

a34d8bd

Convert AtomicBoolean to AtomicFieldUpdater

194f4e9

Convert AtomicBoolean to AtomicFieldUpdater

4422e2d

This test did not change; revert back

29b203a

Clear the ZooKeeperDataCache when stopping the load manager

685999a

rdhabalia approved these changes Apr 8, 2017

View reviewed changes

merlimat approved these changes Apr 10, 2017

View reviewed changes

bobbeyreese added 2 commits April 10, 2017 14:12

Fix bug causing unavailable brokers to be considered

03ffb57

Revert "Fix bug causing unavailable brokers to be considered"

18f69c1

This reverts commit 03ffb57.

rdhabalia merged commit 731e5df into apache:master Apr 10, 2017

rdhabalia pushed a commit that referenced this pull request May 2, 2017

Fix Bugs Introduced by New Load Manager (#332)

6c9a217

* Fix bugs introduced by new load manager

xiaotongwang1 mentioned this pull request Aug 4, 2021

Pulsar 2.7.0+ KOP 2.7.2.x getPartitionedTopicMetadata timeout #11532

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Bugs Introduced by New Load Manager #332

Fix Bugs Introduced by New Load Manager #332

bobbeyreese commented Apr 5, 2017 •

edited

Loading

rdhabalia Apr 5, 2017 •

edited

Loading

rdhabalia commented Apr 5, 2017

rdhabalia Apr 5, 2017

rdhabalia Apr 6, 2017 •

edited

Loading

bobbeyreese Apr 6, 2017

rdhabalia Apr 6, 2017

rdhabalia Apr 6, 2017

merlimat Apr 6, 2017

rdhabalia commented Apr 6, 2017

rdhabalia Apr 6, 2017

merlimat Apr 7, 2017

rdhabalia Apr 7, 2017

rdhabalia Apr 6, 2017

saandrews Apr 6, 2017

saandrews Apr 6, 2017

rdhabalia left a comment

merlimat left a comment

Fix Bugs Introduced by New Load Manager #332

Fix Bugs Introduced by New Load Manager #332

Conversation

bobbeyreese commented Apr 5, 2017 • edited Loading

Motivation

Modifications

Result

rdhabalia Apr 5, 2017 • edited Loading

Choose a reason for hiding this comment

rdhabalia commented Apr 5, 2017

Choose a reason for hiding this comment

rdhabalia Apr 6, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdhabalia commented Apr 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdhabalia left a comment

Choose a reason for hiding this comment

merlimat left a comment

Choose a reason for hiding this comment

bobbeyreese commented Apr 5, 2017 •

edited

Loading

rdhabalia Apr 5, 2017 •

edited

Loading

rdhabalia Apr 6, 2017 •

edited

Loading