Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ZOOKEEPER-1177] Add the memory optimized watch manager for concentrate watches scenario #590

Closed
wants to merge 4 commits into from

Conversation

lvfangmin
Copy link
Contributor

The current HashSet based WatcherManager will consume more than 40GB memory when
creating 300M watches.

This patch optimized the memory and time complexity for concentrate watches scenario, compared to WatchManager, both the memory consumption and time complexity improved a lot. I'll post more data later with micro benchmark result.

Changed made compared to WatchManager:

  • Only keep path to watches map
  • Use BitSet to save the memory used to store watches
  • Use ConcurrentHashMap and ReadWriteLock instead of synchronized to reduce lock retention
  • Lazily clean up the closed watchers

Copy link
Contributor

@hanm hanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvfangmin
Copy link
Contributor Author

@hanm the code complained by Findbug are all correct and expected:

  1. System.exit if create watch manager failed in DataTree
    It's using the factory to create the watch manager based on the class name, in case we cannot initialize the class due to invalid classname, we have to exit.

  2. BitHashSet.elementCount not synchronized in iterator()
    It cannot guarantee thread safe during iteration inside this function, so it's marked as non thread safe in the comment, and the caller needs to and is synchronizing it during iterating.

  3. Synchronize on AtomicInteger
    In the code, we synchronize on it to do wait/notify, but not relying on the synchronization to control the AtomicInteger value update, so it's used correctly.

@anmolnar
Copy link
Contributor

anmolnar commented Aug 8, 2018

@lvfangmin In which case you need to add exceptions to config/findbugsExcludeFile.xml.

@maoling
Copy link
Member

maoling commented Aug 8, 2018

@lvfangmin interesting.where can we find a benchmark which shows how this pr improves the memory?

@lvfangmin
Copy link
Contributor Author

Based on internal benchmark, this may save more than 95% memory usage on concentrate watches scenario. I'm working on adding some micro benchmark.

In the original Jira, @phunt also added some basic test, which could show you some ideas about how much memory it's going to save.

The current HashMap based watch manager uses lots of memory due to the overhead of storing each entry (32 Bytes for each entry), the object reference and the duplicated path strings.

Use BitSet without revert index could reduce those overhead and make it more memory efficient.

@lvfangmin
Copy link
Contributor Author

There is a PR in the Jira few years ago, which uses BitMap as well, but it sacrificed the performance on triggering watches, this patch improved that and uses lazy clean up for cleaning those watchers who has closed the session.

@anmolnar
Copy link
Contributor

anmolnar commented Aug 9, 2018

This is going to be a huge improvement for Accumulo for example that is a heavy user of watchers. I'm going to allocate some capacity to review these new patches.

@lvfangmin
Copy link
Contributor Author

Added JMH micro benchmark for the watch manager:

  • It shows big win for the watch heavy cases, with the current implementation, it uses more than 50MB memory to store 1M watches, with WatchManagerOptimized it only uses around 0.2MB.
  • It also makes add and trigger watches more efficient, since WatchManagerOptimized doesn't maintain the reverse map.
  • In sparse watches use case, the WatchManagerOptimized is expected to use a bit more memory because it needs extra effort to maintain those bit set. In the test it shows around 10% more memory usage.

Here are more result about the throughput/latency related with WatchManager:

Benchmark                               (pathCount)    (watchManagerClass)  (watcherCount)  Mode  Cnt   Score    Error  Units
WatchBench.testAddConcentrateWatch            10000           WatchManager             N/A  avgt    9   5.382 ±  0.968  ms/op
WatchBench.testAddConcentrateWatch            10000  WatchManagerOptimized             N/A  avgt    9   0.696 ±  0.133  ms/op
WatchBench.testAddSparseWatch                 10000           WatchManager           10000  avgt    9   4.889 ±  1.585  ms/op
WatchBench.testAddSparseWatch                 10000  WatchManagerOptimized           10000  avgt    9   4.794 ±  1.068  ms/op
WatchBench.testTriggerConcentrateWatch            1           WatchManager               1  avgt    9  ≈ 10⁻⁴           ms/op
WatchBench.testTriggerConcentrateWatch            1           WatchManager            1000  avgt    9   0.037 ±  0.002  ms/op
WatchBench.testTriggerConcentrateWatch            1  WatchManagerOptimized               1  avgt    9  ≈ 10⁻⁴           ms/op
WatchBench.testTriggerConcentrateWatch            1  WatchManagerOptimized            1000  avgt    9   0.025 ±  0.001  ms/op
WatchBench.testTriggerConcentrateWatch         1000           WatchManager               1  avgt    9   0.048 ±  0.003  ms/op
WatchBench.testTriggerConcentrateWatch         1000           WatchManager            1000  avgt    9  71.838 ±  4.043  ms/op
WatchBench.testTriggerConcentrateWatch         1000  WatchManagerOptimized               1  avgt    9   0.079 ±  0.002  ms/op
WatchBench.testTriggerConcentrateWatch         1000  WatchManagerOptimized            1000  avgt    9  26.135 ±  0.223  ms/op
WatchBench.testTriggerSparseWatch             10000           WatchManager           10000  avgt    9   1.207 ±  0.035  ms/op
WatchBench.testTriggerSparseWatch             10000  WatchManagerOptimized           10000  avgt    9   1.321 ±  0.019  ms/op

You can try the following command to run the micro benchmark:

$ ant clean package
$ ant clean package -buildfile zookeeper-contrib/zookeeper-contrib-fatjar/build.xml
$ java -jar build/contrib/fatjar/zookeeper-dev-fatjar.jar jmh

@maoling @anmolnar hope this gives you a more vivid comparison between the old and new watch manager implementation.

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvfangmin Please take a look at my initial feedback on the patch. Didn't have time to review the testing side, but it generally looks good.

@nkalmar There's a new project being added here: bench. Please take a quick look at it and advise on what would be the best place to put in terms of Maven migration. Thanks.

build.xml Outdated
@@ -119,6 +119,7 @@ xmlns:cs="antlib:com.puppycrawl.tools.checkstyle.ant">
<property name="test.java.classes" value="${test.java.build.dir}/classes"/>
<property name="test.src.dir" value="${src.dir}/java/test"/>
<property name="systest.src.dir" value="${src.dir}/java/systest"/>
<property name="bench.src.dir" value="${src.dir}/java/bench"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this new dir should be added to classpath of eclipse task too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@Override
public void removeWatcher(Watcher watcher) {
Integer watcherBit;
addRemovePathRWLock.writeLock().lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to acquire write lock here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need the exclusive lock with addWatch, otherwise addWatch may still add a dead watch which won't be cleaned up in the WatchCleaner when it started to clean up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case you need to move addDeadWatcher() call inside the critical block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this comment last time, what we need here is that, as long as we called addDeadWatcher, there will be no watches added related with this dead watcher. Code executed after line 136 means the watcher will be marked as stale, after we release this lock, any on flying addWatcher for this dead watcher will be rejected, so it guarantees when we call addDeadWatcher there will be no race condition between removing and adding watch.

And I need to move addDeadWatcher out of the locking block, since the WatchCleaner might block on it to avoid OOM issue if the cleaner cannot catch up of cleaning the dead watchers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you mark the watcher as stale inside the critical block?
It only calls a getter on the BitIdMap, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the caller of removeWatcher, which is in NIOServerCnxn.close and NettyServerCnxn.close.

When the cnxn is closed, it will set stale before call removeCnxn on zkServer, which calls this function sequentially, if we grabbed this lock, it means the cnxn has been marked as stale.

We can explicitly setStale here as well, but I don't think that's necessary.

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public class WatchManagerFactory {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick javadoc would be awesome here.


import java.util.Set;

public interface DeadWatcherListener {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a few words javadoc here.

import java.util.BitSet;
import java.util.concurrent.locks.ReentrantReadWriteLock;

public class BitMap<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a short javadoc similar to BitHashSet's would useful here.

import org.apache.zookeeper.WatchedEvent;
import org.apache.zookeeper.proto.ReplyHeader;

public class DumbWatcher extends ServerCnxn {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using mockito.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree from unit test case mock object is easier to maintain than stub ones, but I also need this DumbWatcher in the micro benchmark, I'll put this class somewhere in the code, so the micro benchmark and unit test can share it.

@anmolnar
Copy link
Contributor

anmolnar commented Sep 5, 2018

@lvfangmin Docs for the new caching parameter zookeeper.bitHashCacheSize?

Copy link
Contributor

@nkalmar nkalmar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the new project bench:
I'm not sure we need a top level module for this bench test (although more could come in the future...)

From my side, this can stay in an org.apache.zookeeper project, but then it should go to org.apache.zookeeper.bench in my opinion, and directory structure should be src/java/org/apache/zookeeper/bench/**

It will be moved with maven migration anyway (it can't stay in java/bench/org/..

@lvfangmin Can you please move the directory, and possibly also create a new package "branch" for it, under zookeeper?

Otherwise, looking at the patch, really great job! Thanks!

@lvfangmin
Copy link
Contributor Author

Thanks @nkalmar for the suggestion of bench project position, I was following the same directory as the src/java/systest for now, do you think we can move them together later? I'm not against to move it now.

If we want to move now, just to confirm the directory is src/java/org/apache/zookeeper/bench/ without main folder, right?

@nkalmar
Copy link
Contributor

nkalmar commented Sep 6, 2018

systest will go to src/test/java , from my side, you can put the bench in org.apache.zookeeper.test.system . Thinking about it, that's a pretty good place.

No need to create main directory or anything, I will move all the files anyway. Just move the files amongst the others in systest. (Hopefully no package level dependency, I didn't check)

But there's a chance you will have to rebase if this this PR cannot be merged before the movement of all the remaining files as the directory refactor's last step. Sorry about that in advance... I'm not going to be the most popular with that PR :(

@lvfangmin
Copy link
Contributor Author

Update based on the comments:

  1. add findbugs exclusion
  2. add comments for the newly added classes
  3. add admin doc for the new JVM options introduced in this diff
  4. move the bench project to src/test/java/bench
  5. move DumbWatcher to src/java/main dir for share between unit test and bench
  6. change to use ExitCode

@nkalmar I moved the bench to src/test/java/bench, which seems more reasonable to me, let me know if you think that's not a good position based on your plan.

@lvfangmin
Copy link
Contributor Author

Resolve conflict with latest code on master.

@lvfangmin
Copy link
Contributor Author

Rebased to resolve conflict. @anmolnar @hanm @maoling @nkalmar please revisit this PR when you have time.

@anmolnar
Copy link
Contributor

@lvfangmin Thanks. Sorry for the delay. I'd like to check one more thing before accepting. Bear with me please.

@anmolnar
Copy link
Contributor

@lvfangmin Just to wrap up the difference between this and original 6-year-old patch on Jira: you've added deadWatchers collection and lazy WatcherCleaner to avoid the performance penalty on removeWatches().

Is that correct?

@lvfangmin
Copy link
Contributor Author

@anmolnar Thanks for reviewing, take your time.

Here are the main differences between this version and the old one on the Jira:

  1. use path to watchers map instead of watcher to paths to improve the performance of trigger watches, we need the lazily dead watcher clean up because of this (main change)
  2. better and cleaner implementation, for example, added WatchManagerFactory to easily switch between different watch manager implementation
  3. some perf improvement by using HashSet and BitSet to find a balance between memory usage and time complexity
  4. fix the watcher leaking issue due to adding dead watcher (we can separate this out if we want to)
  5. fix the NettyServerCnxn doesn't de-register itself from watcher manager when the cnxn closed (this is actually fixed recently in [ZOOKEEPER-3131] Remove watcher when session closed in NettyServerCnxn #612, so it's not necessary to do it here now)
  6. added jmh micro benchmark

Copy link
Contributor

@anmolnar anmolnar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the summary. It's very useful for the records.
+1 approved

@anmolnar
Copy link
Contributor

I cannot commit, because we still have -1 from @hanm . Waiting for him to approve.
Does it make #612 redundant?

@asfgit
Copy link

asfgit commented Sep 21, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2223/

@@ -0,0 +1,12 @@
package org.apache.zookeeper;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is missing apache license header. This triggers a -1 in last jenkins build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it.

@hanm
Copy link
Contributor

hanm commented Sep 21, 2018

@anmolnar I'll sign this off by end of next Monday if no other issues.
@lvfangmin great work and thanks for your patience!

Copy link
Contributor

@hanm hanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nitpicks on spelling, a couple of questions, and some comments around code i see suspicious..

import org.apache.zookeeper.server.ServerCnxn;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all imports here except these three since rest of those were not used (my guess is this file was copied pasted?)
import java.io.PrintWriter; import org.apache.zookeeper.Watcher; import org.apache.zookeeper.Watcher.Event.EventType;

public boolean addWatch(String path, Watcher watcher);

/**
* Checks the specified watcher exists for the given path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing full stop at end of sentence.

public boolean containsWatcher(String path, Watcher watcher);

/**
* Removes the specified watcher for the given path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing full stop at end of sentence.


/**
* Distribute the watch event for the given path, but ignore those
* supressed ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell check: suppressed instead ofsupressed

* @return the watchers have been notified
*/
public WatcherOrBitSet triggerWatch(
String path, EventType type, WatcherOrBitSet supress);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar spelling issue for supress

/**
* Interface used to process the dead watchers related to closed cnxns.
*/
public interface DeadWatcherListener {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to rename this to IDeadWatchListner, which makes it obvious this is an interface. We already do this for IWatchManager.

<ulink url="https://issues.apache.org/jira/browse/ZOOKEEPER-1179">ZOOKEEPER-1179</ulink> The new watcher
manager WatchManagerOptimized will clean up the dead watchers lazily, this config is used to decide how
many thread is used in the WatcherCleaner. More thread usually means larger clean up throughput. The
default value is 2, which is good enough even for heavy and continuous session closing/receating cases.</para>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closing/recreating

try {
if (deadWatchers.size() < watcherCleanThreshold) {
int maxWaitMs = (watcherCleanIntervalInSeconds +
r.nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a particular reason of choosing this versus, say exponential backoff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up dead watches on large watches ensemble is a heavy work, which might affect the performance, so add jitter to make sure we don't do the lazily clean up at the same time on all the servers in the ensemble.

deadWatchers.clear();
int total = snapshot.size();
LOG.info("Processing {} dead watchers", total);
cleaners.schedule(new WorkRequest() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation issue on this line...

synchronized (this) {
// Snapshot of the current dead watchers
final Set<Integer> snapshot = new HashSet<Integer>(deadWatchers);
deadWatchers.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason to copy the deadWatchers, clear it immediately and then do the work, instead of just operate on deadWatchers directly and only clear it after the work is done? I assume the motivation was to free deadWatchers earlier so we can pipeline the work: adding more dead watchers while the previous pipeline of cleaning was in progress, but it looks like the new dead watchers will block on totalDeadWatchers, which will only be reset after previous dead watchers were cleaned up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean the dead watchers need to go through all the current watches, which is pretty heavy and may take a second if there are millions of watches, that's why we're doing lazily batch clean up here, we don't want to block addDeadWatcher, which is called from the Cnxn.close while we're doing the clean work.

The totalDeadWatchers is used to avoid OOM when the watcher cleaner cannot catch up (we haven't seen this problem even with heavy reconnecting scenario), it is suggested to be set to something like 1000 * watcherCleanThreshold. The watcherCleanThreshold is used to control the batch size when doing the clean up, there is trade off between GC the dead watcher memory and the time complexity of cleaning up, so we cannot set this too large.

Copy link
Contributor Author

@lvfangmin lvfangmin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @hanm for the detailed review, I have made comments to address your concerns, let me know if there is anything unclear to you.

Meanwhile, I'll remove the unused imports and correct the typo as you suggested.

if (elementBit == null) {
return false;
}
return elementBits.get(elementBit);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BitSet.get is O(1), check cache doesn't may actually more expensive.

HashSet is used to optimize the iterating, for example, if there is a single element in this BitHashSet, but the bit is very large, without HashSet we need to go through all the words before return that element, which is not efficient.

* iterate through this set.
*/
@Override
public Iterator<Integer> iterator() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in the triggerWatcher with for iterator.

try {
if (deadWatchers.size() < watcherCleanThreshold) {
int maxWaitMs = (watcherCleanIntervalInSeconds +
r.nextInt(watcherCleanIntervalInSeconds / 2 + 1)) * 1000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up dead watches on large watches ensemble is a heavy work, which might affect the performance, so add jitter to make sure we don't do the lazily clean up at the same time on all the servers in the ensemble.

Integer bit = getBit(value);
if (bit != null) {
return bit;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This BitMap is used by WatchManagerOptimized.watcherBitIdMap, which is used to store watcher to bit mapping.

Add might be called a lot if the same client connection is watching on thousands of even millions of nodes, remove only called once when the session is closed, that's why we optimized to check read lock first in add, but use write lock directly in remove.

* limitations under the License.
*/

package org.apache.zookeeper.server.watch;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the beginning when we added this class, it was bound with Watcher, but not anymore after refactoring, we can move this to server.util, I'll do that.

*/
public class BitHashSet implements Iterable<Integer> {

static final long serialVersionUID = 6382565447128283568L;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, it was using inheritance instead of composition with HashSet, at that time we added this serialVersionUID, didn't remove this after changing to composition, will remove it.

synchronized (this) {
// Snapshot of the current dead watchers
final Set<Integer> snapshot = new HashSet<Integer>(deadWatchers);
deadWatchers.clear();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean the dead watchers need to go through all the current watches, which is pretty heavy and may take a second if there are millions of watches, that's why we're doing lazily batch clean up here, we don't want to block addDeadWatcher, which is called from the Cnxn.close while we're doing the clean work.

The totalDeadWatchers is used to avoid OOM when the watcher cleaner cannot catch up (we haven't seen this problem even with heavy reconnecting scenario), it is suggested to be set to something like 1000 * watcherCleanThreshold. The watcherCleanThreshold is used to control the batch size when doing the clean up, there is trade off between GC the dead watcher memory and the time complexity of cleaning up, so we cannot set this too large.

@@ -0,0 +1,12 @@
package org.apache.zookeeper;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add it.

@lvfangmin
Copy link
Contributor Author

@anmolnar #612 was fired because there is user reporting that issue, and it's better to solve it earlier than waiting this diff being reviewed and merged. I'll do the rebase to get rid of that change in this patch.

// this is will slow down the socket packet processing and
// the adding watches in the ZK pipeline.
while (maxInProcessingDeadWatchers > 0 && !stopped &&
totalDeadWatchers.get() >= maxInProcessingDeadWatchers) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be maxInProcessingDeadWatchers != -1 && totalDeadWatchers.get() >= maxInProcessingDeadWatchers. Otherwise we'll always wait on totalDeadWatchers if user use default configuration value of maxInProcessingDeadWatchers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hanm I'm not sure I understand this correctly, the default value of maxInProcessingDeadWatchers is -1, in this case it will skip checking the totalDeadWatchers, am I missing anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, did not see maxInProcessingDeadWatchers > 0. we are good here.

import org.apache.zookeeper.server.ServerCnxn;
import org.apache.zookeeper.server.util.BitMap;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of unused imports here

* Optimized in memory and time complexity, compared to WatchManager, both the
* memory consumption and time complexity improved a lot, but it cannot
* efficiently remove the watcher when the session or socket is closed, for
* majority usecase this is not a problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use case

return null;
}

int triggeredWatches = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove this - it's assigned later but never used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had metrics for this, I removed those metrics because #580 was still in review, I'll add those metrics back since that patch has been merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew it's the metrics. It's fine to leave this variable and we can add metrics in another patch, since this patch is already big enough and almost ready to land.


@Override
public WatcherOrBitSet triggerWatch(
String path, EventType type, WatcherOrBitSet supress) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - suppress

@Override
public boolean addWatch(String path, Watcher watcher) {
boolean result = false;
addRemovePathRWLock.readLock().lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the purpose of using a read lock here is to optimize for addWatch heavy workloads? Would be good to add a comment here about why choose use a read lock instead of write lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's used to improve the read throughput, creating new watcher bit and adding it to the BitHashSet has it's own lock to minimize the lock scope. I'll add some comments here.

if (watchers == null) {
watchers = new BitHashSet();
BitHashSet existingWatchers = pathWatches.putIfAbsent(path, watchers);
if (existingWatchers != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check necessary, because we are using a read lock here so it's possible for another thread to modify the pathWatches while we are here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, reading requests are processed concurrently in CommitProcessor worker service, so it's possible multiple thread might add to pathWatches while we're holding read lock, that's why we need this check here.


@Override
public boolean containsWatcher(String path, Watcher watcher) {
BitHashSet watchers = pathWatches.get(path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to add a comment here regarding why no synchronization is required here.

@hanm
Copy link
Contributor

hanm commented Sep 24, 2018

thanks @lvfangmin for detailed reply.
just had another review pass over some files i forgot to look last time. overall looks good. will sign this off once all review comments are addressed. thanks

@lvfangmin lvfangmin mentioned this pull request Sep 25, 2018
@lvfangmin
Copy link
Contributor Author

Added more comments about the locking where @anmolnar and @hanm asked during review, it will make the future reference easier as well. Also corrected the typo and unused import.

@asfgit
Copy link

asfgit commented Sep 28, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2278/

Copy link
Contributor

@hanm hanm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @lvfangmin . Will commit today after running through jenkins for a few time.

@hanm
Copy link
Contributor

hanm commented Sep 28, 2018

got a "green" build (minors the known failed test testReconfigRemoveClientFromStatic discussed at ZOOKEEPER-2847).

@asfgit asfgit closed this in fdde8b0 Sep 28, 2018
@hanm
Copy link
Contributor

hanm commented Sep 28, 2018

merged to master. great work @lvfangmin !

@vivekpatani
Copy link

Is it worth it to back port this to 3.4? @hanm @lvfangmin

@eolivelli
Copy link
Contributor

@vivekpatani sorry but 3.4 is end of life .
Consider moving to 3.6.x.
We ate going to release 3.7.0 so even 3.5.x is becoming old.

RokLenarcic pushed a commit to RokLenarcic/zookeeper that referenced this pull request Sep 3, 2022
…e watches scenario

The current HashSet based WatcherManager will consume more than 40GB memory when
creating 300M watches.

This patch optimized the memory and time complexity for concentrate watches scenario, compared to WatchManager, both the memory consumption and time complexity improved a lot. I'll post more data later with micro benchmark result.

Changed made compared to WatchManager:
* Only keep path to watches map
* Use BitSet to save the memory used to store watches
* Use ConcurrentHashMap and ReadWriteLock instead of synchronized to reduce lock retention
* Lazily clean up the closed watchers

Author: Fangmin Lyu <allenlyu@fb.com>

Reviewers: Andor Molnár <andor@apache.org>, Norbert Kalmar <nkalmar@yahoo.com>, Michael Han <hanm@apache.org>

Closes apache#590 from lvfangmin/ZOOKEEPER-1177
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants