Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][ml] Fix thread safe issue with RangeCache.put and RangeCache.clear #21302

Merged
merged 1 commit into from
Oct 7, 2023

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented Oct 5, 2023

Fixes #21301
Fixes #10433

Motivation

There are 2 thread safety issues in RangeCache.

RangeCache.put:

public boolean put(Key key, Value value) {
MutableBoolean flag = new MutableBoolean();
entries.computeIfAbsent(key, (k) -> {
size.addAndGet(weighter.getSize(value));
flag.setValue(true);
return value;
});
return flag.booleanValue();
}

computeIfAbsent doesn't lock the key when ConcurrentSkipListMap is used.
You can see this in the source code of ConcurrentMap:

    default V computeIfAbsent(K key,
            Function<? super K, ? extends V> mappingFunction) {
        Objects.requireNonNull(mappingFunction);
        V oldValue, newValue;
        return ((oldValue = get(key)) == null
                && (newValue = mappingFunction.apply(key)) != null
                && (oldValue = putIfAbsent(key, newValue)) == null)
            ? newValue
            : oldValue;
    }

On the other hand, ConcurrentHashMap.computeIfAbsent does lock the key and atomically call the mappingFunction. Please check the references that @michaeljmarshall shared in the issue #21301. Kudos to @michaeljmarshall for pinpointing the issue!

Another problem is in RangeCache.clear. There should be no call to entries.clear() in that method.

Modifications

Fix the problems in RangeCache.put and RangeCache.clear.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@lhotari lhotari added type/bug The PR fixed a bug or issue reported a bug ready-to-test area/ML category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost labels Oct 5, 2023
@lhotari lhotari added this to the 3.2.0 milestone Oct 5, 2023
@lhotari lhotari self-assigned this Oct 5, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 5, 2023
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing it!

@codecov-commenter
Copy link

Codecov Report

Merging #21302 (0db0464) into master (643428b) will increase coverage by 36.29%.
The diff coverage is 100.00%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #21302       +/-   ##
=============================================
+ Coverage     36.91%   73.21%   +36.29%     
- Complexity    12284    32488    +20204     
=============================================
  Files          1698     1887      +189     
  Lines        130510   140196     +9686     
  Branches      14260    15437     +1177     
=============================================
+ Hits          48183   102644    +54461     
+ Misses        76016    29454    -46562     
- Partials       6311     8098     +1787     
Flag Coverage Δ
inttests 24.16% <66.66%> (-0.13%) ⬇️
systests 24.76% <66.66%> (-0.09%) ⬇️
unittests 72.52% <100.00%> (+40.41%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...org/apache/bookkeeper/mledger/util/RangeCache.java 96.55% <100.00%> (+24.96%) ⬆️

... and 1452 files with indirect coverage changes

Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

The new solution is also better than #15707 to avoid any issues with the entry release. In #15707, it tries to resolve the problem by getting the size before putting it into the map, because the entry release is triggered by the clear() method. The new solution can also fix the potential issue from any other cases released the entry before getting the entry size.

And the change of this line https://github.com/apache/pulsar/pull/21302/files#diff-cf9a27d8abb87f6c4fe85e6e597230e9fc102c0ca8169eed08a44218c554c5edL245 will fix the issue that the newly added items to the cache be removed without the entry release.

@codelipenghui codelipenghui merged commit 70d086f into apache:master Oct 7, 2023
51 checks passed
lhotari added a commit that referenced this pull request Oct 7, 2023
lhotari added a commit that referenced this pull request Oct 7, 2023
lhotari added a commit that referenced this pull request Oct 7, 2023
lhotari added a commit that referenced this pull request Oct 7, 2023
liangyuanpeng pushed a commit to liangyuanpeng/pulsar that referenced this pull request Oct 11, 2023
vinayakmalik95 pushed a commit to tmdc-io/pulsar that referenced this pull request Oct 12, 2023
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ML category/reliability The function does not work properly in certain specific environments or failures. e.g. data lost cherry-picked/branch-2.10 cherry-picked/branch-2.11 cherry-picked/branch-3.0 cherry-picked/branch-3.1 doc-not-needed Your PR changes do not impact docs ready-to-test release/2.10.6 release/2.11.3 release/3.0.2 release/3.1.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Incorrect usage of ConcurrentSkipListMap#computeIfAbsent NPE in broker: EntryImpl.getLength()
7 participants