Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE in broker: EntryImpl.getLength() #10433

Closed
dinghram opened this issue Apr 29, 2021 · 27 comments · Fixed by #21302
Closed

NPE in broker: EntryImpl.getLength() #10433

dinghram opened this issue Apr 29, 2021 · 27 comments · Fixed by #21302
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@dinghram
Copy link

During a CHO test, one broker crashed with a NPE. We lost the logs due to a cleanup of files, but we have the general stack trace:
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes();
managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryCacheImpl.java: 115: manager.entryAdded(entry.getLength());

@lhotari
Copy link
Member

lhotari commented Apr 29, 2021

@dinghram Thanks for reporting.

Did I understand the problem report correctly, that the NPE is happening on EntryImpl.java, line 132?

@Override
public int getLength() {
return data.readableBytes();
}

Called from EntryCacheImpl.java, line 115:

EntryImpl cacheEntry = EntryImpl.create(position, cachedData);
cachedData.release();
if (entries.put(position, cacheEntry)) {
manager.entryAdded(entry.getLength());

Please share the Pulsar version and deployment type (plain/docker/k8s helm). The target Java JVM version would also be useful if you aren't deploying with official k8s helm chart and Docker images.

One experiment that could be useful in tracking down the problem would be to add -Dio.netty.recycler.maxCapacityPerThread=0 to the Pulsar Broker JVM options by adding it to the PULSAR_EXTRA_OPTS environment variable.

For example, In the k8s Helm chart deployment, an environment variable can be set for the broker by adding it under broker.configData in the values.yaml for the deployment.

broker:
  configData:
    PULSAR_EXTRA_OPTS: -Dio.netty.recycler.maxCapacityPerThread=0

The -Dio.netty.recycler.maxCapacityPerThread=0 setting disables object instance reuse that might be causing the issue that you have reported. Pulsar uses the Netty framework's Recycler class to minimize allocations of some object instances.

Would it be possible for you to do a test run with PULSAR_EXTRA_OPTS=-Dio.netty.recycler.maxCapacityPerThread=0 to see if the problem reproduces with this setting?

@merlimat
Copy link
Contributor

The -Dio.netty.recycler.maxCapacityPerThread=0 setting disables object instance reuse that might be causing the issue that you have reported

Though it won't disable the call to recycle, in where we're resetting the fields to nulls.

@lhotari
Copy link
Member

lhotari commented Apr 29, 2021

Though it won't disable the call to recycle, in where we're resetting the fields to nulls.

@merlimat I guess the data field shouldn't be null in this case at all. EntryImpl.create is setting the data field by calling EntryImpl.create(position, cachedData) on line 112.

EntryImpl cacheEntry = EntryImpl.create(position, cachedData);
cachedData.release();
if (entries.put(position, cacheEntry)) {
manager.entryAdded(entry.getLength());

EntryImpl.create:

public static EntryImpl create(PositionImpl position, ByteBuf data) {
EntryImpl entry = RECYCLER.get();
entry.timestamp = System.nanoTime();
entry.ledgerId = position.getLedgerId();
entry.entryId = position.getEntryId();
entry.data = data;
entry.data.retain();
entry.setRefCnt(1);
return entry;
}

The reported NPE is "impossible". :)

@merlimat
Copy link
Contributor

the only explanation would be for the entry passed in (

) to be released while the cache insert is still happening.

@merlimat
Copy link
Contributor

But that too doesn’t look to be possible....

@lhotari
Copy link
Member

lhotari commented Apr 29, 2021

I wonder if something in this direction could be happening:

Let's say we have 2 threads, T1 and T2 and a single object instance O1.

  1. T1 gets O1 from the pool
  2. T1 sets fields on O1
  3. T1 shares O1 to T2
  4. T2 set O1 fields to null and returns O1 to the pool
  5. T1 gets O1 from the pool
  6. T1 sets fields on O1
  7. T2 reads O1 fields. The values that T1 set in step 6 aren't visible to T2. The visible values are the null values set in step 4.

I've been recently watching Aleksey Shipilëv's Java Memory Model Unlearning Experience presentation.

In the presentation, at about 21 minutes 50 seconds, there's an interesting example:
image

@codelipenghui
Copy link
Contributor

@lhotari The pool here we are using is a thread-local pool, and always use the the same thread for a managed ledger to insert entry to the cache

EntryImpl entry = EntryImpl.create(ledger.getId(), entryId, data);
. Seems the entry does not have an opportunity to share with other thread.

And if passed here

EntryImpl cacheEntry = EntryImpl.create(position, cachedData);
means the data are not null at this point, otherwise will throw NPE here. It's very interesting. If the entry does not insert into the cache, the cache eviction also can get a chance to access the entry.

@lhotari
Copy link
Member

lhotari commented Apr 29, 2021

@codelipenghui Thanks for pointing to the OpAddEntry code location. I initially didn't realize that the code had both entry and cachedEntry and that it's the entry.data that is null and the source for the field is OpAddEntry.data as you pointed out.

public void safeRun() {
// Remove this entry from the head of the pending queue
OpAddEntry firstInQueue = ml.pendingAddEntries.poll();
checkArgument(this == firstInQueue);
ManagedLedgerImpl.NUMBER_OF_ENTRIES_UPDATER.incrementAndGet(ml);
ManagedLedgerImpl.TOTAL_SIZE_UPDATER.addAndGet(ml, dataLength);
if (ml.hasActiveCursors()) {
// Avoid caching entries if no cursor has been created
EntryImpl entry = EntryImpl.create(ledger.getId(), entryId, data);
// EntryCache.insert: duplicates entry by allocating new entry and data. so, recycle entry after calling
// insert
ml.entryCache.insert(entry);
entry.release();
}

Even with this updated information, the problem is "impossible"... :) This seems to be a nice Java Concurrency puzzler.

@lhotari
Copy link
Member

lhotari commented Apr 29, 2021

The pool here we are using is a thread-local pool, and always use the the same thread for a managed ledger to insert entry to the cache

@codelipenghui it seems that Netty's Recycler is thread-local for lookup and creation, but an object can be returned to the pool in any thread. The multi-thread access seems to lead to visibility problems which are possible in the JMM.

@dinghram
Copy link
Author

Pulsar version: 2.7.0 + cherry-picked fixes
Deployment: helm
Java version:
openjdk version "1.8.0_282"
OpenJDK Runtime Environment (IcedTea 3.18.0) (build 1.8.0_282-b07 suse-lp152.2.9.1-x86_64)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
OS: openSUSE Leap 15.2 + patches

@lhotari
Copy link
Member

lhotari commented Apr 30, 2021

There's a Netty bug that has been fixed in 4.1.60.Final by this commit. Pulsar 2.7.2 will be released with Netty 4.1.60.Final . Upgrade to Netty 4.1.60.Final in Pulsar was made as part of #10073 .

@dinghram Would you be able to re-test with a build of Pulsar from branch-2.7 since that would include Netty 4.1.60.Final ?
The instructions for building custom docker images is in README.

@lhotari
Copy link
Member

lhotari commented Apr 30, 2021

Another NPE bug that is related to the Netty Recycler: #5727 .

@lhotari
Copy link
Member

lhotari commented Apr 30, 2021

Since the reported problem could be related to Netty Recycler usage, I'd recommend doing the testing with Netty Recycler completely turned off (this was already proposed in the first comment).

You can find positive reports about disabling the Netty Recycler at netty/netty#5968 (comment) and netty/netty#5968 (comment) . Elasticsearch also disabled Netty Recycler completely because of the problems it causes.

@dinghram Would you be able to run your CHO (Continuous Hours of Operation) tests with Pulsar configured with -Dio.netty.recycler.maxCapacityPerThread=0 ?
It would be very useful if you could also share the possible impact to performance (throughput, latency, resource utilization) after disabling Netty Recycler by setting -Dio.netty.recycler.maxCapacityPerThread=0 (in PULSAR_EXTRA_OPTS as explained in the first comment).

@lhotari
Copy link
Member

lhotari commented May 3, 2021

#2993 seems related.

@lhotari
Copy link
Member

lhotari commented May 4, 2021

UPDATE: The following description of a race condition doesn't seem to be possible since a counter in the base class of EntryImpl protects against it. (AbstractCASReferenceCounted.release0)


My current assumption of the problem is that a race condition is possible when cache invalidation happens with a certain timing.

The possible race condition:

Code in EntryCacheImpl, asyncReadEntry0 method (code is for branch-2.7):

Collection<EntryImpl> cachedEntries = entries.getRange(firstPosition, lastPosition);
if (cachedEntries.size() == entriesToRead) {
long totalCachedSize = 0;
final List<EntryImpl> entriesToReturn = Lists.newArrayListWithExpectedSize(entriesToRead);
// All entries found in cache
for (EntryImpl entry : cachedEntries) {
entriesToReturn.add(EntryImpl.create(entry));
totalCachedSize += entry.getLength();
entry.release();
}

Code for entries.getRange, referenced on line 272 in above code snippet:

public Collection<Value> getRange(Key first, Key last) {
List<Value> values = Lists.newArrayList();
// Return the values of the entries found in cache
for (Value value : entries.subMap(first, true, last, true).values()) {
try {
value.retain();
values.add(value);
} catch (Throwable t) {
// Value was already destroyed between get() and retain()
}
}
return values;
}

explanation of a possible race condition:

  • one thread reads entries from the cache. the entries get marked for use with the value.retain(); call (line 115 in RangeCache).
  • in the meantime, the cache entry gets invalidated and invalidation calls release() for the entry
  • the EntryImpl instance gets returned to the Recycler object pool, and another thread picks it put for usage.
  • the thread that read entries from the cache calls .release() for the instance that is now used by another thread. (line 282 in EntryCacheImpl)
  • since release has been called, it can cause the NPE on managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java: 132: return data.readableBytes(); reported in this issue.

@eolivelli
Copy link
Contributor

this comment "// Value was already destroyed between get() and retain() " is very bad.
catching a Throwable may lead to totally unpredictable behaviour...OutOfMemory? NPE ? IllegalXXXException ?

@lhotari
Copy link
Member

lhotari commented May 4, 2021

When looking more at the possible race condition, the previous explained scenario seems to be possible with a slight variation to it. It seems that invalidation while the entry is in use is possible, but the remaining steps are different. It seems that is would require 2 concurrent invalidations so that the problem would manifest itself.

It might sound like a very unlikely situation, but it seems that the problem of this issue is also very rare. With enough operations, a very unlikely call path will eventually get executed.

@lhotari
Copy link
Member

lhotari commented May 4, 2021

I created a minimal fix without any new tests in #10480 .

@dlg99
Copy link
Contributor

dlg99 commented May 4, 2021

@lhotari Let me poke into this theory:

Initial refCnt of the entry in cache == 1

one thread reads entries from the cache. the entries get marked for use with the value.retain(); call (line 115 in RangeCache).

A: value.retain() succeeds: refCnt++ == 2
B: value.retain() throws (other thread released it already somehow): entry is not returned.

in the meantime, the cache entry gets invalidated and invalidation calls release() for the entry

A: refCnt-- == 1
B: entry was not returned, could not be used

"the EntryImpl instance gets returned to the Recycler object pool" does not happen in case A.

Did I miss something there?

@dlg99
Copy link
Contributor

dlg99 commented May 4, 2021

@lhotari there are other players there:
ReadEntriesCallback, ReadEntryCallback - implementations can call extra .release()

@lhotari
Copy link
Member

lhotari commented May 5, 2021

@lhotari Let me poke into this theory:

Initial refCnt of the entry in cache == 1

one thread reads entries from the cache. the entries get marked for use with the value.retain(); call (line 115 in RangeCache).

A: value.retain() succeeds: refCnt++ == 2
B: value.retain() throws (other thread released it already somehow): entry is not returned.

in the meantime, the cache entry gets invalidated and invalidation calls release() for the entry

A: refCnt-- == 1
B: entry was not returned, could not be used

"the EntryImpl instance gets returned to the Recycler object pool" does not happen in case A.

Did I miss something there?

@dlg99 yes, you are right that it's not possible in the way that I first explained it. There's a more recent comment about the races in invalidating the entries which could lead to a somewhat similar situation. The referenced PR contains more discussion about the possible explanations.

@dinghram
Copy link
Author

@lhotari Thanks for the feedback. We tested with the -Dio.netty.recycler.maxCapacityPerThread=0 option and saw no negative performance impact. In fact, the CPU usage was lower with this option set. However, this was a single test and not definitive of how Pulsar would behave long term with this setting. As such, it seems we can run with this option set if a customer begins seeing the NPE occur (we have only seen it once).

@lhotari
Copy link
Member

lhotari commented Jan 24, 2022

This problem might be fixed by #13328 . Netty Recycler was rewritten in 4.1.71.Final with PR netty/netty#11858 . Pulsar uses Netty Recycler extensively and the improvements to Netty Recycler might fix various thread-safety problems.

@codelipenghui
Copy link
Contributor

The issue had no activity for 30 days, mark with Stale label.

@mattisonchao
Copy link
Member

This problem fixed by #15707.

@tisonkun
Copy link
Member

@mattisonchao as a committer you should be able to close this issue as completed :)

@mattisonchao
Copy link
Member

@tisonkun Got it. thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
8 participants