Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE in managed ledger on read failed #11521

Closed
bharanic-dev opened this issue Jul 30, 2021 · 7 comments
Closed

NPE in managed ledger on read failed #11521

bharanic-dev opened this issue Jul 30, 2021 · 7 comments
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@bharanic-dev
Copy link
Contributor

bharanic-dev commented Jul 30, 2021

Describe the bug
Under some scenarios, the following NPE exception is seen in Pulsar broker logs.

{"instant":{"epochSecond":1626050179,"nanoOfSecond":804217000},"thread":"bookkeeper-ml-workers-OrderedExecutor-7-0","level":"ERROR","loggerName":"org.apache.bookkeeper.common.util.SafeRunnable","message":"Unexpected throwable caught ","thrown":{"commonElementCount":0,"name":"java.lang.NullPointerException","extendedStackTrace":"java.lang.NullPointerException: null\n\tat org.apache.bookkeeper.mledger.impl.OpReadEntry.lambda$readEntriesFailed$0(OpReadEntry.java:94) ~[org.apache.pulsar-managed-ledger.jar]\n\tat org.apache.bookkeeper.mledger.util.SafeRun$1.safeRun(SafeRun.java:32) ~[org.apache.pulsar-managed-ledger.jar]\n\tat org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common.jar:?]\n\tat org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common.jar:?]\n\tat org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:203) [org.apache.bookkeeper-bookkeeper-common.jar:?]\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.60.Final.jar:4.1.60.Final]\n\tat java.lang.Thread.run(Thread.java:834) [?:?]\n"},"endOfBatch":false,"loggerFqcn":"org.apache.logging.slf4j.Log4jLogger","contextMap":{},"threadId":99,"threadPriority":5,"time":"2021-07-12T00:02:30Z","service":"pulsar","hostname":"broker-2"}

To Reproduce

Currently, I don't have a reliable way to reproduce the problem.

Expected behavior
No NPE

@lhotari
Copy link
Member

lhotari commented Jul 31, 2021

There is PR #11387 that fixes some thread safety issues in OpReadEntry. I wonder if that fix would help for this case.

@bharanic-dev
Copy link
Contributor Author

@lhotari thank you for pointing to #11387 . I looked through the changes and it appears that changes in #11387 try to serialize accesses to entries. I am not sure if that will help with this issue, though.

For this issue, on a given OpReadEntry object one of readEntriesFailed() and checkReadCompletion() needs to get invoked more than once. From reading through the code, I am not entirely sure how that is possible. What makes it harder is that I don't have a way to reproduce the problem.

@lhotari
Copy link
Member

lhotari commented Aug 10, 2021

For this issue, on a given OpReadEntry object one of readEntriesFailed() and checkReadCompletion() needs to get invoked more than once. From reading through the code, I am not entirely sure how that is possible. What makes it harder is that I don't have a way to reproduce the problem.

@bharanic-dev OpReadEntry isn't thread safe. Breaking thread safety rules can lead to very surprising behaviour which is "impossible" from reading the code. One such example of an "impossible" problem in the past is #10433 .
#11387 will fix some issues since most access to OpReadEntry will happen from a single thread. That's why it could help in this case too.

@bharanic-dev
Copy link
Contributor Author

@lhotari thank you. I agree that some of the issues that can arise if thread safety rules are broken, can be impossible to figure out just reading the code. I will try #11387 and see if it helps.

@lhotari
Copy link
Member

lhotari commented Jan 24, 2022

This problem might be fixed by #13328 . Netty Recycler was rewritten in 4.1.71.Final with PR netty/netty#11858 . Pulsar uses Netty Recycler extensively and the improvements to Netty Recycler might fix various thread-safety problems.

@codelipenghui
Copy link
Contributor

The issue had no activity for 30 days, mark with Stale label.

@tisonkun
Copy link
Member

tisonkun commented Dec 9, 2022

Closed as stale. Please create a new issue if it's still relevant to the maintained versions.

@tisonkun tisonkun closed this as completed Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

4 participants