-
Notifications
You must be signed in to change notification settings - Fork 895
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 1476: LedgerEntry is recycled twice at ReadLastConfirmedAndEntryOp #1509
Conversation
Thank you @infodog for looking into this. How can you prove the fix addresses the problem? |
@eolivelli
when I start a writer write to the bookkeeper server, after serveral thoundsand writes, the client will throw the exception saying that object recycled , then the client will not receive any new data. In fact I start 2 clients in the same time to read from the server,and always one client fails, randomly. After apply the fix, my client will run and always get the updated data. So I think the fix addressed the problem. Maybe we can create a test case like this? But I dont know how to do that. I think to repoduce the problem the key point is multiple bookies. The problem will arise when the response from the bookie with data comes after the response of the bookie without data. By the way, When I debug the problem, I add some log statements to the code, and the logs shows there is still some problem hiding. Still shows something I can't understand. Maybe I am not look carefully enough : the following is my log,you can see there is still error, but i think it's another problem and this will not cause data lose, so I think the problem is addressed. 2018-06-10 09:25:26 [ BookKeeperClientWorker-OrderedExecutor-0-0:49040066 ] - [ WARN ] --- ledgerId=206,entryId=2083, length=-1, handler=395656841,entryImpl.hashCode=609480430 org.apache.bookkeeper.client.impl.LedgerEntryImpl.close(LedgerEntryImpl.java:167) |
@infodog I see your description. @sijie I don't know DL codebase very much and I am not using long-poll so I can't be very helpful here. |
@eolivelli I think the problem only happened when speculations happen. also it is not just on long poll reads, I think it will impact normal reads. those tests around speculations might end before real problems show up. |
So @sijie you are saying that we have some code path not covered by test cases. It would be good to have minimal coverage of this change, just by using mockito. |
@eolivelli - sure, we need to add test cases for this. however I don't think the fix address the root cause. still looking ... |
@sijie sure, go ahead. |
*Problem* There are two flags on checking whether a request is completed. That is being misued for two different branches, one is when the request is completed with advanced lac but no entry piggybacked, the other one is when the request is completed with adavanced lac with an entry piggybacked. When this happen, it will cause a request being completed twice and the entry buffer is recycled twice. *Solution* Remove direct usage of submitCallback and use completeRequest instead. Add a unit test for reproduce the issue and ensure the fix address the problem.
@infodog : your fix is correct. however, it can be simplified with changing using I created a PR to your branch to improve it and also added a unit test to reproduce the problem and ensure your changes fix the problem: infodog#1 if you merged it, the changes will be applied to your branch and showed up here. then we should be ready to merge your fix. |
@eolivelli - changes in infodog#1 include the unit test. @jiazhai @yzang please review this PR as well, since it is dlog related. |
@sijie I think the fix of infodog#1 may have problem. because In request.complete
the entryImpl may already recycled and owned by another request, so this will ruin another request's data? Maybe should move
from request.complete to request.close, and when we found thre requested is closed, we dont call request.complete? |
@infodog - https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedAndEntryOp.java#L127 use an atomic boolean to guarantee it only executed once. so the problem you described won't happen. |
@sijie but in https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedAndEntryOp.java#L615 use |
@infodog I see your point now. let me try to explain:
Hope this explains why the change in infodog#1 work. |
@sijie it's possible that compleRequest() come before request.complte(), such as https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/ReadLastConfirmedAndEntryOp.java#L589, this will happen when the response without piggyback data comes before the response with data. In this case, the request will be closed in completeRequest() by the submitCallback() in it, the entryImpl will be recycled. Because completeRequest() does not set
will be executed,and although this time request.close will not be called. Since entryImpl is already recycled , if entryImpl is resued by other request, then data will be ruined even the operation is executed in the same thread. |
- add validations in the unit test to ensure the recycled entry will not be mutated
@sijie I still have questions,
is not executed.
|
@infodog good questions, comments inline.
This will not cause any problems to the application using DL. Here we completed the request only when lac (last add confirmed) is advanced. it is better if there is an entry piggybacked, if there is no entry piggybacked, it is still good, because we received a valid advanced lac. you might think we dropped a valid response. however the thought behind is to tell the client caller (which is DL library at this case) as soon as LAC is advanced, it can read entries till the new LAC. DL will react to the callback and know LAC is advanced and issue corresponding read requests.
You don't need to handle this situation. Since DL handles that.
The conversation between normal reads and long poll reads and knowing LAC advance are all happending in the dl library. applications don't have to worry about that.
This will not cause resource leak. The recycle there means putting back the objects to be reused. if recycle is not called, it doesn't mean resource leak, it only means the objects will not be put back to the object pool to be reused. the recycle objects are different from bytebuf where releasing bytebuf should be done when the bytebuf is not reused anymore, otherwise it will causing memory can't be released.
I am not sure what exactly you are asking. but I am guessing you are asking the boolean flag in the request. that boolean flag is needed, it is used for flagging whether the request is completed or not. the Hope this clarifies your questions. |
Issue 1476: LedgerEntry is recycled twice at ReadLastConfirmedAndEntryOp
@sijie Ok thanks very much for the explanation, I already merged to my branch. |
@infodog thank you. @jiazhai @yzang @eolivelli can you guys review this change? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with [commit] (b2b4f22)
run bookkeeper-server bookie tests |
rebuild java8 |
|
||
// readEntryComplete above will release the entry impl back to the object pools. | ||
// we want to make sure after the entry is recycled, it will not be mutated by any future callbacks. | ||
LedgerEntryImpl entry = LedgerEntryImpl.create(LEDGERID, Long.MAX_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eolivelli
Yes I assuming that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to perform an assertion about this assumption ?
if that assumption is broken the test is not really useful.
I don't have a suggestion on how to capture the LedgerEntryImpl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eolivelli It is hard to capture the original entry. I mean it is possible, however that's going to make things a bit complicated, which I don't think it is worth doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall ok, just a question on the test
run bookkeeper-server bookie tests |
rubuild java8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
run bookkeeper-server bookie tests |
1 similar comment
run bookkeeper-server bookie tests |
retest this please |
(somehow the jenkins precommit checks are not running after rebased. trying to close and reopen) |
run bookkeeper-server bookie tests |
run bookkeeper-server bookie tests |
I think "bookie tests" precommit check is flaky due to #1516 . there was one successful run - https://builds.apache.org/job/bookkeeper_precommit_bookie_tests/61/ so going to ignore CI to merge this bug fix. |
IGNORE CI |
…ryOp Descriptions of the changes in this PR: The issue #1476 is caused by peculative reads with object recycling, same request object will be added to the CompletionObjects multiple times with different txnid. In fact the logic of process the request already take this into account, only on place inside `ReadLastConfirmedAndEntryOp.requestComplete` forget to check requestComplete before calling `submitCallback` which in turn call request.close. ### Motivation to fix #1476 ### Changes check `requestComplete` before `submitCallback` in `ReadLastConfirmedAndEntryOp.requestComplete` Master Issue: #1476 Author: Sijie Guo <sijie@apache.org> Author: infodog <infodog@hotmail.com> Author: zhengxiangyang <zxy@xinshi.net> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Jia Zhai <None> This closes #1509 from infodog/issue1476, closes #1476 (cherry picked from commit 6476fc3) Signed-off-by: Sijie Guo <sijie@apache.org>
merged the changes. thank you @infodog ! |
…AndEntryOp Descriptions of the changes in this PR: The issue apache#1476 is caused by peculative reads with object recycling, same request object will be added to the CompletionObjects multiple times with different txnid. In fact the logic of process the request already take this into account, only on place inside `ReadLastConfirmedAndEntryOp.requestComplete` forget to check requestComplete before calling `submitCallback` which in turn call request.close. ### Motivation to fix apache#1476 ### Changes check `requestComplete` before `submitCallback` in `ReadLastConfirmedAndEntryOp.requestComplete` Master Issue: apache#1476 Author: Sijie Guo <sijie@apache.org> Author: infodog <infodog@hotmail.com> Author: zhengxiangyang <zxy@xinshi.net> Reviewers: Enrico Olivelli <eolivelli@gmail.com>, Jia Zhai <None> This closes apache#1509 from infodog/issue1476, closes apache#1476 (cherry picked from commit 6476fc3) Signed-off-by: Sijie Guo <sijie@apache.org> (cherry picked from commit c7b1610) Signed-off-by: JV Jujjuri <vjujjuri@salesforce.com>
Descriptions of the changes in this PR:
The issue #1476 is caused by peculative reads with object recycling, same request object will be added to the CompletionObjects multiple times with different txnid. In fact the logic of process the request already take this into account, only on place inside
ReadLastConfirmedAndEntryOp.requestComplete
forget to check requestComplete before callingsubmitCallback
which in turn call request.close.Motivation
to fix #1476
Changes
check
requestComplete
beforesubmitCallback
inReadLastConfirmedAndEntryOp.requestComplete
Master Issue: #1476