New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ats fixes #914
Ats fixes #914
Conversation
Current coverage is 75.86% (diff: 100%)@@ master #914 diff @@
==========================================
Files 149 149
Lines 25991 25999 +8
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 19713 19725 +12
+ Misses 6278 6274 -4
Partials 0 0
|
Thanks @trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here. Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirements) |
No problem, I kinda threw them in last night to get it going through travis. I'll fix them up and rebake the other PR as soon as I get back to my desktop. On November 22, 2016 at 9:31:44 AM PST, Mark Grondona notifications@github.com wrote: Thanks @trwshttps://github.com/trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here. Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirementshttps://github.com/flux-framework/rfc/blob/master/spec_1.adoc#patch-requirements) You are receiving this because you were mentioned. |
@trws, I'm probably missing something simple, but if you have a chance can you give me more details on the reproducer for the move-to-front lru bug? The current test case is pretty simple, but it does create a cache of size 5 and adds 10 items, so the cache size is exceeded. It then continually touches the first item added so it should be continually moved back to the front of the list. I assume I'm missing some corner case still. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: er, I meant this to be a comment on the connectors/local change.
This change is fine, but I can't see what the problem was in the old code...
Anyway, we should probably make the same change in op_event_unsubscribe() and drop the shortjson.h include.
@grondo unfortunately I don't have a small reproducer for the issue, just using sched to run 256+ jobs did it consistently. The result of the issue was that the count would be full, but the actual list would be empty. I think the pattern was a mix of using the most recent and cycling through all of the rest as they were accessed for state updates, but haven't dug in far enough to create a good reproducer. @garlick, I don't think there was an actual problem with the old code... that's the especially odd part. It only comes up when using an allocator other than the default linux malloc, and it looks like what happens when the default malloc is used to get a piece of memory that's then passed to the free of tcmalloc or jemalloc or what have you. Why it appeared there, and only there, and not somewhere else that shortjson is used I can't say. Truth be told, I'm not sure I hit the root cause of the issue, but the failure was always in that Jput, and it completely disappeared after the change. |
The issue eventually causes the LRU to count full but contain no elements, when the NULL head or tail are used after that it becomes a segfault. It does not appear to be a simple pattern that happens every time something moves, but only when a node with a null prev pointer is not first, or a null next pointer in a node that is not last.
Rebased with each commit reworded. |
Thanks @trws that is a good hint, and again I apologize for sloppy code. |
I verified that I did find that the real error here was that this line was missing from the end of /* Re-initialized prev/next pointers to NULL */
l->prev = l->next = NULL; Your changes make the code clearer I think so they could stay in but we should add this fix on top as well, otherwise the internal list gets corrupted. I'm still working on a test that fails under the lru_cache unit test -- might have to add a |
@trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include. |
Sure, I'll make those changes as soon as I am get back to flux. On November 22, 2016 at 3:18:31 PM PST, Jim Garlick notifications@github.com wrote: @trwshttps://github.com/trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include. You are receiving this because you were mentioned. |
@trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them. In order to write a failing case for this issue, I had to add a Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there. Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added. |
Not at all. I'm glad you found the root cause, and more so for the tests. Thanks @grondo! On November 22, 2016 at 10:09:14 PM PST, Mark Grondona notifications@github.com wrote: @trwshttps://github.com/trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them. In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered. Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there. Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added. You are receiving this because you were mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This resolves a free of data allocated by a different allocator detected in jemalloc and tcmalloc from shortjson, always from this callsite. The issue does not appear to affect other uses of shortjson, or use with the standard system allocator, despite the fact that this is a common pattern and appears to match calls correctly. In removing the last of the shortjson dependence, I noticed that the sub and unsub functions were identical, so they now also use a common helper function to reduce duplication.
When a LRU entry was removed from the LRU list, failure to reset the prev/next pointers to NULL on the removed entry could cause an entry immediately re-inserted (as done in lru_requeue) to corrupt the internal list. For example a move-to-front operation would cause the front of list to have a prev pointer, which should never happen. Setting l->prev = l->next = NULL solves this problem as any insert can be handled same as a new entry.
Add a function to run consistency checks on a lru_cache_t object. This is meant to test for corruption during unit testing.
Use lru_cache_selfcheck() to verify lru_cache_t has no corruption in test_lru_cache.t.
Add a reproducer for specific case of lru_cache corruption which occured during move-to-front due to failure to reinitialize lru_entry prev/next pointers during entry removal.
If an item is already on front of LRU list, avoid doing the requeue work since it is unnecessary.
Rebased. Also popped in something that's been bugging me for a bit, added the intermediate xml files from doc building into the gitignore for in-tree builds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me now. Thanks @trws!
everything looks good to me, hitting the big button |
This is an aggregation of small single-file commits with fixes required for the ATS backend to behave properly. The LRU cache item only comes up when the cache size is exceeded, so it may be worth adding a test for that.