Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ats fixes #914

Merged
merged 9 commits into from Nov 23, 2016
Merged

Ats fixes #914

merged 9 commits into from Nov 23, 2016

Conversation

trws
Copy link
Member

@trws trws commented Nov 22, 2016

This is an aggregation of small single-file commits with fixes required for the ATS backend to behave properly. The LRU cache item only comes up when the cache size is exceeded, so it may be worth adding a test for that.

@garlick garlick added the review label Nov 22, 2016
@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 76.143% when pulling 69b5274 on trws:ats-fixes into 2cca060 on flux-framework:master.

@codecov-io
Copy link

codecov-io commented Nov 22, 2016

Current coverage is 75.86% (diff: 100%)

Merging #914 into master will increase coverage by 0.02%

@@             master       #914   diff @@
==========================================
  Files           149        149          
  Lines         25991      25999     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          19713      19725    +12   
+ Misses         6278       6274     -4   
  Partials          0          0          
Diff Coverage File Path
•••••••••• 100% src/common/libutil/lru_cache.c
•••••••••• 100% src/connectors/local/local.c

Powered by Codecov. Last update 2cca060...568ef12

@grondo
Copy link
Contributor

grondo commented Nov 22, 2016

Thanks @trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here.

Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirements)

@trws
Copy link
Member Author

trws commented Nov 22, 2016

No problem, I kinda threw them in last night to get it going through travis. I'll fix them up and rebake the other PR as soon as I get back to my desktop.

On November 22, 2016 at 9:31:44 AM PST, Mark Grondona notifications@github.com wrote:

Thanks @trwshttps://github.com/trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here.

Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirementshttps://github.com/flux-framework/rfc/blob/master/spec_1.adoc#patch-requirements)

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262308226, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStZZxRXUSQfq2Nx4KmLF0Y1o4tmvcks5rAyb1gaJpZM4K48sL.

@grondo
Copy link
Contributor

grondo commented Nov 22, 2016

@trws, I'm probably missing something simple, but if you have a chance can you give me more details on the reproducer for the move-to-front lru bug? The current test case is pretty simple, but it does create a cache of size 5 and adds 10 items, so the cache size is exceeded. It then continually touches the first item added so it should be continually moved back to the front of the list.

I assume I'm missing some corner case still.

Copy link
Member

@garlick garlick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: er, I meant this to be a comment on the connectors/local change.

This change is fine, but I can't see what the problem was in the old code...

Anyway, we should probably make the same change in op_event_unsubscribe() and drop the shortjson.h include.

@trws
Copy link
Member Author

trws commented Nov 22, 2016

@grondo unfortunately I don't have a small reproducer for the issue, just using sched to run 256+ jobs did it consistently. The result of the issue was that the count would be full, but the actual list would be empty. I think the pattern was a mix of using the most recent and cycling through all of the rest as they were accessed for state updates, but haven't dug in far enough to create a good reproducer.

@garlick, I don't think there was an actual problem with the old code... that's the especially odd part. It only comes up when using an allocator other than the default linux malloc, and it looks like what happens when the default malloc is used to get a piece of memory that's then passed to the free of tcmalloc or jemalloc or what have you. Why it appeared there, and only there, and not somewhere else that shortjson is used I can't say.

Truth be told, I'm not sure I hit the root cause of the issue, but the failure was always in that Jput, and it completely disappeared after the change.

The issue eventually causes the LRU to count full but contain no
elements, when the NULL head or tail are used after that it becomes a
segfault.  It does not appear to be a simple pattern that happens every
time something moves, but only when a node with a null prev pointer is
not first, or a null next pointer in a node that is not last.
@trws
Copy link
Member Author

trws commented Nov 22, 2016

Rebased with each commit reworded.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.05%) to 76.191% when pulling d787750 on trws:ats-fixes into 2cca060 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Nov 22, 2016

The result of the issue was that the count would be full, but the actual list would be empty.

Thanks @trws that is a good hint, and again I apologize for sloppy code.

@grondo
Copy link
Contributor

grondo commented Nov 22, 2016

The result of the issue was that the count would be full, but the actual list would be empty.

Thanks @trws that is a good hint, and again I apologize for sloppy code.

I verified that test/lru_cache.c is reproducing this issue, however, there is no mechanism in place to ensure the lru queue/list doesn't get out of sync with lru->count, so the corruption goes uncaught.

I did find that the real error here was that this line was missing from the end of lru_entry_remove. Without it there are dangling links back into the list. Oops, total rookie mistake.

    /* Re-initialized prev/next pointers to NULL */
    l->prev = l->next = NULL;

Your changes make the code clearer I think so they could stay in but we should add this fix on top as well, otherwise the internal list gets corrupted.

I'm still working on a test that fails under the lru_cache unit test -- might have to add a lru_cache_verify() or other sanity check function to actually walk the list

@garlick
Copy link
Member

garlick commented Nov 22, 2016

@trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include.

@trws
Copy link
Member Author

trws commented Nov 22, 2016

Sure, I'll make those changes as soon as I am get back to flux.

On November 22, 2016 at 3:18:31 PM PST, Jim Garlick notifications@github.com wrote:

@trwshttps://github.com/trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262395241, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStQoKyOyIzgzjbUMbaF7qvFfRGrRrks5rA3gvgaJpZM4K48sL.

@grondo
Copy link
Contributor

grondo commented Nov 23, 2016

@trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them.

In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered.

Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there.

Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.006%) to 76.147% when pulling 50c5317 on trws:ats-fixes into 2cca060 on flux-framework:master.

@trws
Copy link
Member Author

trws commented Nov 23, 2016

Not at all. I'm glad you found the root cause, and more so for the tests. Thanks @grondo!

On November 22, 2016 at 10:09:14 PM PST, Mark Grondona notifications@github.com wrote:

@trwshttps://github.com/trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them.

In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered.

Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there.

Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262440621, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStYHk7tZVFMt65e9UAprLsOiFBsfXks5rA9iBgaJpZM4K48sL.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great cleanup! This could be squashed into d787750
then I think this PR is ready, though someone other than I should probably merge it.

(should have been a comment on 4b8fa6b)

trws and others added 7 commits November 23, 2016 11:25
This resolves a free of data allocated by a different allocator detected
in jemalloc and tcmalloc from shortjson, always from this callsite.  The
issue does not appear to affect other uses of shortjson, or use with the
standard system allocator, despite the fact that this is a common
pattern and appears to match calls correctly.

In removing the last of the shortjson dependence, I noticed
that the sub and unsub functions were identical, so they now
also use a common helper function to reduce duplication.
When a LRU entry was removed from the LRU list, failure to reset
the prev/next pointers to NULL on the removed entry could cause
an entry immediately re-inserted (as done in lru_requeue) to
corrupt the internal list. For example a move-to-front operation
would cause the front of list to have a prev pointer, which should
never happen.

Setting l->prev = l->next = NULL solves this problem as any insert
can be handled same as a new entry.
Add a function to run consistency checks on a lru_cache_t object.
This is meant to test for corruption during unit testing.
Use lru_cache_selfcheck() to verify lru_cache_t has no corruption
in test_lru_cache.t.
Add a reproducer for specific case of lru_cache corruption which
occured during move-to-front due to failure to reinitialize
lru_entry prev/next pointers during entry removal.
If an item is already on front of LRU list, avoid doing the requeue
work since it is  unnecessary.
@trws
Copy link
Member Author

trws commented Nov 23, 2016

Rebased. Also popped in something that's been bugging me for a bit, added the intermediate xml files from doc building into the gitignore for in-tree builds.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.004%) to 76.145% when pulling 568ef12 on trws:ats-fixes into 2cca060 on flux-framework:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 76.164% when pulling 568ef12 on trws:ats-fixes into 2cca060 on flux-framework:master.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me now. Thanks @trws!

@grondo
Copy link
Contributor

grondo commented Nov 23, 2016

I probably should not be eligible to merge this one, @chu11, @morrone, @lipari, or @dongahn want to do a quick review of this and merge if OK? I think @garlick is not available today.

@chu11
Copy link
Member

chu11 commented Nov 23, 2016

everything looks good to me, hitting the big button

@chu11 chu11 merged commit dbfcf2a into flux-framework:master Nov 23, 2016
@garlick garlick removed the review label Nov 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants