Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ats fixes #914

Merged
merged 9 commits into from Nov 23, 2016

Conversation

Projects
None yet
6 participants
@trws
Copy link
Member

trws commented Nov 22, 2016

This is an aggregation of small single-file commits with fixes required for the ATS backend to behave properly. The LRU cache item only comes up when the cache size is exceeded, so it may be worth adding a test for that.

@garlick garlick added the review label Nov 22, 2016

@coveralls

This comment has been minimized.

Copy link

coveralls commented Nov 22, 2016

Coverage Status

Coverage increased (+0.002%) to 76.143% when pulling 69b5274 on trws:ats-fixes into 2cca060 on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 22, 2016

Current coverage is 75.86% (diff: 100%)

Merging #914 into master will increase coverage by 0.02%

@@             master       #914   diff @@
==========================================
  Files           149        149          
  Lines         25991      25999     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          19713      19725    +12   
+ Misses         6278       6274     -4   
  Partials          0          0          
Diff Coverage File Path
•••••••••• 100% src/common/libutil/lru_cache.c
•••••••••• 100% src/connectors/local/local.c

Powered by Codecov. Last update 2cca060...568ef12

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 22, 2016

Thanks @trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here.

Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirements)

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Nov 22, 2016

No problem, I kinda threw them in last night to get it going through travis. I'll fix them up and rebake the other PR as soon as I get back to my desktop.

On November 22, 2016 at 9:31:44 AM PST, Mark Grondona notifications@github.com wrote:

Thanks @trwshttps://github.com/trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here.

Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirementshttps://github.com/flux-framework/rfc/blob/master/spec_1.adoc#patch-requirements)

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262308226, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStZZxRXUSQfq2Nx4KmLF0Y1o4tmvcks5rAyb1gaJpZM4K48sL.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 22, 2016

@trws, I'm probably missing something simple, but if you have a chance can you give me more details on the reproducer for the move-to-front lru bug? The current test case is pretty simple, but it does create a cache of size 5 and adds 10 items, so the cache size is exceeded. It then continually touches the first item added so it should be continually moved back to the front of the list.

I assume I'm missing some corner case still.

@garlick
Copy link
Member

garlick left a comment

Edit: er, I meant this to be a comment on the connectors/local change.

This change is fine, but I can't see what the problem was in the old code...

Anyway, we should probably make the same change in op_event_unsubscribe() and drop the shortjson.h include.

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Nov 22, 2016

@grondo unfortunately I don't have a small reproducer for the issue, just using sched to run 256+ jobs did it consistently. The result of the issue was that the count would be full, but the actual list would be empty. I think the pattern was a mix of using the most recent and cycling through all of the rest as they were accessed for state updates, but haven't dug in far enough to create a good reproducer.

@garlick, I don't think there was an actual problem with the old code... that's the especially odd part. It only comes up when using an allocator other than the default linux malloc, and it looks like what happens when the default malloc is used to get a piece of memory that's then passed to the free of tcmalloc or jemalloc or what have you. Why it appeared there, and only there, and not somewhere else that shortjson is used I can't say.

Truth be told, I'm not sure I hit the root cause of the issue, but the failure was always in that Jput, and it completely disappeared after the change.

trws added some commits Nov 22, 2016

lru_cache: fix item loss on move-to-front
The issue eventually causes the LRU to count full but contain no
elements, when the NULL head or tail are used after that it becomes a
segfault.  It does not appear to be a simple pattern that happens every
time something moves, but only when a node with a null prev pointer is
not first, or a null next pointer in a node that is not last.

@trws trws force-pushed the trws:ats-fixes branch from 69b5274 to d787750 Nov 22, 2016

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Nov 22, 2016

Rebased with each commit reworded.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Nov 22, 2016

Coverage Status

Coverage increased (+0.05%) to 76.191% when pulling d787750 on trws:ats-fixes into 2cca060 on flux-framework:master.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 22, 2016

The result of the issue was that the count would be full, but the actual list would be empty.

Thanks @trws that is a good hint, and again I apologize for sloppy code.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 22, 2016

The result of the issue was that the count would be full, but the actual list would be empty.

Thanks @trws that is a good hint, and again I apologize for sloppy code.

I verified that test/lru_cache.c is reproducing this issue, however, there is no mechanism in place to ensure the lru queue/list doesn't get out of sync with lru->count, so the corruption goes uncaught.

I did find that the real error here was that this line was missing from the end of lru_entry_remove. Without it there are dangling links back into the list. Oops, total rookie mistake.

    /* Re-initialized prev/next pointers to NULL */
    l->prev = l->next = NULL;

Your changes make the code clearer I think so they could stay in but we should add this fix on top as well, otherwise the internal list gets corrupted.

I'm still working on a test that fails under the lru_cache unit test -- might have to add a lru_cache_verify() or other sanity check function to actually walk the list

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Nov 22, 2016

@trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include.

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Nov 22, 2016

Sure, I'll make those changes as soon as I am get back to flux.

On November 22, 2016 at 3:18:31 PM PST, Jim Garlick notifications@github.com wrote:

@trwshttps://github.com/trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262395241, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStQoKyOyIzgzjbUMbaF7qvFfRGrRrks5rA3gvgaJpZM4K48sL.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 23, 2016

@trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them.

In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered.

Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there.

Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Nov 23, 2016

Coverage Status

Coverage increased (+0.006%) to 76.147% when pulling 50c5317 on trws:ats-fixes into 2cca060 on flux-framework:master.

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Nov 23, 2016

Not at all. I'm glad you found the root cause, and more so for the tests. Thanks @grondo!

On November 22, 2016 at 10:09:14 PM PST, Mark Grondona notifications@github.com wrote:

@trwshttps://github.com/trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them.

In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered.

Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there.

Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262440621, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStYHk7tZVFMt65e9UAprLsOiFBsfXks5rA9iBgaJpZM4K48sL.

@grondo
Copy link
Contributor

grondo left a comment

Great cleanup! This could be squashed into d787750
then I think this PR is ready, though someone other than I should probably merge it.

(should have been a comment on 4b8fa6b)

trws and others added some commits Nov 22, 2016

conn/local: switch to jannson interface
This resolves a free of data allocated by a different allocator detected
in jemalloc and tcmalloc from shortjson, always from this callsite.  The
issue does not appear to affect other uses of shortjson, or use with the
standard system allocator, despite the fact that this is a common
pattern and appears to match calls correctly.

In removing the last of the shortjson dependence, I noticed
that the sub and unsub functions were identical, so they now
also use a common helper function to reduce duplication.
lru_cache: reset lru list prev/next on removal
When a LRU entry was removed from the LRU list, failure to reset
the prev/next pointers to NULL on the removed entry could cause
an entry immediately re-inserted (as done in lru_requeue) to
corrupt the internal list. For example a move-to-front operation
would cause the front of list to have a prev pointer, which should
never happen.

Setting l->prev = l->next = NULL solves this problem as any insert
can be handled same as a new entry.
lru_cache: add lru_cache_selfcheck
Add a function to run consistency checks on a lru_cache_t object.
This is meant to test for corruption during unit testing.
test/lru_cache: add selfcheck to lru_cache tests
Use lru_cache_selfcheck() to verify lru_cache_t has no corruption
in test_lru_cache.t.
test/lru_cache: add reproducer for LRU list corruption
Add a reproducer for specific case of lru_cache corruption which
occured during move-to-front due to failure to reinitialize
lru_entry prev/next pointers during entry removal.
lru_cache: optimize requeue of item on front of list
If an item is already on front of LRU list, avoid doing the requeue
work since it is  unnecessary.

@trws trws force-pushed the trws:ats-fixes branch from 4b8fa6b to 568ef12 Nov 23, 2016

@trws

This comment has been minimized.

Copy link
Member Author

trws commented Nov 23, 2016

Rebased. Also popped in something that's been bugging me for a bit, added the intermediate xml files from doc building into the gitignore for in-tree builds.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Nov 23, 2016

Coverage Status

Coverage increased (+0.004%) to 76.145% when pulling 568ef12 on trws:ats-fixes into 2cca060 on flux-framework:master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Nov 23, 2016

Coverage Status

Coverage increased (+0.02%) to 76.164% when pulling 568ef12 on trws:ats-fixes into 2cca060 on flux-framework:master.

@grondo

grondo approved these changes Nov 23, 2016

Copy link
Contributor

grondo left a comment

Looks great to me now. Thanks @trws!

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 23, 2016

I probably should not be eligible to merge this one, @chu11, @morrone, @lipari, or @dongahn want to do a quick review of this and merge if OK? I think @garlick is not available today.

@chu11

This comment has been minimized.

Copy link
Contributor

chu11 commented Nov 23, 2016

everything looks good to me, hitting the big button

@chu11 chu11 merged commit dbfcf2a into flux-framework:master Nov 23, 2016

4 checks passed

codecov/patch 100% of diff hit (target 75.84%)
Details
codecov/project 75.86% (+0.02%) compared to 2cca060
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 76.164%
Details

@garlick garlick removed the review label Nov 23, 2016

@grondo grondo referenced this pull request Nov 28, 2016

Closed

Create 0.6.0 release notes #916

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.