Ats fixes #914

trws · 2016-11-22T02:48:45Z

This is an aggregation of small single-file commits with fixes required for the ATS backend to behave properly. The LRU cache item only comes up when the cache size is exceeded, so it may be worth adding a test for that.

coveralls · 2016-11-22T03:10:50Z

Coverage increased (+0.002%) to 76.143% when pulling 69b5274 on trws:ats-fixes into 2cca060 on flux-framework:master.

codecov-io · 2016-11-22T03:11:28Z

Current coverage is 75.86% (diff: 100%)

Merging #914 into master will increase coverage by 0.02%

@@             master       #914   diff @@
==========================================
  Files           149        149          
  Lines         25991      25999     +8   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          19713      19725    +12   
+ Misses         6278       6274     -4   
  Partials          0          0

Diff Coverage	File Path
•••••••••• 100%	src/common/libutil/lru_cache.c
•••••••••• 100%	src/connectors/local/local.c

Powered by Codecov. Last update 2cca060...568ef12

grondo · 2016-11-22T17:31:32Z

Thanks @trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here.

Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirements)

trws · 2016-11-22T17:35:06Z

No problem, I kinda threw them in last night to get it going through travis. I'll fix them up and rebake the other PR as soon as I get back to my desktop.

On November 22, 2016 at 9:31:44 AM PST, Mark Grondona notifications@github.com wrote:

Thanks @trwshttps://github.com/trws! and sorry about that lame lru bug! I'll create a test (thought I had already, duh!) and push it here.

Meanwhile, I hate to do this to you but the commit messages here should be fixed up to our commit message style (See RFC1: Patch Requirementshttps://github.com/flux-framework/rfc/blob/master/spec_1.adoc#patch-requirements)

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262308226, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStZZxRXUSQfq2Nx4KmLF0Y1o4tmvcks5rAyb1gaJpZM4K48sL.

grondo · 2016-11-22T18:16:03Z

@trws, I'm probably missing something simple, but if you have a chance can you give me more details on the reproducer for the move-to-front lru bug? The current test case is pretty simple, but it does create a cache of size 5 and adds 10 items, so the cache size is exceeded. It then continually touches the first item added so it should be continually moved back to the front of the list.

I assume I'm missing some corner case still.

garlick

Edit: er, I meant this to be a comment on the connectors/local change.

This change is fine, but I can't see what the problem was in the old code...

Anyway, we should probably make the same change in op_event_unsubscribe() and drop the shortjson.h include.

trws · 2016-11-22T20:45:50Z

@grondo unfortunately I don't have a small reproducer for the issue, just using sched to run 256+ jobs did it consistently. The result of the issue was that the count would be full, but the actual list would be empty. I think the pattern was a mix of using the most recent and cycling through all of the rest as they were accessed for state updates, but haven't dug in far enough to create a good reproducer.

@garlick, I don't think there was an actual problem with the old code... that's the especially odd part. It only comes up when using an allocator other than the default linux malloc, and it looks like what happens when the default malloc is used to get a piece of memory that's then passed to the free of tcmalloc or jemalloc or what have you. Why it appeared there, and only there, and not somewhere else that shortjson is used I can't say.

Truth be told, I'm not sure I hit the root cause of the issue, but the failure was always in that Jput, and it completely disappeared after the change.

The issue eventually causes the LRU to count full but contain no elements, when the NULL head or tail are used after that it becomes a segfault. It does not appear to be a simple pattern that happens every time something moves, but only when a node with a null prev pointer is not first, or a null next pointer in a node that is not last.

trws · 2016-11-22T20:52:10Z

Rebased with each commit reworded.

coveralls · 2016-11-22T21:06:26Z

Coverage increased (+0.05%) to 76.191% when pulling d787750 on trws:ats-fixes into 2cca060 on flux-framework:master.

grondo · 2016-11-22T21:15:49Z

The result of the issue was that the count would be full, but the actual list would be empty.

Thanks @trws that is a good hint, and again I apologize for sloppy code.

grondo · 2016-11-22T22:51:07Z

The result of the issue was that the count would be full, but the actual list would be empty.

Thanks @trws that is a good hint, and again I apologize for sloppy code.

I verified that test/lru_cache.c is reproducing this issue, however, there is no mechanism in place to ensure the lru queue/list doesn't get out of sync with lru->count, so the corruption goes uncaught.

I did find that the real error here was that this line was missing from the end of lru_entry_remove. Without it there are dangling links back into the list. Oops, total rookie mistake.

    /* Re-initialized prev/next pointers to NULL */
    l->prev = l->next = NULL;

Your changes make the code clearer I think so they could stay in but we should add this fix on top as well, otherwise the internal list gets corrupted.

I'm still working on a test that fails under the lru_cache unit test -- might have to add a lru_cache_verify() or other sanity check function to actually walk the list

garlick · 2016-11-22T23:18:07Z

@trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include.

trws · 2016-11-22T23:19:48Z

Sure, I'll make those changes as soon as I am get back to flux.

On November 22, 2016 at 3:18:31 PM PST, Jim Garlick notifications@github.com wrote:

@trwshttps://github.com/trws if you don't mind, please update the unsubscribe handler in the local connector as well and drop the shortjson.h include.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262395241, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStQoKyOyIzgzjbUMbaF7qvFfRGrRrks5rA3gvgaJpZM4K48sL.

grondo · 2016-11-23T06:09:05Z

@trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them.

In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered.

Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there.

Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added.

coveralls · 2016-11-23T06:15:01Z

Coverage increased (+0.006%) to 76.147% when pulling 50c5317 on trws:ats-fixes into 2cca060 on flux-framework:master.

trws · 2016-11-23T06:16:43Z

Not at all. I'm glad you found the root cause, and more so for the tests. Thanks @grondo!

On November 22, 2016 at 10:09:14 PM PST, Mark Grondona notifications@github.com wrote:

@trwshttps://github.com/trws, hope you don't mind I pushed some test additions and extra fixes to lru_cache here directly onto your PR. If you'd like them split off, go ahead and rebase without them.

In order to write a failing case for this issue, I had to add a lru_cache_selfcheck() function to check for corruption in the LRU list, compare number of items in the list vs lru->count etc. I then added a very specific reproducer for the corruption you encountered.

Even with your fix the selfcheck fails, so the missing reset of a removed item before re-insertion is required and should have always been there.

Finally, a small optimization to do nothing when an entry at the front of the list is "requeued" was added.

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHubhttps://github.com//pull/914#issuecomment-262440621, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AAoStYHk7tZVFMt65e9UAprLsOiFBsfXks5rA9iBgaJpZM4K48sL.

grondo

Great cleanup! This could be squashed into d787750
then I think this PR is ready, though someone other than I should probably merge it.

(should have been a comment on 4b8fa6b)

This resolves a free of data allocated by a different allocator detected in jemalloc and tcmalloc from shortjson, always from this callsite. The issue does not appear to affect other uses of shortjson, or use with the standard system allocator, despite the fact that this is a common pattern and appears to match calls correctly. In removing the last of the shortjson dependence, I noticed that the sub and unsub functions were identical, so they now also use a common helper function to reduce duplication.

When a LRU entry was removed from the LRU list, failure to reset the prev/next pointers to NULL on the removed entry could cause an entry immediately re-inserted (as done in lru_requeue) to corrupt the internal list. For example a move-to-front operation would cause the front of list to have a prev pointer, which should never happen. Setting l->prev = l->next = NULL solves this problem as any insert can be handled same as a new entry.

Add a function to run consistency checks on a lru_cache_t object. This is meant to test for corruption during unit testing.

Use lru_cache_selfcheck() to verify lru_cache_t has no corruption in test_lru_cache.t.

Add a reproducer for specific case of lru_cache corruption which occured during move-to-front due to failure to reinitialize lru_entry prev/next pointers during entry removal.

If an item is already on front of LRU list, avoid doing the requeue work since it is unnecessary.

trws · 2016-11-23T19:27:08Z

Rebased. Also popped in something that's been bugging me for a bit, added the intermediate xml files from doc building into the gitignore for in-tree builds.

coveralls · 2016-11-23T19:34:19Z

Coverage increased (+0.004%) to 76.145% when pulling 568ef12 on trws:ats-fixes into 2cca060 on flux-framework:master.

coveralls · 2016-11-23T19:45:22Z

Coverage increased (+0.02%) to 76.164% when pulling 568ef12 on trws:ats-fixes into 2cca060 on flux-framework:master.

grondo

Looks great to me now. Thanks @trws!

grondo · 2016-11-23T22:48:13Z

I probably should not be eligible to merge this one, @chu11, @morrone, @lipari, or @dongahn want to do a quick review of this and merge if OK? I think @garlick is not available today.

chu11 · 2016-11-23T23:32:24Z

everything looks good to me, hitting the big button

garlick added the review label Nov 22, 2016

garlick reviewed Nov 22, 2016

View reviewed changes

trws added 2 commits November 22, 2016 12:47

python: add unicode -> str conversion for cffi

0b9a160

trws force-pushed the ats-fixes branch from 69b5274 to d787750 Compare November 22, 2016 20:51

grondo reviewed Nov 23, 2016

View reviewed changes

trws and others added 7 commits November 23, 2016 11:25

lru_cache: add lru_cache_selfcheck

5e75cb7

Add a function to run consistency checks on a lru_cache_t object. This is meant to test for corruption during unit testing.

test/lru_cache: add selfcheck to lru_cache tests

62df82a

Use lru_cache_selfcheck() to verify lru_cache_t has no corruption in test_lru_cache.t.

test/lru_cache: add reproducer for LRU list corruption

8e962bb

Add a reproducer for specific case of lru_cache corruption which occured during move-to-front due to failure to reinitialize lru_entry prev/next pointers during entry removal.

lru_cache: optimize requeue of item on front of list

89e1c83

If an item is already on front of LRU list, avoid doing the requeue work since it is unnecessary.

gitignore: ignore intermediate xml

568ef12

trws force-pushed the ats-fixes branch from 4b8fa6b to 568ef12 Compare November 23, 2016 19:26

grondo approved these changes Nov 23, 2016

View reviewed changes

chu11 merged commit dbfcf2a into flux-framework:master Nov 23, 2016

garlick removed the review label Nov 23, 2016

grondo mentioned this pull request Nov 28, 2016

Create 0.6.0 release notes #916

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ats fixes #914

Ats fixes #914

trws commented Nov 22, 2016

coveralls commented Nov 22, 2016

codecov-io commented Nov 22, 2016 •

edited

grondo commented Nov 22, 2016

trws commented Nov 22, 2016

grondo commented Nov 22, 2016

garlick left a comment •

edited

trws commented Nov 22, 2016

trws commented Nov 22, 2016

coveralls commented Nov 22, 2016

grondo commented Nov 22, 2016

grondo commented Nov 22, 2016

garlick commented Nov 22, 2016

trws commented Nov 22, 2016

grondo commented Nov 23, 2016

coveralls commented Nov 23, 2016

trws commented Nov 23, 2016

grondo left a comment •

edited

trws commented Nov 23, 2016

coveralls commented Nov 23, 2016

coveralls commented Nov 23, 2016

grondo left a comment

grondo commented Nov 23, 2016

chu11 commented Nov 23, 2016

Ats fixes #914

Ats fixes #914

Conversation

trws commented Nov 22, 2016

coveralls commented Nov 22, 2016

codecov-io commented Nov 22, 2016 • edited

Current coverage is 75.86% (diff: 100%)

grondo commented Nov 22, 2016

trws commented Nov 22, 2016

grondo commented Nov 22, 2016

garlick left a comment • edited

Choose a reason for hiding this comment

trws commented Nov 22, 2016

trws commented Nov 22, 2016

coveralls commented Nov 22, 2016

grondo commented Nov 22, 2016

grondo commented Nov 22, 2016

garlick commented Nov 22, 2016

trws commented Nov 22, 2016

grondo commented Nov 23, 2016

coveralls commented Nov 23, 2016

trws commented Nov 23, 2016

grondo left a comment • edited

Choose a reason for hiding this comment

trws commented Nov 23, 2016

coveralls commented Nov 23, 2016

coveralls commented Nov 23, 2016

grondo left a comment

Choose a reason for hiding this comment

grondo commented Nov 23, 2016

chu11 commented Nov 23, 2016

codecov-io commented Nov 22, 2016 •

edited

garlick left a comment •

edited

grondo left a comment •

edited