Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVS oom() and xzmalloc() cleanup #1124

Merged
merged 22 commits into from Jul 25, 2017

Conversation

Projects
None yet
5 participants
@chu11
Copy link
Contributor

chu11 commented Jul 25, 2017

Like other refactorings that have been going on in flux-core, this one removes a large number of oom() calls and functions that call oom() (notably xzmalloc() and xstrdup()). Instead
ENOMEM is detected, properly return as an error, check for error, log error, etc.

A few oom()s remain (wait_runqueue() and the merging ops functions). Those will require a tad more re-architecture and logic updates so I will put those in another PR. The changes in this PR are relatively more grunted out fixes without much thought except for cleanup path handling.

I suspect the coverage on this diff will be bad. I'd be surprised if it was even greater than 50% coverage, as almost the entirety of it is handling ENOMEM errors and the cascading checks for functions that can now return an ENOMEM error.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2017

Coverage Status

Coverage decreased (-0.5%) to 77.991% when pulling 2341204 on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Jul 25, 2017

Codecov Report

Merging #1124 into master will decrease coverage by 0.49%.
The diff coverage is 41.27%.

@@            Coverage Diff            @@
##           master    #1124     +/-   ##
=========================================
- Coverage   78.07%   77.57%   -0.5%     
=========================================
  Files         184      184             
  Lines       31430    31668    +238     
=========================================
+ Hits        24538    24567     +29     
- Misses       6892     7101    +209
Impacted Files Coverage Δ
src/common/libkvs/jansson_dirent.c 64.44% <25%> (-16.12%) ⬇️
src/modules/kvs/fence.c 77.77% <28.57%> (-4.58%) ⬇️
src/modules/kvs/lookup.c 83.68% <34.88%> (-7.69%) ⬇️
src/modules/kvs/kvs.c 69.12% <36.66%> (-6.24%) ⬇️
src/modules/kvs/commit.c 73.08% <41.41%> (-8.74%) ⬇️
src/modules/kvs/waitqueue.c 81.13% <45%> (-5.33%) ⬇️
src/modules/kvs/cache.c 91.3% <62.85%> (-1.94%) ⬇️
src/modules/kvs/kvs_util.c 73.91% <64.7%> (-4.35%) ⬇️
t/mpi/hello.c 93.54% <0%> (-6.46%) ⬇️
src/common/libkvs/kvs.c 70.8% <0%> (-6.1%) ⬇️
... and 16 more
@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

hmmm. the 40.27% dff target isn't too surprising. But the project coverage going down 0.53% seems rather high. But then again, that's only 156 lines of code, which doesn't seem too alarming given all of the ENOMEM paths. I'll see if I can tweak anything and get the coverage better.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 25, 2017

Yeah, the unchecked paths mostly seem like error paths, and it is very difficult to get coverage there. However, it would seem like some of these error paths would be really good to check, and for example cleanup_dirty_cache_entry is not called during testing. Could you think of any way to trigger an error in any these paths for testing purposes, and ensure the cleanup works as expected?

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 25, 2017

I just peeked at this real quick to check on the coverage report, but I really appreciate some of the nice code comments you added in tricky sections. Really helpful for someone browsing the code. Thanks!

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 25, 2017

Nice work @chu11 as usual!

I haven't looked in detail at how it might apply here but we do have the debug hooks for modules that are set with flux module debug. E.g. see some of the tests in t0017-security.t that tweak the connector-local and userdb modules. Possibly some test bits for the kvs module could be used to trigger some of the hairier error paths.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

@grondo Yeah, the cleanup_dirty_cache_entry in kvs.c did stick out a bit. I think if I somehow make a common function along with cleanup_dirty_cache_entry in commit.c then I'll atleast have confidence that the cleanup would work (it is covered via the commit.c specific unit tests).

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

pushed a refactor creating a common function commit_cleanup_dirty_cache_entry, and added a unit test for a general missed case. May not matter much, but hopefully the net change is a bit less.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2017

Coverage Status

Coverage decreased (-0.4%) to 78.014% when pulling 146b62e on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@garlick
Copy link
Member

garlick left a comment

I added a few minor comments but on the whole this is a really positive set of changes.

goto done;
}
if (!blobref || blobref[blobref_size - 1] != '\0') {
errno = EPROTO;
flux_log_error (ctx->h, "%s", __FUNCTION__);
flux_log_error (ctx->h, "%s: EPROTO", __FUNCTION__);

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

This is going to print "content_store_get: EPROTO: Protocol Error". Maybe change "EPROTO" to "invalid blobref"?

oom ();
if (!(s = json_dumps (tmp, flags)))
oom ();
if (!(tmp = json_null ())) {

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

Or maybe just replace json_null() and json_dumps() calls with strdup ("null")?

bool rc = strlen (s) > BLOBREF_MAX_STRING_SIZE;
free (s);
return rc;
return rc ? 1 : 0;

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

Maybe just make rc an int and set it before the free, then return it directly?

bool rc = strlen (s) > BLOBREF_MAX_STRING_SIZE;
free (s);
return rc;
return rc ? 1 : 0;

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

Suggestion: make new utility function that calls kvs_util_json_dumps() internally:

// returns 0 on success, -1 on failure
int kvs_util_json_encoded_size (json_t *o, size_t *size);

then below, use it to set a new variable size_t value_encoded_size, and test it for > BLOBREF_MAX_STRING_SIZE right there in commit_unroll().

Not a big deal, but might be a little clearer.

This comment has been minimized.

@chu11

chu11 Jul 25, 2017

Author Contributor

Ahhh, I see what you mean. Yeah, that'll probably clarify things a bit. But more importantly it'll use kvs_util_json_dumps(), which I had missed earlier.

/* Return true if load successful, false if stalling */
static bool load (kvs_ctx_t *ctx, const href_t ref, wait_t *wait, json_t **op)
/* Return 1 if load successful, 0 if stalling, -1 on error */
static int load (kvs_ctx_t *ctx, const href_t ref, wait_t *wait, json_t **op)

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

Would it be clearer to add a stalled parameter to the function and just return 0 on success, -1 on failure?

@@ -253,10 +254,13 @@ static int commit_unroll (commit_t *c, int current_epoch, json_t *dir)
goto done;

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

comment block describing return value for commit_unroll() needs an update (still refers to boolean return).

This comment has been minimized.

@chu11

chu11 Jul 25, 2017

Author Contributor

oops, this is all the way back from my commit API refactoring. Will add a cleanup.

@@ -325,9 +332,11 @@ static int commit_link_dirent (commit_t *c, int current_epoch,
errno = ENOMEM;
goto done;
}
if (json_object_set_new (dir,

This comment has been minimized.

@garlick

garlick Jul 25, 2017

Member

Not sure if this is workable, but would it be easier to just dispense with j_dirent_create() and call json_pack() in all the places where it's used? It doesn't really do very much, and it might be easier to keep track of refcounts and error behavior if we were more direct. Ignore if that's out of scope here.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

Just pushed fixes for all the tiny things @garlick found. If looks good, can squash and then rebase on master.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2017

Coverage Status

Coverage decreased (-0.4%) to 78.042% when pulling 3df439d on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 25, 2017

Those changes look good!

@chu11 chu11 force-pushed the chu11:kvs_oom_xzmalloc_cleanup-try4 branch from 3df439d to 2df537e Jul 25, 2017

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

just pushed a rebased + squashed tree. Didn't squash every cleanup patch, as some of them seemed more like fixes/cleanup from the prior PRs, so they should stay separate.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2017

Coverage Status

Coverage decreased (-0.4%) to 78.031% when pulling 2df537e on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

hmmm, getting valgrind errors. But I only squashed and rebased compared to the last push. Wonder what I did.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Jul 25, 2017

Hm, that is a bit scary. I hope it is not a strange false positive.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

I think I may have just read the travis results wrong. Normally "coverage/coveralls" is the first one listed and "continuous-integration/travis-ci" is second. I saw the red x and green checkmark and assumed travis had passed and coverage was bad. Think I found the mem-leak I introduced, and perhaps another minor cleanup I can fix.

@chu11 chu11 force-pushed the chu11:kvs_oom_xzmalloc_cleanup-try4 branch from 2df537e to cb64774 Jul 25, 2017

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

re-pushed, i forgot to free memory in the new kvs_util_json_encoded_size() function. That's the only change.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2017

Coverage Status

Coverage decreased (-0.4%) to 78.036% when pulling cb64774 on chu11:kvs_oom_xzmalloc_cleanup-try4 into ccf25dc on flux-framework:master.

chu11 added some commits Jul 24, 2017

modules/kvs: Add/cleanup flux log error messages
Add flux log error messages in places there should be log messages.

Fix up flux log error message formats for consistency.
modules/kvs: Minor code cleanup
Move log function above asserts, so log message will definitely
occur.
modules/kvs: Remove basic oom() calls
Begin refactoring removal of oom () calls.  Begin by removing oom ()
calls where it is easy to simply replace with error responses.  i.e.
functions that already return errors and callers check for errors, so
slipping in a new return of ENOMEM is relatively simple.

Remove include of oom.h where appropriate.
modules/kvs: Replace basic alloc calls that exit
Begin refactoring away functions that will oom() and exit on
ENOMEM errors.  Replace calls of xstrdup(), xzmalloc(), and
xasprintf() with strdup(), calloc(), and asprintf() respectively
and return ENOMEM to callers appropriately.  Begin by replacing
where it is easy to simply replace with error responses.  i.e
functions that already return errors and callers check for errors,
so slipping in a new return of ENOMEM is relatively simple.

Remove header include of xzmalloc.h where appropriate.
modules/kvs: Remove oom() from kvs_util
kvs_util_json_dumps() and kvs_util_json_hash() can now return ENOMEM
as an error.

Adjust callers of these functions appropriately to handle error.
modules/kvs: Remove oom() in cache functions.
Have cache_get_stats() and cache_expire_entries() return -1 on error.

Adjust callers appropriately to handle potential error.

chu11 added some commits Jul 21, 2017

modules/kvs: Fix oom in cache entry list push
When putting a dirty cache entry in the item callback list, do
not oom(), instead cleanup the dirty cache entry and return
ENOMEM appropriately.
modules/kvs: Remove oom() in fileval_big()
Have fileval_big() return -1 on error, 0 on false, 1 on true.

Adjust callers appropriately to handle potential error.
modules/kvs: Refactor load() function
Refactor load() function so it can return an error instead of only
returning true/false on load/stall.
modules/kvs: Remove oom() from wait_queue_create()
Have wait_queue_create() return NULL on ENOMEM.

The potential for error in wait_queue_create() cascades to
cache_entry_wait_notdirty() and cache_entry_wait_valid(), which now
can return errors.  Update all callers of cache_entry_wait_notdirty()
and cache_entry_wait_valid() to handle potential errors.

Refactor cleanup_dirty_cache_entry() in kvs.c and commit.c into
commit_cleanup_dirty_cache_entry() common function.

Update unit tests for cache_entry_wait_notdirty() and cache_entry_wait_valid().
Add unit test for commit_cleanup_dirty_cache_entry().
common/libkvs: Return ENOMEM from j_dirent_create
Do not oom() on errors in j_dirent_create(), return ENOMEM appropriately.

Fix all callers of j_dirent_create() to check for error and handle
error appropriately.
modules/kvs: Cleanup oom responses in commit API
With refactor / change of j_dirent_create(), now remove oom() call
from json_object_set_new() calls and return error appropriately.
modules/kvs: Do not oom() in cache_entry_create()
Call calloc() over xzmalloc() in cache_entry_create() and return
ENOMEM appropriately.  Update all callers to handle error appropriately.
modules/kvs: Remove kvs_util_json_copydir()
Replace kvs_util_json_copydir() with json_copy() and check for
ENOMEM errors appropriately.

Remove unnecessary kvs_util.h includes.

Update unit tests appropriately.
modules/kvs: Remove oom() from wait_addqueue()
Return -1 on error from wait_addqueue() and 0 on success.

Update all callers to check for error appropriately.

Update unit tests appropriately.
modules/kvs: Remove oom() from wait_destroy_msg()
Handle errors appropriately and update callers to handle errors.
modules/kvs: Add cache_entry_set_json unit test
Add missing unit test for removing object from cache entry
modules/kvs: Simplify kvs_util_json_dumps()
Instead of using two jansson calls, just strdup the string "null".
modules/kvs: Add kvs_util_json_encoded_size()
Add new kvs_util_json_encoded_size() function determine the
size of an object consistent to the format used by kvs_util_json_dumps().

Use this function instead of fileval_big() in commit API.

Add unit tests appropriately.

@chu11 chu11 force-pushed the chu11:kvs_oom_xzmalloc_cleanup-try4 branch from cb64774 to e464830 Jul 25, 2017

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

rebased on master and re-pushed

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jul 25, 2017

Coverage Status

Coverage decreased (-0.4%) to 78.018% when pulling e464830 on chu11:kvs_oom_xzmalloc_cleanup-try4 into a132370 on flux-framework:master.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Jul 25, 2017

Finally through travis! Ready @chu11?

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Jul 25, 2017

yup!

@garlick garlick merged commit 5f48548 into flux-framework:master Jul 25, 2017

2 of 4 checks passed

codecov/patch 41.27% of diff hit (target 78.07%)
Details
codecov/project 77.57% (-0.5%) compared to a132370
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.4%) to 78.018%
Details

@grondo grondo referenced this pull request Aug 23, 2017

Closed

0.8.0 Release #1160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.