Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libflux: add flux_msg_incref() and _decref() #2334

Merged
merged 16 commits into from Aug 26, 2019

Conversation

@garlick
Copy link
Member

commented Aug 25, 2019

This PR was peeled off from work on #2302. It adds flux_msg_incref() and flux_msg_decref() as discussed in issue #2039. It then visits a bunch of places where request messages are copied and saved for asynchronous handling and turns those into refcounts to avoid the copy.

One tradeoff is allowing a const flux_msg_t * to be increfed and decrefed, when the decref could destroy it. I thought this was ok. The presence of const reminds users that they are not to modify the message content, and the fact that destruction now occurs on last use rather than at an explicit point in the code seems OK? It really was not practical to introduce this without const for the deferred request case anyway, without changing the message handler prototype (too much change!).

We had discussed providing some real support for the recurring "deferrred request" thing. I was more focused the potential copy avoidance in plumbing for #2302 and didn't have any great ideas on how to generalize that here, even having walked through several use cases, sorry.

Something people might complain of: in several places I added a little container for deferred messages, when they were stored as barevoid * items in a zlist or aux hash, to maintain the const tag on the message. This helped me be more assured that the code I was modifying wasn't going to scribble on the message, but isn't strictly necessary - one could just cast away the const on insertion and put it back on access. I thought it was slightly clearer this way but as it replaces one malloc with another, the overall change might not be accomplishing much for small messages.

Speaking of, I'm not optimistic that this is a big performance win. The reduced copying in this PR probably doesn't amount to a lot. However, it's not a bad thing to have in our API IMHO.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 25, 2019

Nice cleanup!

I thought it was slightly clearer this way but as it replaces one malloc with another, the overall change might not be accomplishing much for small messages.

Haven't given this a full look yet, but it does seem a bit heavy handed to add a whole new type with constructor/destructor in order to preserve const here. Could the flux_msg_t just be copied as before in this case? I guess what you alluded to above is that the extra code and malloc() could be a net win if the messages are large?

In the end, since the new container isn't used in every instance of flux_msg_incref(), I don't find your approach distasteful at all, certainly not enough to change it in this PR.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

I have a series of fixups queued that get rid of all the struct request containers, except one in libsubprocess where there were two aux items that always went together and it seemed smart to just put them both in a single struct. On second examination, putting const messages directly in void * containers is no big deal. The code winds up being pretty clear, actually (IMHO).

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

Restarted a stuck builder (no output in 10m variety...)

@garlick

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

Shall I rebase the fixups? Might be easier to see the changes that way.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

Sure. Works for me. I'll check again in the morning, but I don't see why this shouldn't go in.

@garlick garlick force-pushed the garlick:msg_incref branch from 6f95347 to 320332f Aug 26, 2019
static void job_clean (struct job *job)
{
if (job) {
free (job->jobspec);
job->jobspec = NULL;
(void)flux_msg_set_payload (job->msg, NULL, 0);
job->J = NULL;

This comment has been minimized.

Copy link
@chu11

chu11 Aug 26, 2019

Contributor

removal of these two lines seems unrelated to this commit? Possible rebase conflict? Or if it was just general cleanup (seems like it is), probably just needs a comment in the commit message.

This comment has been minimized.

Copy link
@garlick

garlick Aug 26, 2019

Author Member

The purpose of that code was to drop the payload from the deferred request since when this is called, the data has been copied to the kvs transaction. Now that the deferred request is a const refcounted message instead of a copy, we can't do that.

I'll update the commit message - should have called that out!

@@ -27,7 +27,7 @@ struct kvssync {
flux_msg_handler_f cb;

This comment has been minimized.

Copy link
@chu11

chu11 Aug 26, 2019

Contributor

typo in commit message , "seemd" -> "seemed"

@@ -552,6 +552,12 @@ void check_sendzsock (void)
int type;
const char *uri = "inproc://test";

zsys_init ();

This comment has been minimized.

Copy link
@chu11

chu11 Aug 26, 2019

Contributor

perhaps add comment in commit message here, so that we know why the zys boiler plate is around this chunk of code.

This comment has been minimized.

Copy link
@garlick

garlick Aug 26, 2019

Author Member

Already have this;

Also, since two tests are lives_ok() tests that call fork(),
add zsys boiler plate around zeromq socket tests, as not doing
so seems to result in assertions in czmq's atexit handler.

This comment has been minimized.

Copy link
@chu11

chu11 Aug 26, 2019

Contributor

sorry, what I meant was put the comments from the commit message into the code :-)

This comment has been minimized.

Copy link
@garlick

garlick Aug 26, 2019

Author Member

I misread your comment, sorry! Yes good idea.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

I'll go ahead and make those changes, squash, update to current master, and force a push if no objections.

garlick added 16 commits Aug 25, 2019
Problem: it's sometimes necessary to make an effecively "const"
copy of a message just to defer the message's destruction.
For example, to retain a request message for asynchronous processing
after a message request handler has returned.

Introduce flux_msg_incref() and flux_msg_decref().  These functions
take a (const flux_msg_t *) argument to allow the messsage's destruction
to be deferred to the last user, without compromising its read-only
protection, which promotes sharing.

Fixes #2039

Also, the following cleanup:
- drop inconsistently used and untested internal "magic cookie" assertions
- factor out flux_msg_create_common() so we don't need to set the initial
  refcount in three different places.
Problem: I'm tired of doing the errno save/restore around
destructor code when the only function that cannot be trusted
not to clobber errno is free(3).

Create ERRNO_SAFE_FREE() macro that saves and restores
errno in self-contained fashion.

Name the header errno_safe.h, leaving the door open for
other such wrappers.
Problem: flux_msg_decode() sets errno to EINVAL in
out of memory situations.

Change the errno in those cases to ENOMEM.

Clean up errno handling in function, now that flux_msg_destroy()
takes pains not to clobber errno.
Problem: struct map_struct violates RFC 7.

Rename to "struct typemap".

Also rename the table from "msgtype_map" to "typemap" to
make it use less horizontal space for readability.
Instead of copying a deferred request message, use message
refcounts.
Instead of copying a deferred request message, simply
increment its refcount.

Unrelated cleanup: use ERRNO_SAFE_FREE() macro in jobreq_destroy(),
since errno is used afterwards by flux_log_error() and free(3)
cannot be trusted not to clobber it.
Instead of copying a deferred request message, use message
refcounts.

Rename some functions that operate on lists of requests
to have a common request_list_* prefix, for clarity.
Instead of copying a deferred job ingest request,
simply increment its refcount.

In job_clean(), remove code that dropped the payload in
a copy of the deferred request.  The message is const now
so we can't do that.
Instead of copying a deferred job drain request, simply
increment its refcount.
Instead of copying a deferred request, simply increment
its refcount.
Instead of copying a deferred request, simply increment
its refcount.
Instead of copying request messages, use message refcounts
in waitqueue, kvssync, and treq modules.

Note: there are a couple more cases in kvs.c where request
copies might be replaced with refcounted originals, but
the code seemed fairly complex, so save for later.
Instead of copying a deferred request, simply increment
its refcount.
Instead of copying a deferred request, simply increment
its refcount.
Instead of copying deferred rexec request into future
aux hash, create a struct containing refcounted message
and server context, and put that in the aux hash.
Also, since two tests are lives_ok() tests that call fork(),
add zsys boiler plate around zeromq socket tests, as not doing
so seems to result in assertions in czmq's atexit handler.
@garlick garlick force-pushed the garlick:msg_incref branch from 320332f to 93cda92 Aug 26, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Aug 26, 2019

Codecov Report

Merging #2334 into master will increase coverage by 0.01%.
The diff coverage is 77.08%.

@@            Coverage Diff             @@
##           master    #2334      +/-   ##
==========================================
+ Coverage   80.82%   80.84%   +0.01%     
==========================================
  Files         215      215              
  Lines       34256    34238      -18     
==========================================
- Hits        27687    27679       -8     
+ Misses       6569     6559      -10
Impacted Files Coverage Δ
src/modules/job-info/lookup.c 67.74% <100%> (+1.86%) ⬆️
src/modules/kvs-watch/kvs-watch.c 79.41% <100%> (ø) ⬆️
src/modules/kvs/kvssync.c 93.33% <100%> (+2.85%) ⬆️
src/modules/job-ingest/job-ingest.c 73.43% <100%> (-0.41%) ⬇️
src/modules/job-exec/job-exec.c 75.49% <100%> (+0.81%) ⬆️
src/modules/sched-simple/sched.c 73.26% <100%> (ø) ⬆️
src/modules/barrier/barrier.c 79.78% <100%> (-0.22%) ⬇️
src/common/libschedutil/alloc.c 69.64% <100%> (ø) ⬆️
src/modules/job-info/guest_watch.c 72.01% <100%> (+0.62%) ⬆️
src/modules/kvs/waitqueue.c 90.47% <100%> (+2.19%) ⬆️
... and 12 more
@garlick

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

This is ready to go IMHO. The coverage drop is (I think) due to the removed containers, which would have increased the ratio of new (covered) code to old (uncovered) error code that this PR touches. Anyway, I don't see an easy way to improve the coverage.

@chu11
chu11 approved these changes Aug 26, 2019
@chu11

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

LGTM. Don't know if @grondo has completed this "I'll check again in the morning" yet, so just want to make sure he's approved as well.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

Yeah, looks fine to me. The actual project coverage went up so a win all around!

@grondo grondo merged commit fa1fbdf into flux-framework:master Aug 26, 2019
3 of 4 checks passed
3 of 4 checks passed
codecov/patch 77.08% of diff hit (target 80.82%)
Details
Summary 1 potential rule
Details
codecov/project 80.84% (+0.01%) compared to 5378e4c
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Aug 26, 2019

Thanks!

@garlick garlick deleted the garlick:msg_incref branch Aug 27, 2019
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.