Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smart pointer support and misc. cleanup #537

Merged
merged 14 commits into from Nov 19, 2019
Merged

Conversation

@dongahn
Copy link
Contributor

dongahn commented Oct 15, 2019

This PR adds smart pointer support using std::shared_ptr to:

  • Improve memory management by disallowing memory leaks in general;
  • Allow the traverser code to use shared pointers on a few key member variables: filtered graph, metadata and match policy callback members. The traverser does not solely "own" these objects (in fact they are being passed from outside), and this means using raw pointers requires a careful manual coordination between the owners of these objects. Sharing the ownership of these objects using std::shared_ptr should be safer and far more elegant.

This PR also adds a few other cleanups:

  • Use fully qualified std namespace paths for all of the symbols in this namespace;
  • Change log messages within the resource module to include the function name in the message, which contains that log message statement;
  • Other misc. style changes including enforcing 80 character per line limit and improving indentations for readability.
@dongahn dongahn requested review from milroy and SteVwonder Oct 15, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

@SteVwonder and @milroy: I added you as the reviewers for this.

Although this PR spans multiple files and many lines, a review should be relatively painlessly. This essentially replaces raw pointers to shared_ptr objects: mostly drop-in replacements. But note that there are a few places like "destroy" logic for resource contexts within resource_match.cpp and resource-query.cpp, which would need some good review, though.

Others are style changes, log message changes and applying a bit better practice of using explicit namespace paths within std.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

Two Travis test instances failed. Let me look at them before your reviews.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

qmanager: loading resource and qmanager modules works

expecting success:
    jobid=$(flux job submit basic.json) &&
    flux job wait-event -t 2 ${jobid} start &&
    flux job wait-event -t 2 ${jobid} finish &&
    flux job wait-event -t 2 ${jobid} release &&
    flux job wait-event -t 2 ${jobid} clean

flux-job: flux_open: Connection reset by peer
not ok 4 - qmanager: basic job runs in simulated mode

expecting success:
    jobid=$(flux jobspec srun -t 1 hostname | \
        exec_test | flux job submit) &&
    flux job wait-event -vt 2.5 ${jobid} start &&
    flux job cancel ${jobid} &&
    flux job wait-event -t 2.5 ${jobid} exception &&
    flux job wait-event -t 2.5 ${jobid} finish | grep status=15 &&
    flux job wait-event -t 2.5 ${jobid} release &&
    flux job wait-event -t 2.5 ${jobid} clean &&
    exec_eventlog $jobid | grep "complete" | grep "\"status\":15"

flux-job: flux_open: No such file or directory
not ok 5 - qmanager: canceling job during execution works

flux-job: flux_open: No such file or directory

This probably means the flux instance crashed...?

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

$ src/test/backtrace-all.sh

This doesn't report anything though...

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 15, 2019

Yeah, unfortunately it has been a long standing problem that Travis can't generate corefiles in docker images.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

@grondo: ah...

Thankfully I just reproduced this on my docker using clang-6.0!

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 15, 2019

Whew, otherwise the approach is typically a series of printf debugging via pushes to get a Travis build. Not fun!

@dongahn dongahn force-pushed the dongahn:shared_ptr branch from 393d221 to 42192f8 Oct 15, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 15, 2019

Codecov Report

Merging #537 into master will decrease coverage by 0.44%.
The diff coverage is 68.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #537      +/-   ##
==========================================
- Coverage   76.03%   75.59%   -0.45%     
==========================================
  Files          69       69              
  Lines        6481     6515      +34     
==========================================
- Hits         4928     4925       -3     
- Misses       1553     1590      +37
Impacted Files Coverage Δ
resource/policies/dfu_match_policy_factory.hpp 100% <ø> (ø) ⬆️
resource/readers/resource_reader_hwloc.hpp 100% <ø> (ø) ⬆️
resource/readers/resource_reader_grug.hpp 100% <ø> (ø) ⬆️
resource/writers/match_writers.hpp 100% <ø> (ø) ⬆️
resource/utilities/command.hpp 100% <ø> (ø) ⬆️
resource/libjobspec/jobspec.hpp 100% <ø> (ø) ⬆️
resource/store/resource_graph_store.hpp 100% <ø> (ø) ⬆️
resource/readers/resource_reader_jgf.hpp 100% <ø> (ø) ⬆️
resource/readers/resource_reader_base.hpp 100% <ø> (ø) ⬆️
resource/traversers/dfu_impl.hpp 100% <ø> (ø) ⬆️
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8342b7d...fd9207a. Read the comment docs.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

Ugg... Docker hub service availability issue?

Building image bionic for user travis 2000 group=20000.50s
482Sending build context to Docker daemon   2.56kB
483Step 1/9 : FROM fluxrm/flux-core:bionic
484received unexpected HTTP status: 503 Service Unavailable
485docker-run-checks.sh: docker build failed
@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 15, 2019

Ugg... Docker hub service availability issue?

Looks that way. Just try restarting the builder...

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

@grondo: OK. will restart.

FYI -- Seems the issue was that shared_ptr C++ smart pointers don't directly translate to C-style void *-based callbacks.

You can't directly pass a shared_ptr object to void * as if it is a raw pointer.

You can take the address of a shared_ptr object and pass this raw pointer into void * so that you can later retrieve and cast it back to share_ptr on each callback invocation. But the caveat is, you have to make sure that this shared_ptr object must be valid all the time. I was passing a temporary shared_ptr object into flux_aux_set which then led to an undefined behavior.

I moved flux_aux_set from getctx to mod_main and passed its resource context shared_ptr into this API. This should be ok as each object in mod_main's stack frame should be valid as far as the resource module is active.

We'll see if Travis likes it this time.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 15, 2019

This should be good to review now. Code coverage failures should be ignored. I glanced at them and the coverage slightly decreased because I add more lines in the exceptional conditions as part of code cleanup and use of C++ exceptions...

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 17, 2019

FYI -- Seems the issue was that shared_ptr C++ smart pointers don't directly translate to C-style void *-based callbacks.

@trws and I discussed this a bit after our meeting. This is a generic problem for C++ code interfacing with C callbacks in flux-core. I will create at a ticket so that @trws can propose something to solve this generally. The current code should serve its purpose in the meantime.

Copy link
Member

milroy left a comment

I've reviewed this PR and included some comments that I hope to be helpful. As a disclaimer some of this code is beyond my C++ expertise, so another reviewer should submit feedback before the PR is merged.

std::shared_ptr<qmanager_ctx_t> ctx = nullptr;
ctx = *(static_cast<std::shared_ptr<qmanager_ctx_t> *>(arg));
Comment on lines +128 to +129

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

I assume there aren't complications created by using newer C++ features in a C linkage as long as the input parameters and return are C compatible?

qmanager/modules/qmanager.cpp Show resolved Hide resolved
const char *t, int s, void *a)
{
std::shared_ptr<job_t> job;
qmanager_ctx_t *ctx = (qmanager_ctx_t *)a;
std::shared_ptr<qmanager_ctx_t> ctx = nullptr;
ctx = *(static_cast<std::shared_ptr<qmanager_ctx_t> *>(a));
Comment on lines 182 to 186

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

General comment: the arguments' names (e.g. h, t, s, a) aren't very descriptive.

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

Agreed. I think this was probably a space saver. I will make these arguments more descriptive.

@@ -675,12 +679,13 @@ static int run_match (resource_ctx_t *ctx, int64_t jobid, const char *cmd,
return rc;
}

static inline bool is_existent_jobid (const resource_ctx_t *ctx, uint64_t jobid)
static inline bool is_existent_jobid (std::shared_ptr<resource_ctx_t> &ctx,

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

Shouldn't this be is_existent_jobid (const std::shared_ptr<resource_ctx_t> &ctx,...?

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

Good catch. I think so. Will change.

static void freectx (void *arg)
{
resource_ctx_t *ctx = (resource_ctx_t *)arg;
if (ctx) {
flux_msg_handler_delvec (ctx->handlers);
delete ctx->matcher;
delete ctx->traverser;
delete ctx->fgraph;
for (auto &kv : ctx->jobs) {
delete kv.second; /* job_info_t* type */
ctx->jobs.erase (kv.first);
}
delete ctx->writers;
ctx->jobs.clear ();
ctx->allocations.clear ();
ctx->reservations.clear ();
delete ctx;
}
}

Comment on lines -143 to -160

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

I've visually confirmed that much of this functionality is replicated by the new shared_ptr structure (including the resource_ctx_t destructor). However, I don't see that something equivalent to ctx->allocations.clear (); ctx->reservations.clear (); occurs. There are two places of concern- I've noted them below.

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

I don't think ctx->allocations.clear () and ctx->reservations.clear() are required. When the shared pointer is going out of scope, it will automatically call the destructors of allocation and reservation each of std::map type. The destructors will clean up these maps properly.

}
ctx->fgraph = NULL; /* Cannot be allocated at this point */
ctx->writers = NULL; /* Cannot be allocated at this point */
flux_aux_set (h, "resource", ctx, freectx);

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

freectx also executes ctx->allocations.clear (); ctx->reservations.clear ();. Is there a replacement in the updated code?

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

I think flux_aux_set shouldn't call freectx. Otherwise both freectx and smart pointer will try to free memory associated with context - double free. Right now, I send the raw pointer of context into (void *) argument of flux_aux_set, but this is just a kludge to deal with C and C++ compatibility. flux_aux_set shouldn't own this pointer but rely on the smart pointer to deallocate memory properly. Once @trws's PR #542 completes, this should be handled in a much more robust fashion.

@@ -300,8 +287,7 @@ static resource_ctx_t *init_module (flux_t *h, int argc, char **argv)
return ctx;

error:
freectx (ctx);
return NULL;
return nullptr;

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

How is ctx->allocations.clear (); ctx->reservations.clear (); performed in the updated code?

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

Again you shouldn't have to call clear(). The smart pointer should handle this automatically.

else if (policy == VAR_AWARE)
matcher = (dfu_match_cb_t *)new var_aware_t ();
matcher = std::make_shared<var_aware_t> ();

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

Should there be a guard else if policy is something new or unsupported?

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

Same comment as above. Let me know if you disagree.

{
if (ctx->params.r_fname != "")
ctx->params.r_out.close ();
if (ctx->params.o_fname != "")
write_to_graph (ctx);
destory_resource_ctx (ctx);

This comment has been minimized.

Copy link
@milroy

milroy Oct 21, 2019

Member

Does the updated code also (need to?) do something equivalent to ctx->allocations.clear (); ctx->reservations.clear ();?

This comment has been minimized.

Copy link
@dongahn

dongahn Nov 13, 2019

Author Contributor

Same comment as above.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Oct 21, 2019

@trws and I discussed this a bit after our meeting. This is a generic problem for C++ code interfacing with C callbacks in flux-core. I will create at a ticket so that @trws can propose something to solve this generally. The current code should serve its purpose in the meantime.

@dongahn : at the end of #538, you mentioned that you would consider @trws' suggestion with a fresh set of eyes. Do you plan to add that suggestion to this PR? Or in a separate PR? Just wondering if I should review this as-is or wait for more updates.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Oct 21, 2019

@SteVwonder: Thank. I thought about this and it would be best to not to add shared pointer for the context piece. Because the changes will be substantial, it would be best to close this and do another PR.

@milroy: I will look at your review and address that in a new PR as well.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Nov 13, 2019

@SteVwonder: Thank. I thought about this and it would be best to not to add shared pointer for the context piece. Because the changes will be substantial, it would be best to close this and do another PR.

Unfortunately, backing this up will be too disruptive for this PR and my other PR (#543). My proposal is to update this PR with minor changes requested by @milroy and merge it in so that we can quickly get to #541

C/C++ compatibility will be handled in a more robust way with @trws' PR #542.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Nov 14, 2019

@milroy: I add two additional commits that address your review points. If you are okay with it and @SteVwonder has no other comments, this can go in after squashing these new commits. Let me know.

@milroy

This comment has been minimized.

Copy link
Member

milroy commented Nov 16, 2019

@milroy: I add two additional commits that address your review points. If you are okay with it and @SteVwonder has no other comments, this can go in after squashing these new commits. Let me know.

@dongahn I've looked over your responses and this PR is ready to merge as far as I'm concerned.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Nov 16, 2019

@dongahn I've looked over your responses and this PR is ready to merge as far as I'm concerned.

Sounds good. I will squash those two latest commits and I think this can go in.

dongahn added 3 commits Oct 14, 2019
Add better memory management into qmanager
using a smart pointer (std::shared_ptr).

Queue policy factory class:
Return a shared_ptr object from the queue
factory method.

qmanager:
Integrate the factory class change into
the qmanager module code.

Use shared_ptr to manage the main context
for qmanager (qmanager_ctx_t) as well.
Ctx will be freed right before module_main
exists.

Note that this didn't go all the way
to convert the qmanager_new and qmanager_destroy
into a constructor and destructor of
qmanager_ctx_t keeping the code style consistent
with other C-based modules.
Change it because the constructor shouldn't modify
the input jobspec string.
Add virtual destructors as any class that derives
from an abstract class should have them for
a correct resource release sequence.

While there is no harm done here as these reader
classes currently have no resources to free,
it is a good practice in case they are extended
to include resource acquisition in the future.
dongahn added 11 commits Oct 14, 2019
Enforce 80 character per line limit.

Add deeper indentations for conditional statements
for readability.
Add smart pointer support for all of the components
within the resource infrastructure which are used
by both resource matching module and resource-query
code.

Provide better memory management.
In particular,  modify the traverser code
to use shared pointers on a few key member
variables: filtered graph, metadata and match
policy callback members. The traverser
does not solely "own" these objects (in fact they
are being passed from outside), and this means
using raw pointers requires a careful
manual coordination between the owners
of these objects.
Sharing the ownership of these objects using
std::shared_ptr should be safer and far more elegant.

Include the following specific changes. (Note that
the change span many files because of dependence
of the requisite changes.)

In the converted factor classes (DFU match policy
factory class and match writer policy factory
class), modify their create methods to return
a shared_ptr object and add catch clauses
to catch bad_alloc exceptions.

Add a catch-all catch clause at the end of
mod_main of the resource match module as
std::make_shared essentially doesn't
support nothrow, and this means a bad_alloc
exception can bubble up to the broker code
level leading to a broker crash.

Change the roots member of the resource metadata
class to be a std::shared_ptr type so that this can
also be the "shared" with the traverser more easily
as well.

Modify resource match module and resource-query
to use shared_ptr objects for their resource contexts
instead of raw pointers. Adjust their initialization
and finalize routines as well as how a context is
pushed and popped with the broker through
an interface like flux_aux_get ().
For better namespace safety,
remove "using namespace std" and instead explicitly
use std:: prefix for all of the std symbols
as being used in the traverser code.
For better namespace safety,
remove "using namespace std" and instead explicitly
use std:: prefix for all of the std symbols
as being used in resource_match.cpp
Remove "using namespace std" and explicitly use
the "std::" prefix for all of the std symbols being
used in reader codes.
Explicitly use std:: prefix for all of the std symbols
as being used in match_writers.cpp.
Remove "using namespace std" and instead explicitly
use std:: prefix for std symbols in resource-query.cpp
and command.cpp.
@dongahn dongahn force-pushed the dongahn:shared_ptr branch from 73093a0 to fd9207a Nov 16, 2019
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Nov 16, 2019

I will squash those two latest commits and I think this can go in.

@milroy: those commits are squashed. Once Travis CI is happy, this can be merged.

Copy link
Member

SteVwonder left a comment

Other than my one inline comment, this LGTM! Thanks @dongahn.

qmanager/modules/qmanager.cpp Show resolved Hide resolved
@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Nov 17, 2019

Merging since @milroy and I have both reviewed and approved (annoyingly I cannot click the “approve” button on github mobile).

Thanks @dongahn!

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Nov 17, 2019

Thank you!

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Nov 17, 2019

I lied. Apparently I cannot merge a PR on mobile when there is a failing test (coveralls, which we said was ok).

If no one else beats me to it, I’ll click the button tomorrow.

@trws trws merged commit b54ff48 into flux-framework:master Nov 19, 2019
1 of 3 checks passed
1 of 3 checks passed
codecov/patch 68.93% of diff hit (target 76.03%)
Details
codecov/project 75.59% (-0.45%) compared to 8342b7d
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Nov 19, 2019

Thanks @trws and @dongahn!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.