Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: go through main cleanup path, do not call log_err_exit() #2181

Merged
merged 17 commits into from Jun 13, 2019

Conversation

@chu11
Copy link
Contributor

commented Jun 11, 2019

In #1036, it was discovered that an assert can occur inside zmq b/c the broker calls exit() without doing some proper cleanup.

Generally speaking, I re-worked part of the overlay and broker to not call log_err_exit() and instead go through the cleanup path of the broker.

There are still a lot of log_err_exit() calls I got to remove, but wanted to throw up this PR for now to see if there are any comments.

@garlick

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

This looks like good cleanup. Just for fun, have you tried forcing the broker out through some of those goto cleanup paths to see if the zsys assertions are avoided?

As I review the changes, I keep stifling comments about necessary refactoring. I see you captured some thoughts in #2182 which seems like the right approach, since we may not want to open that can of worms right away, but there are definitely some obvious improvements that should be made at some point.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

This looks like good cleanup. Just for fun, have you tried forcing the broker out through some of those goto cleanup paths to see if the zsys assertions are avoided?

I tested the specific case in #1036 but no other ones. Good idea to hardcode something in to make sure a lot of the other paths work.

@chu11 chu11 force-pushed the chu11:issue1036 branch from 0345616 to 4470240 Jun 11, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

re-pushed with a few more tiny fixes and a few more log_err_exit() removals. This is probably as far as I'm willing to go with this particular PR in terms of cleanup, as many other log_err_exit() calls are a bit deeper in the bowels of the broker.

There is one corner case I'm trying to solve still. During cleanup unload_module_byname() can hang on the call of pthread_join() in module_destroy(). My assumption is that the module thread has not yet started, thus the pthread_join() hangs? Or perhaps the reactor in the thread has not yet started, so the connector-local.shutdown doesn't get to where it should go. Not sure best way to solve this at the moment. Dunno if a pthread_cancel() would be fine.

Edit: Wait a second, by the time unload_module_byname() is called, we've already exited the reactor. So when we call module_stop(), the module.shutdown rpc should never have a way to be sent. So how do the modules exit normally? Now I'm confused.

@garlick

This comment has been minimized.

Copy link
Member

commented Jun 11, 2019

I would think the more likely reason for pthread_join() to hang is the module is blocked on a synchronous RPC or something like that? At the point where you're already out of the broker's reactor loop, it probably is OK to pthread_kill(). What module is hanging like that? I presume rc3 is stuck in a module_unload() or are you past that and hanging while unloading the connector-local module?

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

It's with the connector-local module, although I suppose it occur with any other module as well.
In my test all I did was "goto cleanup" immediately before entering the reactor, so basically all setup/initialization is done, but I just skip the reactor loop to see if the cleanup path worked.

I suppose we could do a pthread_tryjoin_np() before doing a kill, but that's non-portable of course.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Now that I think about it, calling pthread_kill() sounds like a good idea. If we're already in module_destroy() is actually seems like a bug if it can hang. It should just do whatever it can do cleanup.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Added.

diff --git a/src/broker/module.c b/src/broker/module.c
index 01b83cb..56d369c 100644
--- a/src/broker/module.c
+++ b/src/broker/module.c
@@ -315,6 +315,13 @@ static void module_destroy (module_t *p)
     void *res;
 
     if (p->t) {
+        if ((e = pthread_kill (p->t, SIGTERM)) != 0) {
+            if (e != ESRCH)
+                log_errn (e, "pthread_kill");
+        }
+        else
+            log_msg ("Killed module %s via SIGTERM", p->name);
         if ((e = pthread_join (p->t, &res)) != 0)
             log_errn_exit (e, "pthread_cancel");
         if (res == PTHREAD_CANCELED)

And it works in the general sense, but I've seen

lt-flux-broker: Killed module job-ingest via SIGTERM
lt-flux-broker: Killed module userdb via SIGTERM
lt-flux-broker: Killed module aggregator via SIGTERM
lt-flux-broker: Killed module kvs via SIGTERM
lt-flux-broker: Killed module content-sqlite via SIGTERM
lt-flux-broker: Killed module connector-local via SIGTERM

it's pretty racy/random. Which is probably not want we want.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

I had the bright idea to only do pthread_kill() when the module state is FLUX_MODSTATE_INIT, i.e. only send the signal when the reactor began. But pthread_join() still hung. hmmm

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

ok, just re-pushed, I think if we pthread_cancel() the thread if the module is in the INIT state, I think this is safe and works out. Basically, only canceling when we have confidence the module basically never "got going".

@chu11 chu11 changed the title [WIP] broker: go through main cleanup path, do not call log_err_exit() broker: go through main cleanup path, do not call log_err_exit() Jun 12, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Jun 12, 2019

This seems like it closes a loose end. Nice!

Are you done here @chu11?

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

yup

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

wait, actually, i could combine some other cleanups into this PR, one sec

@chu11 chu11 force-pushed the chu11:issue1036 branch from 873721f to 77a7534 Jun 12, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

well, this PR is getting a tad big, could maybe be split up if you think it best.

Basically, removed more log_err_exit() situations to force the broker to go through the cleanup path.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 12, 2019

diff patch is unsurprisingly bad, as many log_err_exit()s were replaced with two lines which can't be covered.

@garlick

This comment has been minimized.

Copy link
Member

commented Jun 13, 2019

I'm fine with this all being in one PR. I quickly skimmed through the new changes and it all seems fine. LMK when you're ready (rebase needed).

chu11 added 11 commits Jun 10, 2019
Rename to heartbeat_register_attrs() for consistency to rest of
broker (i.e. hello_register_attrs(), content_cache_register_attrs(), etc.).
In module_thread() some errors would call log_err_exit() while
other errors would "goto done".  Consistently "goto done" on errors
throughout the function.
Previously bind_child() would call exit() on any any errors, instead
return -1 on error.  Have callers check for errors as well.
Error message did not indicate the function that had the error.
Generally speaking, instead of calling log_err_exit() in broker.c,
go through the full cleanup path of the broker.  This allows the
broker to properly clean itself up on error situations.

By doing so, it avoids a race with a zactor thread and its atexit()
call which can lead to an assert (#1036).

Alter logic of exit_rc setting accordingly.

Move "cleaning up" log message accordingly.

Fixes #1036
Check errno before calling error message output to reduce unnecessary
output.
chu11 added 6 commits Jun 12, 2019
In broker_handle_signals(), do not call oom() or log_err_exit(),
return to caller and go through cleanup path.
Do not call log_err_exit() in broker_add_services().  Return error
to caller and go through broker cleanup path.
Have overlay_set_flux() return error instead of calling log_err_exit()
on error.  Adjust callers accordingly.
Have shutdown_set_flux() return error instead of calling log_err_exit()
on error.  Adjust callers accordingly.
Do not call log_err_exit() or oom() in module_add(), cleanup
and return error appropriately.  Adjust module_destroy() to be
able to take a NULL pointer.
If a module never leaves the INIT state, assume that we've
never entered the reactor and that a pthread_join() has the
potential to hang.  In order to safely call pthread_join(),
call pthread_cancel() on the thread beforehand.
@chu11 chu11 force-pushed the chu11:issue1036 branch from 77a7534 to fb9e12f Jun 13, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Jun 13, 2019

There's a few cleanups that I'll leave for a different PR since they are orthogonal to this one. Re-push and rebased.

@codecov-io

This comment has been minimized.

Copy link

commented Jun 13, 2019

Codecov Report

Merging #2181 into master will decrease coverage by 0.22%.
The diff coverage is 42.6%.

@@            Coverage Diff             @@
##           master    #2181      +/-   ##
==========================================
- Coverage    81.1%   80.88%   -0.23%     
==========================================
  Files         199      199              
  Lines       31688    31774      +86     
==========================================
- Hits        25702    25700       -2     
- Misses       5986     6074      +88
Impacted Files Coverage Δ
src/broker/heartbeat.c 83.09% <100%> (ø) ⬆️
src/broker/exec.c 66.66% <100%> (+1.28%) ⬆️
src/broker/modservice.c 79.8% <100%> (-0.97%) ⬇️
src/broker/module.c 78.42% <30.55%> (-3.23%) ⬇️
src/broker/shutdown.c 82.05% <33.33%> (-3.29%) ⬇️
src/broker/broker.c 74.07% <44.02%> (-3.5%) ⬇️
src/broker/overlay.c 82.35% <47.82%> (-1.19%) ⬇️
src/modules/job-ingest/worker.c 65.03% <0%> (-7.7%) ⬇️
src/common/libutil/veb.c 95.42% <0%> (-3.43%) ⬇️
src/modules/barrier/barrier.c 78.52% <0%> (-2.02%) ⬇️
... and 6 more
@garlick

This comment has been minimized.

Copy link
Member

commented Jun 13, 2019

Thanks! Good cleanup.

@garlick garlick merged commit 1d2094b into flux-framework:master Jun 13, 2019
2 of 4 checks passed
2 of 4 checks passed
codecov/patch 42.6% of diff hit (target 81.1%)
Details
codecov/project 80.88% (-0.23%) compared to 5ddfd23
Details
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.