New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
broker: send <service>.disconnect requests on module unload #2913
Conversation
f5df992
to
31e4c74
Compare
Hmm got what looks like a new valgrind error on just one builder, that was not associated with an abnormal broker exit as far as I can tell. I'll restart to see what happens...
|
Add a definition for FLUX_MSGFLAG_NORESPONSE, matching what was proposed in flux-framework/flux-core#2913, and add some verbage describing how may be used to suppress responses.
Hm, this is strange. I'm getting a failure on my test system (with flux-core v0.16 installed)
This seems to be a problem with a few of our tests, at least on my system. We inadvertently linked against diff --git a/src/common/librouter/Makefile.am b/src/common/librouter/Makefile.am
index fff0b633e..3a50d3e97 100644
--- a/src/common/librouter/Makefile.am
+++ b/src/common/librouter/Makefile.am
@@ -55,12 +55,12 @@ T_LOG_DRIVER = env AM_TAP_AWK='$(AWK)' $(SHELL) \
$(top_srcdir)/config/tap-driver.sh
test_ldadd = \
- $(builddir)/libtestutil.la \
+ $(top_builddir)/src/common/libflux/libflux.la \
+ $(top_builddir)/src/common/libflux-internal.la \
$(top_builddir)/src/common/librouter/librouter.la \
$(top_builddir)/src/common/libtestutil/libtestutil.la \
- $(top_builddir)/src/common/libflux-internal.la \
- $(top_builddir)/src/common/libflux-core.la \
- $(top_builddir)/src/common/libtap/libtap.la
+ $(top_builddir)/src/common/libtap/libtap.la \
+ $(LIBUUID_LIBS)
test_cppflags = \
$(AM_CPPFLAGS) \ Not sure if this is just on my system though -- @garlick, are you able to reproduce the issue? Either way, I'll open an issue and work on a quick fix (at least |
I also have an older version of flux installed on my test system ( On the pasted ldd output, would that be expected given that the libtool wrapper script is bypassed there? |
Not sure, I would expect there wouldn't need to be a libtool wrapper script for these tests since they are never being installed. I get the same result with |
@garlick, are you building with
Still not exactly sure what is causing the difference here. |
No.
This is on Ubuntu 18.04 with (as you can see) Flux installed to a |
Something about construction of link arguments when using |
test_expect_success 'module watcher gets disconnected on module unload' ' | ||
before_watchers=`flux module stats --parse "watchers" kvs-watch` && | ||
echo "waiters before loading module: $before_watchers" && | ||
flux module load ${TEST_WATCHER} && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't there be a check_kvs_watchers
call after loading the module, to reduce race possibilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I thought I had that covered because the RPC was sent before the reactor was entered, but you're right of course! We don't know if the request has made it to the KVS by the time we check.
Codecov Report
@@ Coverage Diff @@
## master #2913 +/- ##
==========================================
+ Coverage 81.08% 81.10% +0.01%
==========================================
Files 257 257
Lines 40856 40884 +28
==========================================
+ Hits 33130 33158 +28
Misses 7726 7726
|
If that fixup looks good to you @chu11, I'll squash it. |
it LGTM |
0deb9f9
to
2be8737
Compare
Thanks - squashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, after getting distracted on that unit test issue, this LGTM now.
Great, I'll set MWP |
Problem: RPC test sets FLUX_RPC_NORESPONSE but then expects a response. Don't set this flag in the test.
Problem: when responding to a request, there is no way for the sender to indicate that no response is required. A client can use FLUX_MATCHTAG_NONE to indicate this in an ad-hoc way, but in somes cases we skip matchtag allocation even when a response is expected, e.g. when some other key like the jobid is available to distinguish responses without tagpool overhead. Add a new message flag FLUX_MSGFLAG_NORESPONSE and: flux_msg_is_noresponse() flux_msg_set_noresponse() Add coverage to message unit test. Fixes flux-framework#2912
Remove tabs from message.h.
If the caller sets FLUX_RPC_NORESPONSE flag, then set FLUX_MSGFLAG_NORESPONSE in request message.
If flux_repsond_*() is called on a request that has FLUX_MSGFLAG_NORESPONSE set, quietly suppress the response and return success. This allows the sender of a request to be in control of whether a response is sent.
Problem: <service>.disconnect messages are generated for all used services upon disconnect of a client, but if there is no message handler installed for that method, an ENOSYS response may be generated. Set FLUX_MSGFLAG_NORESPONSE in <service>.disconnect requests so that automatically generated ENOSYS response is suppressed. Also: don't arm the disconnect notifier for requests that have the FLUX_MSGFLAG_NORESPONSE flag set, since generally such messages would not create state that would require cleanup on the service end.
Problem: when broker modules are unloaded, they do not generate disconnect messages. When a connection is dropped to the connector-local module (e.g. client dies or otherwise closes the connection), the connector-local module sends disconnect requests to all services used by that handle. For example, streaming RPCs can be canceled. Add the same functionality to the broker for module unload. This may come in handy once modules are using streaming RPC services. Leverage librouter/disconnect.[ch] which creates a hash of disconnect messages, added to each time a request message sent by the module uses a new service. When the hash is destroyed, messages are sent using a callback. Tie hash destruction to destruction of the module_t by the broker. Since modules can be "routers", only perform this service for requests that have a route hop count of 1, which indicates that they originated from a module, not some client connected to a module. Fixes flux-framework#2911
2be8737
to
256ff9b
Compare
This PR solves two long standing problems:
connector-local
, modules that are unloaded do not cause<service>.disconnect
messages to be sent to services used by the module, so modules using some services (for example KVS watch) would cause services to leak state when the modules are unloaded.disconnect
methods automatically respond to them with ENOSYS.A new message flag: FLUX_MSGFLAG_NORESPONSE is added, and
flux_rpc*()
now sets this in requests when FLUX_RPC_NORESPONSE is specified. Ifflux_respond*()
is called on a message with this flag, the response is suppressed without error.Disconnect messages now set this flag, so now if a service doesn't implement the
disconnect
method, an ENOSYS response is not generated.The librouter/disconnect "class" is leveraged to add disconnect messages at module unload time with minimal new code in the broker.