broker: send <service>.disconnect requests on module unload #2913

garlick · 2020-04-25T14:15:42Z

This PR solves two long standing problems:

Unlike clients of connector-local, modules that are unloaded do not cause <service>.disconnect messages to be sent to services used by the module, so modules using some services (for example KVS watch) would cause services to leak state when the modules are unloaded.
There is no way for the sender of a request to tell the recipient that a response should not be sent. As a consequence, services that don't implement disconnect methods automatically respond to them with ENOSYS.

A new message flag: FLUX_MSGFLAG_NORESPONSE is added, and flux_rpc*() now sets this in requests when FLUX_RPC_NORESPONSE is specified. If flux_respond*() is called on a message with this flag, the response is suppressed without error.

Disconnect messages now set this flag, so now if a service doesn't implement the disconnect method, an ENOSYS response is not generated.

The librouter/disconnect "class" is leveraged to add disconnect messages at module unload time with minimal new code in the broker.

garlick · 2020-04-25T15:10:29Z

Hmm got what looks like a new valgrind error on just one builder, that was not associated with an abnormal broker exit as far as I can tell. I'll restart to see what happens...

==25765== 
==25765== HEAP SUMMARY:
==25765==     in use at exit: 21,053 bytes in 50 blocks
==25765==   total heap usage: 325,622 allocs, 325,572 frees, 194,713,629 bytes allocated
==25765== 
==25765== 336 bytes in 1 blocks are possibly lost in loss record 5 of 11
==25765==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25765==    by 0x40134A6: allocate_dtv (dl-tls.c:286)
==25765==    by 0x40134A6: _dl_allocate_tls (dl-tls.c:530)
==25765==    by 0x59D2227: allocate_stack (allocatestack.c:627)
==25765==    by 0x59D2227: pthread_create@@GLIBC_2.2.5 (pthread_create.c:644)
==25765==    by 0x52CECD4: zactor_new (in /usr/lib/x86_64-linux-gnu/libczmq.so.4.1.0)
==25765==    by 0x1275E6: zsecurity_comms_init (zsecurity.c:166)
==25765==    by 0x11778C: overlay_sec_init (overlay.c:461)
==25765==    by 0x11778C: overlay_connect (overlay.c:478)
==25765==    by 0x1113A0: main (broker.c:618)
==25765== 
[snip]
not ok 1 - valgrind reports no new errors on 2 broker run

Add a definition for FLUX_MSGFLAG_NORESPONSE, matching what was proposed in flux-framework/flux-core#2913, and add some verbage describing how may be used to suppress responses.

grondo · 2020-04-27T17:53:23Z

Hm, this is strange. I'm getting a failure on my test system (with flux-core v0.16 installed)

$ ./test_disconnect.t
/home/grondo/git/flux-core.git/src/common/librouter/.libs/test_disconnect.t: symbol lookup error: /home/grondo/git/flux-core.git/src/common/librouter/.libs/test_disconnect.t: undefined symbol: flux_msg_is_noresponse

flux_msg_is_noresponse() is definitely in libflux in my build directory, however (and this is disturbing):

grondo@asp:~/git/flux-core.git/src/common/librouter$ ldd .libs/test_disconnect.t  | grep libflux
	libflux-core.so.2 => /usr/lib/libflux-core.so.2 (0x00007fb6850ed000)
	libflux-security.so.1 => /usr/lib/libflux-security.so.1 (0x00007fb6841bf000)

This seems to be a problem with a few of our tests, at least on my system. We inadvertently linked against libflux-core.la instead of libflux/libflux.la and libflux-internal.la. In all honesty, I've forgotten how this was all supposed to work, but this fixes the test builds for me:

diff --git a/src/common/librouter/Makefile.am b/src/common/librouter/Makefile.am
index fff0b633e..3a50d3e97 100644
--- a/src/common/librouter/Makefile.am
+++ b/src/common/librouter/Makefile.am
@@ -55,12 +55,12 @@ T_LOG_DRIVER = env AM_TAP_AWK='$(AWK)' $(SHELL) \
         $(top_srcdir)/config/tap-driver.sh

 test_ldadd = \
-       $(builddir)/libtestutil.la \
+       $(top_builddir)/src/common/libflux/libflux.la \
+        $(top_builddir)/src/common/libflux-internal.la \
         $(top_builddir)/src/common/librouter/librouter.la \
         $(top_builddir)/src/common/libtestutil/libtestutil.la \
-        $(top_builddir)/src/common/libflux-internal.la \
-        $(top_builddir)/src/common/libflux-core.la \
-        $(top_builddir)/src/common/libtap/libtap.la
+        $(top_builddir)/src/common/libtap/libtap.la \
+       $(LIBUUID_LIBS)

 test_cppflags = \
         $(AM_CPPFLAGS) \

Not sure if this is just on my system though -- @garlick, are you able to reproduce the issue?

Either way, I'll open an issue and work on a quick fix (at least librouter libterminus are affected)

garlick · 2020-04-27T18:00:47Z

I also have an older version of flux installed on my test system (0.16.0-222-g434614446) but I'm not seeing this.

On the pasted ldd output, would that be expected given that the libtool wrapper script is bypassed there?

grondo · 2020-04-27T18:48:01Z

Not sure, I would expect there wouldn't need to be a libtool wrapper script for these tests since they are never being installed. I get the same result with libtool e ldd test_disconnect.t

grondo · 2020-04-27T19:10:51Z

@garlick, are you building with --with-flux-security? When I build without this option, the issue goes away:

$ libtool e ldd ./test_disconnect.t | grep libflux
	libflux-core.so.2 => /home/grondo/git/flux-core.git/src/common/.libs/libflux-core.so.2 (0x00007f157585d000)

Still not exactly sure what is causing the difference here.

garlick · 2020-04-27T19:20:08Z

@garlick, are you building with --with-flux-security?

No.

$ libtool e ldd ./test_disconnect.t | grep libflux
        libflux-core.so.2 => /home/garlick/proj/flux-core/src/common/.libs/libflux-core.so.2 (0x00007fa6add62000)
$ ldd .libs/test_disconnect.t | grep libflux
        libflux-core.so.2 => /usr/local/lib/libflux-core.so.2 (0x00007ff3491b1000)

This is on Ubuntu 18.04 with (as you can see) Flux installed to a /usr/local prefix...

grondo · 2020-04-27T19:23:03Z

Something about construction of link arguments when using --with-flux-security is causing this problem. I'm not sure I want to try to fully investigate exactly why, but we should be building our unit tests in such a way as they don't think they'll be eventually linked with system libflux. I'll continue this in an issue.

chu11 · 2020-04-27T19:22:31Z

t/t1103-apidisconnect.t

+test_expect_success 'module watcher gets disconnected on module unload' '
+	before_watchers=`flux module stats --parse "watchers" kvs-watch` &&
+	echo "waiters before loading module: $before_watchers" &&
+	flux module load ${TEST_WATCHER} &&


shouldn't there be a check_kvs_watchers call after loading the module, to reduce race possibilities?

Oh I thought I had that covered because the RPC was sent before the reactor was entered, but you're right of course! We don't know if the request has made it to the KVS by the time we check.

codecov-io · 2020-04-27T20:51:31Z

Codecov Report

Merging #2913 into master will increase coverage by 0.01%.
The diff coverage is 88.57%.

@@            Coverage Diff             @@
##           master    #2913      +/-   ##
==========================================
+ Coverage   81.08%   81.10%   +0.01%     
==========================================
  Files         257      257              
  Lines       40856    40884      +28     
==========================================
+ Hits        33130    33158      +28     
  Misses       7726     7726

Impacted Files	Coverage Δ
src/common/librouter/disconnect.c	`85.91% <50.00%> (-0.85%)`	⬇️
src/broker/broker.c	`74.58% <83.33%> (+0.13%)`	⬆️
src/broker/module.c	`74.94% <100.00%> (+0.34%)`	⬆️
src/common/libflux/message.c	`83.10% <100.00%> (+0.43%)`	⬆️
src/common/libflux/response.c	`88.33% <100.00%> (+0.26%)`	⬆️
src/common/libflux/rpc.c	`93.71% <100.00%> (+0.03%)`	⬆️
src/modules/job-info/watch.c	`70.98% <0.00%> (-1.56%)`	⬇️
... and 2 more

garlick · 2020-04-27T22:40:44Z

If that fixup looks good to you @chu11, I'll squash it.

chu11 · 2020-04-27T23:09:43Z

it LGTM

garlick · 2020-04-28T00:07:44Z

Thanks - squashed.

grondo

Ok, after getting distracted on that unit test issue, this LGTM now.

garlick · 2020-04-28T16:53:39Z

Great, I'll set MWP

Problem: RPC test sets FLUX_RPC_NORESPONSE but then expects a response. Don't set this flag in the test.

Problem: when responding to a request, there is no way for the sender to indicate that no response is required. A client can use FLUX_MATCHTAG_NONE to indicate this in an ad-hoc way, but in somes cases we skip matchtag allocation even when a response is expected, e.g. when some other key like the jobid is available to distinguish responses without tagpool overhead. Add a new message flag FLUX_MSGFLAG_NORESPONSE and: flux_msg_is_noresponse() flux_msg_set_noresponse() Add coverage to message unit test. Fixes flux-framework#2912

Remove tabs from message.h.

If the caller sets FLUX_RPC_NORESPONSE flag, then set FLUX_MSGFLAG_NORESPONSE in request message.

If flux_repsond_*() is called on a request that has FLUX_MSGFLAG_NORESPONSE set, quietly suppress the response and return success. This allows the sender of a request to be in control of whether a response is sent.

Problem: <service>.disconnect messages are generated for all used services upon disconnect of a client, but if there is no message handler installed for that method, an ENOSYS response may be generated. Set FLUX_MSGFLAG_NORESPONSE in <service>.disconnect requests so that automatically generated ENOSYS response is suppressed. Also: don't arm the disconnect notifier for requests that have the FLUX_MSGFLAG_NORESPONSE flag set, since generally such messages would not create state that would require cleanup on the service end.

Problem: when broker modules are unloaded, they do not generate disconnect messages. When a connection is dropped to the connector-local module (e.g. client dies or otherwise closes the connection), the connector-local module sends disconnect requests to all services used by that handle. For example, streaming RPCs can be canceled. Add the same functionality to the broker for module unload. This may come in handy once modules are using streaming RPC services. Leverage librouter/disconnect.[ch] which creates a hash of disconnect messages, added to each time a request message sent by the module uses a new service. When the hash is destroyed, messages are sent using a callback. Tie hash destruction to destruction of the module_t by the broker. Since modules can be "routers", only perform this service for requests that have a route hop count of 1, which indicates that they originated from a module, not some client connected to a module. Fixes flux-framework#2911

garlick force-pushed the module_disconnect branch 2 times, most recently from f5df992 to 31e4c74 Compare April 25, 2020 14:21

garlick mentioned this pull request Apr 26, 2020

rfc3,6: add FLUX_MSGFLAG_NORESPONSE flux-framework/rfc#238

Merged

chu11 reviewed Apr 27, 2020

View reviewed changes

grondo mentioned this pull request Apr 27, 2020

unit tests link against system libflux-core.so when built with --with-flux-security #2916

Closed

garlick force-pushed the module_disconnect branch from 0deb9f9 to 2be8737 Compare April 28, 2020 00:07

grondo approved these changes Apr 28, 2020

View reviewed changes

garlick added the merge-when-passing label Apr 28, 2020

garlick added 8 commits April 28, 2020 16:55

testsuite: fix inappropriate FLUX_RPC_NORESPONSE flag

87f8886

Problem: RPC test sets FLUX_RPC_NORESPONSE but then expects a response. Don't set this flag in the test.

libflux/message: [cleanup] fix whitespace

5e29d6e

Remove tabs from message.h.

libflux/rpc: use FLUX_MSGFLAG_NORESPONSE

260f7a5

If the caller sets FLUX_RPC_NORESPONSE flag, then set FLUX_MSGFLAG_NORESPONSE in request message.

libflux/response: suppress response on NORESPONSE flag

d7b1ed0

If flux_repsond_*() is called on a request that has FLUX_MSGFLAG_NORESPONSE set, quietly suppress the response and return success. This allows the sender of a request to be in control of whether a response is sent.

testsuite: add test for disconnect on module unload

256ff9b

grondo force-pushed the module_disconnect branch from 2be8737 to 256ff9b Compare April 28, 2020 16:55

mergify bot merged commit 4d06a43 into flux-framework:master Apr 28, 2020

garlick deleted the module_disconnect branch April 28, 2020 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

broker: send <service>.disconnect requests on module unload #2913

broker: send <service>.disconnect requests on module unload #2913

garlick commented Apr 25, 2020

garlick commented Apr 25, 2020 •

edited

grondo commented Apr 27, 2020

garlick commented Apr 27, 2020

grondo commented Apr 27, 2020

grondo commented Apr 27, 2020

garlick commented Apr 27, 2020

grondo commented Apr 27, 2020

chu11 Apr 27, 2020

garlick Apr 27, 2020

codecov-io commented Apr 27, 2020

garlick commented Apr 27, 2020

chu11 commented Apr 27, 2020

garlick commented Apr 28, 2020

grondo left a comment

garlick commented Apr 28, 2020

broker: send <service>.disconnect requests on module unload #2913

broker: send <service>.disconnect requests on module unload #2913

Conversation

garlick commented Apr 25, 2020

garlick commented Apr 25, 2020 • edited

grondo commented Apr 27, 2020

garlick commented Apr 27, 2020

grondo commented Apr 27, 2020

grondo commented Apr 27, 2020

garlick commented Apr 27, 2020

grondo commented Apr 27, 2020

chu11 Apr 27, 2020

Choose a reason for hiding this comment

garlick Apr 27, 2020

Choose a reason for hiding this comment

codecov-io commented Apr 27, 2020

Codecov Report

garlick commented Apr 27, 2020

chu11 commented Apr 27, 2020

garlick commented Apr 28, 2020

grondo left a comment

Choose a reason for hiding this comment

garlick commented Apr 28, 2020

garlick commented Apr 25, 2020 •

edited