Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: send <service>.disconnect requests on module unload #2913

Merged
merged 8 commits into from Apr 28, 2020

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 25, 2020

This PR solves two long standing problems:

  1. Unlike clients of connector-local, modules that are unloaded do not cause <service>.disconnect messages to be sent to services used by the module, so modules using some services (for example KVS watch) would cause services to leak state when the modules are unloaded.
  2. There is no way for the sender of a request to tell the recipient that a response should not be sent. As a consequence, services that don't implement disconnect methods automatically respond to them with ENOSYS.

A new message flag: FLUX_MSGFLAG_NORESPONSE is added, and flux_rpc*() now sets this in requests when FLUX_RPC_NORESPONSE is specified. If flux_respond*() is called on a message with this flag, the response is suppressed without error.

Disconnect messages now set this flag, so now if a service doesn't implement the disconnect method, an ENOSYS response is not generated.

The librouter/disconnect "class" is leveraged to add disconnect messages at module unload time with minimal new code in the broker.

@garlick garlick force-pushed the module_disconnect branch 2 times, most recently from f5df992 to 31e4c74 Compare April 25, 2020 14:21
@garlick
Copy link
Member Author

garlick commented Apr 25, 2020

Hmm got what looks like a new valgrind error on just one builder, that was not associated with an abnormal broker exit as far as I can tell. I'll restart to see what happens...

==25765== 
==25765== HEAP SUMMARY:
==25765==     in use at exit: 21,053 bytes in 50 blocks
==25765==   total heap usage: 325,622 allocs, 325,572 frees, 194,713,629 bytes allocated
==25765== 
==25765== 336 bytes in 1 blocks are possibly lost in loss record 5 of 11
==25765==    at 0x4C31B25: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==25765==    by 0x40134A6: allocate_dtv (dl-tls.c:286)
==25765==    by 0x40134A6: _dl_allocate_tls (dl-tls.c:530)
==25765==    by 0x59D2227: allocate_stack (allocatestack.c:627)
==25765==    by 0x59D2227: pthread_create@@GLIBC_2.2.5 (pthread_create.c:644)
==25765==    by 0x52CECD4: zactor_new (in /usr/lib/x86_64-linux-gnu/libczmq.so.4.1.0)
==25765==    by 0x1275E6: zsecurity_comms_init (zsecurity.c:166)
==25765==    by 0x11778C: overlay_sec_init (overlay.c:461)
==25765==    by 0x11778C: overlay_connect (overlay.c:478)
==25765==    by 0x1113A0: main (broker.c:618)
==25765== 
[snip]
not ok 1 - valgrind reports no new errors on 2 broker run

garlick added a commit to garlick/flux-rfc that referenced this pull request Apr 26, 2020
Add a definition for FLUX_MSGFLAG_NORESPONSE, matching what
was proposed in flux-framework/flux-core#2913, and add some
verbage describing how may be used to suppress responses.
@grondo
Copy link
Contributor

grondo commented Apr 27, 2020

Hm, this is strange. I'm getting a failure on my test system (with flux-core v0.16 installed)

$ ./test_disconnect.t
/home/grondo/git/flux-core.git/src/common/librouter/.libs/test_disconnect.t: symbol lookup error: /home/grondo/git/flux-core.git/src/common/librouter/.libs/test_disconnect.t: undefined symbol: flux_msg_is_noresponse

flux_msg_is_noresponse() is definitely in libflux in my build directory, however (and this is disturbing):

grondo@asp:~/git/flux-core.git/src/common/librouter$ ldd .libs/test_disconnect.t  | grep libflux
	libflux-core.so.2 => /usr/lib/libflux-core.so.2 (0x00007fb6850ed000)
	libflux-security.so.1 => /usr/lib/libflux-security.so.1 (0x00007fb6841bf000)

This seems to be a problem with a few of our tests, at least on my system. We inadvertently linked against libflux-core.la instead of libflux/libflux.la and libflux-internal.la. In all honesty, I've forgotten how this was all supposed to work, but this fixes the test builds for me:

diff --git a/src/common/librouter/Makefile.am b/src/common/librouter/Makefile.am
index fff0b633e..3a50d3e97 100644
--- a/src/common/librouter/Makefile.am
+++ b/src/common/librouter/Makefile.am
@@ -55,12 +55,12 @@ T_LOG_DRIVER = env AM_TAP_AWK='$(AWK)' $(SHELL) \
         $(top_srcdir)/config/tap-driver.sh

 test_ldadd = \
-       $(builddir)/libtestutil.la \
+       $(top_builddir)/src/common/libflux/libflux.la \
+        $(top_builddir)/src/common/libflux-internal.la \
         $(top_builddir)/src/common/librouter/librouter.la \
         $(top_builddir)/src/common/libtestutil/libtestutil.la \
-        $(top_builddir)/src/common/libflux-internal.la \
-        $(top_builddir)/src/common/libflux-core.la \
-        $(top_builddir)/src/common/libtap/libtap.la
+        $(top_builddir)/src/common/libtap/libtap.la \
+       $(LIBUUID_LIBS)

 test_cppflags = \
         $(AM_CPPFLAGS) \

Not sure if this is just on my system though -- @garlick, are you able to reproduce the issue?

Either way, I'll open an issue and work on a quick fix (at least librouter libterminus are affected)

@garlick
Copy link
Member Author

garlick commented Apr 27, 2020

I also have an older version of flux installed on my test system (0.16.0-222-g434614446) but I'm not seeing this.

On the pasted ldd output, would that be expected given that the libtool wrapper script is bypassed there?

@grondo
Copy link
Contributor

grondo commented Apr 27, 2020

Not sure, I would expect there wouldn't need to be a libtool wrapper script for these tests since they are never being installed. I get the same result with libtool e ldd test_disconnect.t

@grondo
Copy link
Contributor

grondo commented Apr 27, 2020

@garlick, are you building with --with-flux-security? When I build without this option, the issue goes away:

$ libtool e ldd ./test_disconnect.t | grep libflux
	libflux-core.so.2 => /home/grondo/git/flux-core.git/src/common/.libs/libflux-core.so.2 (0x00007f157585d000)

Still not exactly sure what is causing the difference here.

@garlick
Copy link
Member Author

garlick commented Apr 27, 2020

@garlick, are you building with --with-flux-security?

No.

$ libtool e ldd ./test_disconnect.t | grep libflux
        libflux-core.so.2 => /home/garlick/proj/flux-core/src/common/.libs/libflux-core.so.2 (0x00007fa6add62000)
$ ldd .libs/test_disconnect.t | grep libflux
        libflux-core.so.2 => /usr/local/lib/libflux-core.so.2 (0x00007ff3491b1000)

This is on Ubuntu 18.04 with (as you can see) Flux installed to a /usr/local prefix...

@grondo
Copy link
Contributor

grondo commented Apr 27, 2020

Something about construction of link arguments when using --with-flux-security is causing this problem. I'm not sure I want to try to fully investigate exactly why, but we should be building our unit tests in such a way as they don't think they'll be eventually linked with system libflux. I'll continue this in an issue.

test_expect_success 'module watcher gets disconnected on module unload' '
before_watchers=`flux module stats --parse "watchers" kvs-watch` &&
echo "waiters before loading module: $before_watchers" &&
flux module load ${TEST_WATCHER} &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't there be a check_kvs_watchers call after loading the module, to reduce race possibilities?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I thought I had that covered because the RPC was sent before the reactor was entered, but you're right of course! We don't know if the request has made it to the KVS by the time we check.

@codecov-io
Copy link

Codecov Report

Merging #2913 into master will increase coverage by 0.01%.
The diff coverage is 88.57%.

@@            Coverage Diff             @@
##           master    #2913      +/-   ##
==========================================
+ Coverage   81.08%   81.10%   +0.01%     
==========================================
  Files         257      257              
  Lines       40856    40884      +28     
==========================================
+ Hits        33130    33158      +28     
  Misses       7726     7726              
Impacted Files Coverage Δ
src/common/librouter/disconnect.c 85.91% <50.00%> (-0.85%) ⬇️
src/broker/broker.c 74.58% <83.33%> (+0.13%) ⬆️
src/broker/module.c 74.94% <100.00%> (+0.34%) ⬆️
src/common/libflux/message.c 83.10% <100.00%> (+0.43%) ⬆️
src/common/libflux/response.c 88.33% <100.00%> (+0.26%) ⬆️
src/common/libflux/rpc.c 93.71% <100.00%> (+0.03%) ⬆️
src/modules/job-info/watch.c 70.98% <0.00%> (-1.56%) ⬇️
... and 2 more

@garlick
Copy link
Member Author

garlick commented Apr 27, 2020

If that fixup looks good to you @chu11, I'll squash it.

@chu11
Copy link
Member

chu11 commented Apr 27, 2020

it LGTM

@garlick
Copy link
Member Author

garlick commented Apr 28, 2020

Thanks - squashed.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, after getting distracted on that unit test issue, this LGTM now.

@garlick
Copy link
Member Author

garlick commented Apr 28, 2020

Great, I'll set MWP

Problem: RPC test sets FLUX_RPC_NORESPONSE but then expects a
response.

Don't set this flag in the test.
Problem: when responding to a request, there is no
way for the sender to indicate that no response is
required.

A client can use FLUX_MATCHTAG_NONE to indicate this
in an ad-hoc way, but in somes cases we skip matchtag
allocation even when a response is expected, e.g. when
some other key like the jobid is available to distinguish
responses without tagpool overhead.

Add a new message flag FLUX_MSGFLAG_NORESPONSE and:
  flux_msg_is_noresponse()
  flux_msg_set_noresponse()

Add coverage to message unit test.

Fixes flux-framework#2912
Remove tabs from message.h.
If the caller sets FLUX_RPC_NORESPONSE flag, then set
FLUX_MSGFLAG_NORESPONSE in request message.
If flux_repsond_*() is called on a request that has
FLUX_MSGFLAG_NORESPONSE set, quietly suppress the response
and return success.

This allows the sender of a request to be in control
of whether a response is sent.
Problem: <service>.disconnect messages are generated for
all used services upon disconnect of a client, but if
there is no message handler installed for that method,
an ENOSYS response may be generated.

Set FLUX_MSGFLAG_NORESPONSE in <service>.disconnect requests
so that automatically generated ENOSYS response is suppressed.

Also: don't arm the disconnect notifier for requests that
have the FLUX_MSGFLAG_NORESPONSE flag set, since generally
such messages would not create state that would require
cleanup on the service end.
Problem: when broker modules are unloaded, they do not
generate disconnect messages.

When a connection is dropped to the connector-local module
(e.g. client dies or otherwise closes the connection),
the connector-local module sends disconnect requests
to all services used by that handle. For example, streaming RPCs
can be canceled.

Add the same functionality to the broker for module unload.
This may come in handy once modules are using streaming RPC
services.  Leverage librouter/disconnect.[ch] which creates a
hash of disconnect messages, added to each time a request
message sent by the module uses a new service.  When the hash
is destroyed, messages are sent using a callback.  Tie hash
destruction to destruction of the module_t by the broker.

Since modules can be "routers", only perform this service for
requests that have a route hop count of 1, which indicates
that they originated from a module, not some client connected
to a module.

Fixes flux-framework#2911
@mergify mergify bot merged commit 4d06a43 into flux-framework:master Apr 28, 2020
@garlick garlick deleted the module_disconnect branch April 28, 2020 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants