rework local connector to be fully nonblocking + mod_main() prototype fix #289

garlick · 2015-07-23T16:27:51Z

In this PR the "api" module is renamed to "local-connector", is updated to use new reactor/message interfaces, and is restructured to implement non-blocking management of client (unix domain socket) file descriptors. In addition, the connector itself now honors FLUX_O_NONBLOCK if specified for its methods backing flux_send() and flux_recv(). This fixes #83.

flux_msg_t now has some new functions with unit tests and man pages:

flux_msg_encode() to be used instead of zmsg_encode()
flux_msg_decode() to be used instead of zmsg_decode()
flux_msg_sendfd() to be used instead of zfd_send()
flux_msg_recvfd() to be used instead of zfd_recv()

The zfd functions, now removed, were a hold out user of zmsg_t, and were designed for blocking operation and hacked to be "a little bit non-blocking". Gone!

A sharness test t0007-ping.t was added along with some enhancements to the flux-ping program to move messages through the connector in both directions to exercise the non-blocking client management.

Finally, an unrelated change is mod_main() in all in-tree comms modules was updated to use argc, argv style arguments rather than a zhash_t as specified in RFC 5. (This will require a change to out of tree modules like those in flux-sched; I will submit a PR to flux-sched)

grondo · 2015-07-23T17:05:26Z

Oh, great improvement! Do the flood ping tests actually end up being good verification that the local-connector module doesn't ever block? Is there a test where you try doing "bad stuff" in the client and ensure you don't block the broker plugin?

I feel a bit bad about this next comment, but...

This probably isn't worth holding up this PR, but at the beginning of a new naming convention (as in *-connector here), it might be good to have a bit of group discussion. In this case we'll probably want all future modules of a certain class to follow a similar convention, so we should decide now what that might look like.

My input would be that in a naming schema, the most general name should come first. It isn't that important, but it does have some nice side effects (like in lsmod | sort all connectors-* would show up together) It also follows naturally that "connector" prefix is the container for all connectors -- or "connector" is the class name, etc. This same convention could apply across other classes of modules/plugins.

This most general name first convention seems to be pretty standard across various projects (think Lua/Python package names, Go packages, etc. With Go you can even have qualifiers including a domain name, e.g.

import {
   "golang.org/x/net/context"
}

It would be neat if you could (eventually) load broker modules in this style, e.g. if both LLNL and CEA had a scheduler you could refer to them as cea.fr/sched and llnl.gov/sched.

Sorry to co-opt this PR for my diatribe, but the new convention for naming connectors got me started...

garlick · 2015-07-23T17:41:06Z

The flood ping tests are sort of a first order verification that we aren't losing messages while exercising the non-blocking reactor logic. The flux ping --batch mode where responses are not processed until all the requests have been sent is bad behavior of a sort that would have deadlocked before. I.e. a blocked write(2) in the api module would have prevented the api module from read(2)ing all of the requests. I did verify (by adding some debug logging) that the flood ping "workload" does cause a lot of EWOULDBLOCKs in both directions. More tests could be added. I would lean towards covering flux_send() and flux_recv() in nonblocking mode next; although that might be best added in conjunction with a reactor "handle watcher".

Regarding naming - the other hierarchy we might consider would reflect modules loading modules. For example, if "sched" loads a "backfill" module, a (future) recursive flux module list might show that as "sched.backfill". RFC 5 was headed in this direction, though I think there may be a few details to iron out. RFC 5 is probably the place to put down any naming conventions we agree on. I am fine with a general-to-specific hyphenated topical convention in parallel to the module loading one. Should I tentatively rename "local-connector" to "connector-local"?

grondo · 2015-07-23T17:46:17Z

Should I tentatively rename "local-connector" to "connector-local"?

Well I'd prefer that slightly, however it seems like a lot of work to make that change.
If however this convention will propagate to other modules it might be nice to make the change now.
If you think I'm being unreasonable I'm fine keeping it as is and delaying any name change until such time that it makes sense.

garlick · 2015-07-23T17:48:28Z

No it seems reasonable and it won't be a lot of work. I'll go ahead and do that.

RFC 5 defines the mod_main() function prototype to be int mod_main (void *context, int argc, char **argv); We were still mod-transition from zhash_t arguments. Complete the transition by: - broker: call mod_main() with RFC 5 arguments - kvs module: parse key=val arguments - pymod module: convert internally to zhash_t - connector-local module: sockpath=path converted to fixed argv[0] argument. There are currently no users of this. - other in-tree modules: update mod_main() prototype, but no arguments are handled

These are simply flux_msg_t alternatives to zmsg_encode() and zmsg_decode().

Add replacements for zfd_send(), zfd_recv() that use flux_msg_t and handle non-blocking file descriptors properly: int flux_msg_sendfd (int fd, const flux_msg_t *msg, struct flux_msg_iobuf *iobuf); flux_msg_t *flux_msg_recvfd (int fd, struct flux_msg_iobuf *iobuf); On EOF, recvfd returns NULL with errno set to EPROTO. If iobuf is non-NULL, and EWOULDBLOCK/EAGAIN is returned, you may call the function again with identical arguments to continue reading/writing. The 'iobuf' may be NULL if the file descriptor is in blocking mode. If iobuf is NULL and an EWOULDBLOCK/EAGAIN is encountered, this is internally converted to an EPROTO error, after which the channel is in an undefined state. iobuf should be initialized with flux_msg_iobuf_init() before use. While EWOULDBLOCK/EAGAIN handling is in progress, iobuf contains internally allocated storage. If you need to dispose of this, call flux_msg_iobuf_clean(), which is a no-op if no internal storage is allocated.

Implement non-blocking reactor handling for client send/recv path. Add a message queue for each client, where messages for the client can be stored while the client is not ready for writing. Also: address possible starvation issue in the client read path by handling at most one message before letting the reactor run. Fixes flux-framework#83

Now flux_send() on this connector no longer ignores FLUX_O_NONBLOCK.

zfd_send() and zfd_recv() were never designed for nonblocking I/O. They also lived in libutil and used zmsg_t, and since libutil doesn't depend on anything in libflux, couldn't be converted to flux_msg_t. Better functions are now provided in libflux, and users converted, so these are euthanized.

This test wasn't driven by anything in the check target, and it was using the zfd class that was just removed. Vanquished!

Add basic tests for encode/decode and sendfd/recvfd. Update message test to flux_msg_t (mostly).

And update the dictionary.

In preparation for a sharness test that pumps some large messages through the local connector using flux-ping, add a bit more verification that data has not been mangled.

With --batch --count, responses are not processed until count requests have been sent. This may be useful for testing the non-blocking client write path in the local-connector module.

Run flood pings of various sizes and counts under timeout. If any message is mangled or lost, the test should fail beacuse flux-ping will hang and the sharness timer will catch it. This is a stress test of sorts on the local connector and the corresponding 'local-connector' comms module.

garlick · 2015-07-23T18:17:16Z

Just forced a push with local-connector renamed to connector-local.

grondo · 2015-07-23T23:27:09Z

Looks good! Also checked out on my c9.io image. Merging...

rework local connector to be fully nonblocking + mod_main() prototype fix

This change goes along with flux-framework/flux-core#289 where the prototype of mon_main() changed to use argc, argv style argument passing rather than a zhash_t.

garlick added the review label Jul 23, 2015

garlick added 15 commits July 23, 2015 11:06

modules/connector-local: rename from "api"

d0214c6

modules/connector-local: modernize reactor usage

e139cf3

libflux: add flux_msg_encode/decode

89b4e44

These are simply flux_msg_t alternatives to zmsg_encode() and zmsg_decode().

connectors/local: use flux_msg_sendfd/recvfd

73c2867

Now flux_send() on this connector no longer ignores FLUX_O_NONBLOCK.

test: drop src/test/tenc.c

f378af4

This test wasn't driven by anything in the check target, and it was using the zfd class that was just removed. Vanquished!

test/message: add tests for new functions

ecd9dc8

Add basic tests for encode/decode and sendfd/recvfd. Update message test to flux_msg_t (mostly).

doc: add flux_msg_encode(3), flux_msg_sendfd(3), etc

9034d1e

And update the dictionary.

cmd/flux-ping: verify pad content

90ab380

In preparation for a sharness test that pumps some large messages through the local connector using flux-ping, add a bit more verification that data has not been mangled.

cmd/flux-ping: add --batch option

57c9a1a

With --batch --count, responses are not processed until count requests have been sent. This may be useful for testing the non-blocking client write path in the local-connector module.

doc/man1: add flux-ping --batch description

6380040

garlick force-pushed the local_nonblock branch from 445c651 to 49a1258 Compare July 23, 2015 18:16

grondo added a commit that referenced this pull request Jul 23, 2015

Merge pull request #289 from garlick/local_nonblock

77602e1

rework local connector to be fully nonblocking + mod_main() prototype fix

grondo merged commit 77602e1 into flux-framework:master Jul 23, 2015

grondo removed the review label Jul 23, 2015

This was referenced Jul 23, 2015

modules: update mod_main() prototype per RFC 5 flux-framework/flux-sched#42

Merged

[pymod] convert to argc, argv style argument parsing #291

Closed

[schedsrv] convert to argc, argv style argument parsing flux-framework/flux-sched#43

Closed

garlick mentioned this pull request Jul 31, 2015

Leaks and minor build fixes #306

Merged

garlick deleted the local_nonblock branch August 12, 2015 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rework local connector to be fully nonblocking + mod_main() prototype fix #289

rework local connector to be fully nonblocking + mod_main() prototype fix #289

garlick commented Jul 23, 2015

grondo commented Jul 23, 2015

garlick commented Jul 23, 2015

grondo commented Jul 23, 2015

garlick commented Jul 23, 2015

garlick commented Jul 23, 2015

grondo commented Jul 23, 2015

rework local connector to be fully nonblocking + mod_main() prototype fix #289

rework local connector to be fully nonblocking + mod_main() prototype fix #289

Conversation

garlick commented Jul 23, 2015

grondo commented Jul 23, 2015

garlick commented Jul 23, 2015

grondo commented Jul 23, 2015

garlick commented Jul 23, 2015

garlick commented Jul 23, 2015

grondo commented Jul 23, 2015