flux-start: add embedded server #3650

garlick · 2021-05-12T22:54:33Z

This adds a convenience function for setting up an embedded server that listens on a local:// socket and routes messages between connected clients and a server side flux_t handle. It's a bit of an abomination if you think about it too hard, but it does let arbitrary programs offer "services" in the usual way by registering message handlers, and clients to connect to it and use RPCs the way they would do with flux, so pretty convenient.

Maybe this will come in handy if the shell needs to offer a standalone service to applications, especially if that application is flux?

A server is then embedded in flux-start for the purpose of enabling system tests that restart brokers. In this PR, the only service method just lists broker pids. I thought maybe getting this much in with tests was maybe a good place to cut this PR and start the next one?

lgtm-com · 2021-05-12T23:27:18Z

This pull request introduces 1 alert when merging 5d3b389 into 166bff3 - view on LGTM.com

new alerts:

1 for Unused import

grondo · 2021-05-13T02:08:56Z

This is pretty neat. My mind is starting to explore all the neat things for which this could be used.

The flux-shell application is a good example. Does this open up the ability for programs to route messages as well?

How do you derive the URI for one of these connections? Could there be a special URI to attach to a named process with flux_open() for convenience (e..g flux_open ("local:flux-start", 0))? (eh, that is probably getting a little off-topic here)

garlick · 2021-05-13T04:13:34Z

Does this open up the ability for programs to route messages as well?

librouter has other abstractions that make that relatively straightforward (e.g. see flux-proxy or connector-local), although nothing prevents a program that embeds one of these servers from routing messages too, at least that I can think of right now. There isn't actually that much that is new here - for the most part it is just repackaging a few things that were already implemented.

How do you derive the URI for one of these connections? Could there be a special URI to attach to a named process with flux_open() for convenience (e..g flux_open ("local:flux-start", 0))? (eh, that is probably getting a little off-topic here)

Heh, do you mean like have a special URI path that instructs flux_open to internally substitute the final component of FLUX_URI? That's kind of a neat idea that would save a bit of code. Let me think about that one.

Edit: flux-start locates the socket in rundir, so if FLUX_URI is set to local:///tmp/flux-gmnMYf/local-0 then flux-start can be reached at local:///tmp/flux-gmnMYf/start. But that's just the convention I implemented for this use case - kind of arbitrary.

grondo · 2021-05-13T14:05:17Z

Edit: flux-start locates the socket in rundir, so if FLUX_URI is set to local:///tmp/flux-gmnMYf/local-0 then flux-start can be reached at local:///tmp/flux-gmnMYf/start. But that's just the convention I implemented for this use case - kind of arbitrary.

Well there is some precedent for this (I'm thinking of something like tmux, which places all sockets under a common path for discovery). What I was thinking we could allow is that the server-side could provide a name (e.g. "flux-start" or just "start", or for the shell "flux-shell") when opening the socket. Then a client could discover available sockets and connect by name (again thinking of the tmux example). It is difficult to imagine this would be very useful, since I can't think of other non-broker processes that would want to open one of these sockets, it was more of a spur of the moment idea. At the very least each client would not be required to substitute paths in FLUX_URI by hand.

Sorry if the idea got you a little off track.

garlick · 2021-05-13T14:40:56Z

I'm not sure that would work for the shell, since if there are multiple jobs running under the same broker, the rundir directory is not unique. Also the shell may not have write permission there to create a socket (as a guest). There might be other approaches for "socket discovery" though that could make a flux ecosystem easier to navigate, but maybe that's a topic for another day?

garlick · 2021-05-13T14:45:29Z

Just realized I had not documented another caveat - that the server flux_t handle requires a reactor to make progress. Added header comments and updated commit message.

grondo · 2021-05-13T15:17:38Z

Also the shell may not have write permission there to create a socket (as a guest). There might be other approaches for "socket discovery" though that could make a flux ecosystem easier to navigate, but maybe that's a topic for another day?

Yeah, duh that is a problem. 🤦 Yeah, we'll just kick the can down the road awhile longer. It is not likely we'll add this capability to the shell anytime soon anyway, and no use discussing this for test-only flux-start use

garlick · 2021-05-13T19:49:27Z

At the very least each client would not be required to substitute paths in FLUX_URI by hand.

I forced a push a change to set a new environment variable FLUX_START_URI to relieve clients of the burden of substituting strings. I'm not a great fan of proliferating Flux environment variables, but for testing only maybe this is OK.

grondo · 2021-05-13T20:24:40Z

but for testing only maybe this is OK.

flux start isn't just for testing anymore (works well for a single node, single user instance). Though I think the impact starting the internal flux-start server and resulting FLUX_START_URI is minimal, perhaps the server should be behind a --test-* option?

Also, flux-start should probably clear FLUX_START_URI from the environment at startup so it doesn't get passed through to subinstances. Again probably low risk, but it might be confusing if your subinstance connects to a flux-start from a different instance.

Edit: Realized I didn't make it clear - I'm fine with the implementation as it is now. The suggestion to put the server behind a --test option was more of a question than a requirement.

I do think the FLUX_START_URI should be sanitized in new instances. Maybe there's a better place to do that than flux-start though.

garlick · 2021-05-13T22:22:40Z

As discussed at coffee, I don't think it's necessary to provide an option to enable/disable this as the service is low overhead and seems unlikely to get in the way. We could easily add it later if needed.

We can't sanitize FLUX_START_URI in the broker because we need it to be passed through when a test program is run as the initial program.

Should we sanitize it in the job shell, e.g.

diff --git a/src/shell/task.c b/src/shell/task.c
index 5d1912db7..a90123d01 100644
--- a/src/shell/task.c
+++ b/src/shell/task.c
@@ -150,6 +150,9 @@ struct shell_task *shell_task_create (struct shell_info *info,
                               getenv ("FLUX_KVS_NAMESPACE")) < 0)
             goto error;
     }
+    /* Sanitize FLUX variables that should not propagate to jobs.
+     */
+    flux_cmd_unsetenv (task->cmd, "FLUX_START_URI");
     return task;
 error:
     shell_task_destroy (task);

It feels slightly wrong, like we're on a slippery slope to complex environment propagation rules.

Maybe it's a better idea to revert the change that added the env var?

grondo · 2021-05-14T02:12:59Z

It feels slightly wrong, like we're on a slippery slope to complex environment propagation rules.
Maybe it's a better idea to revert the change that added the env var?

I think you had convinced me that propagation of FLUX_START_URI to all jobs and child instances was not a big deal at this point. I say your current approach is fine for now. Improvements, if necessary, could always be done in the future.

garlick · 2021-05-14T02:32:10Z

OK, then let's leave it in and press forward. Thanks for the discussion!

I restarted CI - one builder got stuck on that shell input sharness test, which I think is a known one.

grondo

As discussed, LGTM!

chu11 · 2021-05-14T04:02:11Z

src/common/libflux/message.h

@@ -191,7 +191,7 @@ int flux_msg_set_flags (flux_msg_t *msg, uint8_t flags);
 /* Get/set string payload.
 * flux_msg_set_string() accepts a NULL 's' (no payload).
 * flux_msg_get_string() will set 's' to NULL if there is no payload
- * N.B. the raw paylaod includes C string \0 terminator.
+ * N.B. the raw payload includes C string \0 terminator.


commit message typo? I see only one comment misspelling, not two :-)

Huh, weird. I could have sworn there were two. Oh, found the same misspelled word in event.h. I'll tack that on and fix the commit message.

Problem: spelling errors in header comments Correct spelling.

Problem: some function protos spill over 80 columns, and others don't line up. Consistently indent parameters that span lines. Break long lines where needed. Drop duplicate inline comments that happened to run over 80 cols.

Problem: need to add "server" functionality to flux-start. Add a generic "usock_service_create()" function that establishes a usock (PF_LOCAL) socket listener, and returns a server-side flux_t handle. Clients may use the flux API to flux_open() local://${path} and access services registered on the server side flux_t handle in the usual way. Limitations: - guest users are rejected (message cred.userid != server userid) - event messages may not be published or subscribed to - clients may not register services - rank routing information in requests is ignored - a request with topic 'disconnect' is sent upon client disconnect, not "<service>.disconnect" as specified in RFC 6. - server flux_t handle requires async reactor operation (one cannot call flux_recv() in a loop and expect it to make progress) Add simple unit test.

Problem: tests that involve starting and stopping brokers are difficult to orchestrate using flux-start, but we will need support for running such tests in CI. Use the new usock_service to embed a server in flux-start. The server creates a socket named 'start' in the rundir, and sets FLUX_START_URI in the environment for clients. Currently the server has support for the following methods: start.status Return an array of procs that includes broker PIDs in rank order disconnect Log receipt of disconnect message. This is a placeholder for future streaming socket management. Add a test front end command that provides the client side for the start.status RPC, and is available to add sub-commands for simple, shell script driven testing. More sophisticated, event driven test programs would be written in python and combine broker and flux-start communication. Add a few tests to t0001-basic.t to exercise basic function.

codecov · 2021-05-14T04:32:20Z

Codecov Report

Merging #3650 (67085dd) into master (c896ace) will decrease coverage by 0.01%.
The diff coverage is 74.34%.

❗ Current head 67085dd differs from pull request most recent head f9954f0. Consider uploading reports for the commit f9954f0 to get more accurate results

@@            Coverage Diff             @@
##           master    #3650      +/-   ##
==========================================
- Coverage   82.76%   82.74%   -0.02%     
==========================================
  Files         325      326       +1     
  Lines       49025    49166     +141     
==========================================
+ Hits        40576    40684     +108     
- Misses       8449     8482      +33

Impacted Files	Coverage Δ
src/cmd/flux-start.c	`78.01% <73.17%> (-0.69%)`	⬇️
src/common/librouter/usock_service.c	`74.77% <74.77%> (ø)`
src/cmd/builtin/proxy.c	`73.72% <0.00%> (-1.47%)`	⬇️
src/modules/job-archive/job-archive.c	`59.03% <0.00%> (-0.81%)`	⬇️
src/common/libflux/message.c	`83.92% <0.00%> (-0.13%)`	⬇️
src/broker/broker.c	`75.22% <0.00%> (+0.25%)`	⬆️
src/common/libsubprocess/local.c	`79.65% <0.00%> (+0.34%)`	⬆️
src/broker/module.c	`77.55% <0.00%> (+1.24%)`	⬆️

garlick · 2021-05-14T04:32:48Z

OK, forced a push with a slightly expanded set of spelling corrections and a more vague header :-) Thanks for that. I'll set MWP.

garlick force-pushed the embedded_server branch from 7242a06 to 5d3b389 Compare May 12, 2021 22:58

garlick force-pushed the embedded_server branch 2 times, most recently from fad351f to d94e8cc Compare May 13, 2021 01:37

garlick force-pushed the embedded_server branch from d94e8cc to e2b8d2f Compare May 13, 2021 04:31

garlick force-pushed the embedded_server branch from e2b8d2f to 8323da3 Compare May 13, 2021 14:44

garlick force-pushed the embedded_server branch from 8323da3 to 5dbc161 Compare May 13, 2021 19:46

grondo approved these changes May 14, 2021

View reviewed changes

chu11 reviewed May 14, 2021

View reviewed changes

garlick added 4 commits May 13, 2021 21:31

libflux: fix spelling errors in headers

e98c4c8

Problem: spelling errors in header comments Correct spelling.

libflux: fix whitespace in message.h

a8a17c5

Problem: some function protos spill over 80 columns, and others don't line up. Consistently indent parameters that span lines. Break long lines where needed. Drop duplicate inline comments that happened to run over 80 cols.

garlick force-pushed the embedded_server branch from 5dbc161 to f9954f0 Compare May 14, 2021 04:32

garlick added the merge-when-passing label May 14, 2021

mergify bot merged commit e1acd4a into flux-framework:master May 14, 2021

garlick deleted the embedded_server branch May 14, 2021 04:56

garlick mentioned this pull request Jun 6, 2021

flux-start: build system instance test features #3700

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flux-start: add embedded server #3650

flux-start: add embedded server #3650

garlick commented May 12, 2021

lgtm-com bot commented May 12, 2021

grondo commented May 13, 2021

garlick commented May 13, 2021 •

edited

grondo commented May 13, 2021

garlick commented May 13, 2021

garlick commented May 13, 2021

grondo commented May 13, 2021

garlick commented May 13, 2021

grondo commented May 13, 2021 •

edited

garlick commented May 13, 2021

grondo commented May 14, 2021

garlick commented May 14, 2021

grondo left a comment

chu11 May 14, 2021

garlick May 14, 2021

codecov bot commented May 14, 2021

garlick commented May 14, 2021

flux-start: add embedded server #3650

flux-start: add embedded server #3650

Conversation

garlick commented May 12, 2021

lgtm-com bot commented May 12, 2021

grondo commented May 13, 2021

garlick commented May 13, 2021 • edited

grondo commented May 13, 2021

garlick commented May 13, 2021

garlick commented May 13, 2021

grondo commented May 13, 2021

garlick commented May 13, 2021

grondo commented May 13, 2021 • edited

garlick commented May 13, 2021

grondo commented May 14, 2021

garlick commented May 14, 2021

grondo left a comment

Choose a reason for hiding this comment

chu11 May 14, 2021

Choose a reason for hiding this comment

garlick May 14, 2021

Choose a reason for hiding this comment

codecov bot commented May 14, 2021

Codecov Report

garlick commented May 14, 2021

garlick commented May 13, 2021 •

edited

grondo commented May 13, 2021 •

edited