Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-start: add embedded server #3650

Merged
merged 4 commits into from May 14, 2021
Merged

Conversation

garlick
Copy link
Member

@garlick garlick commented May 12, 2021

This adds a convenience function for setting up an embedded server that listens on a local:// socket and routes messages between connected clients and a server side flux_t handle. It's a bit of an abomination if you think about it too hard, but it does let arbitrary programs offer "services" in the usual way by registering message handlers, and clients to connect to it and use RPCs the way they would do with flux, so pretty convenient.

Maybe this will come in handy if the shell needs to offer a standalone service to applications, especially if that application is flux?

A server is then embedded in flux-start for the purpose of enabling system tests that restart brokers. In this PR, the only service method just lists broker pids. I thought maybe getting this much in with tests was maybe a good place to cut this PR and start the next one?

@lgtm-com
Copy link

lgtm-com bot commented May 12, 2021

This pull request introduces 1 alert when merging 5d3b389 into 166bff3 - view on LGTM.com

new alerts:

  • 1 for Unused import

@garlick garlick force-pushed the embedded_server branch 2 times, most recently from fad351f to d94e8cc Compare May 13, 2021 01:37
@grondo
Copy link
Contributor

grondo commented May 13, 2021

This is pretty neat. My mind is starting to explore all the neat things for which this could be used.

The flux-shell application is a good example. Does this open up the ability for programs to route messages as well?

How do you derive the URI for one of these connections? Could there be a special URI to attach to a named process with flux_open() for convenience (e..g flux_open ("local:flux-start", 0))? (eh, that is probably getting a little off-topic here)

@garlick
Copy link
Member Author

garlick commented May 13, 2021

Does this open up the ability for programs to route messages as well?

librouter has other abstractions that make that relatively straightforward (e.g. see flux-proxy or connector-local), although nothing prevents a program that embeds one of these servers from routing messages too, at least that I can think of right now. There isn't actually that much that is new here - for the most part it is just repackaging a few things that were already implemented.

How do you derive the URI for one of these connections? Could there be a special URI to attach to a named process with flux_open() for convenience (e..g flux_open ("local:flux-start", 0))? (eh, that is probably getting a little off-topic here)

Heh, do you mean like have a special URI path that instructs flux_open to internally substitute the final component of FLUX_URI? That's kind of a neat idea that would save a bit of code. Let me think about that one.

Edit: flux-start locates the socket in rundir, so if FLUX_URI is set to local:///tmp/flux-gmnMYf/local-0 then flux-start can be reached at local:///tmp/flux-gmnMYf/start. But that's just the convention I implemented for this use case - kind of arbitrary.

@grondo
Copy link
Contributor

grondo commented May 13, 2021

Edit: flux-start locates the socket in rundir, so if FLUX_URI is set to local:///tmp/flux-gmnMYf/local-0 then flux-start can be reached at local:///tmp/flux-gmnMYf/start. But that's just the convention I implemented for this use case - kind of arbitrary.

Well there is some precedent for this (I'm thinking of something like tmux, which places all sockets under a common path for discovery). What I was thinking we could allow is that the server-side could provide a name (e.g. "flux-start" or just "start", or for the shell "flux-shell") when opening the socket. Then a client could discover available sockets and connect by name (again thinking of the tmux example). It is difficult to imagine this would be very useful, since I can't think of other non-broker processes that would want to open one of these sockets, it was more of a spur of the moment idea. At the very least each client would not be required to substitute paths in FLUX_URI by hand.

Sorry if the idea got you a little off track.

@garlick
Copy link
Member Author

garlick commented May 13, 2021

I'm not sure that would work for the shell, since if there are multiple jobs running under the same broker, the rundir directory is not unique. Also the shell may not have write permission there to create a socket (as a guest). There might be other approaches for "socket discovery" though that could make a flux ecosystem easier to navigate, but maybe that's a topic for another day?

@garlick
Copy link
Member Author

garlick commented May 13, 2021

Just realized I had not documented another caveat - that the server flux_t handle requires a reactor to make progress. Added header comments and updated commit message.

@grondo
Copy link
Contributor

grondo commented May 13, 2021

Also the shell may not have write permission there to create a socket (as a guest). There might be other approaches for "socket discovery" though that could make a flux ecosystem easier to navigate, but maybe that's a topic for another day?

Yeah, duh that is a problem. 🤦 Yeah, we'll just kick the can down the road awhile longer. It is not likely we'll add this capability to the shell anytime soon anyway, and no use discussing this for test-only flux-start use

@garlick
Copy link
Member Author

garlick commented May 13, 2021

At the very least each client would not be required to substitute paths in FLUX_URI by hand.

I forced a push a change to set a new environment variable FLUX_START_URI to relieve clients of the burden of substituting strings. I'm not a great fan of proliferating Flux environment variables, but for testing only maybe this is OK.

@grondo
Copy link
Contributor

grondo commented May 13, 2021

but for testing only maybe this is OK.

flux start isn't just for testing anymore (works well for a single node, single user instance). Though I think the impact starting the internal flux-start server and resulting FLUX_START_URI is minimal, perhaps the server should be behind a --test-* option?

Also, flux-start should probably clear FLUX_START_URI from the environment at startup so it doesn't get passed through to subinstances. Again probably low risk, but it might be confusing if your subinstance connects to a flux-start from a different instance.

Edit: Realized I didn't make it clear - I'm fine with the implementation as it is now. The suggestion to put the server behind a --test option was more of a question than a requirement.

I do think the FLUX_START_URI should be sanitized in new instances. Maybe there's a better place to do that than flux-start though.

@garlick
Copy link
Member Author

garlick commented May 13, 2021

As discussed at coffee, I don't think it's necessary to provide an option to enable/disable this as the service is low overhead and seems unlikely to get in the way. We could easily add it later if needed.

We can't sanitize FLUX_START_URI in the broker because we need it to be passed through when a test program is run as the initial program.

Should we sanitize it in the job shell, e.g.

diff --git a/src/shell/task.c b/src/shell/task.c
index 5d1912db7..a90123d01 100644
--- a/src/shell/task.c
+++ b/src/shell/task.c
@@ -150,6 +150,9 @@ struct shell_task *shell_task_create (struct shell_info *info,
                               getenv ("FLUX_KVS_NAMESPACE")) < 0)
             goto error;
     }
+    /* Sanitize FLUX variables that should not propagate to jobs.
+     */
+    flux_cmd_unsetenv (task->cmd, "FLUX_START_URI");
     return task;
 error:
     shell_task_destroy (task);

It feels slightly wrong, like we're on a slippery slope to complex environment propagation rules.

Maybe it's a better idea to revert the change that added the env var?

@grondo
Copy link
Contributor

grondo commented May 14, 2021

It feels slightly wrong, like we're on a slippery slope to complex environment propagation rules.
Maybe it's a better idea to revert the change that added the env var?

I think you had convinced me that propagation of FLUX_START_URI to all jobs and child instances was not a big deal at this point. I say your current approach is fine for now. Improvements, if necessary, could always be done in the future.

@garlick
Copy link
Member Author

garlick commented May 14, 2021

OK, then let's leave it in and press forward. Thanks for the discussion!

I restarted CI - one builder got stuck on that shell input sharness test, which I think is a known one.

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, LGTM!

@@ -191,7 +191,7 @@ int flux_msg_set_flags (flux_msg_t *msg, uint8_t flags);
/* Get/set string payload.
* flux_msg_set_string() accepts a NULL 's' (no payload).
* flux_msg_get_string() will set 's' to NULL if there is no payload
* N.B. the raw paylaod includes C string \0 terminator.
* N.B. the raw payload includes C string \0 terminator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commit message typo? I see only one comment misspelling, not two :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, weird. I could have sworn there were two. Oh, found the same misspelled word in event.h. I'll tack that on and fix the commit message.

Problem: spelling errors in header comments

Correct spelling.
Problem: some function protos spill over 80 columns, and
others don't line up.

Consistently indent parameters that span lines.
Break long lines where needed.
Drop duplicate inline comments that happened to run over 80 cols.
Problem: need to add "server" functionality to flux-start.

Add a generic "usock_service_create()" function that establishes
a usock (PF_LOCAL) socket listener, and returns a server-side
flux_t handle.  Clients may use the flux API to flux_open()
local://${path} and access services registered on the server side
flux_t handle in the usual way.

Limitations:
- guest users are rejected (message cred.userid != server userid)
- event messages may not be published or subscribed to
- clients may not register services
- rank routing information in requests is ignored
- a request with topic 'disconnect' is sent upon client disconnect,
  not "<service>.disconnect" as specified in RFC 6.
- server flux_t handle requires async reactor operation
  (one cannot call flux_recv() in a loop and expect it to make progress)

Add simple unit test.
Problem: tests that involve starting and stopping brokers are
difficult to orchestrate using flux-start, but we will need
support for running such tests in CI.

Use the new usock_service to embed a server in flux-start.
The server creates a socket named 'start' in the rundir, and
sets FLUX_START_URI in the environment for clients.

Currently the server has support for the following methods:

start.status
  Return an array of procs that includes broker PIDs in rank order

disconnect
  Log receipt of disconnect message.  This is a placeholder for
  future streaming socket management.

Add a test front end command that provides the client side for the
start.status RPC, and is available to add sub-commands for simple,
shell script driven testing.  More sophisticated, event driven test
programs would be written in python and combine broker and flux-start
communication.

Add a few tests to t0001-basic.t to exercise basic function.
@codecov
Copy link

codecov bot commented May 14, 2021

Codecov Report

Merging #3650 (67085dd) into master (c896ace) will decrease coverage by 0.01%.
The diff coverage is 74.34%.

❗ Current head 67085dd differs from pull request most recent head f9954f0. Consider uploading reports for the commit f9954f0 to get more accurate results

@@            Coverage Diff             @@
##           master    #3650      +/-   ##
==========================================
- Coverage   82.76%   82.74%   -0.02%     
==========================================
  Files         325      326       +1     
  Lines       49025    49166     +141     
==========================================
+ Hits        40576    40684     +108     
- Misses       8449     8482      +33     
Impacted Files Coverage Δ
src/cmd/flux-start.c 78.01% <73.17%> (-0.69%) ⬇️
src/common/librouter/usock_service.c 74.77% <74.77%> (ø)
src/cmd/builtin/proxy.c 73.72% <0.00%> (-1.47%) ⬇️
src/modules/job-archive/job-archive.c 59.03% <0.00%> (-0.81%) ⬇️
src/common/libflux/message.c 83.92% <0.00%> (-0.13%) ⬇️
src/broker/broker.c 75.22% <0.00%> (+0.25%) ⬆️
src/common/libsubprocess/local.c 79.65% <0.00%> (+0.34%) ⬆️
src/broker/module.c 77.55% <0.00%> (+1.24%) ⬆️

@garlick
Copy link
Member Author

garlick commented May 14, 2021

OK, forced a push with a slightly expanded set of spelling corrections and a more vague header :-) Thanks for that. I'll set MWP.

@mergify mergify bot merged commit e1acd4a into flux-framework:master May 14, 2021
@garlick garlick deleted the embedded_server branch May 14, 2021 04:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants