Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: drop broker.rundir and select ipc vs tcp using broker.mapping #3554

Merged
merged 12 commits into from Mar 17, 2021

Conversation

garlick
Copy link
Member

@garlick garlick commented Mar 13, 2021

This is based on top of #3553.

With PMI_process_mapping available in the broker, use it rather than an explicit broker command line option from flux start to determine when ipc:// can be used in place of tcp:// for the overlay network.

In addition, perform the following cleanup that falls out from these changes:

  • Don't allow tbon.endpoint to be set on the command line. We don't need this anymore now that we can bootstrap from config files, and flux-start doesn't need to set it tell the broker to use ipc://
  • Eliminate broker.rundir, the ranked subdirectory of rundir. Instead append the rank to ipc:// paths when used, and to the local URI, in case rundir happens to be shared by multiple brokers (e.g. standalone flux start).
  • Eliminate the callback triggered by calling overlay_init(). This used to create broker.rundir from within the PMI bootstrap code, after the rank was known, so that the ipc:// paths could use it later on in the PMI bootstrap code.

@garlick garlick force-pushed the overlay_callback branch 2 times, most recently from ba5d922 to 1cd3424 Compare March 16, 2021 15:03
Problem: overlay unit test calls ok() with missing printf
arguments.

Supply the missing arguments.
Problem: timer watchers are leaked in overlay
unit test function recvmsg_timeout().

Destroy the watcher on exit from the function

Thanks to grondo for pointing this out.
Problem: the creation of 'broker.rundir', a directory that
is guaranteed unique for each broker, as opposed to 'rundir',
which is not always unique, is no longer necessary.

Change the local URI from local://<rundir>/<rank>/local to
local://<rundir>/local-<rank>.

Drop the init callback from overlay.c that was needed to create
the broker.rundir between discovering the rank and creating an
ipc:// tbon endoint in it when bootstrapping with PMI.

Update flux-broker-attributes(7).
Update overlay unit test.
Problem: create_rundir() has a local variable that is
set to NULL then passed to free().

Drop unused variable.
Problem: if create_rundir() fails, in some cases a created directory
may not be cleaned up.

Change logic slightly so that cleanup registration occurs in the
error path, not just the success path.
Problem: create_rundir() uses asprintf() to build path
when a static buffer would be less complicated.

Switch to static path buffer.
Problem: overlay_init() has a different, lesser purpose now
that it no longer triggers the init callback, so its is named
a bit too generically.

Rename to overlay_set_geometry().

Update broker and unit test.
Problem: rundir generated by flux-start includes flux-start pid,
while rundir generated by the broker does not.

For consistency, use the same procedure to generate the rundir in
flux-start as is used in the broker.  Directories will be
$TMPDIR/flux-XXXXXX, where XXXXXX is substituted with a unique
string by mkdtemp(3).
Problem: if local-uri is malformed, module exits without
setting errno.

Set errno to EINVAL in this case.
Problem: a malformed local-uri is not caught early, allowing
the instance to enter rc1 and time out.

The connector-local module initialization happens in parallel
with other broker startup, so is not easily caught before rc1
executes.

Improve error handling by checking for obvious problems with the
local-uri value during early broker startup.
Problem: errors in broker command line specification of
rundir and local-uri are not covered well in test.

Add some tests to t0001-basic.t.
@garlick
Copy link
Member Author

garlick commented Mar 17, 2021

Rebased on current master and added some more testing.

@codecov
Copy link

codecov bot commented Mar 17, 2021

Codecov Report

Merging #3554 (c4ae955) into master (ce879ba) will increase coverage by 0.04%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #3554      +/-   ##
==========================================
+ Coverage   82.38%   82.42%   +0.04%     
==========================================
  Files         322      322              
  Lines       48829    48814      -15     
==========================================
+ Hits        40226    40237      +11     
+ Misses       8603     8577      -26     
Impacted Files Coverage Δ
src/modules/connector-local/local.c 76.10% <0.00%> (-0.68%) ⬇️
src/broker/broker.c 74.81% <85.71%> (+2.10%) ⬆️
src/broker/boot_config.c 80.09% <100.00%> (ø)
src/broker/boot_pmi.c 65.00% <100.00%> (+0.50%) ⬆️
src/broker/overlay.c 87.76% <100.00%> (-0.15%) ⬇️
src/cmd/flux-start.c 78.71% <100.00%> (-0.08%) ⬇️
src/common/libflux/message.c 83.68% <0.00%> (-0.13%) ⬇️
src/common/libpmi/pmi.c 93.39% <0.00%> (+1.88%) ⬆️
src/common/libpmi/simple_client.c 89.61% <0.00%> (+3.27%) ⬆️

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! One minor comment/question inline.

if (attr_add (attrs, "local-uri", uri, FLUX_ATTRFLAG_IMMUTABLE) < 0) {
log_err ("create_broker_rundir: attr_add (local-uri)");
goto cleanup;
if (snprintf (path, sizeof (path), "%s", uri + 8) >= sizeof (path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're writing to uri + 8, should the bounds check be for sizeof (path) - 8?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading from uri + 8 so we check for overflow when writing to path.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh duh! Oops, sorry for the think-o.

@garlick
Copy link
Member Author

garlick commented Mar 17, 2021

I'll go ahead and set MWP. Thanks!

@mergify mergify bot merged commit 4690785 into flux-framework:master Mar 17, 2021
@garlick garlick deleted the overlay_callback branch March 17, 2021 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants