Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental: add global rundir from which broker.rundir is derived #2121

Merged
merged 9 commits into from Apr 16, 2019

Conversation

@grondo
Copy link
Contributor

commented Apr 14, 2019

This should be considered an experimental PR for now, because some of the design choices made here need some thought and agreement.

This PR changes the current broker.rundir attribute into a "global" rundir attribute which, as before, is designed to be overridden on the command line. If not set, a temporary path is chosen with mkdtemp(3) as before, and in either case the directory is created (and subsequently cleaned up at exit) if it doesn't exist.

The broker.rundir attribute is now always derived from rundir as ${rundir}/<rank> to seamlessly allow multiple brokers per node in either of the rundir situations (random directory vs preset directory).

After this change, flux-start no longer needs to set a different broker.rundir per broker it is launching (for the --size case). It now sets rundir to the created scratchdir and lets each broker handle creation of broker.rundir. Due to this change, the flux-start scratchdir needs to use cleanup_directory_recursive to clean scratchdir.

A questionable approach for initializing broker.rundir is used here, because the broker rank is required, but is not determined until early in the "boot" phase, during which the broker.rundir attribute is used to create ipc endpoints. Instead of duplicating code in both the boot methods (pmi and config), an "initialization" callback was added to the overlay_t structure, which is called at the end of overlay_init() where broker rank and session size are determined. This felt a bit icky since two new entries were added to the structure which are only used during initialization, but it resulted in the cleanest code. Also, presuming all future "boot" methods need to call overlay_init(), this method should seamlessly work for future extensions.

The other perhaps controversial decision here is the choice of rundir for the global rundir attribute itself, but I couldn't think of another reasonable name, and liked the simplicity of rundir for the "global" path and broker.rundir for the per-broker path. I'd be fine choosing something else, and this is why I've delayed updating docs, etc.

The end result of this PR is meant to give the ability to predetermine FLUX_URI by setting a known global rundir for the session. Of course, each broker rank can't ever have the same FLUX_URI when brokers may be run on the same node, so this PR does the next best thing, the FLUX_URI can be derived from the known rundir as local://${rundir}/${rank}. e.g.

$ srun -N4 --mpi=none src/cmd/flux start -o,-S,rundir=/tmp/grondo/foo flux exec -r all sh -c 'echo $FLUX_URI'
local:///tmp/grondo/foo/0
local:///tmp/grondo/foo/1
local:///tmp/grondo/foo/2
local:///tmp/grondo/foo/3

Resolves #2118

@grondo grondo requested a review from garlick Apr 14, 2019
@garlick

This comment has been minimized.

Copy link
Member

commented Apr 14, 2019

This all seems reasonable to me, and fairly tidy considering the ill formed abstractions you had to deal with.

I was a little worried about breaking the content cache save/restore mode, but it looks like that is not affected since the persist-directory appears to be independent of the rundir stuff (and tests are passing) so good!

I'll poke at this a bit as soon as I get a chance.

Thanks for doing this so carefully.

Copy link
Member

left a comment

I poked at this quite a bit and everything seems to work as advertised!

As discussed offline, I don't really have a great suggestion for names, so as long as we document the behavior of each attribute, these are probably fine. (In addition to flux-broker-attributes(7) it looks like the description of the --scratchdir option in flux-start(1) will probably need an update).

Nice work.

grondo added 5 commits Apr 12, 2019
Attempt to create non-existent broker.rundir when the attribute is
set on the commandline. If the directory was pre-existing, then do
not schedule it for auto-cleanup on broker exit.
Add a test that the broker creates broker.rundir set on commandline,
and cleans up the result.
Add an initialization callback to the overlay structure which is
called at the end of overlay_init(). This allows the broker to register
code that is called *after* the overlay is initialized, once the broker
rank and instance size have been determined, from any boot method.
Change the user settable `broker.rundir` attribute to a global
`rundir`, presumed to be shared per instance if set. Add a new
`broker.rundir` which is now set to `${rundir}/rank` to allow sharing
of the global rundir.

The broker.rundir is still used as before, as the basis for ipc socket
paths as well as the connector-local socket and basis for FLUX_URI.

This change allows a global `rundir` to be set for a flux instance,
from which the FLUX_URI for any rank may be derived.

In flux-start, there is now no need to set a per-broker broker.rundir,
since this is now handled internally. Clean up the set `rundir`
recursively since there will be one subdirectory per launched broker.
Update the t0001-basic.t test that used broker.rundir to use the
global `rundir` attribute instead. Simplify tests that assumed
flux-start created a broker.rundir under its scratchdir, and instead
just ensure flux-start creates and cleans up the global rundir.

Since flux-start now recursively cleans up rundir, there is no need
for the "cleanup after panic" test.
@grondo grondo force-pushed the grondo:global-rundir branch from 66e7a26 to 4f155cc Apr 15, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

Forced a push with a tiny cleanup in the first commit and added the doc changes.

@garlick

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

For selfpmi bootstrap mode, set a pre-existing directory that will be
 used as the rundir directory for the instance. If a DIR is not set
 with this option, a unique temporary directory will be created for the
 lifetime of the instance and removed when the instance is destroyed.

I might be confused but if the --scratchdir doesn't exist, it is still passed in as the broker rundir where it is created and later cleaned up, correct? If so then pre-existing should be dropped in this description.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

I might be confused but if the --scratchdir doesn't exist, it is still passed in as the broker rundir where it is created and later cleaned up, correct?

Good catch. I failed to think about that when updating the flux-start(1) documentation.

In fact, I wonder if now the --scratchdir option could be dropped, since it is duplicating the functionality of setting the broker's rundir attribute?

@garlick

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Works for me! I don't see any tests that use it, offhand.

@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

Works for me! I don't see any tests that use it, offhand.

Unfortunately, pushing off the creation of rundir from flux-start to the brokers results in a change that does break some tests. By default there is a different rundir per rank for selfpmi mode (e.g. flux start -s N). For one thing this breaks an assumption in flux-proxy, but is also not quite as clean as allowing flux start to create a common rundir (it could choose a rundir but there would be a race there anyway.)

So I'll just remove the pre-existing from flux-start(1)

grondo added 3 commits Apr 15, 2019
Update documentaiton of --scratchdir option in flux-start(1) to indicate
it creates and sets `rundir` attribute, not `broker.rundir`.
Add documentation for the global rundir attribute in flux-broker,
and update the documentation for `broker.rundir` and `local-uri` to
further delineate how they are related to `rundir`.
Set the `rundir` attribute in the systemd flux.service file
instead of `broker.rundir` since this is now the suggested attribute
to override on the commandline.
@grondo grondo force-pushed the grondo:global-rundir branch from 4f155cc to 2dee7bc Apr 15, 2019
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2019

Ok forced a push with update --scratchdir documentation

-For selfpmi bootstrap mode, set a pre-existing directory that will be
-used as the rundir directory for the instance.  If a DIR is not set
-with this option, a unique temporary directory will be created for the
-lifetime of the instance and removed when the instance is destroyed.
+For selfpmi bootstrap mode, set the directory that will be
+used as the rundir directory for the instance. If the directory
+does not exist then it will be created during instance startup.
+If a DIR is not set with this option, a unique temporary directory
+will be created. Unless DIR was pre-existing, it will be removed
+when the instance is destroyed.
@garlick

This comment has been minimized.

Copy link
Member

commented Apr 15, 2019

Great! I'll merge once travis is green.

Add 'startup' to spelling dictionary, which seems to be missing
on CentOS 7 only.
@codecov-io

This comment has been minimized.

Copy link

commented Apr 16, 2019

Codecov Report

Merging #2121 into master will decrease coverage by <.01%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master    #2121      +/-   ##
==========================================
- Coverage   80.22%   80.21%   -0.01%     
==========================================
  Files         201      201              
  Lines       31712    31731      +19     
==========================================
+ Hits        25441    25454      +13     
- Misses       6271     6277       +6
Impacted Files Coverage Δ
src/cmd/flux-start.c 78.18% <100%> (-0.03%) ⬇️
src/broker/overlay.c 84.89% <100%> (+0.31%) ⬆️
src/broker/broker.c 77.85% <72.97%> (-0.09%) ⬇️
src/modules/barrier/barrier.c 76.55% <0%> (-2.07%) ⬇️
src/cmd/flux-module.c 84.19% <0%> (+0.23%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

Thanks!

@garlick garlick merged commit 2c126f3 into flux-framework:master Apr 16, 2019
3 of 4 checks passed
3 of 4 checks passed
codecov/patch 77.77% of diff hit (target 80.22%)
Details
Mergify — Summary 1 potential rule
Details
codecov/project 80.21% (-0.01%) compared to cc3c67f
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@grondo

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2019

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.