job shell: add flux_shell_t and completion references #2240

grondo · 2019-07-17T18:01:25Z

This PR addresses #2237.

I apologize in advance because this will be a difficult PR to review. I attempt to make a minor advance by adding a struct flux_shell structure to contain the independing io info and pmi structures used by the various components of the shell, but ended up making more changes than I wanted. Perhaps with the luxury of more time, this PR could be made cleaner, but hopefully after some rounds of feedback this approach could be acceptable.

As noted above, this branch adds a new struct flux_shell structure as container for the most of the other structures used by the shell. This structure can then be passed to callbacks that may require access to multiple shell components and data, without requiring those callers to use custom structures or push multiple pointers into the reactor or handle aux hashes.

The struct flux_shell may also serve as an opaque handle for plugins in the future (along with a set of accessors as yet unwritten), and thus a flux_shell_t typedef is used as the start of a public interface. Later when we need this we can install flux/shell.h and move the struct flux_shell definition internal.

Once the flux_shell_t structure was defined, many of the init functions were updated to take this function as an argument. This resulted in cascading changes that, unfortunately, come in one big commit. To help review, here's a summary of the changes:

move cmdline parsing and basic option checking into a function shell_parse_cmdline()
move shell "options" (verbose, standalone) set on cmdline to bools in flux_shell_t
set jobid in shell->jobid
move destruction of components of flux_shell_t into shell_finalize()
move fetch of jobspec and R fully into shell_info_create() (whether set on cmdline or fetched via kvs) (this is probably a majority of the code change, since some functions were moved from shell.c to info.c)

Finally, to address #2237, a concept from wrexecd was borrowed to add a "completion reference" interface for the shell's reactor. Once all completion refs that are taken are dropped, the shell's reactor is manually stopped with flux_reactor_stop(). The references are also named as a future debugging aid. Initially, a reference is taken for each task, and released in the task completion handler.

Fixes #2237.

Handle NULL passed to rcalc_destroy () for consistency with most other flux-core code.

Problem: As the job shell grows in complexity, functions and callbacks may need to access more than one of the shell internal datatatypes at a time, potentially including flux handle, reactor, shell job information etc. Currently there is no easy or common way for internal code to accomplish this without use of custom data types, adding globals, et.c To reduce this future effort and present a more common methodology, add a `struct flux_shell` data type as a container for other shell data types, and refactor the shell to pass this object to submodules, simplifying their implementation a bit. As part of this effort, fetching of jobspec and R is moved wholly into the info.c module, whether fetched from the command line or KVS. Command line options, such as jobid and the standalone and verbose flags, is moved up into the `struct flux_shell`.

Problem: Once the flux-shell starts adding reactor watchers other than those of libsubprocess, e.g. message handlers, signalfd watchers, etc., then exiting from the internal reactor when all tasks have completed will not be as straightforward. Instead of adding a single `is_complete()` function that needs to check for all tasks exited, all IO complete, etc, set up a system where shell components can take a "completion reference" on the shell. Once all existing completion refs are released, then the shell's reactor will be stopped manually with flux_reactor_stop(). The completion references are named to aid in future debugging. If a shell appears to be stuck, keys from the internal hash could be dumped to determine which components haven't completed.

Use flux_shell_add_completion_ref() to take a reference on the shell for each task executed. Drop the reference in the completion callback so that the shell's reactor is manually stopped after the final task exits. This doesn't change any functionality of the shell, but ensures the shell will still exit the reactor when new reactor references are attached to the shell in future commits (e.g. message watchers).

codecov-io · 2019-07-17T18:22:09Z

Codecov Report

Merging #2240 into master will increase coverage by 0.02%.
The diff coverage is 89.75%.

@@            Coverage Diff             @@
##           master    #2240      +/-   ##
==========================================
+ Coverage   80.74%   80.77%   +0.02%     
==========================================
  Files         209      209              
  Lines       32988    33037      +49     
==========================================
+ Hits        26637    26685      +48     
- Misses       6351     6352       +1

Impacted Files	Coverage Δ
src/shell/io.c	`100% <ø> (ø)`	⬆️
src/shell/rcalc.c	`85.32% <100%> (+0.08%)`	⬆️
src/shell/pmi.c	`74.45% <88.23%> (+0.18%)`	⬆️
src/shell/shell.c	`84.28% <89.62%> (-1.7%)`	⬇️
src/shell/info.c	`80.85% <90.47%> (+9.6%)`	⬆️
src/common/libutil/aux.c	`90.74% <0%> (-3.71%)`	⬇️
src/modules/connector-local/local.c	`73.67% <0%> (-0.15%)`	⬇️
src/cmd/flux-event.c	`77.97% <0%> (ø)`	⬆️
src/common/libflux/message.c	`80.75% <0%> (+0.12%)`	⬆️
src/common/libflux/mrpc.c	`88.97% <0%> (+1.18%)`	⬆️

garlick · 2019-07-17T20:27:54Z

Excellent!

grondo added 4 commits July 17, 2019 09:53

shell: rcalc: prevent segfault from rcalc_destroy (NULL)

cbbd6e7

Handle NULL passed to rcalc_destroy () for consistency with most other flux-core code.

garlick merged commit 5e7635b into flux-framework:master Jul 17, 2019

grondo deleted the shell-refactor branch July 18, 2019 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job shell: add flux_shell_t and completion references #2240

job shell: add flux_shell_t and completion references #2240

grondo commented Jul 17, 2019

codecov-io commented Jul 17, 2019

garlick commented Jul 17, 2019

job shell: add flux_shell_t and completion references #2240

job shell: add flux_shell_t and completion references #2240

Conversation

grondo commented Jul 17, 2019

codecov-io commented Jul 17, 2019

Codecov Report

garlick commented Jul 17, 2019