Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job shell: add flux_shell_t and completion references #2240

Merged
merged 4 commits into from
Jul 17, 2019

Conversation

grondo
Copy link
Contributor

@grondo grondo commented Jul 17, 2019

This PR addresses #2237.

I apologize in advance because this will be a difficult PR to review. I attempt to make a minor advance by adding a struct flux_shell structure to contain the independing io info and pmi structures used by the various components of the shell, but ended up making more changes than I wanted. Perhaps with the luxury of more time, this PR could be made cleaner, but hopefully after some rounds of feedback this approach could be acceptable.

As noted above, this branch adds a new struct flux_shell structure as container for the most of the other structures used by the shell. This structure can then be passed to callbacks that may require access to multiple shell components and data, without requiring those callers to use custom structures or push multiple pointers into the reactor or handle aux hashes.

The struct flux_shell may also serve as an opaque handle for plugins in the future (along with a set of accessors as yet unwritten), and thus a flux_shell_t typedef is used as the start of a public interface. Later when we need this we can install flux/shell.h and move the struct flux_shell definition internal.

Once the flux_shell_t structure was defined, many of the init functions were updated to take this function as an argument. This resulted in cascading changes that, unfortunately, come in one big commit. To help review, here's a summary of the changes:

  • move cmdline parsing and basic option checking into a function shell_parse_cmdline()
  • move shell "options" (verbose, standalone) set on cmdline to bools in flux_shell_t
  • set jobid in shell->jobid
  • move destruction of components of flux_shell_t into shell_finalize()
  • move fetch of jobspec and R fully into shell_info_create() (whether set on cmdline or fetched via kvs) (this is probably a majority of the code change, since some functions were moved from shell.c to info.c)

Finally, to address #2237, a concept from wrexecd was borrowed to add a "completion reference" interface for the shell's reactor. Once all completion refs that are taken are dropped, the shell's reactor is manually stopped with flux_reactor_stop(). The references are also named as a future debugging aid. Initially, a reference is taken for each task, and released in the task completion handler.

Fixes #2237.

Handle NULL passed to rcalc_destroy () for consistency with most other
flux-core code.
Problem: As the job shell grows in complexity, functions and callbacks
may need to access more than one of the shell internal datatatypes
at a time, potentially including flux handle, reactor, shell job
information etc. Currently there is no easy or common way for internal
code to accomplish this without use of custom data types, adding globals,
et.c

To reduce this future effort and present a more common methodology,
add a `struct flux_shell` data type as a container for other shell
data types, and refactor the shell to pass this object to submodules,
simplifying their implementation a bit.

As part of this effort, fetching of jobspec and R is moved wholly into
the info.c module, whether fetched from the command line or KVS.

Command line options, such as jobid and the standalone and verbose flags,
is moved up into the `struct flux_shell`.
Problem: Once the flux-shell starts adding reactor watchers other
than those of libsubprocess, e.g. message handlers, signalfd watchers,
etc., then exiting from the internal reactor when all tasks have
completed will not be as straightforward.

Instead of adding a single `is_complete()` function that needs
to check for all tasks exited, all IO complete, etc, set up a system
where shell components can take a "completion reference" on the shell.
Once all existing completion refs are released, then the shell's reactor
will be stopped manually with flux_reactor_stop().

The completion references are named to aid in future debugging. If a
shell appears to be stuck, keys from the internal hash could be dumped
to determine which components haven't completed.
Use flux_shell_add_completion_ref() to take a reference on the shell
for each task executed. Drop the reference in the completion callback
so that the shell's reactor is manually stopped after the final task
exits.

This doesn't change any functionality of the shell, but ensures the
shell will still exit the reactor when new reactor references are
attached to the shell in future commits (e.g. message watchers).
@codecov-io
Copy link

Codecov Report

Merging #2240 into master will increase coverage by 0.02%.
The diff coverage is 89.75%.

@@            Coverage Diff             @@
##           master    #2240      +/-   ##
==========================================
+ Coverage   80.74%   80.77%   +0.02%     
==========================================
  Files         209      209              
  Lines       32988    33037      +49     
==========================================
+ Hits        26637    26685      +48     
- Misses       6351     6352       +1
Impacted Files Coverage Δ
src/shell/io.c 100% <ø> (ø) ⬆️
src/shell/rcalc.c 85.32% <100%> (+0.08%) ⬆️
src/shell/pmi.c 74.45% <88.23%> (+0.18%) ⬆️
src/shell/shell.c 84.28% <89.62%> (-1.7%) ⬇️
src/shell/info.c 80.85% <90.47%> (+9.6%) ⬆️
src/common/libutil/aux.c 90.74% <0%> (-3.71%) ⬇️
src/modules/connector-local/local.c 73.67% <0%> (-0.15%) ⬇️
src/cmd/flux-event.c 77.97% <0%> (ø) ⬆️
src/common/libflux/message.c 80.75% <0%> (+0.12%) ⬆️
src/common/libflux/mrpc.c 88.97% <0%> (+1.18%) ⬆️

@garlick
Copy link
Member

garlick commented Jul 17, 2019

Excellent!

@garlick garlick merged commit 5e7635b into flux-framework:master Jul 17, 2019
@grondo grondo deleted the shell-refactor branch July 18, 2019 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job shell needs a way to stop reactor when all tasks are complete
3 participants