Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue1331: libsubprocess redesign #1564

Merged
merged 12 commits into from
Sep 12, 2018
Merged

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Jul 4, 2018

A checkpoint of where I'm at on #1331, didn't want to continue too far ahead without atleast some comment.

  • only have done the "local" support so far, except the channel fd stuff, mostly wanted to concentrate on the subprocess i/o, state changes, and unit tests

  • I decided to require the user to specify a stdout/stderr callback all the time. Most of the time users will probably just use the provided flux_subprocess_output. If they don't specify anything, there will be no stdout/stderr.

  • I sort of dislike that the stdio callbacks have a int stream argument. But I couldn't think of anything better.

  • A couple of commits to allow a flag so all flux_buffer_t reads & peeks to be NUL terminated. Just sort of convenient.

  • you'll notice a test_echo utility test tool. I got tired of trying to figure out the best way to do some unit tests w/ just tools in /bin and just caved and wrote something simple.

  • @grondo no changes to your initial command work.

@coveralls
Copy link

coveralls commented Jul 4, 2018

Coverage Status

Coverage decreased (-0.2%) to 79.367% when pulling af2d277 on chu11:issue1331 into 1209a02 on flux-framework:master.

@chu11
Copy link
Member Author

chu11 commented Jul 6, 2018

pushed a commit with local channel fd support, not the cleanest I would have liked the code to be, but not bad. I'll bring up a design issue in issue #1331 for discussion.

@chu11 chu11 mentioned this pull request Jul 6, 2018
@chu11
Copy link
Member Author

chu11 commented Jul 11, 2018

this push completes (I think) all "local" libsubprocess work. Now all stdin, stdout, stderr, and "channel" support is basically the same. All use the subprocess_read/write functions. stdin/out/err are simply treated as special channels that are write/read only. And there are now options for channels for buffer size and if all reads should be NUL terminated.

@grondo
Copy link
Contributor

grondo commented Jul 11, 2018

stdin/out/err are simply treated as special channels that are write/read only.

Nice! I kind of like that result.

@chu11
Copy link
Member Author

chu11 commented Aug 10, 2018

Just pushed a tree with initial implementation of subprocess with a flux_rexec() implementation. Wanted to push this current point and get some initial comments.

A few notes:

  • I know it's one giant commit, will divide it up later.

  • For the time being, I wrote a exec2 broker module. Obviously would be converted over to exec at a later time. Tests are named exec2 as well.

  • There are two FAILED state changes: EXEC_FAILED and RUNNING_FAILED. The former is the failure of exec(), while the latter is sort of a catch all "fail" if something goes wrong after the job has entered the running state (buffer overflow, ENOMEM, etc.).

  • I decided that on_completion does not get called if a process ever reaches a EXEC_FAILED or RUNNING_FAILED state.

  • @grondo, I noticed you had an uuid field in your original subprocess prototype. Did you have a purpose for this in mind? Perhaps some type of subprocess manager that manages both local & remote processes? I didn't use it for the moment (and likewise don't have a subprocess manager yet).

  • At the moment no way to get the process list on a remote rank, not sure if that's important.

  • Possible next steps, not counting things listed above? Convert old subprocess code to the new library? Write a new flux-exec tool?

@chu11
Copy link
Member Author

chu11 commented Aug 13, 2018

just re-pushed with the commits split up a bit

@chu11
Copy link
Member Author

chu11 commented Aug 17, 2018

Pushed a new branch with more tests and a few bug fixes. Also elected to change RUNNING_FAILED to just FAILED, as a catch all for all internal errors that aren't really in a user's control (ENOMEM, EPROTO, etc.). Next step is to replace cmb.exec and flux-exec.

@chu11
Copy link
Member Author

chu11 commented Aug 28, 2018

This is the mostly completed subprocess work, with code converted to use the new flux subprocess library globally, and all old usage removed.

Some minor highlights:

  • introduced a flag FLUX_SUBPROCESS_FLAGS_STDIO_FALLTHROUGH, which will inform locally generated subprocesses to not create STDIN, STDOUT, or STDERR channels, instead letting the parent's fallthrough to the child.

  • added a flux_cmd_arg() function

  • the completion callback is now called via a prep/check handler as well.

  • per discussion with @grondo, flux ps has been removed

  • various minor corner case fixes along the way.

In some later commits, or perhaps a cleanup in another PR, I'd like to try and cleanup/refactor some code surrounding my conversions from the old libsubprocess to the new one. There are many tiny things (such as using argz, or json, etc. etc.) that was designed into code given the old libsubprocess API or the old cmb.exec service. A bunch of that code could use an update.

I have not yet removed the old libsubprocess library, b/c some code still uses zio. Should we move zio somewhere else? Or should we convert lingering zio code to buffer reactors?

@chu11 chu11 force-pushed the issue1331 branch 15 times, most recently from 1bf646d to 306d93e Compare August 29, 2018 06:08
@chu11
Copy link
Member Author

chu11 commented Aug 29, 2018

Made a change where flux_rexec() can take < 0 to launch flux_exec(). This is convenient b/c it allows the user to pass a flux_t *h to flux_exec(). And if flux_t *h is available, it allows more logging to go to the flux log. Most notably anything in side the broker that uses flux_exec() (runlevels, cmb.rexec) can have all subprocess library errors goto the log.

I mainly did this b/c I'm trying to figure out one lonely error in travis.

t0015-cron.t 6 - cron delete leaves running task - --kill works

consistently fails on travis but not on catalyst.

For some reason, when the subprocess (here sleep 100) is killed, stdout/stderr EOFs are never seen by the caller. (It ends up there was a race in the test originally, so I have a patch to fix that too in here, but it ends up that wasn't the main reason the test was failing.)

@grondo
Copy link
Contributor

grondo commented Sep 11, 2018

@chu11, any other updates pending here, or is this ready to be merged?

I think any other minor cleanups or changes could go in as future PRs. Any major changes to API or anything like that should probably go in this PR before the first version is merged (I just couldn't remember where we were on this one)

@garlick
Copy link
Member

garlick commented Sep 11, 2018

I meant to say earlier that this is a nice big chunk of solid work @chu11. Nice.

I'm fine with merging this if @grondo is happy. We can always keep going in future PRs if there are still few loose ends.

@chu11
Copy link
Member Author

chu11 commented Sep 11, 2018

Oh, I was intending to fixup some of the bigger items mentioned above. But if you're ok doing it in future PRs, how about I fixup the smaller nits you mention above. Then I'll create tickets for the bigger items you mention and re-push.

Add support for a new flux subprocess API that merges both
local subprocess support (in the old "libsubprocess") and
remote subprocess support and remote subprocess server
support.

Support a new cmd API to encapsulate the commands that callers
wish to launch in subprocces.

By Albert Chu and Mark Grondona
Add support of the new "exec2" module, which is similar to the old
"exec" module but uses the new flux subprocess library.  This module
will be used for transition and eventually be renamed to "exec".
Add unit tests to test the flux subprocess library's remote execution.
Convert to use cmb.exec2 service via the new flux subprocess
library, instead of using the cmb.exec service.
Convert to use cmb.exec2 service via the new flux subprocess
library, instead of using the cmb.exec service.
Original tool was developed primarily for testing.  For new
cmb.exec2 broker, a test specific ps tool was developed.
Replace flux-exec command with a new flux-exec command based on the
new subprocess library.
Adjust all instances of "exec2" and replace with "exec" appropriately.
Convert runlevel to launch local subprocesses via the new
subprocess library.

Remove lingering uses of libsubprocess in main broker as well.
@chu11
Copy link
Member Author

chu11 commented Sep 12, 2018

Just pushed a mergeable tree, fixing a few of the nits that @grondo noticed above. They were so minor I just went ahead and squashed all the changes. The five remaining mini-big future fixes are:

  • collapse on_channel_out, on_stdout, and on_stderr into one function (presumably on_out).

  • Default stdout/stderr to go to parent stdout/stderr, resulting in flux_subprocess_output() no longer needing to be a public function.

  • flux_subprocess_destroy() should go back to its original, a #define pointing to flux_subprocess_unref(). Any code that requires a destroy function taking a void * should create a wrapper destroy function.

  • cron module should cache current working directory on module load. This will be included in cron module cleanup after this PR.

  • channel environment variables should not be made by appending _FD to them. Users should simply define the channel name to be the environment variable they want it to be.

@chu11
Copy link
Member Author

chu11 commented Sep 12, 2018

hmmm, missed the diff target this time around, just a bit though. Maybe hit an unlucky set of racy bits.

@grondo
Copy link
Contributor

grondo commented Sep 12, 2018

Yeah, uncovered parts of the diff are 99% error handling. The only thing that stuck out to me is that FLUX_SUBPROCESS_FAILED case didn't seem to be covered in rexec_state_change_cb, but I'm not sure how easy it would be to trigger that in testing. Not a big deal at this point.

@grondo
Copy link
Contributor

grondo commented Sep 12, 2018

One thing I noticed when reading through the diff again is that the return code of -1 for flux_subprocess_signaled() when !WIFSIGNALED() might be surprising at first, so at least a comment should be added in the header (tendency might be to use if flux_subprocess_signaled(p))

Also, there are 4 places in the code where 128 is added to the exit code (with a reference to why in the comments, nice!) when a process is signaled, which makes me wonder if the old subprocess_exit_code() took the right approach by doing that internally and saving some repetition? I'm not sure...

Otherwise, this looks good to merge, I'll press the button in the morning.

@chu11
Copy link
Member Author

chu11 commented Sep 12, 2018

Also, there are 4 places in the code where 128 is added to the exit code (with a reference to why in the comments, nice!) when a process is signaled, which makes me wonder if the old subprocess_exit_code() took the right approach by doing that internally and saving some repetition? I'm not sure...

Good point. I wanted to remove the +128 b/c that is a "style" choice on returning a status, but you are correct I'm repeating its use. I'll put that on the ponder list.

@grondo
Copy link
Contributor

grondo commented Sep 12, 2018

Ok, merging. Thanks for this huge effort!

@grondo grondo merged commit bdf296f into flux-framework:master Sep 12, 2018
@SteVwonder
Copy link
Member

WOO! Nice work @chu11! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants