Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux-job: watch guest output instead of doing a lookup #2332

Merged
merged 2 commits into from Aug 29, 2019

Conversation

@chu11
Copy link
Contributor

commented Aug 22, 2019

As discussed yesterday, this PR changes flux job attach to use a job-info eventlog watch instead of a job-info lookup of of the guest.output log. This is a step change for when job attach will attach to a running job and output stdout/err as it happens.

One corner case that came up in this change is when a job never runs, (e.g. resources are impossible), this would hang flux job attach b/c nothing ever gets output to guest.output. So we have to check to make sure the job actually ran (event "start" occurs).

One other issue that had to be solved in this change is how to determine that all output has completed. It's easy to count EOF occurrences, but flux job attach needs to know how man EOF occurrences to expect. For the time being, I just stuffed a task_count into the IO eventlog header.

If we think that this is a good idea, I can send a PR to update RFC24 to allow a task_count to be included in the header.

@chu11 chu11 changed the title [RFC] flux-job: watch guest output instead of doing a lookup [WIP] flux-job: watch guest output instead of doing a lookup Aug 22, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 22, 2019

If we think that this is a good idea, I can send a PR to update RFC24 to allow a task_count to be included in the header.

Maybe instead of task_count something more generic like stream_count. This could cover the case where some or most output is directed elsewhere.

Ideally it would be nice to support generic open events, but I'm guessing that would clutter the log and also be a bit racy.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 22, 2019

Maybe instead of task_count something more generic like stream_count. This could cover the case where some or most output is directed elsewhere.

Good idea. Perhaps it should be separate counts, like stdout_count and stderr_count?

@chu11 chu11 force-pushed the chu11:guest_output_attach branch from 9226a8e to 41d8371 Aug 23, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2019

re-pushed, updated with RFC suggestion I made in flux-framework/rfc#199

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

This seems to be working fine. Since it modifies shell/io.c, it would be nice to get it in so these changes can be folded into #2335.

Ready for a merge?

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 26, 2019

it's ready, although should be agree to merge on flux-framework/rfc#199 first?

@chu11 chu11 changed the title [WIP] flux-job: watch guest output instead of doing a lookup flux-job: watch guest output instead of doing a lookup Aug 26, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 26, 2019

Ok, the RFC 24 change was merged.

After looking this over again, I do have one comment. Would it be better if the output watch was installed before entering the reactor the first time? This way, live output should "just work" when the shell starts generating it, instead of requiring another modification to flux-job attach down the raod to start the watch before the job is complete.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 26, 2019

After looking this over again, I do have one comment. Would it be better if the output watch was installed before entering the reactor the first time? This way, live output should "just work" when the shell starts generating it, instead of requiring another modification to flux-job attach down the raod to start the watch before the job is complete.

That's a good idea. My initial thought was that b/c we don't have live output in a job yet, to just watch after the job is complete. Let me see if that works out.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2019

That's a good idea. My initial thought was that b/c we don't have live output in a job yet, to just watch after the job is complete. Let me see if that works out.

Oh wait, there's a reason. B/c the shell currently doesn't create guest.output until after the job is done. So the watch would get a ENOENT error.

I assume as we get further along in I/O support, guest.output will eventually be created far earlier in the process, alongside guest.exec.eventlog perhaps? Or is job-info "wait-create" support something that's going to be needed?

@chu11 chu11 force-pushed the chu11:guest_output_attach branch from 41d8371 to 7603f2f Aug 27, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 27, 2019

rebased & re-pushed. For the moment, no modifications based on last 3 comments.

@chu11 chu11 force-pushed the chu11:guest_output_attach branch 3 times, most recently from b479955 to 9cee98d Aug 27, 2019
@codecov-io

This comment has been minimized.

Copy link

commented Aug 27, 2019

Codecov Report

Merging #2332 into master will increase coverage by 0.03%.
The diff coverage is 81.48%.

@@            Coverage Diff             @@
##           master    #2332      +/-   ##
==========================================
+ Coverage   80.82%   80.85%   +0.03%     
==========================================
  Files         215      215              
  Lines       34254    34278      +24     
==========================================
+ Hits        27686    27716      +30     
+ Misses       6568     6562       -6
Impacted Files Coverage Δ
src/shell/io.c 76.22% <100%> (+0.69%) ⬆️
src/cmd/flux-job.c 85.6% <80.39%> (+0.06%) ⬆️
src/broker/modservice.c 70.67% <0%> (-0.76%) ⬇️
src/common/libflux/message.c 80.57% <0%> (-0.38%) ⬇️
src/cmd/flux-module.c 84.19% <0%> (+0.23%) ⬆️
src/modules/connector-local/local.c 74.42% <0%> (+1.15%) ⬆️
src/common/libflux/mrpc.c 88.97% <0%> (+1.18%) ⬆️
@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

Oh wait, there's a reason. B/c the shell currently doesn't create guest.output until after the job is done. So the watch would get a ENOENT error.

Ehh, good point.

Could the header be generated during io initialization in the shell in this PR?
Or is your preference just to wait for now?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

Or is job-info "wait-create" support something that's going to be needed?

Not sure. I guess it might depend on how we want to support attach to a job that hasn't started yet. If we want to support it.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

Or is your preference just to wait for now?

This was supposed to be a PR for "demonstration" purposes, and to be that tiny step forward for flux job attach. Maybe it's too early for these changes b/c we don't know enough yet? OR do we just accept it for the simple "minor step forward" that it is? I can go either way on this.

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Since we already support waiting on the guest eventlog (or will soon), should we just have the shell append an event there when output is created? It could be done in the same KVS commit.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

Since we already support waiting on the guest eventlog (or will soon), should we just have the shell append an event there when output is created? It could be done in the same KVS commit.

I think that would work. Basically, don't watch the guest.output until the eventlog says its created? And obviously if the guest.exec.eventlog reaches an error, the guest output watch will never occur.

Let me see how easy it would be to add that.

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

I think this is consistent with the intent of the guest eventlog. All RFC 16 has to say about it is:

exec.eventlog
An eventlog for the use of job shells, TBD.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

That seems fine, but it feels like the APIs should make it easier than this. As it is, what would a user tool outside of job attach have to do to, say, watch output from all a user's jobs (e.g. looking for a certain string)? I could be wrong but it seems like a lot of machinery currently.

I guess this can all be wrapped up in another API that hides the multiple wait-for-event, open next eventlog and wait-for-next event chains, but since we have the job-info service could we hide the complexity behind one method exported by that module?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 28, 2019

I realized on the way to pick up lunch that my comment above was dumb. Sorry.

I guess my general comment that dealing with some of the job data is perhaps unnecessarily a bit tricky still stands, but was inappropriate here.

I'm thinking we should just merge this PR as is, since the more modification we do to the shell here, the more trouble it will make for me when rebasing #2335.

It also reminds me that the current "events" in the exec.eventlog are mostly dummy events. We should push some of these (after "starting") into the shell as suggested, so that they are meaningful. (Will try to open issue)

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

@garlick

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Does this work reliably then? E.g. there is no race where the watch on the output eventlog fails b/c the key doesn't exist yet? If it works, then I'm good with merging.

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

Does this work reliably then? E.g. there is no race where the watch on the output eventlog fails b/c the key doesn't exist yet? If it works, then I'm good with merging.

No race at the moment, b/c it the watch against the guest output is AFTER the job has finished.

Edit: The only "gotcha" that has to be done is to ensure the job actually started. if the job didn't start, guest.output is never created.

chu11 added 2 commits Aug 22, 2019
Per RFC24 update, add stdout and stderr stream counts to io header.
In preparation for future live output with job attach, refactor
flux-job to use flux_job_event_watch() for printing output from
a job instead of using a simple lookup of all output.

A small corner case that results in this change is that if the
job never ran (e.g. resource request impossible), this change could
hang as no events will ever occur in the output log.  To resolve this,
we must check to ensure that the job actually started.
@chu11 chu11 force-pushed the chu11:guest_output_attach branch from 9cee98d to d6352ba Aug 29, 2019
@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

rebased

@chu11

This comment has been minimized.

Copy link
Contributor Author

commented Aug 29, 2019

restarting a builder, hit what I assume was #2195

@grondo

This comment has been minimized.

Copy link
Contributor

commented Aug 29, 2019

Ok, merging since @garlick said it was ok. ;-)

@grondo grondo merged commit 7c0e403 into flux-framework:master Aug 29, 2019
2 checks passed
2 checks passed
Summary 1 potential rule
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick garlick referenced this pull request Sep 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.