Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: add job submit debug flag #2047

Merged
merged 7 commits into from Feb 28, 2019

Conversation

@garlick
Copy link
Member

commented Feb 27, 2019

This PR adds a debug flag to the submission API, and a --debug option to flux job submit as proposed in #2033.

This causes the job manager to emit some "extra" events that are not justified for every job but which might be useful to provide context when debugging. For now, it just adds alloc-request and free-request debug events indicating that the sched.alloc or sched.free request has been sent. Normally events are only logged when the response is received.

The sharness test with the dummy scheduler is modified to verify that these events are added as expected for sched mode=single.

This scheme may be too simple, but perhaps its a reasonable first step?

@garlick garlick force-pushed the garlick:job_debug_flag branch 2 times, most recently from de49f62 to f7a253c Feb 27, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

This scheme may be too simple, but perhaps its a reasonable first step?

In general, I really like the idea of debug events in eventlogs. I think this will make easier to debug issues as well as compute timing of different stages in the job lifecycle when looking for problems.

Some general thoughts while poking at this PR -- a lot of these are more thinking out loud rather than directed comments

  • Do you plan for the debug flag to be passed to the scheduler and eventually exec system as well?
  • I'd suggest flux job submit --submit-flags=debug or similar instead, since it would more easily extended when support for other flags is added
  • The per-job debug flag seems like a good choice now, but for a full-featured Flux, I wonder if having the flag be per-job would be used? Seems like an administrator or developer would want to flip debug mode on and off for all jobs when there is a perceived issue?
  • Readability of the submit event might be improved if the flags were sent as an array of strings instead of an integer mask. This approach is also much more extensible (at the expense of string compares or having a flags hash instead of bitmask)
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

One other thought -- what if debug events were named debug.<event> instead of just debug?
E.g. debug.alloc-request -- the only benefit here is that it would make it easier to work with these entries, as I'm not sure there is a use case for waiting on these events (though I don't see why we should try to prevent it)

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2019

Thanks for the feedback.

Do you plan for the debug flag to be passed to the scheduler and eventually exec system as well?

I did propagate the "submit flags" to the eventlog, and added them to the struct job in the job-manager. That means they will persist across the job manager being reloaded, and be available at the point where RPCs are sent to the scheduler, dependency system, or exec system, if we decide we want them. At this point I was trying to keep this change simple so I didn't go further than that.

I'd suggest flux job submit --submit-flags=debug or similar instead, since it would more easily extended when support for other flags is added

Good idea, I'll make that change.

Seems like an administrator or developer would want to flip debug mode on and off for all jobs when there is a perceived issue?

Good thought. There are some open issues on how to manage configuration that we may need to work through first though. This is a good use case to consider in that discussion...

Readability of the submit event might be improved if the flags were sent as an array of strings instead of an integer mask. This approach is also much more extensible (at the expense of string compares or having a flags hash instead of bitmask)

It's a fair point, though I'm concerned about keeping the size of the job manager's struct job small enough so millions of them can be in the queue in memory. Maybe all the flags don't need to be represented there? Hmm more thought required.

One other thought -- what if debug events were named debug. instead of just debug?
E.g. debug.alloc-request -- the only benefit here is that it would make it easier to work with these entries, as I'm not sure there is a use case for waiting on these events (though I don't see why we should try to prevent it)

It would be valuable to be able to synchronize on these events in a sharness test, so maybe that's a good idea!

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 27, 2019

So were you thinking that the current

flux_future_t *flux_job_submit (flux_t *h, const char *jobspec,
                                int priority, int flags);

should be changed to something like:

flux_future_t *flux_job_submit (flux_t *h, const char *jobspec,
                                int priority, const char *flags_list);

and that the list of flags would go right into the submit event as is?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 27, 2019

though I'm concerned about keeping the size of the job manager's struct job small enough so millions of them can be in the queue in memory. Maybe all the flags don't need to be represented there?

Ah, good point! I wonder if internally the job manager could still store the flags it cares about as a mask, but set the flags from a json array? At this point sending the integer flags is probably fine though!

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2019

Hm, I wasn't thinking in terms of API but rather the message format. I'm not sure my idea translates well to that style of api sorry.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2019

Hm, I wasn't thinking in terms of API but rather the message format. I'm not sure my idea translates well to that style of api sorry.

My thinking got a bit off track here.. For any subsystem that is reading jobspec anyway, extra options and flags selected by a user can be placed in the system dictionary of the jobspec -- so I don't think there is nearly as much need for extending these submission flags as I was thinking.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 28, 2019

Restarted ubuntu builder that failed here:

PASS: t1000-kvs.t 138 - kvs: get --label works
# passed all 138 test(s)
1..138
ERROR: t1000-kvs.t - exited with status 137 (terminated by signal 9?)

I'm not sure what is segfaulting, unless maybe it's the broker as the instance is shut down?

garlick added 6 commits Feb 21, 2019
Add a new flag to job.h for flux_job_submit(),
FLUX_JOB_DEBUG, to enable eventlog debugging.

Then add flux job submit [--flags=[debug,...]]
option to allow it to be set on the command
line at submit time.

Fixes #2033.
Add 'flags' to job's submit eventlog entry,
and include flags with job-manager.submit RPC.

Flags are checked for validity at ingest,
and rejected immediately if invalid.
Allow FLUX_JOB_DEBUG through.
Now that job-ingest is sending over submit flags
and logging them with the submit event, the job-manager
can capture these flags in 'struct job'.

Update unit tests.
If job was submitted with debug flag, log debug.alloc-request
and debug.free-request  events to the job's eventlog when
alloc and free requests are sent.  These are in addition to
the alloc and free events that are logged when the responses
are received.
Enhance t2203-job-manager-dummysched.t sharness test:
* submit all jobs with --flags=debug option
* make existing event pattern matching more precise
* with two state=SCHED jobs in queue, check that only
  one logged an 'debug.alloc-request' (mode=single)
@garlick garlick force-pushed the garlick:job_debug_flag branch from 0cdd897 to c8abbaf Feb 28, 2019
@grondo

This comment has been minimized.

Copy link
Contributor

commented Feb 28, 2019

Want to add a --flags option to t/ingest/submitbench as well? (Useful command for benchmarking, the debug events will make it even more useful)

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 28, 2019

I squashed down the incremental development and forced a push.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 28, 2019

Want to add a --flags option to t/ingest/submitbench as well? (Useful command for benchmarking, the debug events will make it even more useful)

Yes good idea!

Add the same flag to submitbench that was just added
to flux-job submit.
@grondo
grondo approved these changes Feb 28, 2019
Copy link
Contributor

left a comment

Thanks! Looks good to merge

@garlick

This comment has been minimized.

Copy link
Member Author

commented Feb 28, 2019

Sounds good, thanks!

@codecov-io

This comment has been minimized.

Copy link

commented Feb 28, 2019

Codecov Report

Merging #2047 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2047      +/-   ##
==========================================
+ Coverage    80.4%   80.42%   +0.02%     
==========================================
  Files         191      191              
  Lines       30240    30252      +12     
==========================================
+ Hits        24313    24329      +16     
+ Misses       5927     5923       -4
Impacted Files Coverage Δ
src/modules/job-manager/job-manager.c 59.13% <100%> (ø) ⬆️
src/modules/job-ingest/job-ingest.c 72.56% <100%> (ø) ⬆️
src/modules/job-manager/job.c 98.46% <100%> (+0.07%) ⬆️
src/modules/job-manager/alloc.c 73.36% <100%> (+0.5%) ⬆️
src/cmd/flux-job.c 87.98% <100%> (+0.26%) ⬆️
src/common/libflux/message.c 81.27% <0%> (+0.12%) ⬆️
src/modules/barrier/barrier.c 78.62% <0%> (+2.06%) ⬆️
@grondo grondo merged commit 52cb275 into flux-framework:master Feb 28, 2019
4 checks passed
4 checks passed
Mergify — Summary 1 potential rule
Details
codecov/patch 100% of diff hit (target 80.4%)
Details
codecov/project 80.42% (+0.02%) compared to 7bf0343
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick garlick deleted the garlick:job_debug_flag branch Mar 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.