job-manager: centralize job state machine #2067

garlick · 2019-03-07T23:48:07Z

This PR is a refactoring of the job manager to put all of the job state machine logic in event.c rather than sprinkling it all around. Hopefully this will make it easier to understand and modify. Parts of the system like the request handlers, and handlers for alloc/free responses now simply log events, and the event now is not only committed to the KVS, but also drives a state machine update, which in turn initiates new actions.

One other change that wasn't strictly required: The priority and "raise" RPC's now respond to the user before the event has been committed to the KVS. This was to simplify some code that seemed gratuitously complex to me as I was refactoring. However it caused a race in the t2202-job-manager.t test, which then had to be addressed by borrowing a clever shell function from @grondo's scheduler test. Hopefully these tests can be simplified after we get a user facing tool for waiting on events.

I struggled with structuring this PR so that commits were reviewable and didn't entirely get there. I'm happy to try again if the result is hard to look at.

garlick · 2019-03-08T15:47:42Z

Just pushed a couple of inline doc fixes that will need to be squashed. I was thinking of squashing the "use event_job_action(), event_job_update()" commits into one. Any opinion on that one @grondo?

grondo · 2019-03-08T16:13:56Z

Sorry, I only just started looking at PR. Overall it is nicely structured!

I think it is fine to squash the commits you speak of into one.

My only early question is whether it now makes sense to rename event_log() since "logging" the event is only one of the actions the function now triggers? (not sure I have a better name but something like event_job_trigger or event_job_post?) I don't know, keep in mind that is an early question and maybe without merit.

Otherwise, so far the change looks great and the PR is organized nicely.

garlick · 2019-03-08T17:09:08Z

Hmm event_job_post() sounds like an improvement. Thanks! I always struggle with the names.

garlick · 2019-03-08T18:24:32Z

forced a push with some general tidying up and the suggested function name change.

Move the submit request handler out of job-manager.c (the mainline) to its own source, submit.c. This aligns its structure with other request handlers in the job manager. The code itself didn't change substantively.

Move the remaining portion of the restart logic out of job-manager.c, into restart.c. This aligns its structure with other "sub modules" of the job manager.

Change job_create() so that it creates the struct job with default values that can be updated afterwards. This allows some reduction in complexity at call sites. Update submit request handler and tests.

Move the code used to replay the job eventlog during job manager restart into event.c with the goal of centralizing the logic that drives state transitions with events. In this commit it is still only used from job_create_from_eventlog() during restart.

Modify event_log() to accept a 'struct job' rather than just the job id, to prepare for event_log() centrally handling state transitions/actions.

Expose new idempotent functions for enqueuing jobs for scheduler alloc, or sending scheduler free requests, to prepare for being driven by a centralized state machine.

Add a new function to take any needed actions following job state change, for example allocating or freeing resources. It makes use of the idempotent alloc/free interfaces just added.

Replace code sprinkled around job manager for taking action after an event occurs with a call to event_job_action(). This is one step along the way towards centralizing the job state machine.

Problem: event_log() callback is needed even if all it does is log the error. Change the event_log() KVS commit continuation to log an error and stop the reactor on failure, so that a callback is only needed if the user needs to be notified when the event has been committed. Drop event_log callbacks in alloc, priority, and raise source modules. In priority and raise, this changes the control flow so that a response may be received before the eventlog update is finalized, which affects the timing of some tests.

Modify event_log() so that it calls event_job_update() and event_job_action() before appending event to the current KVS commit batch. Rename to event_job_post() to reflect its changed role. Modify callers so that they no longer need to call event_job_update() or event_job_action() directly if they are calling event_job_post(). This accomplishes the goal of centralizing the "state machine" logic for jobs in event.c. Update inline docs. Cleanup: drop the event_completion_f typedef that is identical to flux_completion_f.

Now that flux_job_cancel() doesn't guarantee that the exception has landed in the eventlog before returning, borrow a synchronization function from t2300-sched-simple.t and insert it at points where tests have now become racy.

Break up some large functions in submit.c and export them with a submit_ prefix for unit testing and improved clarity.

garlick · 2019-03-08T22:08:16Z

Rebased on current master and added a unit test for submit-related functions.

codecov-io · 2019-03-08T22:28:13Z

Codecov Report

Merging #2067 into master will increase coverage by 0.05%.
The diff coverage is 82.63%.

@@            Coverage Diff             @@
##           master    #2067      +/-   ##
==========================================
+ Coverage   80.42%   80.47%   +0.05%     
==========================================
  Files         191      192       +1     
  Lines       30343    30246      -97     
==========================================
- Hits        24403    24341      -62     
+ Misses       5940     5905      -35

Impacted Files	Coverage Δ
src/modules/job-manager/job-manager.c	`70.73% <100%> (+11.96%)`	⬆️
src/modules/job-manager/submit.c	`66.66% <66.66%> (ø)`
src/modules/job-manager/job.c	`93.93% <71.42%> (-4.53%)`	⬇️
src/modules/job-manager/priority.c	`89.65% <75%> (+5.44%)`	⬆️
src/modules/job-manager/raise.c	`79.54% <80%> (+10.42%)`	⬆️
src/modules/job-manager/restart.c	`86.36% <86.66%> (-0.43%)`	⬇️
src/modules/job-manager/event.c	`73.54% <94.11%> (+3.17%)`	⬆️
src/modules/job-manager/alloc.c	`74.48% <96%> (+0.72%)`	⬆️
src/common/libutil/aux.c	`90.74% <0%> (-3.71%)`	⬇️
... and 6 more

garlick · 2019-03-08T22:43:19Z

Ready for a review I think.

grondo

LGTM! I poked and prodded and didn't find any issues.

grondo · 2019-03-09T02:20:54Z

May I press the button or is there more work here?

garlick · 2019-03-09T04:24:10Z

Please do :-)

…

On Fri, Mar 8, 2019, 6:20 PM Mark Grondona ***@***.***> wrote: May I press the button or is there more work here? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2067 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX240whqNQW7oVqXcYziM3ZcMpOI-dks5vUxqGgaJpZM4bkWfM> .

garlick · 2019-03-09T15:49:28Z

Thanks!

…

On Sat, Mar 9, 2019, 7:33 AM Mark Grondona ***@***.***> wrote: Merged #2067 <#2067> into master. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2067 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX295AAz-83PJqPJqY7wQoY9KwCz2Kks5vU9RfgaJpZM4bkWfM> .

garlick force-pushed the job_fsm branch from 34e5a18 to 8f1653f Compare March 8, 2019 18:24

garlick added 13 commits March 8, 2019 12:56

job-manager: relocate submit handler

e047cf2

Move the submit request handler out of job-manager.c (the mainline) to its own source, submit.c. This aligns its structure with other request handlers in the job manager. The code itself didn't change substantively.

job-manager: refactor restart logic

8c3c368

Move the remaining portion of the restart logic out of job-manager.c, into restart.c. This aligns its structure with other "sub modules" of the job manager.

job-manager: simplify job_create()

cdc982f

Change job_create() so that it creates the struct job with default values that can be updated afterwards. This allows some reduction in complexity at call sites. Update submit request handler and tests.

job-manager: change event_log() id param to job

d4cf3be

Modify event_log() to accept a 'struct job' rather than just the job id, to prepare for event_log() centrally handling state transitions/actions.

job-manager: expose idempotent alloc/free functions

25c1714

Expose new idempotent functions for enqueuing jobs for scheduler alloc, or sending scheduler free requests, to prepare for being driven by a centralized state machine.

job-manager: add event_job_action()

130bf37

Add a new function to take any needed actions following job state change, for example allocating or freeing resources. It makes use of the idempotent alloc/free interfaces just added.

job-manager: use event_job_action()

71a5e80

Replace code sprinkled around job manager for taking action after an event occurs with a call to event_job_action(). This is one step along the way towards centralizing the job state machine.

testsuite: add synchronization to job-manager test

581fbee

Now that flux_job_cancel() doesn't guarantee that the exception has landed in the eventlog before returning, borrow a synchronization function from t2300-sched-simple.t and insert it at points where tests have now become racy.

job-manager: refactor submit internals for testability

74e8c4e

Break up some large functions in submit.c and export them with a submit_ prefix for unit testing and improved clarity.

testsuite: add job-manager submit unit test

f2be15f

garlick force-pushed the job_fsm branch from 8f1653f to f2be15f Compare March 8, 2019 22:07

garlick requested a review from grondo March 8, 2019 22:43

grondo approved these changes Mar 9, 2019

View reviewed changes

grondo merged commit c59f78a into flux-framework:master Mar 9, 2019

garlick deleted the job_fsm branch March 9, 2019 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-manager: centralize job state machine #2067

job-manager: centralize job state machine #2067

garlick commented Mar 7, 2019

garlick commented Mar 8, 2019

grondo commented Mar 8, 2019

garlick commented Mar 8, 2019

garlick commented Mar 8, 2019

garlick commented Mar 8, 2019

codecov-io commented Mar 8, 2019

garlick commented Mar 8, 2019

grondo left a comment

grondo commented Mar 9, 2019

garlick commented Mar 9, 2019 via email

garlick commented Mar 9, 2019 via email

job-manager: centralize job state machine #2067

job-manager: centralize job state machine #2067

Conversation

garlick commented Mar 7, 2019

garlick commented Mar 8, 2019

grondo commented Mar 8, 2019

garlick commented Mar 8, 2019

garlick commented Mar 8, 2019

garlick commented Mar 8, 2019

codecov-io commented Mar 8, 2019

Codecov Report

garlick commented Mar 8, 2019

grondo left a comment

Choose a reason for hiding this comment

grondo commented Mar 9, 2019

garlick commented Mar 9, 2019 via email

garlick commented Mar 9, 2019 via email