Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: centralize job state machine #2067

Merged
merged 13 commits into from Mar 9, 2019

Conversation

@garlick
Copy link
Member

commented Mar 7, 2019

This PR is a refactoring of the job manager to put all of the job state machine logic in event.c rather than sprinkling it all around. Hopefully this will make it easier to understand and modify. Parts of the system like the request handlers, and handlers for alloc/free responses now simply log events, and the event now is not only committed to the KVS, but also drives a state machine update, which in turn initiates new actions.

One other change that wasn't strictly required: The priority and "raise" RPC's now respond to the user before the event has been committed to the KVS. This was to simplify some code that seemed gratuitously complex to me as I was refactoring. However it caused a race in the t2202-job-manager.t test, which then had to be addressed by borrowing a clever shell function from @grondo's scheduler test. Hopefully these tests can be simplified after we get a user facing tool for waiting on events.

I struggled with structuring this PR so that commits were reviewable and didn't entirely get there. I'm happy to try again if the result is hard to look at.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Just pushed a couple of inline doc fixes that will need to be squashed. I was thinking of squashing the "use event_job_action(), event_job_update()" commits into one. Any opinion on that one @grondo?

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 8, 2019

Sorry, I only just started looking at PR. Overall it is nicely structured!

I think it is fine to squash the commits you speak of into one.

My only early question is whether it now makes sense to rename event_log() since "logging" the event is only one of the actions the function now triggers? (not sure I have a better name but something like event_job_trigger or event_job_post?) I don't know, keep in mind that is an early question and maybe without merit.

Otherwise, so far the change looks great and the PR is organized nicely.

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Hmm event_job_post() sounds like an improvement. Thanks! I always struggle with the names.

@garlick garlick force-pushed the garlick:job_fsm branch from 34e5a18 to 8f1653f Mar 8, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

forced a push with some general tidying up and the suggested function name change.

garlick added 13 commits Mar 5, 2019
Move the submit request handler out of job-manager.c
(the mainline) to its own source, submit.c.  This
aligns its structure with other request handlers
in the job manager.

The code itself didn't change substantively.
Move the remaining portion of the restart logic out
of job-manager.c, into restart.c.  This aligns its
structure with other "sub modules" of the job manager.
Change job_create() so that it creates the
struct job with default values that can be updated
afterwards.  This allows some reduction in complexity
at call sites.

Update submit request handler and tests.
Move the code used to replay the job eventlog
during job manager restart into event.c with the
goal of centralizing the logic that drives state
transitions with events.

In this commit it is still only used from
job_create_from_eventlog() during restart.
Modify event_log() to accept a 'struct job' rather
than just the job id, to prepare for event_log()
centrally handling state transitions/actions.
Expose new idempotent functions for enqueuing jobs
for scheduler alloc, or sending scheduler free
requests, to prepare for being driven by a centralized
state machine.
Add a new function to take any needed actions
following job state change, for example allocating
or freeing resources.  It makes use of the
idempotent alloc/free interfaces just added.
Replace code sprinkled around job manager
for taking action after an event occurs
with a call to event_job_action().

This is one step along the way towards centralizing
the job state machine.
Problem: event_log() callback is needed even if all
it does is log the error.

Change the event_log() KVS commit continuation to
log an error and stop the reactor on failure, so
that a callback is only needed if the user needs
to be notified when the event has been committed.

Drop event_log callbacks in alloc, priority, and
raise source modules.  In priority and raise,
this changes the control flow so that a response
may be received before the eventlog update is
finalized, which affects the timing of some tests.
Modify event_log() so that it calls event_job_update()
and event_job_action() before appending event to the
current KVS commit batch.  Rename to event_job_post()
to reflect its changed role.

Modify callers so that they no longer need to call
event_job_update() or event_job_action() directly
if they are calling event_job_post().

This accomplishes the goal of centralizing the "state
machine" logic for jobs in event.c.

Update inline docs.

Cleanup:  drop the event_completion_f typedef that is
identical to flux_completion_f.
Now that flux_job_cancel() doesn't guarantee that the
exception has landed in the eventlog before returning,
borrow a synchronization function from t2300-sched-simple.t
and insert it at points where tests have now become racy.
Break up some large functions in submit.c and export them
with a submit_ prefix for unit testing and improved clarity.
@garlick garlick force-pushed the garlick:job_fsm branch from 8f1653f to f2be15f Mar 8, 2019
@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Rebased on current master and added a unit test for submit-related functions.

@codecov-io

This comment has been minimized.

Copy link

commented Mar 8, 2019

Codecov Report

Merging #2067 into master will increase coverage by 0.05%.
The diff coverage is 82.63%.

@@            Coverage Diff             @@
##           master    #2067      +/-   ##
==========================================
+ Coverage   80.42%   80.47%   +0.05%     
==========================================
  Files         191      192       +1     
  Lines       30343    30246      -97     
==========================================
- Hits        24403    24341      -62     
+ Misses       5940     5905      -35
Impacted Files Coverage Δ
src/modules/job-manager/job-manager.c 70.73% <100%> (+11.96%) ⬆️
src/modules/job-manager/submit.c 66.66% <66.66%> (ø)
src/modules/job-manager/job.c 93.93% <71.42%> (-4.53%) ⬇️
src/modules/job-manager/priority.c 89.65% <75%> (+5.44%) ⬆️
src/modules/job-manager/raise.c 79.54% <80%> (+10.42%) ⬆️
src/modules/job-manager/restart.c 86.36% <86.66%> (-0.43%) ⬇️
src/modules/job-manager/event.c 73.54% <94.11%> (+3.17%) ⬆️
src/modules/job-manager/alloc.c 74.48% <96%> (+0.72%) ⬆️
src/common/libutil/aux.c 90.74% <0%> (-3.71%) ⬇️
... and 6 more
@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

Ready for a review I think.

@garlick garlick requested a review from grondo Mar 8, 2019
@grondo
grondo approved these changes Mar 9, 2019
Copy link
Contributor

left a comment

LGTM! I poked and prodded and didn't find any issues.

@grondo

This comment has been minimized.

Copy link
Contributor

commented Mar 9, 2019

May I press the button or is there more work here?

@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2019

@grondo grondo merged commit c59f78a into flux-framework:master Mar 9, 2019
4 checks passed
4 checks passed
Mergify — Summary 1 potential rule
Details
codecov/patch 82.63% of diff hit (target 80.42%)
Details
codecov/project 80.47% (+0.05%) compared to 1a6e9dd
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

commented Mar 9, 2019

@garlick garlick deleted the garlick:job_fsm branch Mar 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.