New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: centralize job state machine #2067
Conversation
Just pushed a couple of inline doc fixes that will need to be squashed. I was thinking of squashing the "use event_job_action(), event_job_update()" commits into one. Any opinion on that one @grondo? |
Sorry, I only just started looking at PR. Overall it is nicely structured! I think it is fine to squash the commits you speak of into one. My only early question is whether it now makes sense to rename Otherwise, so far the change looks great and the PR is organized nicely. |
Hmm |
forced a push with some general tidying up and the suggested function name change. |
Move the submit request handler out of job-manager.c (the mainline) to its own source, submit.c. This aligns its structure with other request handlers in the job manager. The code itself didn't change substantively.
Move the remaining portion of the restart logic out of job-manager.c, into restart.c. This aligns its structure with other "sub modules" of the job manager.
Change job_create() so that it creates the struct job with default values that can be updated afterwards. This allows some reduction in complexity at call sites. Update submit request handler and tests.
Move the code used to replay the job eventlog during job manager restart into event.c with the goal of centralizing the logic that drives state transitions with events. In this commit it is still only used from job_create_from_eventlog() during restart.
Modify event_log() to accept a 'struct job' rather than just the job id, to prepare for event_log() centrally handling state transitions/actions.
Expose new idempotent functions for enqueuing jobs for scheduler alloc, or sending scheduler free requests, to prepare for being driven by a centralized state machine.
Add a new function to take any needed actions following job state change, for example allocating or freeing resources. It makes use of the idempotent alloc/free interfaces just added.
Replace code sprinkled around job manager for taking action after an event occurs with a call to event_job_action(). This is one step along the way towards centralizing the job state machine.
Problem: event_log() callback is needed even if all it does is log the error. Change the event_log() KVS commit continuation to log an error and stop the reactor on failure, so that a callback is only needed if the user needs to be notified when the event has been committed. Drop event_log callbacks in alloc, priority, and raise source modules. In priority and raise, this changes the control flow so that a response may be received before the eventlog update is finalized, which affects the timing of some tests.
Modify event_log() so that it calls event_job_update() and event_job_action() before appending event to the current KVS commit batch. Rename to event_job_post() to reflect its changed role. Modify callers so that they no longer need to call event_job_update() or event_job_action() directly if they are calling event_job_post(). This accomplishes the goal of centralizing the "state machine" logic for jobs in event.c. Update inline docs. Cleanup: drop the event_completion_f typedef that is identical to flux_completion_f.
Now that flux_job_cancel() doesn't guarantee that the exception has landed in the eventlog before returning, borrow a synchronization function from t2300-sched-simple.t and insert it at points where tests have now become racy.
Break up some large functions in submit.c and export them with a submit_ prefix for unit testing and improved clarity.
Rebased on current master and added a unit test for submit-related functions. |
Codecov Report
@@ Coverage Diff @@
## master #2067 +/- ##
==========================================
+ Coverage 80.42% 80.47% +0.05%
==========================================
Files 191 192 +1
Lines 30343 30246 -97
==========================================
- Hits 24403 24341 -62
+ Misses 5940 5905 -35
|
Ready for a review I think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I poked and prodded and didn't find any issues.
May I press the button or is there more work here? |
Please do :-)
…On Fri, Mar 8, 2019, 6:20 PM Mark Grondona ***@***.***> wrote:
May I press the button or is there more work here?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2067 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKX240whqNQW7oVqXcYziM3ZcMpOI-dks5vUxqGgaJpZM4bkWfM>
.
|
Thanks!
…On Sat, Mar 9, 2019, 7:33 AM Mark Grondona ***@***.***> wrote:
Merged #2067 <#2067> into
master.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2067 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKX295AAz-83PJqPJqY7wQoY9KwCz2Kks5vU9RfgaJpZM4bkWfM>
.
|
This PR is a refactoring of the job manager to put all of the job state machine logic in
event.c
rather than sprinkling it all around. Hopefully this will make it easier to understand and modify. Parts of the system like the request handlers, and handlers for alloc/free responses now simply log events, and the event now is not only committed to the KVS, but also drives a state machine update, which in turn initiates new actions.One other change that wasn't strictly required: The priority and "raise" RPC's now respond to the user before the event has been committed to the KVS. This was to simplify some code that seemed gratuitously complex to me as I was refactoring. However it caused a race in the
t2202-job-manager.t
test, which then had to be addressed by borrowing a clever shell function from @grondo's scheduler test. Hopefully these tests can be simplified after we get a user facing tool for waiting on events.I struggled with structuring this PR so that commits were reviewable and didn't entirely get there. I'm happy to try again if the result is hard to look at.