Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
job-manager: centralize job state machine #2067
This PR is a refactoring of the job manager to put all of the job state machine logic in
One other change that wasn't strictly required: The priority and "raise" RPC's now respond to the user before the event has been committed to the KVS. This was to simplify some code that seemed gratuitously complex to me as I was refactoring. However it caused a race in the
I struggled with structuring this PR so that commits were reviewable and didn't entirely get there. I'm happy to try again if the result is hard to look at.
Sorry, I only just started looking at PR. Overall it is nicely structured!
I think it is fine to squash the commits you speak of into one.
My only early question is whether it now makes sense to rename
Otherwise, so far the change looks great and the PR is organized nicely.
Move the submit request handler out of job-manager.c (the mainline) to its own source, submit.c. This aligns its structure with other request handlers in the job manager. The code itself didn't change substantively.
Move the remaining portion of the restart logic out of job-manager.c, into restart.c. This aligns its structure with other "sub modules" of the job manager.
Change job_create() so that it creates the struct job with default values that can be updated afterwards. This allows some reduction in complexity at call sites. Update submit request handler and tests.
Move the code used to replay the job eventlog during job manager restart into event.c with the goal of centralizing the logic that drives state transitions with events. In this commit it is still only used from job_create_from_eventlog() during restart.
Modify event_log() to accept a 'struct job' rather than just the job id, to prepare for event_log() centrally handling state transitions/actions.
Expose new idempotent functions for enqueuing jobs for scheduler alloc, or sending scheduler free requests, to prepare for being driven by a centralized state machine.
Add a new function to take any needed actions following job state change, for example allocating or freeing resources. It makes use of the idempotent alloc/free interfaces just added.
Replace code sprinkled around job manager for taking action after an event occurs with a call to event_job_action(). This is one step along the way towards centralizing the job state machine.
Problem: event_log() callback is needed even if all it does is log the error. Change the event_log() KVS commit continuation to log an error and stop the reactor on failure, so that a callback is only needed if the user needs to be notified when the event has been committed. Drop event_log callbacks in alloc, priority, and raise source modules. In priority and raise, this changes the control flow so that a response may be received before the eventlog update is finalized, which affects the timing of some tests.
Modify event_log() so that it calls event_job_update() and event_job_action() before appending event to the current KVS commit batch. Rename to event_job_post() to reflect its changed role. Modify callers so that they no longer need to call event_job_update() or event_job_action() directly if they are calling event_job_post(). This accomplishes the goal of centralizing the "state machine" logic for jobs in event.c. Update inline docs. Cleanup: drop the event_completion_f typedef that is identical to flux_completion_f.
Now that flux_job_cancel() doesn't guarantee that the exception has landed in the eventlog before returning, borrow a synchronization function from t2300-sched-simple.t and insert it at points where tests have now become racy.
Break up some large functions in submit.c and export them with a submit_ prefix for unit testing and improved clarity.
@@ Coverage Diff @@ ## master #2067 +/- ## ========================================== + Coverage 80.42% 80.47% +0.05% ========================================== Files 191 192 +1 Lines 30343 30246 -97 ========================================== - Hits 24403 24341 -62 + Misses 5940 5905 -35
Please do :-)…
On Fri, Mar 8, 2019, 6:20 PM Mark Grondona ***@***.***> wrote: May I press the button or is there more work here? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2067 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX240whqNQW7oVqXcYziM3ZcMpOI-dks5vUxqGgaJpZM4bkWfM> .
On Sat, Mar 9, 2019, 7:33 AM Mark Grondona ***@***.***> wrote: Merged #2067 <#2067> into master. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#2067 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKX295AAz-83PJqPJqY7wQoY9KwCz2Kks5vU9RfgaJpZM4bkWfM> .