Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add job-manager and corresponding API and flux-job subcommands #1774

Merged
merged 16 commits into from Nov 10, 2018

Conversation

Projects
None yet
4 participants
@garlick
Copy link
Member

garlick commented Oct 26, 2018

This PR adds a job-manager module that handles submit requests from ingest module(s) and builds a queue. It orders the queue by 1) job priority, 2) jobid.

It supports the following operations

  • list - list queue (entire thing or some number from head of queue), with configurable attributes
  • purge - remove a job (including all KVS vestiges) if it hasn't yet been seen by the scheduler
  • priority - set priority of a job that has already been submitting (changing its queue position)

The module can be unloaded and reloaded, and it rebuilds its state.

Each operation is going to need more work once we have scheduler allocations and exec requests involved. I tried to keep the module functionality divided up in multiple source files to make it easier to follow as we add to it.

This is intended for early comments. It needs more detail in the commit messages. It needs more extensive testing.

There are some comments about the operations in list.c, purge.c, and priority.c, including caveats for the initial implementation.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from c3fa38a to 4aca759 Oct 27, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Oct 27, 2018

Codecov Report

Merging #1774 into master will increase coverage by <.01%.
The diff coverage is 80.94%.

@@            Coverage Diff             @@
##           master    #1774      +/-   ##
==========================================
+ Coverage    79.7%   79.71%   +<.01%     
==========================================
  Files         189      196       +7     
  Lines       34672    35223     +551     
==========================================
+ Hits        27637    28079     +442     
- Misses       7035     7144     +109
Impacted Files Coverage Δ
src/modules/job-manager/job.c 100% <100%> (ø)
src/modules/job-manager/job-manager.c 68.85% <68.85%> (ø)
src/modules/job-manager/active.c 73.33% <73.33%> (ø)
src/modules/job-manager/purge.c 76.36% <76.36%> (ø)
src/modules/job-manager/list.c 80.95% <80.95%> (ø)
src/modules/job-manager/priority.c 81.96% <81.96%> (ø)
src/common/libjob/job.c 68.18% <82.6%> (+7.71%) ⬆️
src/cmd/flux-job.c 90.41% <90.36%> (+2.51%) ⬆️
src/modules/job-manager/queue.c 91.04% <91.04%> (ø)
src/modules/barrier/barrier.c 76.55% <0%> (-2.07%) ⬇️
... and 11 more

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 8a338d7 to 7297e1c Oct 29, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Oct 29, 2018

Coverage is a bit better now. Any objections if I squash the incremental devel and rebase on current master?

@garlick garlick force-pushed the garlick:flux_job_manager7 branch 2 times, most recently from 62a3c91 to 8f79a17 Oct 29, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Oct 29, 2018

Go ahead and rebase away!

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 8f79a17 to de8bfc2 Oct 30, 2018

@@ -336,7 +336,7 @@ static void state_cb (flux_subprocess_t *p, flux_subprocess_state_t state)
log_msg ("%d (pid %d) %s", cli->rank, pid, strsignal (SIGCONT));
else if (WIFSIGNALED (status))
log_msg ("%d (pid %d) %s", cli->rank, pid, strsignal (WTERMSIG (status)));
else if (WIFEXITED (status))
else if (WIFEXITED (status) && WEXITSTATUS (status) != 0)

This comment has been minimized.

@grondo

grondo Oct 30, 2018

Contributor

See the update in #1749. This function also needs the first erroneous WIFSIGNALED conditional removed, and there is no point in testing for WIFCONTINUED and WIFSTOPPED

This comment has been minimized.

@garlick

garlick Oct 31, 2018

Author Member

Good catch! Fixed and squashed.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from de8bfc2 to 7da883e Oct 31, 2018

@garlick garlick requested review from grondo and SteVwonder Oct 31, 2018

@SteVwonder
Copy link
Member

SteVwonder left a comment

LGTM other than the job.c license header change.

Show resolved Hide resolved src/common/libjob/job.c

@garlick garlick referenced this pull request Nov 1, 2018

Closed

license: make libraries LGPL #83

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 7da883e to 90735c0 Nov 1, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 1, 2018

Rebased on current master.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 90735c0 to 84e45d1 Nov 1, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 1, 2018

Rebased on current master again.

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 2, 2018

Restarted builder that failed on a write error.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 84e45d1 to 19b6621 Nov 5, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 5, 2018

Rebased on current master.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 19b6621 to f72962f Nov 7, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 7, 2018

Rebased on current master.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from f72962f to 62cbf8b Nov 9, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 9, 2018

Rebased on current master.

@grondo

grondo approved these changes Nov 9, 2018

Copy link
Contributor

grondo left a comment

I finally had time to poke at this and it seems really solid.

Job manger even restarts flawlessly in most of my "let's see if this works" tests!

Nice job. I'll be out for the next couple days so anyone feel free to merge this if it is time.

garlick added some commits Oct 8, 2018

libjob: add flux_job_purge()
Add API function in front of job-manager.purge RPC.
libjob: add flux_job_set_priority()
Add API function in front of job-manager.priority RPC.
testsuite: job-ingest use test_under_flux kvs
Problem: as we build out the new exec system,
"test_under_flux job" loads more modules than
we want for standalone job-ingest testing.

Use "test_under_flux_kvs" and load ingest module
manually.
testsuite: load job-manager in test_under_flux job
Problem: t2201-job-cmd.t cannot test list, priority,
and purge subcommands unless both job-ingest and
job-manager modules are loaded.

Add job-manager (rank 0 only) to rc1-job, rc3-job.
testsuite: exercise new flux-job subcommands
Also fix a problem where a bad FLUX_URI setting
that was supposed to be local to a test was set
for the entire script.

@garlick garlick force-pushed the garlick:flux_job_manager7 branch from 62cbf8b to 5ba3054 Nov 10, 2018

@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 10, 2018

Thanks! I went ahead and rebased on current master. It would be great if someone would press the button on this one.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 10, 2018

Maybe I can do something useful this week. Merging!

@grondo grondo merged commit 0666d61 into flux-framework:master Nov 10, 2018

3 checks passed

codecov/patch 80.94% of diff hit (target 79.7%)
Details
codecov/project 79.71% (+<.01%) compared to 49c3b6a
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@garlick

This comment has been minimized.

Copy link
Member Author

garlick commented Nov 10, 2018

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.