New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: add inactive job purge capability #4286
Conversation
@codecov[bot] is not allowed to run commands |
@codecov[bot] is not allowed to run commands |
This is great! This does seem to have an impact on job throughput, though I'm not sure if it is something to worry about at this point. $ flux jobs --stats-only
0 running, 32768 completed, 0 failed, 0 pending
$ src/test/throughput.py -sn 4096
4096 jobs: 4096 submitted, 0 running, 4096 completed
number of jobs: 4096
submit time: 3.690 s (1110.1 job/s)
script runtime: 29.725s
job runtime: 29.180s
throughput: 140.4 job/s (script: 137.8 job/s) vs this branch: $ flux jobs --stats-only
0 running, 32768 completed, 0 failed, 0 pending
$ src/test/throughput.py -sn 4096
4096 jobs: 4096 submitted, 0 running, 4096 completed
number of jobs: 4096
submit time: 3.444 s (1189.3 job/s)
script runtime: 35.197s
job runtime: 34.764s
throughput: 117.8 job/s (script: 116.4 job/s) Purging jobs (down to 1024) in this case restores the throughput: $ flux job purge --num-limit=1024 --batch=100
purged 35840 inactive jobs, 1024 remaining
$ flux jobs --stats-only
0 running, 36864 completed, 0 failed, 0 pending
$ src/test/throughput.py -sn 4096
4096 jobs: 4096 submitted, 0 running, 4096 completed
number of jobs: 4096
submit time: 3.719 s (1101.5 job/s)
script runtime: 27.975s
job runtime: 27.574s
throughput: 148.5 job/s (script: 146.4 job/s) Since the purge is really only meant for the system instance, which is by design the least common instance type, I wonder if we should consider only enabling the job-manager inactive cache by configuration? Having said that, the throughput numbers here may not even be relevant, since the throughput of real jobs is limited in other ways due to the reality of executing actual processes. However, I did want to bring this up since we've worked so hard to keep this numbers high in the past, I just wanted us to keep aware of the issue. |
Thank you for testing that! That result is a little surprising to me. I wonder if the list insertion into |
Some other random thoughts:
|
Hm, you may also want to make sure the result is reproducible for you! |
Good observations!
I do have a development branch in which the journal is modified to include inactive jobs (optionally) so that job-list doesn't have to scan the KVS. In that branch, the entire eventlog for each job is retained in memory and so a new journal consumer can completely "catch up" with the job manager and track ongoing job events in one stream. In that scenario, it would make sense to add a message to the journal stream when a job gets purged even though it's not an event per se. That would let a journal consumer like job-list keep the size of its caches in check without the need for something separate like #3688. This is an "open loop" or eventually consistent design, so the situation you describe could still arise although it would be far less likely. I should probably rebase that branch on top of this and post it as a WIP. If nothing else it's something concrete to contribute to the discussion about #4273 (job-db). |
Thanks for that update! I apologize, I now realize my comment was a bit off-topic for this specific PR. |
Not at all - that was a good observation and a loose end I failed to articulate.
I'm not really seeing it. Or at least the variability in results is too high to really conclude anything. Here are throughput results (
I might be missing something or maybe I should have tried with more jobs? |
Yeah, I was not getting that much variability on my system. You could try more jobs since your system is clearly double the speed of mine. But if it is not reproducible for you, then this might not be an actual issue. |
high level comment, is this something we want to hold off on until we have job archiving (such as #4273) more in place? or is the idea that since this configurable, the default behavior stays the same and we don't have to worry about archiving. If users want to lose inactive data in the kvs permanently, its ok. |
Yeah this is what I was thinking. I might go a little further and recommend that our system instance early adopters purge at 1 week or so just to ensure that we don't have an unsustainable situation. Then we'll add in some form of offline garbage collection and we will be on a healthy trajectory. job-archive will grab all the data pretty aggressively so as long as the configured purge age is on the order of days, we shouldn't be losing data as far as flux-accounting is concerned. However it's up for debate. Purging would limit the ability to ask what or who ran on a "bad node" beyond the purge period for example. Maybe we can swoop in with a good job-db solution in the near term that would allow longer term preservation and querying of job data? |
I did reproduce the slowdown by testing with more jobs. Here's some data that shows it on my system:
|
I was using the list backwards so I had to scan the whole thing to insert the next job! With that fixed (last column) the numbers at scale are better:
I'll post the fix shortly. I wanted to add a test to ensure the oldest jobs are purged first since I was purging the youngest and no tests were catching it. |
Nice catch! |
Hmm, the bionic - py3.7,clang-6.0 builder seems to have gotten stuck in the broker unit tests, on
|
It failed again but in the purge sharness test. I realize I have a test that depends on a previous test that has the NO_CHAIN_LINT conditional, so adding that to the other dependent test. |
I had purging set to 8h on my test system and just spent a minute confused by the fact that Hmm, I wonder if it would be OK to do something simple for now like publish the list of purged jobs so caches in job-list can be pruned? |
It was pretty easy to implement so I went ahead and pushed a change to have the job manager publish an event listing job IDs being purged, and then have job-list subscribe to that message and remove jobs. This invites comparison to PR #3688 (pending forever), where a lot of work was done to implement job-list purging. I think the job manager is the correct place to perform the purge since it also also solves the problem of the KVS growing without bound and is the "source of truth" for a lot of other job information. But some good discussion happened over in that PR, e.g.
I'm not sure either of those points makes as much sense when job-manager is doing the purging, but I thought I'd raise those points in case there should be some discussion, or if other aspects of that PR come to mind that would be applicable here. Edit: just realized that the job-list stats are not updated, so purged jobs will still be included in |
* It doesn't seem worth the effort to structure the code to avoid this | ||
* due to high complexity of solution, low probability of error, and | ||
* minor consequences. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have a chance to fully review this PR yet, but FLUX_TEST_VALGRIND=t ./t2809-job-purge.t
caught a leak of jobs
on successful return here.
==89443== 1,168 (320 direct, 848 indirect) bytes in 8 blocks are definitely lost in loss record 21 of 29
==89443== at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==89443== by 0x4A25070: json_array (in /usr/lib/x86_64-linux-gnu/libjansson.so.4.13.0)
==89443== by 0xA39A93C: purge_inactive_jobs (purge.c:110)
==89443== by 0xA39AC50: purge_request_cb (purge.c:283)
==89443== by 0x487B8B7: call_handler (msg_handler.c:344)
==89443== by 0x487BC49: dispatch_message (msg_handler.c:380)
==89443== by 0x487BC49: handle_cb (msg_handler.c:481)
==89443== by 0x48AACE2: ev_invoke_pending (ev.c:3770)
==89443== by 0x48AE02F: ev_run (ev.c:4190)
==89443== by 0x48AE02F: ev_run (ev.c:4021)
==89443== by 0x487A9FE: flux_reactor_run (reactor.c:128)
==89443== by 0xA392C11: mod_main (job-manager.c:220)
==89443== by 0x119B20: module_thread (module.c:183)
==89443== by 0x4AC4B42: start_thread (pthread_create.c:442)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, most likely that was in the code I just added for updating job-list. Thank you for catching that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed (squashed into 3966478)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM! I ran through some random manual testing and didn't find any issues. Even tried running a set of flux job purge
commands in parallel with a small batch count and everything worked great!
Since flux job purge
is an unrecoverable operation, I wonder if we want to require a --force
option, but otherwise report how many jobs would be purged. I'm not sure how difficult that would be to add. Another option would be an are you sure? (yes/no)
type prompt, unless --force
is given. I don't think this has to be done before merging this PR, it was just something that occurred to me.
src/modules/job-manager/purge.c
Outdated
"batch", &batch) < 0) | ||
goto error; | ||
if (age_limit != INACTIVE_AGE_UNLIMITED && age_limit < 0) { | ||
errmsg = "age limit must be >= 0 (0=unlimited)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is errmsg
correct? (INACTIVE_AGE_UNLIMITED
is -1.
not 0
, but perhaps there's some subtlety I missed)
src/modules/job-manager/purge.c
Outdated
goto error; | ||
} | ||
if (num_limit != INACTIVE_NUM_UNLIMITED && num_limit < 0) { | ||
errmsg = "num limit must be >= 0 (0=unlimited)"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above, isn't -1
unlimited?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops yeah, I changed the values but didn't update the message!
return errprintf (error, | ||
"job-manager.inactive-num-limit: must be >= 0"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the error message be job-manager.inactive-num-limit: must be >= -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, for the TOML config my thought was that users wouldn't specify a limit if they don't want one. Does that sound OK?
O/w we should probably define a special value for unlimited age too, like "unlimited".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me.
I like the idea of reporting number of jobs purged only, unless |
Problem: various job manager services cannot distinguish incorrect job IDs from inactive jobs. Instead of removing jobs when they become inactive, store them in an inactive hash. For now, jobs remain there until the job manager is unloaded.
Problem: various job manager requests return "unknown job id" on an inactive but valid jobid. Now that inactive jobs are available in their own hash, try to look up the jobid in the inactive hash and if found, return "job is inactive" instead.
Problem: there is no way to externally probe how many active and inactive jobs the job manager is tracking. Add active_jobs, inactive_jobs keys to the stats object so that 'flux module stats job-manager' can show the number of jobs of each type that are tracked.
Problem: inactive job purging should be based on the time the job actually enters INACTIVE state but that time is not captured. Add job->t_clean which captures the timestamp on the clean event, which is the only event that transitions a job to inactive.
Problem: a zlistx comparator function used to order jobs by priority is named job_comparator(), which makes naming another compararator difficult. Rename internal function to job_priority_comparator(). Update users.
Problem: the overhead of inactive jobs grows without bound in a system instance. Add a mechanism to trim the oldest inactive jobs from the job manager internal hash and KVS. There are two ways to do it. One way is to make a job-manager.purge RPC specifying either a maximum number of retained inactive jobs, or a maximim age (time since transition to inactive) of retained inactive jobs, or both. The other way is to configure those criteria via TOML, e.g. [job-manager] inactive-age-limit = "7d" inactive-num-limit = 10000 The KVS transactions used to unlink batches of jobs is limited to 100 entries to avoid giant messages and head of line blocking in the broker. When purging via configuration, this limits the number of jobs purged per heartbeat interval. When purging via RPC, the caller specifies the batch size (up to the maximum) and must check the returned job count, and repeat the RPC until a count less than the batch size is returned. Note: KVS job.* directories that become emtpy after all the jobs in them have been purged are NOT removed by the purge. This can be accomplished later by an offline dump/restore of the content backing store, since empty directories are not included in dumps.
Problem: there is no convenient way to clean unwanted inactive jobs if purge limits were not configured. Add 'flux job purge', which can interactively remove inactive jobs using limits set on the command line. For example: flux job purge --num-limit=100 purges all inactive jobs except the 100 newest, where "newest" is gauged by the time the job became inactve. flux job purge --age-limit=3d purges all jobs that have been inactive longer than 3 days.
Problem: t2201-job-cmd.t includes some tests that could pass if the 'flux job' fails in unusual ways. Use "FLUX_URI=value test_must_fail cmd..." not "! FLUX_URI=value cmd..." Also, enclose those in a subshell because apparently in some environments but not others, this causes the environment variable setting to leak to other tests, as noted in 9a77e05.
Problem: there is no test coverage for attempting to perform inappropriate job operations on inactive jobs. Add a few tests to t2201-job-cmd.t and t2280-job-memo.t to cover these cases.
Problem: job-manager may purge jobs from the KVS but other parts of the system like job-list may have cached inactive job IDs and are not informed that they are no longer valid. Publish a 'job-purge-inactive' event message containing a list of job IDs each time a batch of jobs are purged by the job manager.
Problem: flux job list -a still shows inactive jobs that have been purged. Subscribe to the 'job-purge-inactive' event message and remove purged jobs from the 'index' hash and the 'inactive' list when notified.
Problem: flux jobs --stats-only includes purged jobs in stats. Update the stats just before purging a job.
Problem: the job manager purge function has no test coverage Add a new sharness test script.
Problem: job purge functionality is undocumented. Add an entry for the purge subcommand to flux-job(1), and add configuration details to flux-config-job-manager(5).
Codecov Report
@@ Coverage Diff @@
## master #4286 +/- ##
==========================================
+ Coverage 83.56% 83.58% +0.01%
==========================================
Files 388 389 +1
Lines 64848 65139 +291
==========================================
+ Hits 54191 54445 +254
- Misses 10657 10694 +37
|
Just force pushed with the fixes discussed, adding |
Nice! I tested this out and works as described! Feel free to set MWP when you're ready. |
BTW, this feature satisfies one of my minor use cases: When running the Flux testsuite as a set of jobs within an instance, it is convenient to use Now |
This adds two methods of purging inactive jobs:
One can purge interactively, e.g.
or one can configure automatic purging via TOML
The job manager retains inactive jobs in a hash and places them in a list ordered by the time they became inactive. When jobs are purged, the entry in the hash is removed along with the corresponding KVS directory.