Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: add inactive job purge capability #4286

Merged
merged 14 commits into from Apr 18, 2022

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 15, 2022

This adds two methods of purging inactive jobs:

One can purge interactively, e.g.

$ flux job purge --num-limit=10
purged 90 inactive jobs, 10 remaining

or one can configure automatic purging via TOML

[job-manager]

inactive-age-limit = "7d"
inactive-num-limit = 10000

The job manager retains inactive jobs in a hash and places them in a list ordered by the time they became inactive. When jobs are purged, the entry in the hash is removed along with the corresponding KVS directory.

@mergify
Copy link
Contributor

mergify bot commented Apr 15, 2022

@codecov[bot] is not allowed to run commands

@mergify
Copy link
Contributor

mergify bot commented Apr 15, 2022

@codecov[bot] is not allowed to run commands

@grondo
Copy link
Contributor

grondo commented Apr 15, 2022

This is great!

This does seem to have an impact on job throughput, though I'm not sure if it is something to worry about at this point.
For example, on master:

$ flux jobs --stats-only
0 running, 32768 completed, 0 failed, 0 pending
$ src/test/throughput.py -sn 4096
  4096 jobs:   4096 submitted,      0 running, 4096 completed
number of jobs: 4096
submit time:    3.690 s (1110.1 job/s)
script runtime: 29.725s
job runtime:    29.180s
throughput:     140.4 job/s (script: 137.8 job/s)

vs this branch:

$ flux jobs --stats-only
0 running, 32768 completed, 0 failed, 0 pending
$ src/test/throughput.py -sn 4096
  4096 jobs:   4096 submitted,      0 running, 4096 completed
number of jobs: 4096
submit time:    3.444 s (1189.3 job/s)
script runtime: 35.197s
job runtime:    34.764s
throughput:     117.8 job/s (script: 116.4 job/s)

Purging jobs (down to 1024) in this case restores the throughput:
(though of course job-list still shows all 36K jobs)

$ flux job purge --num-limit=1024 --batch=100
purged 35840 inactive jobs, 1024 remaining
$ flux jobs --stats-only
0 running, 36864 completed, 0 failed, 0 pending
$ src/test/throughput.py -sn 4096
  4096 jobs:   4096 submitted,      0 running, 4096 completed
number of jobs: 4096
submit time:    3.719 s (1101.5 job/s)
script runtime: 27.975s
job runtime:    27.574s
throughput:     148.5 job/s (script: 146.4 job/s)

Since the purge is really only meant for the system instance, which is by design the least common instance type, I wonder if we should consider only enabling the job-manager inactive cache by configuration?

Having said that, the throughput numbers here may not even be relevant, since the throughput of real jobs is limited in other ways due to the reality of executing actual processes. However, I did want to bring this up since we've worked so hard to keep this numbers high in the past, I just wanted us to keep aware of the issue.

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

Thank you for testing that! That result is a little surprising to me. I wonder if the list insertion into purge->queue is the problem. Let me poke at that a bit.

@grondo
Copy link
Contributor

grondo commented Apr 15, 2022

Some other random thoughts:

  • The dependency plugins in the job-manager could also make great use of the inactive job cache to allow jobs to have dependencies on other, completed jobs (right now jobs can only have dependencies on active jobs). This might only work if the job "result" for inactive jobs is available.
  • Apropos of the previous bullet, it might be nice to have a function to lookup a jobid that checks active hash and falls back to inactive hash. Callers can check the job state to determine if the job was active vs inactive to determine from which hash the job was found.
  • The job-list module of course doesn't purge its cache of inactive jobs. So, after a purge (currently) a job can be listed, but flux job info JOBID would fail (since the jobid is no longer in the kvs). That might be confusing, but probably isn't a big issue at this point.

@grondo
Copy link
Contributor

grondo commented Apr 15, 2022

That result is a little surprising to me.

Hm, you may also want to make sure the result is reproducible for you!

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

Good observations!

The job-list module of course doesn't purge its cache of inactive jobs. So, after a purge (currently) a job can be listed, but flux job info JOBID would fail (since the jobid is no longer in the kvs). That might be confusing, but probably isn't a big issue at this point.

I do have a development branch in which the journal is modified to include inactive jobs (optionally) so that job-list doesn't have to scan the KVS. In that branch, the entire eventlog for each job is retained in memory and so a new journal consumer can completely "catch up" with the job manager and track ongoing job events in one stream.

In that scenario, it would make sense to add a message to the journal stream when a job gets purged even though it's not an event per se. That would let a journal consumer like job-list keep the size of its caches in check without the need for something separate like #3688. This is an "open loop" or eventually consistent design, so the situation you describe could still arise although it would be far less likely.

I should probably rebase that branch on top of this and post it as a WIP. If nothing else it's something concrete to contribute to the discussion about #4273 (job-db).

@grondo
Copy link
Contributor

grondo commented Apr 15, 2022

Thanks for that update! I apologize, I now realize my comment was a bit off-topic for this specific PR.

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

I now realize my comment was a bit off-topic for this specific PR.

Not at all - that was a good observation and a loose end I failed to articulate.

Hm, you may also want to make sure the result is reproducible for you!

I'm not really seeing it. Or at least the variability in results is too high to really conclude anything. Here are throughput results (src/test/throughput.py -sn 4096) for master and purge branches;

                total           master TP       purge TP

+4k jobs        4k              288.4           300.6
+4k jobs        8k              235.7           280.3
+4k jobs        12k             233.3           224.7
+4k jobs        16k             269.8           244.7
+4k jobs        20k             225.7           229.9
+4k jobs        24k             224.6           247.7
+4k jobs        28k             252.5           226.2
+4k jobs        32k             235.7           246.1

I might be missing something or maybe I should have tried with more jobs?

@grondo
Copy link
Contributor

grondo commented Apr 15, 2022

Yeah, I was not getting that much variability on my system. You could try more jobs since your system is clearly double the speed of mine. But if it is not reproducible for you, then this might not be an actual issue.

@chu11
Copy link
Member

chu11 commented Apr 15, 2022

high level comment, is this something we want to hold off on until we have job archiving (such as #4273) more in place?

or is the idea that since this configurable, the default behavior stays the same and we don't have to worry about archiving. If users want to lose inactive data in the kvs permanently, its ok.

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

or is the idea that since this configurable, the default behavior stays the same and we don't have to worry about archiving. If users want to lose inactive data in the kvs permanently, its ok.

Yeah this is what I was thinking. I might go a little further and recommend that our system instance early adopters purge at 1 week or so just to ensure that we don't have an unsustainable situation. Then we'll add in some form of offline garbage collection and we will be on a healthy trajectory. job-archive will grab all the data pretty aggressively so as long as the configured purge age is on the order of days, we shouldn't be losing data as far as flux-accounting is concerned.

However it's up for debate. Purging would limit the ability to ask what or who ran on a "bad node" beyond the purge period for example. Maybe we can swoop in with a good job-db solution in the near term that would allow longer term preservation and querying of job data?

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

I did reproduce the slowdown by testing with more jobs. Here's some data that shows it on my system:

                total           master TP       purge TP
+32k jobs       32k             203.7           207.7
+8k jobs        40k             210.4           223.1
+8k jobs        48k             228.9           197.0
+8k jobs        56k             210.8           178.8
+8k jobs        64k             206.0           150.9
+8k jobs        72k             203.0           134.4
+8k jobs        80k             205.4           113.3
+8k jobs        88k             216.9           100.5

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

I was using the list backwards so I had to scan the whole thing to insert the next job! With that fixed (last column) the numbers at scale are better:

		total	master	purge	nolist	revlist
+32k jobs	32k	203.7	207.7	198.7	204.5
+8k jobs	40k	210.4	223.1	237.9	225.0
+8k jobs	48k	228.9	197.0	225.5	208.5
+8k jobs	56k	210.8	178.8	217.6	216.9
+8k jobs	64k	206.0	150.9	217.1	211.0
+8k jobs	72k	203.0	134.4	215.5	211.5
+8k jobs	80k	205.4	113.3	197.5	207.4
+8k jobs	88k	216.9	100.5	191.9	209.5

I'll post the fix shortly. I wanted to add a test to ensure the oldest jobs are purged first since I was purging the youngest and no tests were catching it.

@grondo
Copy link
Contributor

grondo commented Apr 15, 2022

Nice catch!

@garlick
Copy link
Member Author

garlick commented Apr 15, 2022

Hmm, the bionic - py3.7,clang-6.0 builder seems to have gotten stuck in the broker unit tests, on test_runat.t specifically. Restarting since this doesn't seem likely to be related to this pr.

2022-04-15T22:46:21.3560385Z make  check-TESTS
2022-04-15T22:46:21.5422684Z                test_attr.t:  PASS: N=53  PASS=53  FAIL=0 SKIP=0 XPASS=0 XFAIL=0
2022-04-15T22:46:21.5439307Z             test_pmiutil.t:  PASS: N=9   PASS=9   FAIL=0 SKIP=0 XPASS=0 XFAIL=0
2022-04-15T22:46:21.5592137Z             test_service.t:  PASS: N=18  PASS=18  FAIL=0 SKIP=0 XPASS=0 XFAIL=0
2022-04-15T22:46:21.5598217Z             test_liblist.t:  PASS: N=6   PASS=6   FAIL=0 SKIP=0 XPASS=0 XFAIL=0
2022-04-15T22:46:21.6522470Z         test_boot_config.t:  PASS: N=54  PASS=54  FAIL=0 SKIP=0 XPASS=0 XFAIL=0
2022-04-15T22:46:22.7412790Z             test_overlay.t:  PASS: N=83  PASS=83  FAIL=0 SKIP=0 XPASS=0 XFAIL=0
2022-04-15T23:42:25.5427971Z ##[error]The operation was canceled.

@garlick
Copy link
Member Author

garlick commented Apr 16, 2022

It failed again but in the purge sharness test. I realize I have a test that depends on a previous test that has the NO_CHAIN_LINT conditional, so adding that to the other dependent test.

@garlick
Copy link
Member Author

garlick commented Apr 16, 2022

The job-list module of course doesn't purge its cache of inactive jobs. So, after a purge (currently) a job can be listed, but flux job info JOBID would fail (since the jobid is no longer in the kvs). That might be confusing, but probably isn't a big issue at this point.

I had purging set to 8h on my test system and just spent a minute confused by the fact that flux jobs was still showing jobs that should have been purged. Well I guess that brought home this point for me!

Hmm, I wonder if it would be OK to do something simple for now like publish the list of purged jobs so caches in job-list can be pruned?

@garlick
Copy link
Member Author

garlick commented Apr 16, 2022

It was pretty easy to implement so I went ahead and pushed a change to have the job manager publish an event listing job IDs being purged, and then have job-list subscribe to that message and remove jobs.

This invites comparison to PR #3688 (pending forever), where a lot of work was done to implement job-list purging. I think the job manager is the correct place to perform the purge since it also also solves the problem of the KVS growing without bound and is the "source of truth" for a lot of other job information. But some good discussion happened over in that PR, e.g.

  • flux job list-purge lists job info that is being purged (like dmesg -c)
  • purged jobids are added by job-list to a hash so they can still be distinguished from invalid jobids

I'm not sure either of those points makes as much sense when job-manager is doing the purging, but I thought I'd raise those points in case there should be some discussion, or if other aspects of that PR come to mind that would be applicable here.

Edit: just realized that the job-list stats are not updated, so purged jobs will still be included in flux jobs --stats-only counts. I should probably fix that.

* It doesn't seem worth the effort to structure the code to avoid this
* due to high complexity of solution, low probability of error, and
* minor consequences.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have a chance to fully review this PR yet, but FLUX_TEST_VALGRIND=t ./t2809-job-purge.t caught a leak of jobs on successful return here.

==89443== 1,168 (320 direct, 848 indirect) bytes in 8 blocks are definitely lost in loss record 21 of 29
==89443==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==89443==    by 0x4A25070: json_array (in /usr/lib/x86_64-linux-gnu/libjansson.so.4.13.0)
==89443==    by 0xA39A93C: purge_inactive_jobs (purge.c:110)
==89443==    by 0xA39AC50: purge_request_cb (purge.c:283)
==89443==    by 0x487B8B7: call_handler (msg_handler.c:344)
==89443==    by 0x487BC49: dispatch_message (msg_handler.c:380)
==89443==    by 0x487BC49: handle_cb (msg_handler.c:481)
==89443==    by 0x48AACE2: ev_invoke_pending (ev.c:3770)
==89443==    by 0x48AE02F: ev_run (ev.c:4190)
==89443==    by 0x48AE02F: ev_run (ev.c:4021)
==89443==    by 0x487A9FE: flux_reactor_run (reactor.c:128)
==89443==    by 0xA392C11: mod_main (job-manager.c:220)
==89443==    by 0x119B20: module_thread (module.c:183)
==89443==    by 0x4AC4B42: start_thread (pthread_create.c:442)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, most likely that was in the code I just added for updating job-list. Thank you for catching that!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed (squashed into 3966478)

Copy link
Contributor

@grondo grondo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! I ran through some random manual testing and didn't find any issues. Even tried running a set of flux job purge commands in parallel with a small batch count and everything worked great!

Since flux job purge is an unrecoverable operation, I wonder if we want to require a --force option, but otherwise report how many jobs would be purged. I'm not sure how difficult that would be to add. Another option would be an are you sure? (yes/no) type prompt, unless --force is given. I don't think this has to be done before merging this PR, it was just something that occurred to me.

"batch", &batch) < 0)
goto error;
if (age_limit != INACTIVE_AGE_UNLIMITED && age_limit < 0) {
errmsg = "age limit must be >= 0 (0=unlimited)";
Copy link
Contributor

@grondo grondo Apr 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is errmsg correct? (INACTIVE_AGE_UNLIMITED is -1. not 0, but perhaps there's some subtlety I missed)

goto error;
}
if (num_limit != INACTIVE_NUM_UNLIMITED && num_limit < 0) {
errmsg = "num limit must be >= 0 (0=unlimited)";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above, isn't -1 unlimited?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops yeah, I changed the values but didn't update the message!

Comment on lines +310 to +392
return errprintf (error,
"job-manager.inactive-num-limit: must be >= 0");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the error message be job-manager.inactive-num-limit: must be >= -1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for the TOML config my thought was that users wouldn't specify a limit if they don't want one. Does that sound OK?

O/w we should probably define a special value for unlimited age too, like "unlimited".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me.

@garlick
Copy link
Member Author

garlick commented Apr 18, 2022

I like the idea of reporting number of jobs purged only, unless --force is given!

Problem: various job manager services cannot distinguish
incorrect job IDs from inactive jobs.

Instead of removing jobs when they become inactive, store them
in an inactive hash.  For now, jobs remain there until the job
manager is unloaded.
Problem: various job manager requests return "unknown job id"
on an inactive but valid jobid.

Now that inactive jobs are available in their own hash, try to
look up the jobid in the inactive hash and if found, return
"job is inactive" instead.
Problem: there is no way to externally probe how many active
and inactive jobs the job manager is tracking.

Add active_jobs, inactive_jobs keys to the stats object so that
'flux module stats job-manager' can show the number of jobs
of each type that are tracked.
Problem: inactive job purging should be based on the time
the job actually enters INACTIVE state but that time is not
captured.

Add job->t_clean which captures the timestamp on the clean event,
which is the only event that transitions a job to inactive.
Problem: a zlistx comparator function used to order jobs by priority
is named job_comparator(), which makes naming another compararator
difficult.

Rename internal function to job_priority_comparator().

Update users.
Problem: the overhead of inactive jobs grows without bound in a
system instance.

Add a mechanism to trim the oldest inactive jobs from the job manager
internal hash and KVS.  There are two ways to do it.  One way is to
make a job-manager.purge RPC specifying either a maximum number of
retained inactive jobs, or a maximim age (time since transition to
inactive) of retained inactive jobs, or both.

The other way is to configure those criteria via TOML, e.g.

  [job-manager]
  inactive-age-limit = "7d"
  inactive-num-limit = 10000

The KVS transactions used to unlink batches of jobs is limited to 100
entries to avoid giant messages and head of line blocking in the broker.
When purging via configuration, this limits the number of jobs purged per
heartbeat interval.  When purging via RPC, the caller specifies the batch
size (up to the maximum) and must check the returned job count, and repeat
the RPC until a count less than the batch size is returned.

Note: KVS job.* directories that become emtpy after all the jobs in them
have been purged are NOT removed by the purge.  This can be accomplished
later by an offline dump/restore of the content backing store, since empty
directories are not included in dumps.
Problem: there is no convenient way to clean unwanted inactive
jobs if purge limits were not configured.

Add 'flux job purge', which can interactively remove inactive
jobs using limits set on the command line.  For example:

flux job purge --num-limit=100
  purges all inactive jobs except the 100 newest, where "newest"
  is gauged by the time the job became inactve.

flux job purge --age-limit=3d
  purges all jobs that have been inactive longer than 3 days.
Problem: t2201-job-cmd.t includes some tests that could pass if
the 'flux job' fails in unusual ways.

Use "FLUX_URI=value test_must_fail cmd..."
not "! FLUX_URI=value cmd..."

Also, enclose those in a subshell because apparently in some environments
but not others, this causes the environment variable setting to leak
to other tests, as noted in 9a77e05.
Problem: there is no test coverage for attempting to perform
inappropriate job operations on inactive jobs.

Add a few tests to t2201-job-cmd.t and t2280-job-memo.t to cover
these cases.
Problem: job-manager may purge jobs from the KVS but other parts
of the system like job-list may have cached inactive job IDs and
are not informed that they are no longer valid.

Publish a 'job-purge-inactive' event message containing a list of
job IDs each time a batch of jobs are purged by the job manager.
Problem: flux job list -a still shows inactive jobs that have been
purged.

Subscribe to the 'job-purge-inactive' event message and remove purged
jobs from the 'index' hash and the 'inactive' list when notified.
Problem: flux jobs --stats-only includes purged jobs in stats.

Update the stats just before purging a job.
Problem: the job manager purge function has no test coverage

Add a new sharness test script.
Problem: job purge functionality is undocumented.

Add an entry for the purge subcommand to flux-job(1),
and add configuration details to flux-config-job-manager(5).
@codecov
Copy link

codecov bot commented Apr 18, 2022

Codecov Report

Merging #4286 (a854fd4) into master (77f1708) will increase coverage by 0.01%.
The diff coverage is 84.48%.

❗ Current head a854fd4 differs from pull request most recent head 3cb8819. Consider uploading reports for the commit 3cb8819 to get more accurate results

@@            Coverage Diff             @@
##           master    #4286      +/-   ##
==========================================
+ Coverage   83.56%   83.58%   +0.01%     
==========================================
  Files         388      389       +1     
  Lines       64848    65139     +291     
==========================================
+ Hits        54191    54445     +254     
- Misses      10657    10694      +37     
Impacted Files Coverage Δ
src/modules/job-list/stats.c 75.00% <45.45%> (-7.23%) ⬇️
src/modules/job-manager/getattr.c 58.18% <50.00%> (-1.08%) ⬇️
src/modules/job-manager/purge.c 82.69% <82.69%> (ø)
src/modules/job-manager/job-manager.c 60.37% <90.90%> (+3.23%) ⬆️
src/cmd/flux-job.c 87.36% <93.33%> (+0.11%) ⬆️
src/modules/job-list/job-list.c 79.22% <95.00%> (+5.53%) ⬆️
src/modules/job-manager/alloc.c 82.80% <100.00%> (ø)
src/modules/job-manager/annotate.c 84.09% <100.00%> (+0.18%) ⬆️
src/modules/job-manager/event.c 76.92% <100.00%> (+0.27%) ⬆️
src/modules/job-manager/job.c 90.71% <100.00%> (+0.20%) ⬆️
... and 14 more

@garlick
Copy link
Member Author

garlick commented Apr 18, 2022

Just force pushed with the fixes discussed, adding flux job purge --force, and squashing some incremental development (hope that's OK). I left the publication of the purged job ids and subscription in job-list broken out as separate commits since that might be a temporary solution, and it should be made clear what to back out later if we come up with something better.

@grondo
Copy link
Contributor

grondo commented Apr 18, 2022

Nice! I tested this out and works as described! Feel free to set MWP when you're ready.

@grondo
Copy link
Contributor

grondo commented Apr 18, 2022

BTW, this feature satisfies one of my minor use cases: When running the Flux testsuite as a set of jobs within an instance, it is convenient to use flux jobs --stats-only to get a report if any jobs failed. The only way to do this multiple times, though, is to restart the instance to get the stats and previously completed jobs cleared.

Now flux job purge may be used as an alternative (e.g. when you do not want to launch another large instance) 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants