New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
job-manager: add inactive job purge capability #4286
Commits on Apr 18, 2022
-
job-manager: preserve inactive jobs
Problem: various job manager services cannot distinguish incorrect job IDs from inactive jobs. Instead of removing jobs when they become inactive, store them in an inactive hash. For now, jobs remain there until the job manager is unloaded.
-
job-manager: improve errors on inactive jobid
Problem: various job manager requests return "unknown job id" on an inactive but valid jobid. Now that inactive jobs are available in their own hash, try to look up the jobid in the inactive hash and if found, return "job is inactive" instead.
-
job-manager: include job counts in module stats
Problem: there is no way to externally probe how many active and inactive jobs the job manager is tracking. Add active_jobs, inactive_jobs keys to the stats object so that 'flux module stats job-manager' can show the number of jobs of each type that are tracked.
-
job-manager: track time job enters INACTIVE state
Problem: inactive job purging should be based on the time the job actually enters INACTIVE state but that time is not captured. Add job->t_clean which captures the timestamp on the clean event, which is the only event that transitions a job to inactive.
-
job-manager: rename internal job comparator
Problem: a zlistx comparator function used to order jobs by priority is named job_comparator(), which makes naming another compararator difficult. Rename internal function to job_priority_comparator(). Update users.
-
job-manager: purge inactive jobs
Problem: the overhead of inactive jobs grows without bound in a system instance. Add a mechanism to trim the oldest inactive jobs from the job manager internal hash and KVS. There are two ways to do it. One way is to make a job-manager.purge RPC specifying either a maximum number of retained inactive jobs, or a maximim age (time since transition to inactive) of retained inactive jobs, or both. The other way is to configure those criteria via TOML, e.g. [job-manager] inactive-age-limit = "7d" inactive-num-limit = 10000 The KVS transactions used to unlink batches of jobs is limited to 100 entries to avoid giant messages and head of line blocking in the broker. When purging via configuration, this limits the number of jobs purged per heartbeat interval. When purging via RPC, the caller specifies the batch size (up to the maximum) and must check the returned job count, and repeat the RPC until a count less than the batch size is returned. Note: KVS job.* directories that become emtpy after all the jobs in them have been purged are NOT removed by the purge. This can be accomplished later by an offline dump/restore of the content backing store, since empty directories are not included in dumps.
-
flux-job: add purge subcommand
Problem: there is no convenient way to clean unwanted inactive jobs if purge limits were not configured. Add 'flux job purge', which can interactively remove inactive jobs using limits set on the command line. For example: flux job purge --num-limit=100 purges all inactive jobs except the 100 newest, where "newest" is gauged by the time the job became inactve. flux job purge --age-limit=3d purges all jobs that have been inactive longer than 3 days.
-
testsuite: use test_must_fail not !
Problem: t2201-job-cmd.t includes some tests that could pass if the 'flux job' fails in unusual ways. Use "FLUX_URI=value test_must_fail cmd..." not "! FLUX_URI=value cmd..." Also, enclose those in a subshell because apparently in some environments but not others, this causes the environment variable setting to leak to other tests, as noted in 9a77e05.
-
testsuite: cover operations on inactive jobs
Problem: there is no test coverage for attempting to perform inappropriate job operations on inactive jobs. Add a few tests to t2201-job-cmd.t and t2280-job-memo.t to cover these cases.
-
job-manager: publish list of purged jobids
Problem: job-manager may purge jobs from the KVS but other parts of the system like job-list may have cached inactive job IDs and are not informed that they are no longer valid. Publish a 'job-purge-inactive' event message containing a list of job IDs each time a batch of jobs are purged by the job manager.
-
Problem: flux job list -a still shows inactive jobs that have been purged. Subscribe to the 'job-purge-inactive' event message and remove purged jobs from the 'index' hash and the 'inactive' list when notified.
-
job-list: update statistics when job is purged
Problem: flux jobs --stats-only includes purged jobs in stats. Update the stats just before purging a job.
-
Problem: the job manager purge function has no test coverage Add a new sharness test script.
-
flux-job(1), flux-config-job-manager(5): add purge
Problem: job purge functionality is undocumented. Add an entry for the purge subcommand to flux-job(1), and add configuration details to flux-config-job-manager(5).