Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: add inactive job purge capability #4286

Merged
merged 14 commits into from Apr 18, 2022

Commits on Apr 18, 2022

  1. job-manager: preserve inactive jobs

    Problem: various job manager services cannot distinguish
    incorrect job IDs from inactive jobs.
    
    Instead of removing jobs when they become inactive, store them
    in an inactive hash.  For now, jobs remain there until the job
    manager is unloaded.
    garlick committed Apr 18, 2022
    Copy the full SHA
    215a94a View commit details
    Browse the repository at this point in the history
  2. job-manager: improve errors on inactive jobid

    Problem: various job manager requests return "unknown job id"
    on an inactive but valid jobid.
    
    Now that inactive jobs are available in their own hash, try to
    look up the jobid in the inactive hash and if found, return
    "job is inactive" instead.
    garlick committed Apr 18, 2022
    Copy the full SHA
    81f706e View commit details
    Browse the repository at this point in the history
  3. job-manager: include job counts in module stats

    Problem: there is no way to externally probe how many active
    and inactive jobs the job manager is tracking.
    
    Add active_jobs, inactive_jobs keys to the stats object so that
    'flux module stats job-manager' can show the number of jobs
    of each type that are tracked.
    garlick committed Apr 18, 2022
    Copy the full SHA
    f613d41 View commit details
    Browse the repository at this point in the history
  4. job-manager: track time job enters INACTIVE state

    Problem: inactive job purging should be based on the time
    the job actually enters INACTIVE state but that time is not
    captured.
    
    Add job->t_clean which captures the timestamp on the clean event,
    which is the only event that transitions a job to inactive.
    garlick committed Apr 18, 2022
    Copy the full SHA
    ba081e4 View commit details
    Browse the repository at this point in the history
  5. job-manager: rename internal job comparator

    Problem: a zlistx comparator function used to order jobs by priority
    is named job_comparator(), which makes naming another compararator
    difficult.
    
    Rename internal function to job_priority_comparator().
    
    Update users.
    garlick committed Apr 18, 2022
    Copy the full SHA
    cf592cd View commit details
    Browse the repository at this point in the history
  6. job-manager: purge inactive jobs

    Problem: the overhead of inactive jobs grows without bound in a
    system instance.
    
    Add a mechanism to trim the oldest inactive jobs from the job manager
    internal hash and KVS.  There are two ways to do it.  One way is to
    make a job-manager.purge RPC specifying either a maximum number of
    retained inactive jobs, or a maximim age (time since transition to
    inactive) of retained inactive jobs, or both.
    
    The other way is to configure those criteria via TOML, e.g.
    
      [job-manager]
      inactive-age-limit = "7d"
      inactive-num-limit = 10000
    
    The KVS transactions used to unlink batches of jobs is limited to 100
    entries to avoid giant messages and head of line blocking in the broker.
    When purging via configuration, this limits the number of jobs purged per
    heartbeat interval.  When purging via RPC, the caller specifies the batch
    size (up to the maximum) and must check the returned job count, and repeat
    the RPC until a count less than the batch size is returned.
    
    Note: KVS job.* directories that become emtpy after all the jobs in them
    have been purged are NOT removed by the purge.  This can be accomplished
    later by an offline dump/restore of the content backing store, since empty
    directories are not included in dumps.
    garlick committed Apr 18, 2022
    Copy the full SHA
    c864aaa View commit details
    Browse the repository at this point in the history
  7. flux-job: add purge subcommand

    Problem: there is no convenient way to clean unwanted inactive
    jobs if purge limits were not configured.
    
    Add 'flux job purge', which can interactively remove inactive
    jobs using limits set on the command line.  For example:
    
    flux job purge --num-limit=100
      purges all inactive jobs except the 100 newest, where "newest"
      is gauged by the time the job became inactve.
    
    flux job purge --age-limit=3d
      purges all jobs that have been inactive longer than 3 days.
    garlick committed Apr 18, 2022
    Copy the full SHA
    73d0bbf View commit details
    Browse the repository at this point in the history
  8. testsuite: use test_must_fail not !

    Problem: t2201-job-cmd.t includes some tests that could pass if
    the 'flux job' fails in unusual ways.
    
    Use "FLUX_URI=value test_must_fail cmd..."
    not "! FLUX_URI=value cmd..."
    
    Also, enclose those in a subshell because apparently in some environments
    but not others, this causes the environment variable setting to leak
    to other tests, as noted in 9a77e05.
    garlick committed Apr 18, 2022
    Copy the full SHA
    5555e12 View commit details
    Browse the repository at this point in the history
  9. testsuite: cover operations on inactive jobs

    Problem: there is no test coverage for attempting to perform
    inappropriate job operations on inactive jobs.
    
    Add a few tests to t2201-job-cmd.t and t2280-job-memo.t to cover
    these cases.
    garlick committed Apr 18, 2022
    Copy the full SHA
    4ed3749 View commit details
    Browse the repository at this point in the history
  10. job-manager: publish list of purged jobids

    Problem: job-manager may purge jobs from the KVS but other parts
    of the system like job-list may have cached inactive job IDs and
    are not informed that they are no longer valid.
    
    Publish a 'job-purge-inactive' event message containing a list of
    job IDs each time a batch of jobs are purged by the job manager.
    garlick committed Apr 18, 2022
    Copy the full SHA
    176ce2c View commit details
    Browse the repository at this point in the history
  11. job-list: handle purge event

    Problem: flux job list -a still shows inactive jobs that have been
    purged.
    
    Subscribe to the 'job-purge-inactive' event message and remove purged
    jobs from the 'index' hash and the 'inactive' list when notified.
    garlick committed Apr 18, 2022
    Copy the full SHA
    028b71f View commit details
    Browse the repository at this point in the history
  12. job-list: update statistics when job is purged

    Problem: flux jobs --stats-only includes purged jobs in stats.
    
    Update the stats just before purging a job.
    garlick committed Apr 18, 2022
    Copy the full SHA
    9b8989c View commit details
    Browse the repository at this point in the history
  13. testsuite: cover job purge

    Problem: the job manager purge function has no test coverage
    
    Add a new sharness test script.
    garlick committed Apr 18, 2022
    Copy the full SHA
    035541d View commit details
    Browse the repository at this point in the history
  14. flux-job(1), flux-config-job-manager(5): add purge

    Problem: job purge functionality is undocumented.
    
    Add an entry for the purge subcommand to flux-job(1),
    and add configuration details to flux-config-job-manager(5).
    garlick committed Apr 18, 2022
    Copy the full SHA
    3cb8819 View commit details
    Browse the repository at this point in the history