Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-manager: make some replay errors non-fatal #5150

Merged
merged 3 commits into from May 5, 2023

Commits on May 5, 2023

  1. job-list: fix memory leak on error path

    Problem: if depthfirst_map_one() fails to look up R,
    futures are leaked.
    
    Ensure that particular failure unwinds allocations like
    the others.
    garlick authored and chu11 committed May 5, 2023
    Configuration menu
    Copy the full SHA
    2e23a37 View commit details
    Browse the repository at this point in the history
  2. job-list: make replay KVS errors non-fatal

    Problem: the job manager now treats jobs that cannot be loaded
    from the KVS as a non-fatal error, but job-list treats them
    as fatal still.
    
    Relax error handling so that replay continues if a job cannot be
    loaded.  Most likely the job will have already been logged by
    the job manager so introduce no new logging here.
    garlick authored and chu11 committed May 5, 2023
    Configuration menu
    Copy the full SHA
    d8f2f10 View commit details
    Browse the repository at this point in the history
  3. job-manager: make some replay errors non-fatal

    Problem: if a few jobs get messed up in the KVS due to an
    improper shutdown, recovery is a tedious process involving
    starting flux in --recovery mode, fixing one job, and starting
    again.
    
    When a job cannot be replayed from the KVS and the reason is
    that the directory is incomplete, log the failure at LOG_ERR
    level but let replay continue and ultimately the flux restart
    be successful.
    
    If a job has more serious problems like incorrect content in
    the eventlog, treat that as a fatal error as before.  This
    avoids breaking the 'valid' tests that check backwards
    compatibility with older kvs dumps, which might use an older
    eventlog format.
    
    Update t2219-job-manage-restart.t to expect warnings rather
    than failure when such jobs are encountered during replay.
    
    Fixes flux-framework#5147
    garlick authored and chu11 committed May 5, 2023
    Configuration menu
    Copy the full SHA
    fc5b7ab View commit details
    Browse the repository at this point in the history