Skip to content

Handle I/O failures on log files gracefully instead of crashing #404

@asmacdo

Description

@asmacdo

Summary

When duct's log files are on NFS with Kerberos authentication (sec=krb5), long-running jobs can crash with PermissionError when the Kerberos ticket expires. Duct crashes entirely, which can also kill the monitored command — in this case a batch orchestration script that had already submitted all SLURM jobs but was still running.

Root Cause

Investigated on Dartmouth's HPC cluster. NFS mount uses sec=krb5 with 10-hour ticket lifetime. A ~14-hour duct job outlived the Kerberos ticket, causing all open NFS file handles to return EACCES on I/O.

Timeline from a real failure:

  • 21:46 — last SSH login refreshed KCM ticket cache (ticket expires ~07:46)
  • 07:42 — duct crashes with PermissionError (4 min before expected expiry)

The inner SLURM jobs were unaffected because compute nodes use SLURM-managed credentials, not the user's SSH Kerberos ticket.

Crash Sites

Three threads crash independently with PermissionError:

  1. monitor_process threadReport.write_subreport() writing usage.jsonl
  2. Two TailPipe._tail threads_catch_up() reading stdout/stderr log files
  3. Main thread → final write_subreport() call after process.wait() (fatal crash)

Impact

  • Duct crashes and takes the monitored command with it (child process loses parent / gets SIGPIPE)
  • info.json left empty (0 bytes), partial data in usage.jsonl
  • In this case the monitored script had already submitted all SLURM jobs, so compute work was fine — but if the crash had happened earlier, it could have interrupted the orchestration

Proposed Fix

  • On OSError in I/O paths: attempt to reopen the file handle and retry (stale handles can be replaced with a fresh open())
  • Track file position internally (self._pos += len(data)) since tell() on a stale handle may also fail
  • Cap retries (2–3 attempts); if exhausted, log a warning and skip the write — don't crash
  • The monitored command must always keep running; duct should produce whatever summary it can

Workaround

Keep Kerberos tickets alive during long jobs (e.g., krenew or k5start in tmux).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions