Summary
When duct's log files are on NFS with Kerberos authentication (sec=krb5), long-running jobs can crash with PermissionError when the Kerberos ticket expires. Duct crashes entirely, which can also kill the monitored command — in this case a batch orchestration script that had already submitted all SLURM jobs but was still running.
Root Cause
Investigated on Dartmouth's HPC cluster. NFS mount uses sec=krb5 with 10-hour ticket lifetime. A ~14-hour duct job outlived the Kerberos ticket, causing all open NFS file handles to return EACCES on I/O.
Timeline from a real failure:
- 21:46 — last SSH login refreshed KCM ticket cache (ticket expires ~07:46)
- 07:42 — duct crashes with
PermissionError (4 min before expected expiry)
The inner SLURM jobs were unaffected because compute nodes use SLURM-managed credentials, not the user's SSH Kerberos ticket.
Crash Sites
Three threads crash independently with PermissionError:
monitor_process thread → Report.write_subreport() writing usage.jsonl
- Two
TailPipe._tail threads → _catch_up() reading stdout/stderr log files
- Main thread → final
write_subreport() call after process.wait() (fatal crash)
Impact
- Duct crashes and takes the monitored command with it (child process loses parent / gets SIGPIPE)
info.json left empty (0 bytes), partial data in usage.jsonl
- In this case the monitored script had already submitted all SLURM jobs, so compute work was fine — but if the crash had happened earlier, it could have interrupted the orchestration
Proposed Fix
- On
OSError in I/O paths: attempt to reopen the file handle and retry (stale handles can be replaced with a fresh open())
- Track file position internally (
self._pos += len(data)) since tell() on a stale handle may also fail
- Cap retries (2–3 attempts); if exhausted, log a warning and skip the write — don't crash
- The monitored command must always keep running; duct should produce whatever summary it can
Workaround
Keep Kerberos tickets alive during long jobs (e.g., krenew or k5start in tmux).
Summary
When duct's log files are on NFS with Kerberos authentication (
sec=krb5), long-running jobs can crash withPermissionErrorwhen the Kerberos ticket expires. Duct crashes entirely, which can also kill the monitored command — in this case a batch orchestration script that had already submitted all SLURM jobs but was still running.Root Cause
Investigated on Dartmouth's HPC cluster. NFS mount uses
sec=krb5with 10-hour ticket lifetime. A ~14-hour duct job outlived the Kerberos ticket, causing all open NFS file handles to returnEACCESon I/O.Timeline from a real failure:
PermissionError(4 min before expected expiry)The inner SLURM jobs were unaffected because compute nodes use SLURM-managed credentials, not the user's SSH Kerberos ticket.
Crash Sites
Three threads crash independently with
PermissionError:monitor_processthread →Report.write_subreport()writingusage.jsonlTailPipe._tailthreads →_catch_up()reading stdout/stderr log fileswrite_subreport()call afterprocess.wait()(fatal crash)Impact
info.jsonleft empty (0 bytes), partial data inusage.jsonlProposed Fix
OSErrorin I/O paths: attempt to reopen the file handle and retry (stale handles can be replaced with a freshopen())self._pos += len(data)) sincetell()on a stale handle may also failWorkaround
Keep Kerberos tickets alive during long jobs (e.g.,
krenework5startin tmux).