Handle I/O failures on log files gracefully instead of crashing

## Summary

When duct's log files are on NFS with Kerberos authentication (`sec=krb5`), long-running jobs can crash with `PermissionError` when the Kerberos ticket expires. Duct crashes entirely, which can also **kill the monitored command** — in this case a batch orchestration script that had already submitted all SLURM jobs but was still running.

## Root Cause

Investigated on Dartmouth's HPC cluster. NFS mount uses `sec=krb5` with 10-hour ticket lifetime. A ~14-hour duct job outlived the Kerberos ticket, causing all open NFS file handles to return `EACCES` on I/O.

Timeline from a real failure:
- 21:46 — last SSH login refreshed KCM ticket cache (ticket expires ~07:46)
- 07:42 — duct crashes with `PermissionError` (4 min before expected expiry)

The inner SLURM jobs were unaffected because compute nodes use SLURM-managed credentials, not the user's SSH Kerberos ticket.

## Crash Sites

Three threads crash independently with `PermissionError`:

1. **`monitor_process` thread** → `Report.write_subreport()` writing `usage.jsonl`
2. **Two `TailPipe._tail` threads** → `_catch_up()` reading stdout/stderr log files
3. **Main thread** → final `write_subreport()` call after `process.wait()` (fatal crash)

## Impact

- Duct crashes and takes the monitored command with it (child process loses parent / gets SIGPIPE)
- `info.json` left empty (0 bytes), partial data in `usage.jsonl`
- In this case the monitored script had already submitted all SLURM jobs, so compute work was fine — but if the crash had happened earlier, it could have interrupted the orchestration

## Proposed Fix

- On `OSError` in I/O paths: attempt to reopen the file handle and retry (stale handles can be replaced with a fresh `open()`)
- Track file position internally (`self._pos += len(data)`) since `tell()` on a stale handle may also fail
- Cap retries (2–3 attempts); if exhausted, log a warning and skip the write — don't crash
- The monitored command must always keep running; duct should produce whatever summary it can

## Workaround

Keep Kerberos tickets alive during long jobs (e.g., `krenew` or `k5start` in tmux).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle I/O failures on log files gracefully instead of crashing #404

Summary

Root Cause

Crash Sites

Impact

Proposed Fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Handle I/O failures on log files gracefully instead of crashing #404

Description

Summary

Root Cause

Crash Sites

Impact

Proposed Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions