Summary
This appears to be a different failure mode than #1343.
For codesess_3541ed2048cc, the sandbox resumed far enough that the platform now reports it as idle, but the resumed instance is broken:
- Hub could reconnect to the sandbox instance
- Hub attempted the post-resume
USR1 reconnect nudge twice
- both nudge executions stayed stuck
queued
- direct
execute() against the sandbox returns 500 Internal Server Error
- direct filesystem listing returns
404 with cow-merged: no such file or directory
- the long-running driver job still reports
running
So the sandbox looks alive in metadata/status, but control and filesystem operations are not actually usable.
Affected IDs
- Hub session:
codesess_3541ed2048cc
- Sandbox:
sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959
- Driver job:
job_30e61ba73aad740180b7e179
- Queued reconnect-nudge executions:
exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94
exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6fa
Hub-side behavior
Server logs during the failed wake path included:
[WARN] [hub] Controller rejected for unknown session codesess_3541ed2048cc
[INFO] Resuming sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959 for session codesess_3541ed2048cc
[WARN] Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: failed to send USR1 reconnect nudge to the CRIU-restored driver (immediate post-resume): Internal Server Error
[INFO] Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: waiting for the CRIU-restored driver to reconnect via WebSocket after the USR1 reconnect nudge
[WARN] Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: CRIU-restored driver still has not reconnected after 5000ms; sending a follow-up reconnect nudge
[WARN] Sandbox sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959: failed to send USR1 reconnect nudge to the CRIU-restored driver (follow-up): Internal Server Error
Hub state then moved from creating to error after the 120s driver connect timeout:
- final Hub error:
Sandbox driver did not connect within 120000ms. Check sandbox background jobs for details.
Platform-side state
Sandbox metadata
Current platform sandbox state is:
The original tracked driver job still shows:
status: running
- no completion
- no replacement job
Reconnect nudge executions
Hub sends the reconnect nudge by calling execute() inside the resumed sandbox with a command that sends USR1 to the rpc-driver process.
Those two executions exist on the platform and are still queued, not failed, not completed:
exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94
exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6fa
Their command is the expected Hub nudge command:
set -eu
if command -v pkill >/dev/null 2>&1; then
if pkill -USR1 -f '[b]un /home/agentuity/client/driver/rpc-driver.ts'; then
exit 0
fi
if pkill -USR1 -f '[b]un run /home/agentuity/client/driver/rpc-driver.ts'; then
exit 0
fi
fi
PID="$(ps -eo pid=,args= | awk '/[b]un (run )?\/home\/agentuity\/client\/driver\/rpc-driver\.ts/ { print $1; exit }')"
if [ -z "$PID" ]; then
echo "rpc-driver process not found" >&2
exit 1
fi
kill -USR1 "$PID"
Direct low-level probes against the resumed sandbox
- Direct filesystem list via SDK
listFiles("/home/agentuity") failed with:
not found: error listing files in /home/agentuity: readdirent /working/sandbox/sbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959/cow-merged: no such file or directory
- Direct SDK
execute() failed with:
500 Internal Server Error
This is the same underlying symptom the Hub hit when trying to send the USR1 reconnect nudge.
Lifecycle timeline
Relevant recent platform events for the affected sandbox:
2026-04-03T15:21:46Z suspend with checkpoint id ckpt_20c38cee01645cde
2026-04-03T15:21:55Z lifecycle:resumed (deferred: true)
2026-04-03T15:24:03Z another evacuation/suspend sequence
2026-04-04T01:01:44Z lifecycle:reconcile(previous_status=suspended)
After the latest Hub wake attempt, the sandbox now reports idle, but control/filesystem behavior is still broken as described above.
Control experiment
I also created a fresh empty control sandbox in the same org/runtime and tested:
- create sandbox
- copy a file in after creation
- verify file via
exec and fs listing
- pause sandbox
- resume sandbox manually via CLI
- verify file persistence and fresh
exec/fs access after resume
That control path succeeded end-to-end. So this does not look like a general pause/resume outage; it looks specific to this resumed sandbox entering a bad state.
Expected behavior
If the sandbox reports idle, the following should also work consistently:
execute()
- filesystem list/read operations
- queued reconnect-nudge executions should start and complete or fail with a concrete process-level error
The platform should not report an idle sandbox whose filesystem mount path is missing and whose exec/control operations are stuck or returning 500.
Summary
This appears to be a different failure mode than #1343.
For
codesess_3541ed2048cc, the sandbox resumed far enough that the platform now reports it asidle, but the resumed instance is broken:USR1reconnect nudge twicequeuedexecute()against the sandbox returns500 Internal Server Error404withcow-merged: no such file or directoryrunningSo the sandbox looks alive in metadata/status, but control and filesystem operations are not actually usable.
Affected IDs
codesess_3541ed2048ccsbx_76fca18ef2247dfbed4c5267de254677643322a1156c24eb5c69453d7959job_30e61ba73aad740180b7e179exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6faHub-side behavior
Server logs during the failed wake path included:
Hub state then moved from
creatingtoerrorafter the 120s driver connect timeout:Sandbox driver did not connect within 120000ms. Check sandbox background jobs for details.Platform-side state
Sandbox metadata
Current platform sandbox state is:
idleThe original tracked driver job still shows:
status: runningReconnect nudge executions
Hub sends the reconnect nudge by calling
execute()inside the resumed sandbox with a command that sendsUSR1to therpc-driverprocess.Those two executions exist on the platform and are still
queued, not failed, not completed:exe_46f293072d7c38298388331985865e578cb06cf3aae422619f802f9b0e94exe_932ef65f40e53bf3011886e4bc318ffa4d85f0391f87f8f6305c6896d6faTheir command is the expected Hub nudge command:
Direct low-level probes against the resumed sandbox
listFiles("/home/agentuity")failed with:execute()failed with:This is the same underlying symptom the Hub hit when trying to send the
USR1reconnect nudge.Lifecycle timeline
Relevant recent platform events for the affected sandbox:
2026-04-03T15:21:46Zsuspend with checkpoint idckpt_20c38cee01645cde2026-04-03T15:21:55Zlifecycle:resumed(deferred: true)2026-04-03T15:24:03Zanother evacuation/suspend sequence2026-04-04T01:01:44Zlifecycle:reconcile(previous_status=suspended)After the latest Hub wake attempt, the sandbox now reports
idle, but control/filesystem behavior is still broken as described above.Control experiment
I also created a fresh empty control sandbox in the same org/runtime and tested:
execand fs listingexec/fs access after resumeThat control path succeeded end-to-end. So this does not look like a general pause/resume outage; it looks specific to this resumed sandbox entering a bad state.
Expected behavior
If the sandbox reports
idle, the following should also work consistently:execute()The platform should not report an
idlesandbox whose filesystem mount path is missing and whose exec/control operations are stuck or returning500.