fix(runner,operator): improve session backup and resume reliability#1057
fix(runner,operator): improve session backup and resume reliability#1057Gkrumbach07 merged 1 commit intomainfrom
Conversation
Three issues caused data loss and broken session resume on runner OOM: 1. **Repo data not backed up periodically**: `backup_git_repos()` was only called on SIGTERM (graceful shutdown). When the runner OOMed, state-sync never received SIGTERM and repo changes were lost. Now backs up every 5th sync cycle (~5 min) via configurable REPO_BACKUP_INTERVAL. 2. **Session resume broken after OOM**: The CLI session ID was only persisted to disk after a turn completed. If the runner OOMed during the first turn, the session ID file never existed in S3. On pod restart, `--resume` was never passed and the agent lost all context. Now persists immediately when the CLI returns the session ID in its init message via an `on_session_id` callback. 3. **Operator not detecting runner OOM**: The pod watch predicate only triggered on pod phase changes. When the runner container OOMed, pod phase stayed "Running" (state-sync sidecar still alive), so the operator never reconciled — pods sat in OOMKilled state indefinitely. Now also triggers on individual container termination transitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WalkthroughThe pull request updates pod watch reconciliation triggers in an Agentic Session controller, refactors session persistence timing in an ambient runner's Claude bridge, and introduces periodic git repository backups in the state sync process. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/operator/internal/controller/agenticsession_controller.go`:
- Around line 264-283: The reconcileRunning function currently ignores
pod.Status.ContainerStatuses, so when the runner container terminates (e.g.,
OOM) but the sidecar keeps the pod phase at Running the session never
transitions; update reconcileRunning to iterate pod.Status.ContainerStatuses,
detect when the runner container (identify by name used for the runner
container) has State.Terminated or lastTerminationState.Terminated with a
non-zero exit code or OOM reason, and then set the AgenticSession status to the
appropriate terminal state (e.g., Failed) or trigger your recovery path (update
session.Status, emit events, and stop requeue), ensuring you reference
reconcileRunning, the pod object passed in, and Status.ContainerStatuses when
locating where to add this logic.
In `@components/runners/state-sync/sync.sh`:
- Around line 288-298: The loop can divide by zero when REPO_BACKUP_INTERVAL is
0; add a startup/config guard that validates REPO_BACKUP_INTERVAL and treats
non-positive or unset values as "disabled" (or set a safe default >0). Update
the periodic backup condition in the sync loop (which references sync_count and
calls backup_git_repos) to first check that REPO_BACKUP_INTERVAL is a positive
integer before performing the modulo, and skip backup_git_repos when disabled;
ensure any parsing/validation happens once at init so the sync_to_s3 and
check_size logic remain unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 09493fb0-e39e-49a7-a62e-1ab0dc4346f4
📒 Files selected for processing (4)
components/operator/internal/controller/agenticsession_controller.gocomponents/runners/ambient-runner/ambient_runner/bridges/claude/bridge.pycomponents/runners/ambient-runner/ambient_runner/bridges/claude/session.pycomponents/runners/state-sync/sync.sh
| // Trigger if pod phase changed | ||
| if e.ObjectOld.Status.Phase != e.ObjectNew.Status.Phase { | ||
| return true | ||
| } | ||
| // Trigger if a container newly terminated (e.g. runner OOM). | ||
| // Pod phase stays Running when one container dies but | ||
| // the sidecar is still alive, so we must also check | ||
| // individual container statuses. | ||
| oldTerminated := make(map[string]bool, len(e.ObjectOld.Status.ContainerStatuses)) | ||
| for _, cs := range e.ObjectOld.Status.ContainerStatuses { | ||
| if cs.State.Terminated != nil { | ||
| oldTerminated[cs.Name] = true | ||
| } | ||
| } | ||
| for _, cs := range e.ObjectNew.Status.ContainerStatuses { | ||
| if cs.State.Terminated != nil && !oldTerminated[cs.Name] { | ||
| return true | ||
| } | ||
| } | ||
| return false |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether reconcileRunning handles container termination
echo "=== Checking reconcileRunning for ContainerStatuses handling ==="
rg -n -A5 -B2 'ContainerStatus' components/operator/internal/controller/reconcile_phases.go
echo ""
echo "=== Checking if any reconcile function handles OOM/Terminated containers ==="
rg -n 'Terminated|OOM' components/operator/internal/controller/
echo ""
echo "=== Full reconcileRunning function to confirm no container status checks ==="
ast-grep --pattern $'func ($_ *AgenticSessionReconciler) reconcileRunning($_, $_) ($_, $_) {
$$$
}'Repository: ambient-code/platform
Length of output: 9137
Predicate fires on container OOM, but reconcileRunning ignores container status.
The predicate correctly detects when a container newly terminates. However, reconcileRunning (lines 281-356) never inspects Status.ContainerStatuses. When the runner container OOMs:
- Pod phase stays
Running(sidecar still alive) - Predicate fires and enqueues reconciliation
reconcileRunningchecks pod existence (found), annotations (unchanged), generation (unchanged) → returns requeue after 30 seconds- AgenticSession remains in
Runningstate indefinitely despite the dead runner
Add logic to reconcileRunning to detect a terminated runner container and transition the session appropriately (e.g., to Failed or initiate recovery).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/operator/internal/controller/agenticsession_controller.go` around
lines 264 - 283, The reconcileRunning function currently ignores
pod.Status.ContainerStatuses, so when the runner container terminates (e.g.,
OOM) but the sidecar keeps the pod phase at Running the session never
transitions; update reconcileRunning to iterate pod.Status.ContainerStatuses,
detect when the runner container (identify by name used for the runner
container) has State.Terminated or lastTerminationState.Terminated with a
non-zero exit code or OOM reason, and then set the AgenticSession status to the
appropriate terminal state (e.g., Failed) or trigger your recovery path (update
session.Status, emit events, and stop requeue), ensuring you reference
reconcileRunning, the pod object passed in, and Status.ContainerStatuses when
locating where to add this logic.
| sync_count=0 | ||
| while true; do | ||
| check_size || echo "Size check warning (continuing anyway)" | ||
| # Periodically backup git repos (every Nth cycle) so repo state | ||
| # is preserved even if the runner container OOMs without SIGTERM | ||
| sync_count=$((sync_count + 1)) | ||
| if [ $((sync_count % REPO_BACKUP_INTERVAL)) -eq 0 ]; then | ||
| backup_git_repos || echo "Repo backup had errors (continuing)" | ||
| fi | ||
| sync_to_s3 || echo "Sync failed, will retry in ${SYNC_INTERVAL}s..." | ||
| sleep ${SYNC_INTERVAL} |
There was a problem hiding this comment.
Guard against REPO_BACKUP_INTERVAL=0 to prevent division by zero.
If a user sets REPO_BACKUP_INTERVAL=0 (perhaps expecting "disabled"), the modulo operation on line 294 will produce a division-by-zero error in bash. Depending on shell behavior with set -e, this could crash the sync loop.
Consider adding a guard near the configuration section to default invalid values:
Suggested fix
REPO_BACKUP_INTERVAL="${REPO_BACKUP_INTERVAL:-5}" # Backup repos every Nth sync cycle
+# Ensure REPO_BACKUP_INTERVAL is at least 1 to avoid division by zero
+if [ "${REPO_BACKUP_INTERVAL}" -lt 1 ] 2>/dev/null; then
+ REPO_BACKUP_INTERVAL=5
+fiOtherwise, the periodic backup logic correctly addresses the OOM scenario by ensuring repo state is captured even without SIGTERM.
🧰 Tools
🪛 Shellcheck (0.11.0)
[info] 298-298: Double quote to prevent globbing and word splitting.
(SC2086)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@components/runners/state-sync/sync.sh` around lines 288 - 298, The loop can
divide by zero when REPO_BACKUP_INTERVAL is 0; add a startup/config guard that
validates REPO_BACKUP_INTERVAL and treats non-positive or unset values as
"disabled" (or set a safe default >0). Update the periodic backup condition in
the sync loop (which references sync_count and calls backup_git_repos) to first
check that REPO_BACKUP_INTERVAL is a positive integer before performing the
modulo, and skip backup_git_repos when disabled; ensure any parsing/validation
happens once at init so the sync_to_s3 and check_size logic remain unchanged.
Summary
Three issues caused data loss and broken session resume when the runner container OOMed:
backup_git_repos()only ran on SIGTERM. Runner OOM meant no SIGTERM → no repo backup. Now runs periodically every ~5 min.--resumenever passed on restart → agent lost all context. Now persists immediately on CLI init.Runningwhen one container dies (sidecar still alive). OOMKilled pods sat indefinitely. Now triggers on container termination too.Changes
state-sync/sync.shbackup_git_reposevery Nth sync cycle (configurableREPO_BACKUP_INTERVAL, default 5)session.pyon_session_idcallback persists session ID to disk immediately on CLI init messagebridge.pyagenticsession_controller.goRoot cause investigation
Diagnosed from live OOMKilled pods in
bug-bash-mturleynamespace on UAT:ambient-code-runnercontainer OOMKilled (exit 137, 8Gi limit)--resumeworks fine on sessions with dangling tool calls (verified via SDK)claude_session_ids.jsonnever existed in S3 because the runner OOMed during its first turn, before_persist_session_ids()was calledRunningTest plan
go vet,go build)🤖 Generated with Claude Code