Skip to content

fix(runner,operator): improve session backup and resume reliability#1057

Merged
Gkrumbach07 merged 1 commit intomainfrom
worktree-better-backup
Mar 26, 2026
Merged

fix(runner,operator): improve session backup and resume reliability#1057
Gkrumbach07 merged 1 commit intomainfrom
worktree-better-backup

Conversation

@Gkrumbach07
Copy link
Copy Markdown
Contributor

Summary

Three issues caused data loss and broken session resume when the runner container OOMed:

  • Repo data lost on OOM: backup_git_repos() only ran on SIGTERM. Runner OOM meant no SIGTERM → no repo backup. Now runs periodically every ~5 min.
  • Session resume broken: CLI session ID was only persisted after turn completion. OOM during first turn → no session ID in S3 → --resume never passed on restart → agent lost all context. Now persists immediately on CLI init.
  • Operator didn't detect OOM: Pod watch predicate only checked pod phase, which stays Running when one container dies (sidecar still alive). OOMKilled pods sat indefinitely. Now triggers on container termination too.

Changes

File Change
state-sync/sync.sh Periodic backup_git_repos every Nth sync cycle (configurable REPO_BACKUP_INTERVAL, default 5)
session.py on_session_id callback persists session ID to disk immediately on CLI init message
bridge.py Resume failure detection + removed redundant post-turn persist
agenticsession_controller.go Pod watch predicate triggers on container termination, not just pod phase

Root cause investigation

Diagnosed from live OOMKilled pods in bug-bash-mturley namespace on UAT:

  • Confirmed ambient-code-runner container OOMKilled (exit 137, 8Gi limit)
  • State-sync kept running but never backed up repos (only on SIGTERM)
  • Session JSONL was valid (237 lines, all valid JSON) and --resume works fine on sessions with dangling tool calls (verified via SDK)
  • The actual cause: claude_session_ids.json never existed in S3 because the runner OOMed during its first turn, before _persist_session_ids() was called
  • Operator pod watch predicate filtered out the event because pod phase stayed Running

Test plan

  • Runner tests pass (488 passed, 11 skipped)
  • Operator builds clean (go vet, go build)
  • Deploy to kind cluster and verify periodic repo backup runs
  • Simulate runner OOM and verify operator detects container termination
  • Verify session resume works after OOM recovery

🤖 Generated with Claude Code

Three issues caused data loss and broken session resume on runner OOM:

1. **Repo data not backed up periodically**: `backup_git_repos()` was only
   called on SIGTERM (graceful shutdown). When the runner OOMed, state-sync
   never received SIGTERM and repo changes were lost. Now backs up every
   5th sync cycle (~5 min) via configurable REPO_BACKUP_INTERVAL.

2. **Session resume broken after OOM**: The CLI session ID was only
   persisted to disk after a turn completed. If the runner OOMed during
   the first turn, the session ID file never existed in S3. On pod
   restart, `--resume` was never passed and the agent lost all context.
   Now persists immediately when the CLI returns the session ID in its
   init message via an `on_session_id` callback.

3. **Operator not detecting runner OOM**: The pod watch predicate only
   triggered on pod phase changes. When the runner container OOMed, pod
   phase stayed "Running" (state-sync sidecar still alive), so the
   operator never reconciled — pods sat in OOMKilled state indefinitely.
   Now also triggers on individual container termination transitions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 26, 2026

Walkthrough

The pull request updates pod watch reconciliation triggers in an Agentic Session controller, refactors session persistence timing in an ambient runner's Claude bridge, and introduces periodic git repository backups in the state sync process.

Changes

Cohort / File(s) Summary
Pod Watch Predicate
components/operator/internal/controller/agenticsession_controller.go
Modified the pod watch predicate to trigger reconciliations when either the pod's Status.Phase changes or any container within Status.ContainerStatuses transitions to a Terminated state (compared against previous status).
Session Persistence
components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py, components/runners/ambient-runner/ambient_runner/bridges/claude/session.py
Refactored session ID persistence: moved from end-of-turn in SessionManager.destroy() to immediate callback upon receiving the SDK init message. Added optional on_session_id callback to SessionWorker and new SessionManager._on_session_id() method for earlier persistence. Bridge now compares worker.session_id against pre-run resume lookup and logs warning on mismatch.
Git Repo Backup
components/runners/state-sync/sync.sh
Introduced REPO_BACKUP_INTERVAL environment variable (default 5) and added periodic backup logic to main sync loop: calls backup_git_repos every Nth iteration (when sync_count % REPO_BACKUP_INTERVAL == 0) with non-fatal error logging.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes across the PR: improving session backup and resume reliability by fixing three root causes of data loss when the runner OOMs.
Description check ✅ Passed The description is well-structured and directly related to the changeset, explaining the root causes, specific changes per file, and investigation findings that led to the fixes.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-better-backup

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/controller/agenticsession_controller.go`:
- Around line 264-283: The reconcileRunning function currently ignores
pod.Status.ContainerStatuses, so when the runner container terminates (e.g.,
OOM) but the sidecar keeps the pod phase at Running the session never
transitions; update reconcileRunning to iterate pod.Status.ContainerStatuses,
detect when the runner container (identify by name used for the runner
container) has State.Terminated or lastTerminationState.Terminated with a
non-zero exit code or OOM reason, and then set the AgenticSession status to the
appropriate terminal state (e.g., Failed) or trigger your recovery path (update
session.Status, emit events, and stop requeue), ensuring you reference
reconcileRunning, the pod object passed in, and Status.ContainerStatuses when
locating where to add this logic.

In `@components/runners/state-sync/sync.sh`:
- Around line 288-298: The loop can divide by zero when REPO_BACKUP_INTERVAL is
0; add a startup/config guard that validates REPO_BACKUP_INTERVAL and treats
non-positive or unset values as "disabled" (or set a safe default >0). Update
the periodic backup condition in the sync loop (which references sync_count and
calls backup_git_repos) to first check that REPO_BACKUP_INTERVAL is a positive
integer before performing the modulo, and skip backup_git_repos when disabled;
ensure any parsing/validation happens once at init so the sync_to_s3 and
check_size logic remain unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 09493fb0-e39e-49a7-a62e-1ab0dc4346f4

📥 Commits

Reviewing files that changed from the base of the PR and between 9ca8168 and 84c5cfe.

📒 Files selected for processing (4)
  • components/operator/internal/controller/agenticsession_controller.go
  • components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
  • components/runners/ambient-runner/ambient_runner/bridges/claude/session.py
  • components/runners/state-sync/sync.sh

Comment on lines +264 to +283
// Trigger if pod phase changed
if e.ObjectOld.Status.Phase != e.ObjectNew.Status.Phase {
return true
}
// Trigger if a container newly terminated (e.g. runner OOM).
// Pod phase stays Running when one container dies but
// the sidecar is still alive, so we must also check
// individual container statuses.
oldTerminated := make(map[string]bool, len(e.ObjectOld.Status.ContainerStatuses))
for _, cs := range e.ObjectOld.Status.ContainerStatuses {
if cs.State.Terminated != nil {
oldTerminated[cs.Name] = true
}
}
for _, cs := range e.ObjectNew.Status.ContainerStatuses {
if cs.State.Terminated != nil && !oldTerminated[cs.Name] {
return true
}
}
return false
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether reconcileRunning handles container termination

echo "=== Checking reconcileRunning for ContainerStatuses handling ==="
rg -n -A5 -B2 'ContainerStatus' components/operator/internal/controller/reconcile_phases.go

echo ""
echo "=== Checking if any reconcile function handles OOM/Terminated containers ==="
rg -n 'Terminated|OOM' components/operator/internal/controller/

echo ""
echo "=== Full reconcileRunning function to confirm no container status checks ==="
ast-grep --pattern $'func ($_ *AgenticSessionReconciler) reconcileRunning($_, $_) ($_, $_) {
  $$$
}'

Repository: ambient-code/platform

Length of output: 9137


Predicate fires on container OOM, but reconcileRunning ignores container status.

The predicate correctly detects when a container newly terminates. However, reconcileRunning (lines 281-356) never inspects Status.ContainerStatuses. When the runner container OOMs:

  1. Pod phase stays Running (sidecar still alive)
  2. Predicate fires and enqueues reconciliation
  3. reconcileRunning checks pod existence (found), annotations (unchanged), generation (unchanged) → returns requeue after 30 seconds
  4. AgenticSession remains in Running state indefinitely despite the dead runner

Add logic to reconcileRunning to detect a terminated runner container and transition the session appropriately (e.g., to Failed or initiate recovery).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/operator/internal/controller/agenticsession_controller.go` around
lines 264 - 283, The reconcileRunning function currently ignores
pod.Status.ContainerStatuses, so when the runner container terminates (e.g.,
OOM) but the sidecar keeps the pod phase at Running the session never
transitions; update reconcileRunning to iterate pod.Status.ContainerStatuses,
detect when the runner container (identify by name used for the runner
container) has State.Terminated or lastTerminationState.Terminated with a
non-zero exit code or OOM reason, and then set the AgenticSession status to the
appropriate terminal state (e.g., Failed) or trigger your recovery path (update
session.Status, emit events, and stop requeue), ensuring you reference
reconcileRunning, the pod object passed in, and Status.ContainerStatuses when
locating where to add this logic.

Comment on lines +288 to 298
sync_count=0
while true; do
check_size || echo "Size check warning (continuing anyway)"
# Periodically backup git repos (every Nth cycle) so repo state
# is preserved even if the runner container OOMs without SIGTERM
sync_count=$((sync_count + 1))
if [ $((sync_count % REPO_BACKUP_INTERVAL)) -eq 0 ]; then
backup_git_repos || echo "Repo backup had errors (continuing)"
fi
sync_to_s3 || echo "Sync failed, will retry in ${SYNC_INTERVAL}s..."
sleep ${SYNC_INTERVAL}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard against REPO_BACKUP_INTERVAL=0 to prevent division by zero.

If a user sets REPO_BACKUP_INTERVAL=0 (perhaps expecting "disabled"), the modulo operation on line 294 will produce a division-by-zero error in bash. Depending on shell behavior with set -e, this could crash the sync loop.

Consider adding a guard near the configuration section to default invalid values:

Suggested fix
 REPO_BACKUP_INTERVAL="${REPO_BACKUP_INTERVAL:-5}"  # Backup repos every Nth sync cycle
+# Ensure REPO_BACKUP_INTERVAL is at least 1 to avoid division by zero
+if [ "${REPO_BACKUP_INTERVAL}" -lt 1 ] 2>/dev/null; then
+    REPO_BACKUP_INTERVAL=5
+fi

Otherwise, the periodic backup logic correctly addresses the OOM scenario by ensuring repo state is captured even without SIGTERM.

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 298-298: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/runners/state-sync/sync.sh` around lines 288 - 298, The loop can
divide by zero when REPO_BACKUP_INTERVAL is 0; add a startup/config guard that
validates REPO_BACKUP_INTERVAL and treats non-positive or unset values as
"disabled" (or set a safe default >0). Update the periodic backup condition in
the sync loop (which references sync_count and calls backup_git_repos) to first
check that REPO_BACKUP_INTERVAL is a positive integer before performing the
modulo, and skip backup_git_repos when disabled; ensure any parsing/validation
happens once at init so the sync_to_s3 and check_size logic remain unchanged.

@Gkrumbach07 Gkrumbach07 merged commit 896aaa1 into main Mar 26, 2026
38 checks passed
@Gkrumbach07 Gkrumbach07 deleted the worktree-better-backup branch March 26, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant