fix(runner,operator): improve session backup and resume reliability by Gkrumbach07 · Pull Request #1057 · ambient-code/platform

Gkrumbach07 · 2026-03-26T14:56:39Z

Summary

Three issues caused data loss and broken session resume when the runner container OOMed:

Repo data lost on OOM: backup_git_repos() only ran on SIGTERM. Runner OOM meant no SIGTERM → no repo backup. Now runs periodically every ~5 min.
Session resume broken: CLI session ID was only persisted after turn completion. OOM during first turn → no session ID in S3 → --resume never passed on restart → agent lost all context. Now persists immediately on CLI init.
Operator didn't detect OOM: Pod watch predicate only checked pod phase, which stays Running when one container dies (sidecar still alive). OOMKilled pods sat indefinitely. Now triggers on container termination too.

Changes

File	Change
`state-sync/sync.sh`	Periodic `backup_git_repos` every Nth sync cycle (configurable `REPO_BACKUP_INTERVAL`, default 5)
`session.py`	`on_session_id` callback persists session ID to disk immediately on CLI init message
`bridge.py`	Resume failure detection + removed redundant post-turn persist
`agenticsession_controller.go`	Pod watch predicate triggers on container termination, not just pod phase

Root cause investigation

Diagnosed from live OOMKilled pods in bug-bash-mturley namespace on UAT:

Confirmed ambient-code-runner container OOMKilled (exit 137, 8Gi limit)
State-sync kept running but never backed up repos (only on SIGTERM)
Session JSONL was valid (237 lines, all valid JSON) and --resume works fine on sessions with dangling tool calls (verified via SDK)
The actual cause: claude_session_ids.json never existed in S3 because the runner OOMed during its first turn, before _persist_session_ids() was called
Operator pod watch predicate filtered out the event because pod phase stayed Running

Test plan

Runner tests pass (488 passed, 11 skipped)
Operator builds clean (go vet, go build)
Deploy to kind cluster and verify periodic repo backup runs
Simulate runner OOM and verify operator detects container termination
Verify session resume works after OOM recovery

🤖 Generated with Claude Code

Three issues caused data loss and broken session resume on runner OOM: 1. **Repo data not backed up periodically**: `backup_git_repos()` was only called on SIGTERM (graceful shutdown). When the runner OOMed, state-sync never received SIGTERM and repo changes were lost. Now backs up every 5th sync cycle (~5 min) via configurable REPO_BACKUP_INTERVAL. 2. **Session resume broken after OOM**: The CLI session ID was only persisted to disk after a turn completed. If the runner OOMed during the first turn, the session ID file never existed in S3. On pod restart, `--resume` was never passed and the agent lost all context. Now persists immediately when the CLI returns the session ID in its init message via an `on_session_id` callback. 3. **Operator not detecting runner OOM**: The pod watch predicate only triggered on pod phase changes. When the runner container OOMed, pod phase stayed "Running" (state-sync sidecar still alive), so the operator never reconciled — pods sat in OOMKilled state indefinitely. Now also triggers on individual container termination transitions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-26T14:56:59Z

Walkthrough

The pull request updates pod watch reconciliation triggers in an Agentic Session controller, refactors session persistence timing in an ambient runner's Claude bridge, and introduces periodic git repository backups in the state sync process.

Changes

Cohort / File(s)	Summary
Pod Watch Predicate `components/operator/internal/controller/agenticsession_controller.go`	Modified the pod watch predicate to trigger reconciliations when either the pod's `Status.Phase` changes or any container within `Status.ContainerStatuses` transitions to a `Terminated` state (compared against previous status).
Session Persistence `components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py`, `components/runners/ambient-runner/ambient_runner/bridges/claude/session.py`	Refactored session ID persistence: moved from end-of-turn in `SessionManager.destroy()` to immediate callback upon receiving the SDK init message. Added optional `on_session_id` callback to `SessionWorker` and new `SessionManager._on_session_id()` method for earlier persistence. Bridge now compares `worker.session_id` against pre-run resume lookup and logs warning on mismatch.
Git Repo Backup `components/runners/state-sync/sync.sh`	Introduced `REPO_BACKUP_INTERVAL` environment variable (default 5) and added periodic backup logic to main sync loop: calls `backup_git_repos` every Nth iteration (when `sync_count % REPO_BACKUP_INTERVAL == 0`) with non-fatal error logging.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main changes across the PR: improving session backup and resume reliability by fixing three root causes of data loss when the runner OOMs.
Description check	✅ Passed	The description is well-structured and directly related to the changeset, explaining the root causes, specific changes per file, and investigation findings that led to the fixes.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-better-backup

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/operator/internal/controller/agenticsession_controller.go`:
- Around line 264-283: The reconcileRunning function currently ignores
pod.Status.ContainerStatuses, so when the runner container terminates (e.g.,
OOM) but the sidecar keeps the pod phase at Running the session never
transitions; update reconcileRunning to iterate pod.Status.ContainerStatuses,
detect when the runner container (identify by name used for the runner
container) has State.Terminated or lastTerminationState.Terminated with a
non-zero exit code or OOM reason, and then set the AgenticSession status to the
appropriate terminal state (e.g., Failed) or trigger your recovery path (update
session.Status, emit events, and stop requeue), ensuring you reference
reconcileRunning, the pod object passed in, and Status.ContainerStatuses when
locating where to add this logic.

In `@components/runners/state-sync/sync.sh`:
- Around line 288-298: The loop can divide by zero when REPO_BACKUP_INTERVAL is
0; add a startup/config guard that validates REPO_BACKUP_INTERVAL and treats
non-positive or unset values as "disabled" (or set a safe default >0). Update
the periodic backup condition in the sync loop (which references sync_count and
calls backup_git_repos) to first check that REPO_BACKUP_INTERVAL is a positive
integer before performing the modulo, and skip backup_git_repos when disabled;
ensure any parsing/validation happens once at init so the sync_to_s3 and
check_size logic remain unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 09493fb0-e39e-49a7-a62e-1ab0dc4346f4

📥 Commits

Reviewing files that changed from the base of the PR and between 9ca8168 and 84c5cfe.

📒 Files selected for processing (4)

components/operator/internal/controller/agenticsession_controller.go
components/runners/ambient-runner/ambient_runner/bridges/claude/bridge.py
components/runners/ambient-runner/ambient_runner/bridges/claude/session.py
components/runners/state-sync/sync.sh

coderabbitai · 2026-03-26T15:01:36Z

components/operator/internal/controller/agenticsession_controller.go

+			// Trigger if pod phase changed
+			if e.ObjectOld.Status.Phase != e.ObjectNew.Status.Phase {
+				return true
+			}
+			// Trigger if a container newly terminated (e.g. runner OOM).
+			// Pod phase stays Running when one container dies but
+			// the sidecar is still alive, so we must also check
+			// individual container statuses.
+			oldTerminated := make(map[string]bool, len(e.ObjectOld.Status.ContainerStatuses))
+			for _, cs := range e.ObjectOld.Status.ContainerStatuses {
+				if cs.State.Terminated != nil {
+					oldTerminated[cs.Name] = true
+				}
+			}
+			for _, cs := range e.ObjectNew.Status.ContainerStatuses {
+				if cs.State.Terminated != nil && !oldTerminated[cs.Name] {
+					return true
+				}
+			}
+			return false


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify whether reconcileRunning handles container termination echo "=== Checking reconcileRunning for ContainerStatuses handling ===" rg -n -A5 -B2 'ContainerStatus' components/operator/internal/controller/reconcile_phases.go echo "" echo "=== Checking if any reconcile function handles OOM/Terminated containers ===" rg -n 'Terminated|OOM' components/operator/internal/controller/ echo "" echo "=== Full reconcileRunning function to confirm no container status checks ===" ast-grep --pattern $'func ($_ *AgenticSessionReconciler) reconcileRunning($_, $_) ($_, $_) { $$$ }'

Repository: ambient-code/platform

Length of output: 9137

Predicate fires on container OOM, but reconcileRunning ignores container status.

The predicate correctly detects when a container newly terminates. However, reconcileRunning (lines 281-356) never inspects Status.ContainerStatuses. When the runner container OOMs:

Pod phase stays Running (sidecar still alive)

Predicate fires and enqueues reconciliation

reconcileRunning checks pod existence (found), annotations (unchanged), generation (unchanged) → returns requeue after 30 seconds

AgenticSession remains in Running state indefinitely despite the dead runner

Add logic to reconcileRunning to detect a terminated runner container and transition the session appropriately (e.g., to Failed or initiate recovery).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/operator/internal/controller/agenticsession_controller.go` around lines 264 - 283, The reconcileRunning function currently ignores pod.Status.ContainerStatuses, so when the runner container terminates (e.g., OOM) but the sidecar keeps the pod phase at Running the session never transitions; update reconcileRunning to iterate pod.Status.ContainerStatuses, detect when the runner container (identify by name used for the runner container) has State.Terminated or lastTerminationState.Terminated with a non-zero exit code or OOM reason, and then set the AgenticSession status to the appropriate terminal state (e.g., Failed) or trigger your recovery path (update session.Status, emit events, and stop requeue), ensuring you reference reconcileRunning, the pod object passed in, and Status.ContainerStatuses when locating where to add this logic.

coderabbitai · 2026-03-26T15:01:36Z

components/runners/state-sync/sync.sh

+sync_count=0
 while true; do
    check_size || echo "Size check warning (continuing anyway)"
+    # Periodically backup git repos (every Nth cycle) so repo state
+    # is preserved even if the runner container OOMs without SIGTERM
+    sync_count=$((sync_count + 1))
+    if [ $((sync_count % REPO_BACKUP_INTERVAL)) -eq 0 ]; then
+        backup_git_repos || echo "Repo backup had errors (continuing)"
+    fi
    sync_to_s3 || echo "Sync failed, will retry in ${SYNC_INTERVAL}s..."
    sleep ${SYNC_INTERVAL}


⚠️ Potential issue | 🟡 Minor

Guard against REPO_BACKUP_INTERVAL=0 to prevent division by zero.

If a user sets REPO_BACKUP_INTERVAL=0 (perhaps expecting "disabled"), the modulo operation on line 294 will produce a division-by-zero error in bash. Depending on shell behavior with set -e, this could crash the sync loop.

Consider adding a guard near the configuration section to default invalid values:

Suggested fix

REPO_BACKUP_INTERVAL="${REPO_BACKUP_INTERVAL:-5}" # Backup repos every Nth sync cycle +# Ensure REPO_BACKUP_INTERVAL is at least 1 to avoid division by zero +if [ "${REPO_BACKUP_INTERVAL}" -lt 1 ] 2>/dev/null; then + REPO_BACKUP_INTERVAL=5 +fi

Otherwise, the periodic backup logic correctly addresses the OOM scenario by ensuring repo state is captured even without SIGTERM.

🧰 Tools

🪛 Shellcheck (0.11.0)

[info] 298-298: Double quote to prevent globbing and word splitting.

(SC2086)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@components/runners/state-sync/sync.sh` around lines 288 - 298, The loop can divide by zero when REPO_BACKUP_INTERVAL is 0; add a startup/config guard that validates REPO_BACKUP_INTERVAL and treats non-positive or unset values as "disabled" (or set a safe default >0). Update the periodic backup condition in the sync loop (which references sync_count and calls backup_git_repos) to first check that REPO_BACKUP_INTERVAL is a positive integer before performing the modulo, and skip backup_git_repos when disabled; ensure any parsing/validation happens once at init so the sync_to_s3 and check_size logic remain unchanged.

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

Gkrumbach07 merged commit 896aaa1 into main Mar 26, 2026
38 checks passed

Gkrumbach07 deleted the worktree-better-backup branch March 26, 2026 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runner,operator): improve session backup and resume reliability#1057

fix(runner,operator): improve session backup and resume reliability#1057
Gkrumbach07 merged 1 commit intomainfrom
worktree-better-backup

Gkrumbach07 commented Mar 26, 2026

Uh oh!

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 26, 2026

Uh oh!

coderabbitai bot Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Gkrumbach07 commented Mar 26, 2026

Summary

Changes

Root cause investigation

Test plan

Uh oh!

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 26, 2026 •

edited

Loading