-
Notifications
You must be signed in to change notification settings - Fork 73
fix(runner,operator): improve session backup and resume reliability #1057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -10,6 +10,7 @@ NAMESPACE="${NAMESPACE:-default}" | |
| SESSION_NAME="${SESSION_NAME:-unknown}" | ||
| SYNC_INTERVAL="${SYNC_INTERVAL:-60}" | ||
| MAX_SYNC_SIZE="${MAX_SYNC_SIZE:-1073741824}" # 1GB default | ||
| REPO_BACKUP_INTERVAL="${REPO_BACKUP_INTERVAL:-5}" # Backup repos every Nth sync cycle | ||
|
|
||
| # Sanitize inputs to prevent path traversal | ||
| NAMESPACE="${NAMESPACE//[^a-zA-Z0-9-]/}" | ||
|
|
@@ -261,6 +262,7 @@ echo "Session: ${NAMESPACE}/${SESSION_NAME}" | |
| echo "S3 Endpoint: ${S3_ENDPOINT}" | ||
| echo "S3 Bucket: ${S3_BUCKET}" | ||
| echo "Sync interval: ${SYNC_INTERVAL}s" | ||
| echo "Repo backup every: ${REPO_BACKUP_INTERVAL} sync cycles" | ||
| echo "Max sync size: ${MAX_SYNC_SIZE} bytes" | ||
| echo "=========================================" | ||
|
|
||
|
|
@@ -283,8 +285,15 @@ echo "Waiting 30s for workspace to populate..." | |
| sleep 30 | ||
|
|
||
| # Main sync loop | ||
| sync_count=0 | ||
| while true; do | ||
| check_size || echo "Size check warning (continuing anyway)" | ||
| # Periodically backup git repos (every Nth cycle) so repo state | ||
| # is preserved even if the runner container OOMs without SIGTERM | ||
| sync_count=$((sync_count + 1)) | ||
| if [ $((sync_count % REPO_BACKUP_INTERVAL)) -eq 0 ]; then | ||
| backup_git_repos || echo "Repo backup had errors (continuing)" | ||
| fi | ||
| sync_to_s3 || echo "Sync failed, will retry in ${SYNC_INTERVAL}s..." | ||
| sleep ${SYNC_INTERVAL} | ||
|
Comment on lines
+288
to
298
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Guard against If a user sets Consider adding a guard near the configuration section to default invalid values: Suggested fix REPO_BACKUP_INTERVAL="${REPO_BACKUP_INTERVAL:-5}" # Backup repos every Nth sync cycle
+# Ensure REPO_BACKUP_INTERVAL is at least 1 to avoid division by zero
+if [ "${REPO_BACKUP_INTERVAL}" -lt 1 ] 2>/dev/null; then
+ REPO_BACKUP_INTERVAL=5
+fiOtherwise, the periodic backup logic correctly addresses the OOM scenario by ensuring repo state is captured even without SIGTERM. 🧰 Tools🪛 Shellcheck (0.11.0)[info] 298-298: Double quote to prevent globbing and word splitting. (SC2086) 🤖 Prompt for AI Agents |
||
| done | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: ambient-code/platform
Length of output: 9137
Predicate fires on container OOM, but
reconcileRunningignores container status.The predicate correctly detects when a container newly terminates. However,
reconcileRunning(lines 281-356) never inspectsStatus.ContainerStatuses. When the runner container OOMs:Running(sidecar still alive)reconcileRunningchecks pod existence (found), annotations (unchanged), generation (unchanged) → returns requeue after 30 secondsRunningstate indefinitely despite the dead runnerAdd logic to
reconcileRunningto detect a terminated runner container and transition the session appropriately (e.g., toFailedor initiate recovery).🤖 Prompt for AI Agents