Skip to content

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1529

Merged
mergify[bot] merged 1 commit into
mainfrom
runner-probes
May 8, 2026
Merged

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1529
mergify[bot] merged 1 commit into
mainfrom
runner-probes

Conversation

@jeremyeder
Copy link
Copy Markdown
Contributor

@jeremyeder jeremyeder commented May 8, 2026

Runner pods had no K8s health probes, so the Service routed traffic before FastAPI finished starting. The backend's proxy requests (e.g., GET /git/status) hit a connection error and returned 503 "runner unavailable". Separately, INITIAL_PROMPT retry errors were undebuggable because exceptions like asyncio.TimeoutError have empty str() representations, producing log lines like error: , retrying in 2s.

Readiness probe (/health, 3s initial delay, 5s period) gates the Service so traffic only reaches the runner once it's actually serving. Liveness probe (/health, 20s initial delay, 30s period) restarts the pod if the runner becomes completely unresponsive. The 20s liveness delay accounts for gRPC listener setup (up to 10s) and MCP server connections during startup.

Error logging now includes type(e).__name__ in retry warnings, so TimeoutError() or ClientConnectorError(...) appears instead of a blank string. The final failure log also reports the last exception.

Test plan

  • go build ./... and go vet ./... pass for operator
  • 22/22 test_app_initial_prompt.py tests pass
  • Deploy to cluster: kubectl describe pod <runner> should show both probes; pod should not become Ready until /health returns 200

Compound Engineering
HARNESS

Summary by CodeRabbit

  • Chores

    • Added health check monitoring to improve system resilience and reliability.
  • Bug Fixes

    • Enhanced authentication validation security with improved token verification.
    • Improved diagnostic logging for connection failures and retry attempts.

Runner pods had no readiness or liveness probes, causing the K8s Service
to route traffic before FastAPI was ready (503 "runner unavailable").
INITIAL_PROMPT retry errors logged empty exception messages because
asyncio.TimeoutError and similar exceptions have empty str() representations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented May 8, 2026

Deploy Preview for cheerful-kitten-f556a0 canceled.

Name Link
🔨 Latest commit 2dc2186
🔍 Latest deploy log https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/69fd491de9bca90008d2900d

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 8, 2026

📝 Walkthrough

Walkthrough

This PR adds Kubernetes health probes to the operator's Pod spec for the ambient-code-runner container and improves the runner app's token authentication formatting and failure logging robustness.

Changes

Runner Health & Robustness

Layer / File(s) Summary
Health Probe Configuration
components/operator/internal/handlers/sessions.go
Kubernetes readinessProbe and livenessProbe added to the runner container spec, both probing /health endpoint on runnerPort with configurable initial delays, periods, timeouts, and failure thresholds.
App Resilience
components/runners/ambient-runner/ambient_runner/app.py
Token authentication middleware refactored to use constant-time comparison with secrets.compare_digest; HTTP retry logic enhanced to track and log the last exception with type/message and backoff details instead of just attempt count.
🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Resource Safety ⚠️ Warning Init container 'init-hydrate' missing resource limits/requests. Other resources properly scoped, have OwnerReferences, and RBAC is not overly permissive. Add Resources field to init-hydrate container with appropriate CPU/memory requests and limits (e.g., 100m/128Mi requests, 1000m/1Gi limits) matching state-sync pattern.
✅ Passed checks (6 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed Title follows Conventional Commits format (fix scope) and accurately describes both main changes: adding health probes and improving error logging.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Performance And Algorithmic Complexity ✅ Passed No performance regressions detected. Probes are static K8s config; retry loop has bounded iterations (8) with one HTTP POST per attempt; token reading is cached and acceptable for retries.
Security And Secret Handling ✅ Passed Token auth uses constant-time comparison. Bot token not logged. Health endpoint sanitized. No plaintext credentials, hardcoded secrets, injection vulnerabilities, or sensitive data leakage detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch runner-probes
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch runner-probes

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1047-1070: The pod spec currently sets RestartPolicyNever which
prevents containers from being restarted when the liveness probe kills an
unresponsive runner; change the pod's RestartPolicy from RestartPolicyNever to
corev1.RestartPolicyAlways (or corev1.RestartPolicyOnFailure if you prefer
restarts only on non-zero exits) where the PodSpec is constructed (replace
RestartPolicyNever with corev1.RestartPolicyAlways in the code that sets the
pod's RestartPolicy).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0c663774-a65b-45bc-9e7d-545dc0ee9c01

📥 Commits

Reviewing files that changed from the base of the PR and between a32da5f and 2dc2186.

📒 Files selected for processing (2)
  • components/operator/internal/handlers/sessions.go
  • components/runners/ambient-runner/ambient_runner/app.py

Comment on lines +1047 to +1070
ReadinessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/health",
Port: intstr.FromInt32(runnerPort),
},
},
InitialDelaySeconds: 3,
PeriodSeconds: 5,
TimeoutSeconds: 2,
FailureThreshold: 3,
},
LivenessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/health",
Port: intstr.FromInt32(runnerPort),
},
},
InitialDelaySeconds: 20,
PeriodSeconds: 30,
TimeoutSeconds: 5,
FailureThreshold: 3,
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify probe config and restart policy coexist in the runner pod spec.
rg -n -C2 'RestartPolicy:\s+corev1\.RestartPolicy|ReadinessProbe:|LivenessProbe:' components/operator/internal/handlers/sessions.go

Repository: ambient-code/platform

Length of output: 595


Liveness probe will not restart containers due to RestartPolicyNever

The liveness probe will kill unresponsive runners, but the pod spec's RestartPolicyNever (line 919) prevents container restart. This defeats the self-healing goal.

Suggested fix
-    RestartPolicy:                 corev1.RestartPolicyNever,
+    RestartPolicy:                 corev1.RestartPolicyOnFailure,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@components/operator/internal/handlers/sessions.go` around lines 1047 - 1070,
The pod spec currently sets RestartPolicyNever which prevents containers from
being restarted when the liveness probe kills an unresponsive runner; change the
pod's RestartPolicy from RestartPolicyNever to corev1.RestartPolicyAlways (or
corev1.RestartPolicyOnFailure if you prefer restarts only on non-zero exits)
where the PodSpec is constructed (replace RestartPolicyNever with
corev1.RestartPolicyAlways in the code that sets the pod's RestartPolicy).

@jeremyeder
Copy link
Copy Markdown
Contributor Author

@wcmitchell here we go. take a look?

@mergify mergify Bot added the queued label May 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 8, 2026

Merge Queue Status

  • Entered queue2026-05-08 02:41 UTC · Rule: default
  • Checks skipped · PR is already up-to-date
  • Merged2026-05-08 02:41 UTC · at 2dc218666f6bbce8ce4c3402c2fa18e863efba72 · squash

This pull request spent 13 seconds in the queue, including 2 seconds running CI.

Required conditions to merge

@mergify mergify Bot merged commit 6821b77 into main May 8, 2026
68 checks passed
@mergify mergify Bot deleted the runner-probes branch May 8, 2026 02:41
@mergify mergify Bot removed the queued label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant