fix(runner): add health probes and improve INITIAL_PROMPT error logging by jeremyeder · Pull Request #1529 · ambient-code/platform

jeremyeder · 2026-05-08T02:23:23Z

Runner pods had no K8s health probes, so the Service routed traffic before FastAPI finished starting. The backend's proxy requests (e.g., GET /git/status) hit a connection error and returned 503 "runner unavailable". Separately, INITIAL_PROMPT retry errors were undebuggable because exceptions like asyncio.TimeoutError have empty str() representations, producing log lines like error: , retrying in 2s.

Readiness probe (/health, 3s initial delay, 5s period) gates the Service so traffic only reaches the runner once it's actually serving. Liveness probe (/health, 20s initial delay, 30s period) restarts the pod if the runner becomes completely unresponsive. The 20s liveness delay accounts for gRPC listener setup (up to 10s) and MCP server connections during startup.

Error logging now includes type(e).__name__ in retry warnings, so TimeoutError() or ClientConnectorError(...) appears instead of a blank string. The final failure log also reports the last exception.

Test plan

go build ./... and go vet ./... pass for operator
22/22 test_app_initial_prompt.py tests pass
Deploy to cluster: kubectl describe pod <runner> should show both probes; pod should not become Ready until /health returns 200

Summary by CodeRabbit

Chores
- Added health check monitoring to improve system resilience and reliability.
Bug Fixes
- Enhanced authentication validation security with improved token verification.
- Improved diagnostic logging for connection failures and retry attempts.

Runner pods had no readiness or liveness probes, causing the K8s Service to route traffic before FastAPI was ready (503 "runner unavailable"). INITIAL_PROMPT retry errors logged empty exception messages because asyncio.TimeoutError and similar exceptions have empty str() representations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

netlify · 2026-05-08T02:23:27Z

✅ Deploy Preview for cheerful-kitten-f556a0 canceled.

Name	Link
🔨 Latest commit	`2dc2186`
🔍 Latest deploy log	https://app.netlify.com/projects/cheerful-kitten-f556a0/deploys/69fd491de9bca90008d2900d

coderabbitai · 2026-05-08T02:23:38Z

📝 Walkthrough

Walkthrough

This PR adds Kubernetes health probes to the operator's Pod spec for the ambient-code-runner container and improves the runner app's token authentication formatting and failure logging robustness.

Changes

Runner Health & Robustness

Layer / File(s)	Summary
Health Probe Configuration `components/operator/internal/handlers/sessions.go`	Kubernetes `readinessProbe` and `livenessProbe` added to the runner container spec, both probing `/health` endpoint on `runnerPort` with configurable initial delays, periods, timeouts, and failure thresholds.
App Resilience `components/runners/ambient-runner/ambient_runner/app.py`	Token authentication middleware refactored to use constant-time comparison with `secrets.compare_digest`; HTTP retry logic enhanced to track and log the last exception with type/message and backoff details instead of just attempt count.

🚥 Pre-merge checks | ✅ 6 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Kubernetes Resource Safety	⚠️ Warning	Init container 'init-hydrate' missing resource limits/requests. Other resources properly scoped, have OwnerReferences, and RBAC is not overly permissive.	Add Resources field to init-hydrate container with appropriate CPU/memory requests and limits (e.g., 100m/128Mi requests, 1000m/1Gi limits) matching state-sync pattern.

✅ Passed checks (6 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	Title follows Conventional Commits format (fix scope) and accurately describes both main changes: adding health probes and improving error logging.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Performance And Algorithmic Complexity	✅ Passed	No performance regressions detected. Probes are static K8s config; retry loop has bounded iterations (8) with one HTTP POST per attempt; token reading is cached and acceptable for retries.
Security And Secret Handling	✅ Passed	Token auth uses constant-time comparison. Bot token not logged. Health endpoint sanitized. No plaintext credentials, hardcoded secrets, injection vulnerabilities, or sensitive data leakage detected.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch runner-probes

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch runner-probes

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1047-1070: The pod spec currently sets RestartPolicyNever which
prevents containers from being restarted when the liveness probe kills an
unresponsive runner; change the pod's RestartPolicy from RestartPolicyNever to
corev1.RestartPolicyAlways (or corev1.RestartPolicyOnFailure if you prefer
restarts only on non-zero exits) where the PodSpec is constructed (replace
RestartPolicyNever with corev1.RestartPolicyAlways in the code that sets the
pod's RestartPolicy).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 0c663774-a65b-45bc-9e7d-545dc0ee9c01

📥 Commits

Reviewing files that changed from the base of the PR and between a32da5f and 2dc2186.

📒 Files selected for processing (2)

components/operator/internal/handlers/sessions.go
components/runners/ambient-runner/ambient_runner/app.py

coderabbitai · 2026-05-08T02:27:26Z

+			ReadinessProbe: &corev1.Probe{
+				ProbeHandler: corev1.ProbeHandler{
+					HTTPGet: &corev1.HTTPGetAction{
+						Path: "/health",
+						Port: intstr.FromInt32(runnerPort),
+					},
+				},
+				InitialDelaySeconds: 3,
+				PeriodSeconds:       5,
+				TimeoutSeconds:      2,
+				FailureThreshold:    3,
+			},
+			LivenessProbe: &corev1.Probe{
+				ProbeHandler: corev1.ProbeHandler{
+					HTTPGet: &corev1.HTTPGetAction{
+						Path: "/health",
+						Port: intstr.FromInt32(runnerPort),
+					},
+				},
+				InitialDelaySeconds: 20,
+				PeriodSeconds:       30,
+				TimeoutSeconds:      5,
+				FailureThreshold:    3,
+			},


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify probe config and restart policy coexist in the runner pod spec. rg -n -C2 'RestartPolicy:\s+corev1\.RestartPolicy|ReadinessProbe:|LivenessProbe:' components/operator/internal/handlers/sessions.go

Repository: ambient-code/platform

Length of output: 595

Liveness probe will not restart containers due to RestartPolicyNever

The liveness probe will kill unresponsive runners, but the pod spec's RestartPolicyNever (line 919) prevents container restart. This defeats the self-healing goal.

Suggested fix

- RestartPolicy: corev1.RestartPolicyNever, + RestartPolicy: corev1.RestartPolicyOnFailure,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@components/operator/internal/handlers/sessions.go` around lines 1047 - 1070, The pod spec currently sets RestartPolicyNever which prevents containers from being restarted when the liveness probe kills an unresponsive runner; change the pod's RestartPolicy from RestartPolicyNever to corev1.RestartPolicyAlways (or corev1.RestartPolicyOnFailure if you prefer restarts only on non-zero exits) where the PodSpec is constructed (replace RestartPolicyNever with corev1.RestartPolicyAlways in the code that sets the pod's RestartPolicy).

jeremyeder · 2026-05-08T02:32:46Z

@wcmitchell here we go. take a look?

mergify · 2026-05-08T02:41:22Z

Merge Queue Status

✅ Entered queue — 2026-05-08 02:41 UTC · Rule: default
✅ Checks skipped · PR is already up-to-date
✅ Merged — 2026-05-08 02:41 UTC · at 2dc218666f6bbce8ce4c3402c2fa18e863efba72 · squash

This pull request spent 13 seconds in the queue, including 2 seconds running CI.

Required conditions to merge

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

mergify Bot added the queued label May 8, 2026

mergify Bot merged commit 6821b77 into main May 8, 2026
68 checks passed

mergify Bot deleted the runner-probes branch May 8, 2026 02:41

mergify Bot removed the queued label May 8, 2026

This was referenced May 8, 2026

fix(runner): add health probes and improve INITIAL_PROMPT error logging #1534

Closed

fix(runner): add health probes and improve INITIAL_PROMPT error logging RedHatInsights/ambient-code-platform#68

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1529

fix(runner): add health probes and improve INITIAL_PROMPT error logging#1529
mergify[bot] merged 1 commit into
mainfrom
runner-probes

jeremyeder commented May 8, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

netlify Bot commented May 8, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 8, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 8, 2026

Uh oh!

jeremyeder commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremyeder commented May 8, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Summary by CodeRabbit

Uh oh!

netlify Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for cheerful-kitten-f556a0 canceled.

Uh oh!

coderabbitai Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

jeremyeder commented May 8, 2026

Uh oh!

mergify Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Queue Status

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeremyeder commented May 8, 2026 •

edited by coderabbitai Bot

Loading

netlify Bot commented May 8, 2026 •

edited

Loading

coderabbitai Bot commented May 8, 2026 •

edited

Loading

mergify Bot commented May 8, 2026 •

edited

Loading