fix(runner): add health probes and improve INITIAL_PROMPT error logging#1529
Conversation
Runner pods had no readiness or liveness probes, causing the K8s Service to route traffic before FastAPI was ready (503 "runner unavailable"). INITIAL_PROMPT retry errors logged empty exception messages because asyncio.TimeoutError and similar exceptions have empty str() representations. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for cheerful-kitten-f556a0 canceled.
|
📝 WalkthroughWalkthroughThis PR adds Kubernetes health probes to the operator's Pod spec for the ambient-code-runner container and improves the runner app's token authentication formatting and failure logging robustness. ChangesRunner Health & Robustness
🚥 Pre-merge checks | ✅ 6 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (6 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
✨ Simplify code
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@components/operator/internal/handlers/sessions.go`:
- Around line 1047-1070: The pod spec currently sets RestartPolicyNever which
prevents containers from being restarted when the liveness probe kills an
unresponsive runner; change the pod's RestartPolicy from RestartPolicyNever to
corev1.RestartPolicyAlways (or corev1.RestartPolicyOnFailure if you prefer
restarts only on non-zero exits) where the PodSpec is constructed (replace
RestartPolicyNever with corev1.RestartPolicyAlways in the code that sets the
pod's RestartPolicy).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 0c663774-a65b-45bc-9e7d-545dc0ee9c01
📒 Files selected for processing (2)
components/operator/internal/handlers/sessions.gocomponents/runners/ambient-runner/ambient_runner/app.py
| ReadinessProbe: &corev1.Probe{ | ||
| ProbeHandler: corev1.ProbeHandler{ | ||
| HTTPGet: &corev1.HTTPGetAction{ | ||
| Path: "/health", | ||
| Port: intstr.FromInt32(runnerPort), | ||
| }, | ||
| }, | ||
| InitialDelaySeconds: 3, | ||
| PeriodSeconds: 5, | ||
| TimeoutSeconds: 2, | ||
| FailureThreshold: 3, | ||
| }, | ||
| LivenessProbe: &corev1.Probe{ | ||
| ProbeHandler: corev1.ProbeHandler{ | ||
| HTTPGet: &corev1.HTTPGetAction{ | ||
| Path: "/health", | ||
| Port: intstr.FromInt32(runnerPort), | ||
| }, | ||
| }, | ||
| InitialDelaySeconds: 20, | ||
| PeriodSeconds: 30, | ||
| TimeoutSeconds: 5, | ||
| FailureThreshold: 3, | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify probe config and restart policy coexist in the runner pod spec.
rg -n -C2 'RestartPolicy:\s+corev1\.RestartPolicy|ReadinessProbe:|LivenessProbe:' components/operator/internal/handlers/sessions.goRepository: ambient-code/platform
Length of output: 595
Liveness probe will not restart containers due to RestartPolicyNever
The liveness probe will kill unresponsive runners, but the pod spec's RestartPolicyNever (line 919) prevents container restart. This defeats the self-healing goal.
Suggested fix
- RestartPolicy: corev1.RestartPolicyNever,
+ RestartPolicy: corev1.RestartPolicyOnFailure,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@components/operator/internal/handlers/sessions.go` around lines 1047 - 1070,
The pod spec currently sets RestartPolicyNever which prevents containers from
being restarted when the liveness probe kills an unresponsive runner; change the
pod's RestartPolicy from RestartPolicyNever to corev1.RestartPolicyAlways (or
corev1.RestartPolicyOnFailure if you prefer restarts only on non-zero exits)
where the PodSpec is constructed (replace RestartPolicyNever with
corev1.RestartPolicyAlways in the code that sets the pod's RestartPolicy).
|
@wcmitchell here we go. take a look? |
Merge Queue Status
This pull request spent 13 seconds in the queue, including 2 seconds running CI. Required conditions to merge |
Runner pods had no K8s health probes, so the Service routed traffic before FastAPI finished starting. The backend's proxy requests (e.g.,
GET /git/status) hit a connection error and returned 503 "runner unavailable". Separately, INITIAL_PROMPT retry errors were undebuggable because exceptions likeasyncio.TimeoutErrorhave emptystr()representations, producing log lines likeerror: , retrying in 2s.Readiness probe (
/health, 3s initial delay, 5s period) gates the Service so traffic only reaches the runner once it's actually serving. Liveness probe (/health, 20s initial delay, 30s period) restarts the pod if the runner becomes completely unresponsive. The 20s liveness delay accounts for gRPC listener setup (up to 10s) and MCP server connections during startup.Error logging now includes
type(e).__name__in retry warnings, soTimeoutError()orClientConnectorError(...)appears instead of a blank string. The final failure log also reports the last exception.Test plan
go build ./...andgo vet ./...pass for operatortest_app_initial_prompt.pytests passkubectl describe pod <runner>should show both probes; pod should not become Ready until/healthreturns 200Summary by CodeRabbit
Chores
Bug Fixes