Workflow reconciliation fails on session creation and is never retried
Summary
Sessions created with activeWorkflow configuration never receive the workflow — the operator attempts to POST the workflow config to the runner's HTTP endpoint before the pod is created, the POST fails with a DNS resolution error, and observedGeneration is then set to the current generation, preventing any retry.
Observed Behavior
Every session created with activeWorkflow (via acp_create_session with workflow_git_url, workflow_branch, workflow_path) shows:
WorkflowReconciled: False
Reason: UpdateFailed
Message: "Failed to notify runner: Post http://session-{name}.{namespace}.svc.cluster.local:8001/workflow:
dial tcp: lookup session-{name}.{namespace}.svc.cluster.local: no such host"
The session starts and enters Running phase, but the workflow is never loaded. The runner receives the initialPrompt without any workflow context (no system prompt, no skills, no rubric).
This is 100% reproducible — tested with two consecutive session creations, both failed identically.
Expected Behavior
The workflow should be applied to the runner after it starts. The WorkflowReconciled condition should eventually become True.
Root Cause
Three-part failure chain in components/operator/internal/handlers/sessions.go:
1. Workflow POST attempted before pod exists (line 679)
During the Pending phase handler, reconcileActiveWorkflowWithPatch() is called before the pod is created. It tries to POST to http://session-{name}.{namespace}.svc.cluster.local:8001/workflow, but the pod and its Service don't exist yet — DNS resolution fails.
// Line 676-679 — called BEFORE pod creation at ~line 1520
spec, _, _ := unstructured.NestedMap(currentObj.Object, "spec")
_ = reconcileSpecReposWithPatch(sessionNamespace, name, spec, currentObj, statusPatch)
_ = reconcileActiveWorkflowWithPatch(sessionNamespace, name, spec, currentObj, statusPatch) // ← fails here
Note: reconcileSpecReposWithPatch works fine because it only sets status conditions — actual repo cloning is done by init containers in the pod spec. But reconcileActiveWorkflowWithPatch requires the runner to be listening on :8001.
2. Error silently discarded (line 679)
The error is assigned to _, so the flow continues to pod creation as if nothing went wrong. The WorkflowReconciled=False condition is batched into the statusPatch but doesn't block progress.
3. observedGeneration set despite failure (line 1543)
After pod creation succeeds, observedGeneration is set to currentObj.GetGeneration():
// Line 1542-1543
statusPatch.SetField("phase", "Creating")
statusPatch.SetField("observedGeneration", currentObj.GetGeneration()) // ← marks spec as "reconciled"
This tells the Running phase reconciler that the spec has been fully applied — even though the workflow POST failed.
4. Running phase never retries (reconcile_phases.go line 325)
When the session reaches Running phase, reconcileRunning() checks for generation drift:
if currentGen != observedGen && observedGen != 0 {
// reconcile spec changes...
}
Since observedGen (1) equals currentGen (1), this is false — workflow reconciliation is never retried. No other mechanism checks for WorkflowReconciled=False.
Suggested Fix
Several options (not mutually exclusive):
Option A: Move workflow reconciliation to after runner is ready
Don't call reconcileActiveWorkflowWithPatch() at line 679. Instead, add a check in reconcileRunning() (or reconcileCreating() after RunnerStarted=True) that detects WorkflowReconciled != True and retries.
Option B: Don't set observedGeneration if workflow reconciliation failed
Only set observedGeneration at line 1543 if both repo and workflow reconciliation succeeded. This would cause the Running phase to detect currentGen != observedGen and retry.
Option C: Check WorkflowReconciled condition in reconcileRunning
Add a condition check in reconcileRunning() independent of generation drift:
if workflowCondition == "False" && workflowReason == "UpdateFailed" {
// Retry workflow reconciliation
}
Option A is the cleanest — the workflow POST can only succeed when the runner is listening, so it should never be attempted before the pod exists.
Reproduction Steps
- Create a session with workflow configuration:
acp_create_session(
session_name="test-workflow",
workflow_git_url="https://github.com/org/repo.git",
workflow_branch="main",
workflow_path=".ambient/workflows/my-workflow"
)
- Check session conditions —
WorkflowReconciled will be False/UpdateFailed
- Wait for session to reach Running phase — workflow is never applied
- Runner receives
initialPrompt without workflow context
Environment
- Platform: Ambient Code (OpenShift)
- Operator version: current main
- Confirmed on two independent session creations in the
quay project namespace
Key Files
| File |
Lines |
What |
handlers/sessions.go |
676-679 |
Workflow POST called before pod creation |
handlers/sessions.go |
1542-1543 |
observedGeneration set despite failure |
handlers/sessions.go |
1907-2003 |
reconcileActiveWorkflowWithPatch() implementation |
controller/reconcile_phases.go |
320-335 |
reconcileRunning() generation drift check |
controller/agenticsession_controller.go |
215-231 |
Watch predicates (no retry trigger for condition changes) |
Workflow reconciliation fails on session creation and is never retried
Summary
Sessions created with
activeWorkflowconfiguration never receive the workflow — the operator attempts to POST the workflow config to the runner's HTTP endpoint before the pod is created, the POST fails with a DNS resolution error, andobservedGenerationis then set to the current generation, preventing any retry.Observed Behavior
Every session created with
activeWorkflow(viaacp_create_sessionwithworkflow_git_url,workflow_branch,workflow_path) shows:The session starts and enters Running phase, but the workflow is never loaded. The runner receives the
initialPromptwithout any workflow context (no system prompt, no skills, no rubric).This is 100% reproducible — tested with two consecutive session creations, both failed identically.
Expected Behavior
The workflow should be applied to the runner after it starts. The
WorkflowReconciledcondition should eventually becomeTrue.Root Cause
Three-part failure chain in
components/operator/internal/handlers/sessions.go:1. Workflow POST attempted before pod exists (line 679)
During the Pending phase handler,
reconcileActiveWorkflowWithPatch()is called before the pod is created. It tries to POST tohttp://session-{name}.{namespace}.svc.cluster.local:8001/workflow, but the pod and its Service don't exist yet — DNS resolution fails.Note:
reconcileSpecReposWithPatchworks fine because it only sets status conditions — actual repo cloning is done by init containers in the pod spec. ButreconcileActiveWorkflowWithPatchrequires the runner to be listening on:8001.2. Error silently discarded (line 679)
The error is assigned to
_, so the flow continues to pod creation as if nothing went wrong. TheWorkflowReconciled=Falsecondition is batched into the statusPatch but doesn't block progress.3.
observedGenerationset despite failure (line 1543)After pod creation succeeds,
observedGenerationis set tocurrentObj.GetGeneration():This tells the Running phase reconciler that the spec has been fully applied — even though the workflow POST failed.
4. Running phase never retries (reconcile_phases.go line 325)
When the session reaches Running phase,
reconcileRunning()checks for generation drift:Since
observedGen(1) equalscurrentGen(1), this isfalse— workflow reconciliation is never retried. No other mechanism checks forWorkflowReconciled=False.Suggested Fix
Several options (not mutually exclusive):
Option A: Move workflow reconciliation to after runner is ready
Don't call
reconcileActiveWorkflowWithPatch()at line 679. Instead, add a check inreconcileRunning()(orreconcileCreating()afterRunnerStarted=True) that detectsWorkflowReconciled != Trueand retries.Option B: Don't set
observedGenerationif workflow reconciliation failedOnly set
observedGenerationat line 1543 if both repo and workflow reconciliation succeeded. This would cause the Running phase to detectcurrentGen != observedGenand retry.Option C: Check WorkflowReconciled condition in reconcileRunning
Add a condition check in
reconcileRunning()independent of generation drift:Option A is the cleanest — the workflow POST can only succeed when the runner is listening, so it should never be attempted before the pod exists.
Reproduction Steps
WorkflowReconciledwill beFalse/UpdateFailedinitialPromptwithout workflow contextEnvironment
quayproject namespaceKey Files
handlers/sessions.gohandlers/sessions.goobservedGenerationset despite failurehandlers/sessions.goreconcileActiveWorkflowWithPatch()implementationcontroller/reconcile_phases.goreconcileRunning()generation drift checkcontroller/agenticsession_controller.go