diff --git a/CHANGELOG.md b/CHANGELOG.md
index 10c99be..dd957bf 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,12 @@ Please choose versions by [Semantic Versioning](http://semver.org/).
## Unreleased
+- feat(task/executor): add ZombieReason enum with stable reason strings for all zombie failure modes (image_pull_backoff, pod_evicted, pod_crash_no_stdout, deadline_exceeded)
+- feat(task/executor): add zombieSweeperIntervalSeconds and zombieJobTimeoutSeconds CRD fields with admission validation floors (10s and 30s respectively)
+- feat(task/executor): propagate zombieJobTimeoutSeconds through AgentConfiguration and stamp Job.Spec.ActiveDeadlineSeconds on every spawned Job
+- feat(task/executor): add Pods informer to JobWatcher for ImagePullBackOff, evicted, and crash-no-stdout failure detection
+- feat(task/executor): narrow Job-condition path reason into ZombieReason enum (DeadlineExceeded/BackoffLimitExceeded → deadline_exceeded, other → pod_crash_no_stdout)
+- feat(task/executor): add deadline sweeper goroutine that classifies zombie tasks and publishes failure events via result publisher; wired into service.Run lifecycle
- feat(task/controller): add `agent_controller_vault_scanner_skipped_files_total{reason}` counter and promote operator-actionable skip logs to `glog.Errorf`, restoring Prometheus observability for files silently skipped by the vault scanner; references the 2026-05-31 / 2026-06-01 incident and advances [[Make Parked Agent Tasks Visible to Operator]]
## v0.63.35
diff --git a/lib/mocks/mocks.go b/lib/mocks/mocks.go
index ef72dfd..f726b26 100644
--- a/lib/mocks/mocks.go
+++ b/lib/mocks/mocks.go
@@ -1,5 +1 @@
-// Copyright (c) 2026 Benjamin Borbe All rights reserved.
-// Use of this source code is governed by a BSD-style
-// license that can be found in the LICENSE file.
-
package mocks
diff --git a/prompts/completed/194-spec-043-doctrine-publishers-and-dedupe.md b/prompts/completed/194-spec-043-doctrine-publishers-and-dedupe.md
new file mode 100644
index 0000000..e696030
--- /dev/null
+++ b/prompts/completed/194-spec-043-doctrine-publishers-and-dedupe.md
@@ -0,0 +1,174 @@
+---
+status: completed
+spec: [043-executor-zombie-job-detection]
+summary: Rewrote PublishFailure to emit update+increment paired commands with TTL LRU dedupe; rewrote PublishTypeMismatchFailure to emit assignee/previous_assignee/current_job only; updated all affected tests
+container: agent-zombie-detect-exec-194-spec-043-doctrine-publishers-and-dedupe
+dark-factory-version: v0.173.0
+created: "2026-06-01T20:30:00Z"
+queued: "2026-06-01T20:11:58Z"
+started: "2026-06-01T20:12:03Z"
+completed: "2026-06-01T20:16:40Z"
+---
+
+
+- Fix `PublishFailure` to follow the retry-aware doctrine: leave `phase`, `status`, and `assignee` untouched; clear `current_job`; bump `trigger_count` atomically alongside the body append.
+- Fix `PublishTypeMismatchFailure` to escalate immediately via the doctrine-correct shape: leave `phase` and `status` untouched; clear `assignee`; set `previous_assignee`; clear `current_job`.
+- Add a publish-layer dedupe so two classifications for the same job emit one Kafka event.
+- Update the existing publisher unit tests so they assert the new doctrine shape rather than the old `phase: human_review` / `phase: ai_review` shape.
+- After this prompt, transient zombies will participate in the existing `trigger_count` retry cap and type-mismatch failures will surface in the operator inbox via the `assignee == ""` filter directly.
+
+
+
+Make `task/executor/pkg/result_publisher.go` emit the doctrine-correct frontmatter shapes for the two existing failure publishers and add LRU dedupe keyed by `current_job` so a second emission for the same job within a bounded TTL is a no-op.
+
+
+
+Read `CLAUDE.md` for project conventions.
+
+Spec: `specs/in-progress/043-executor-zombie-job-detection.md` (sections: Desired Behavior 1, 2, 7; Acceptance Criteria 1, 2, 3, 4, 9).
+
+Doctrine reference: `specs/completed/039-controller-stop-setting-human-review-on-failure.md` and `docs/task-flow-and-failure-semantics.md` (Status Taxonomy & Inbox Signal). `phase: human_review` is reserved for agent-emitted successful verdicts that need human confirmation — never for failure escalation.
+
+Files to read before changing:
+- `task/executor/pkg/result_publisher.go` — current implementation; note `PublishFailure` at line 84 currently writes `phase: human_review`, `PublishTypeMismatchFailure` at line 121 currently writes `phase: ai_review`. Both must be rewritten.
+- `task/executor/pkg/result_publisher_test.go` — existing tests at lines 166 (`PublishFailure`) and 216 (`PublishTypeMismatchFailure`); these assertions will be flipped by this change.
+- `lib/command/task/update-frontmatter-command.go` — `UpdateFrontmatterCommand{TaskIdentifier, Updates, Body}` and `BodySection{Heading, Section}` shapes.
+- `lib/command/task/increment-frontmatter-command.go` — `IncrementFrontmatterCommand{TaskIdentifier, Field, Delta}` and `IncrementFrontmatterCommandOperation = "increment-frontmatter"`.
+- `task/controller/pkg/result/result_writer.go` — `applyTriggerCap` at line 234 (this is the chokepoint the new shape allows to fire) and `clearAssignee` at line 268 (sets `previous_assignee` controller-side; the executor's type-mismatch path must set it directly because no cap-mediated controller-side clear fires for type mismatch).
+
+Coding plugin docs to consult:
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-error-wrapping-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-testing-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-time-injection.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-concurrency-patterns.md`
+
+
+
+### 1. Rewrite `PublishFailure` for the retry-aware zombie doctrine
+
+In `task/executor/pkg/result_publisher.go`, replace the body of `PublishFailure(ctx context.Context, task lib.Task, jobName string, reason string) error` so it:
+
+1. Builds the `## Failure` body section exactly as today (timestamp from `p.currentDateTime.Now().UTC().Format(time.RFC3339)`, job name, reason — preserve the existing format string).
+2. Publishes EXACTLY two CQRS commands in this order, both via the existing `p.publishRaw(...)` helper:
+ - First: an `UpdateFrontmatterCommand` whose `Updates` map contains ONLY `"current_job": ""` and whose `Body` is the `## Failure` section. NO `status`, NO `phase`, NO `assignee`, NO `previous_assignee` key in `Updates`.
+ - Second: an `IncrementFrontmatterCommand{TaskIdentifier: task.TaskIdentifier, Field: "trigger_count", Delta: 1}`, published with operation `taskcmd.IncrementFrontmatterCommandOperation`. This is the same shape the existing `PublishIncrementTriggerCount` method already uses — model the call after it.
+3. If the first `publishRaw` returns an error, return immediately wrapped via `errors.Wrapf(ctx, err, "publish zombie failure update for task %s", task.TaskIdentifier)` and do NOT publish the increment. Because the dedupe entry is recorded ONLY after both publishes succeed (see requirement 3), the caller's next-cycle retry is NOT suppressed and can attempt the publish again — matching the spec's Failure Modes row "Kafka publish of failure command fails: sweeper retries next cycle".
+4. If the second `publishRaw` returns an error, return it wrapped via `errors.Wrapf(ctx, err, "publish zombie failure trigger_count increment for task %s", task.TaskIdentifier)`. The `## Failure` body section has already been written at this point; that is acceptable — because dedupe has NOT yet been recorded, the caller's next retry will re-attempt both publishes. The controller-side write path is idempotent (`applyTriggerCap` re-reads frontmatter on every result write), so a re-applied `## Failure` body append is tolerable. Record the dedupe entry ONLY after BOTH commands succeed; this preserves Kafka-failure recovery while still blocking concurrent in-process duplicates (the dedupe is recorded synchronously before the function returns success, so a racing caller that arrives after a successful publish sees the entry).
+
+The two-command sequence is the atomicity contract called out in spec DB #1 ("a single atomic write — either `update-frontmatter` with a paired increment, or a composite command — agent decides at impl time"). Two sequential publishes is the chosen path because (a) `IncrementFrontmatterCommand` is the existing primitive for atomic counter bumps and the only safe way to increment under concurrent writes, and (b) the controller's `applyRetryCounter` re-reads frontmatter on every result write, so eventual consistency suffices.
+
+### 2. Rewrite `PublishTypeMismatchFailure` for immediate escalation
+
+In `task/executor/pkg/result_publisher.go`, replace the body of `PublishTypeMismatchFailure(ctx context.Context, task lib.Task, reason string) error` so it publishes ONE `UpdateFrontmatterCommand` whose `Updates` map contains EXACTLY 3 keys when prior assignee is non-empty (`assignee`, `previous_assignee`, `current_job`); EXACTLY 2 keys (`assignee`, `current_job`) when prior assignee was empty:
+
+- `"assignee": ""`
+- `"previous_assignee": ` — read via `string(task.Frontmatter.Assignee())` (`Assignee()` returns the `TaskAssignee` string alias defined in `lib/agent_task-frontmatter.go`; cast directly to `string`). If the prior assignee is empty, do NOT emit this key (degenerate state, but defensive).
+- `"current_job": ""`
+
+`Body` is the existing `## Failure` section (preserve the existing format that includes the assignee bullet and reason). NO `status`, NO `phase` keys in `Updates`.
+
+Wrap publish errors via `errors.Wrapf(ctx, err, "publish type mismatch failure for task %s", task.TaskIdentifier)`.
+
+Type mismatch does NOT participate in dedupe (it is called once per Kafka task event in `task_event_handler.go:199`; the dedupe layer added in requirement 3 keys by `current_job` and is purpose-built for zombie classifications that can race between the informer and the sweeper).
+
+### 3. Add publish-layer dedupe for zombie failures
+
+Add an internal LRU keyed by `current_job` (the `jobName` parameter of `PublishFailure`) to the `resultPublisher` struct. The LRU prevents two concurrent classifications from publishing twice for the same job.
+
+3a. **Storage.** Use `github.com/hashicorp/golang-lru/v2` if it is already a dependency; otherwise implement a minimal map + RWMutex + insertion-order list with manual eviction. Verify dependency: run `grep -r 'hashicorp/golang-lru' ~/Documents/workspaces/agent-zombie-detect/go.mod ~/Documents/workspaces/agent-zombie-detect/go.sum` — if absent, use the inline map + mutex variant; do NOT add new module dependencies in this prompt.
+
+3b. **Capacity and TTL.** Capacity pinned at 1024 entries (constant). TTL pinned at `2 * zombieJobTimeoutSeconds` = `3600 * time.Second` (constant — the CRD-derived value is wired in prompt 4; this prompt uses the hardcoded default so the publisher does not yet need configuration plumbing). Define as package-level constants:
+
+```go
+const dedupeCapacity = 1024
+const dedupeTTL = 3600 * time.Second
+```
+
+3c. **Behavior.** Add two unexported methods on `*resultPublisher` (see 3d for the split rationale): `checkDedupe(jobName string) bool` returns `true` if a non-expired entry exists for `jobName` (caller should no-op), `false` otherwise (caller should proceed). `recordDedupe(jobName string)` inserts/refreshes the entry with the current timestamp; called only AFTER both publishes succeed. Entries past TTL are treated as absent (re-publish allowed; controller idempotency via `applyTriggerCap` handles the result-write side).
+
+3d. **Wiring.** At the top of `PublishFailure`, before any publish, perform a dedupe CHECK (read-only — does NOT record yet) via an unexported method `checkDedupe(jobName string) bool` that returns `true` if a non-expired entry exists. If `checkDedupe` returns `true`, emit `glog.V(2).Infof("event=zombie_dedupe job=%s task=%s", jobName, task.TaskIdentifier)` and `return nil`. Then perform both publishes. ONLY after BOTH publishes succeed, call `p.recordDedupe(jobName)` to insert the entry. This ordering preserves Kafka-failure recovery: if either publish fails, the dedupe entry is NOT set, so the caller's next-cycle retry can attempt publish again — matching the spec's Failure Modes row "Kafka publish of failure command fails: sweeper retries next cycle; dedupe LRU prevents double-publish once Kafka recovers." (The "prevents double-publish once Kafka recovers" clause refers to subsequent successful cycles: once one publish succeeds, dedupe is set and any racing caller is blocked.)
+
+Split `recordAndCheckDedupe` into two methods: `checkDedupe(jobName string) bool` (returns true if a non-expired entry exists, no mutation) and `recordDedupe(jobName string)` (inserts the entry with current timestamp, evicts oldest if at capacity). Both must be safe for concurrent callers via the same RWMutex.
+
+3e. **Logging.** Suppressed duplicate emits one log line: `glog.V(2).Infof("event=zombie_dedupe job=%s task=%s", jobName, task.TaskIdentifier)`. Use `V(2)` per the spec's logging gating constraint.
+
+3f. **Time source.** Use `p.currentDateTime.Now().Time()` for TTL math — NEVER `time.Now()` directly. The existing `currentDateTime` field already exists on `resultPublisher`.
+
+### 4. Update existing publisher unit tests
+
+In `task/executor/pkg/result_publisher_test.go`:
+
+4a. Replace the `Describe("PublishFailure", ...)` block (around line 166) with assertions that match the new shape. The test must:
+- Send `PublishFailure(ctx, task, "claude-20260418120000", "pod OOM killed")` once.
+- Assert `len(producer.messages) == 2` (the update + the increment).
+- Decode `producer.messages[0]` as `UpdateFrontmatterCommand` via the existing `decodeUpdateFrontmatterCommand` helper. Assert:
+ - `cmd.Updates` has EXACTLY ONE key, `"current_job"`, equal to `""`.
+ - `cmd.Updates` does NOT contain `"status"`, `"phase"`, `"assignee"`, `"previous_assignee"`, `"trigger_count"`.
+ - `cmd.Body` is non-nil, `cmd.Body.Heading == "## Failure"`, and `cmd.Body.Section` contains the timestamp, job name, and reason.
+- Decode `producer.messages[1]` as `IncrementFrontmatterCommand` via the existing `decodeIncrementFrontmatterCommand` helper at `task/executor/pkg/result_publisher_test.go:94` — do NOT add a duplicate. Assert `Field == "trigger_count"`, `Delta == 1`, `TaskIdentifier == "test-task-2"`.
+
+4b. Replace the `Describe("PublishTypeMismatchFailure", ...)` block (around line 216) with assertions that match the new shape:
+- After `PublishTypeMismatchFailure(ctx, task, "task_type ...")`, assert `len(producer.messages) == 1`.
+- Decode the single message as `UpdateFrontmatterCommand`. Assert `cmd.Updates` has EXACTLY THREE keys: `"assignee" == ""`, `"previous_assignee" == "agent-pr-reviewer"`, `"current_job" == ""`.
+- Assert `cmd.Updates` does NOT contain `"status"`, `"phase"`, `"trigger_count"`.
+- Assert `cmd.Body.Heading == "## Failure"` and `cmd.Body.Section` contains the reason verbatim and the prior assignee value.
+
+4c. Add a new `Describe("PublishFailure dedupe", ...)` block that:
+- Calls `PublishFailure` twice in succession with the same `jobName` (e.g. `"claude-20260418120000"`).
+- Asserts `len(producer.messages) == 2` after the FIRST call (one update + one increment).
+- Asserts `len(producer.messages) == 2` after the SECOND call (still 2 — second call was deduped, NO new messages).
+
+4d. Add a new test that confirms the doctrine-correct trigger_count math across multiple zombie calls. Given a task with `max_triggers: 3` and `trigger_count: 0`, call `PublishFailure` once, then assert:
+- Exactly 2 messages sent (1 update + 1 increment).
+- The increment's `Delta == 1`.
+- No `assignee` key written.
+
+Note: this test does NOT need to simulate the controller's `applyRetryCounter` re-read; it asserts only the executor's emission shape. The controller-side cap behavior is covered by existing controller tests.
+
+### 5. Constraints on the rewrite
+
+- Do NOT change the `ResultPublisher` interface method signatures — only their behavior.
+- Do NOT remove or rename `PublishSpawnNotification`, `PublishIncrementTriggerCount`, or `PublishRaw` — they remain unchanged.
+- Update the GoDoc on `PublishFailure` and `PublishTypeMismatchFailure` to describe the new shapes. The current GoDoc (`PublishFailure publishes a partial frontmatter update setting status, phase, and current_job`) is now wrong; replace it with text matching the new behavior. For example:
+ ```go
+ // PublishFailure publishes a zombie failure: clears current_job and atomically
+ // bumps trigger_count by 1 via a paired IncrementFrontmatterCommand. Leaves
+ // phase, status, and assignee untouched so the existing trigger_count retry
+ // cap (applyTriggerCap in task/controller/pkg/result/result_writer.go) handles
+ // eventual operator-inbox escalation. Idempotent per current_job via a TTL'd
+ // LRU; concurrent classifications for the same job emit one event.
+ ```
+- All wraps use `github.com/bborbe/errors.Wrapf(ctx, err, ...)` — NEVER `fmt.Errorf` and NEVER bare `return err`.
+- Time math uses `p.currentDateTime.Now()` — NEVER `time.Now()`.
+- glog non-error lines use `V(2)` per the spec's logging gating constraint.
+
+### 6. Verify
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. The build will fail until the existing tests are updated (requirement 4) — that is expected and is part of this prompt.
+
+
+
+- `github.com/bborbe/errors.Wrapf(ctx, err, ...)` for wrapping; no `fmt.Errorf`; no bare `return err`.
+- `libtime.CurrentDateTimeGetter` for all time math; never `time.Now()` directly.
+- Ginkgo/Gomega + counterfeiter mocks for tests.
+- glog non-error logs gated with `V(n)`.
+- Do NOT add new go.mod dependencies; if `hashicorp/golang-lru/v2` is not already present, implement the small LRU inline.
+- Do NOT commit — dark-factory handles git.
+- Only touch files under `task/executor/pkg/`.
+- Verification command is `cd task/executor && make precommit` — never `make precommit` at repo root.
+
+
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. In particular:
+- `PublishFailure` produces 2 Kafka messages (update + increment) with the new shape, validated by the rewritten test.
+- `PublishTypeMismatchFailure` produces 1 Kafka message with the new shape, validated by the rewritten test.
+- A second call to `PublishFailure` with the same job name produces 0 additional Kafka messages and one `event=zombie_dedupe` log line.
+
diff --git a/prompts/completed/195-spec-043-pod-state-classifier.md b/prompts/completed/195-spec-043-pod-state-classifier.md
new file mode 100644
index 0000000..7a483e3
--- /dev/null
+++ b/prompts/completed/195-spec-043-pod-state-classifier.md
@@ -0,0 +1,312 @@
+---
+status: completed
+spec: [043-executor-zombie-job-detection]
+summary: Added ZombieReason enum, Pods informer to JobWatcher, and updated job failure classification to emit stable reason strings
+container: agent-zombie-detect-exec-195-spec-043-pod-state-classifier
+dark-factory-version: v0.173.0
+created: "2026-06-01T20:30:00Z"
+queued: "2026-06-01T20:11:58Z"
+started: "2026-06-01T20:16:41Z"
+completed: "2026-06-01T20:20:30Z"
+---
+
+
+- Introduces a closed reason enum used by every zombie / type-mismatch failure publish (image_pull_backoff, pod_evicted, pod_not_scheduled, pod_crash_no_stdout, deadline_exceeded, executor_watch_lost, type_mismatch).
+- Extends the existing job watcher to also watch Pods so the executor classifies failures the Job-condition path misses: ImagePullBackOff, evicted, crashed-before-stdout.
+- Each Pod-state failure emits exactly one event via the (already-doctrine-correct) `PublishFailure` from prompt 1.
+- Maps existing Job-condition reasons to the typed enum so the on-disk `## Failure` body section contains a stable grep-able string instead of arbitrary k8s messages.
+- Pod_not_scheduled is deferred to the sweeper (prompt 4) because it needs a grace window the informer cannot evaluate.
+
+
+
+Make `task/executor/pkg/job_watcher.go` detect Pod-level failure conditions (ImagePullBackOff, evicted, crash-no-stdout) and emit them through `PublishFailure` with a stable reason string from a fixed enum. Also narrow the existing Job-condition path's reason string into the same enum.
+
+
+
+Read `CLAUDE.md` for project conventions.
+
+Spec: `specs/in-progress/043-executor-zombie-job-detection.md` (Desired Behavior 3, 8, 9; Acceptance Criterion 5; Failure Modes rows for ImagePullBackOff, evicted, crash-no-stdout).
+
+Files to read before changing:
+- `task/executor/pkg/job_watcher.go` — current implementation. `HandleJob` (line 98), `publishSyntheticFailure` (line 143), `isJobFailed` (line 178), `jobFailureReason` (line 196).
+- `task/executor/pkg/result_publisher.go` — `PublishFailure` is now retry-aware (per prompt 1). This prompt only calls it; do not modify the publisher.
+- `task/executor/pkg/task_store.go` — `TaskStore.Load`. The Pods informer looks up the owning task via the same `agent.benjamin-borbe.de/task-id` label that Jobs carry.
+- `task/executor/pkg/spawner/job_spawner.go:275` — `applyTaskIDLabel` sets `job.Spec.Template.Labels[taskIDLabelKey]`, which means Pods spawned by the Job inherit the label. The Pods informer uses the same label selector.
+- `task/executor/mocks/result_publisher.go` — `FakeResultPublisher` for tests.
+
+Coding plugin docs:
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-error-wrapping-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-glog-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-testing-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-enum-type-pattern.md`
+
+
+
+### 1. Add the reason enum
+
+Create `task/executor/pkg/zombie_reason.go`:
+
+```go
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package pkg
+
+// ZombieReason is the closed set of machine-readable reason strings emitted in
+// the ## Failure body section. Operators grep on these values to triage.
+// Adding a new value requires updating this list and the documentation; renaming
+// or removing a value is a breaking change to the on-disk task body contract.
+type ZombieReason string
+
+const (
+ ZombieReasonImagePullBackOff ZombieReason = "image_pull_backoff"
+ ZombieReasonPodEvicted ZombieReason = "pod_evicted"
+ ZombieReasonDeadlineExceeded ZombieReason = "deadline_exceeded"
+ ZombieReasonPodNotScheduled ZombieReason = "pod_not_scheduled"
+ ZombieReasonPodCrashNoStdout ZombieReason = "pod_crash_no_stdout"
+ ZombieReasonExecutorWatchLost ZombieReason = "executor_watch_lost"
+ ZombieReasonTypeMismatch ZombieReason = "type_mismatch"
+)
+
+// String returns the reason as a string (for use with PublishFailure).
+func (r ZombieReason) String() string { return string(r) }
+```
+
+Add `task/executor/pkg/zombie_reason_test.go` (external test package `pkg_test`) that asserts each constant's string value is the spec's verbatim lower_snake string. This is the level-1 boundary test confirming the reason strings are stable wire contract.
+
+### 2. Narrow the Job-condition reason into the enum
+
+Rationale for bundling `BackoffLimitExceeded` under `ZombieReasonDeadlineExceeded`: both are k8s killing the pod for resource-policy reasons (activeDeadlineSeconds expiry vs. backoffLimit exhaustion); operators triaging see the same "killed by controller, not by app" semantics. If operators later need to distinguish, file a follow-up to split into a separate enum value.
+
+In `task/executor/pkg/job_watcher.go`, replace the existing `jobFailureReason(job *batchv1.Job) string` helper with:
+
+```go
+// jobFailureReason maps a failed Job's conditions to a ZombieReason. Returns
+// ZombieReasonDeadlineExceeded when any Failed condition has Reason
+// "DeadlineExceeded" or "BackoffLimitExceeded" (kubelet killed the pod for
+// running past activeDeadlineSeconds or exhausting BackoffLimit). Returns
+// ZombieReasonPodCrashNoStdout for any other Failed condition (the pod
+// terminated non-zero and no AgentResult was observed; the Job-condition
+// informer only fires AFTER terminal state, so absence of an AgentResult is
+// implicit at this point).
+func jobFailureReason(job *batchv1.Job) ZombieReason {
+ for _, c := range job.Status.Conditions {
+ if c.Type == batchv1.JobFailed && c.Status == corev1.ConditionTrue {
+ switch c.Reason {
+ case "DeadlineExceeded", "BackoffLimitExceeded":
+ return ZombieReasonDeadlineExceeded
+ }
+ }
+ }
+ return ZombieReasonPodCrashNoStdout
+}
+```
+
+Update `HandleJob` (line 98) to consume the new return type. No source change at L106 — the local variable's static type changes from `string` to `ZombieReason` because the helper return type changed. Update the `glog.V(2).Infof` log line at line 107 to use `reason` (a `ZombieReason` prints fine via `%s`).
+
+Update `handleTerminal` (line 128) and `publishSyntheticFailure` (line 143) signatures so the `reason` parameter is `ZombieReason` instead of `string`. Inside `publishSyntheticFailure`, the call `w.publisher.PublishFailure(ctx, task, job.Name, reason)` requires a string — change it to `w.publisher.PublishFailure(ctx, task, job.Name, reason.String())`.
+
+Note: `publishSyntheticFailure` retains its existing `taskStore.Delete(taskID)` call (current `job_watcher.go:155`). Only the Job-condition path (`HandleJob` → `publishSyntheticFailure`) owns the final TaskStore delete; `HandlePod` does NOT delete — see the inline comment in requirement 4 below.
+
+### 3. No changes to `result_publisher.go`
+
+The body format from prompt 1 (`1-spec-043-doctrine-publishers-and-dedupe.md`) already satisfies spec AC #4 — that prompt rewrites `PublishTypeMismatchFailure` end-to-end including adding `reason=type_mismatch` to the `## Failure` body. Drop any `result_publisher_test.go` assertions for type-mismatch body content from this prompt's scope; prompt 1 owns those tests.
+
+### 3. Add the Pods informer
+
+In `task/executor/pkg/job_watcher.go`, extend the `JobWatcher` interface so unit tests can drive the new Pod path directly without an informer:
+
+```go
+type JobWatcher interface {
+ Run(ctx context.Context) error
+ HandleJob(ctx context.Context, job *batchv1.Job)
+ HandlePod(ctx context.Context, pod *corev1.Pod)
+}
+```
+
+Regenerate the counterfeiter mock for `JobWatcher`: this is auto-handled by `make generate` invoked from `make precommit`. The `//counterfeiter:generate` directive at line 24 already targets the interface — the mock will pick up the new method.
+
+In `jobWatcher.Run`, after registering the existing Jobs informer event handler, add a Pods informer using the SAME `k8sinformers.SharedInformerFactoryWithOptions` factory already created at line 59 (same namespace, same label selector `agent.benjamin-borbe.de/task-id`, same 5-minute resync period). Pods inherit the task-id label from the Job's pod template (verified in `task/executor/pkg/spawner/job_spawner.go:applyTaskIDLabel`).
+
+```go
+podInformer := factory.Core().V1().Pods().Informer()
+_, err = podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+ AddFunc: func(obj interface{}) {
+ pod, ok := obj.(*corev1.Pod)
+ if !ok {
+ return
+ }
+ w.HandlePod(ctx, pod)
+ },
+ UpdateFunc: func(_, newObj interface{}) {
+ pod, ok := newObj.(*corev1.Pod)
+ if !ok {
+ return
+ }
+ w.HandlePod(ctx, pod)
+ },
+})
+if err != nil {
+ return errors.Wrapf(ctx, err, "add pod informer event handler")
+}
+```
+
+The existing single `factory.Start(ctx.Done())` call covers both informers (the shared factory starts all informers registered against it). Extend the existing `cache.WaitForCacheSync` call so it waits for BOTH `informer.HasSynced` and `podInformer.HasSynced`:
+
+```go
+if !cache.WaitForCacheSync(ctx.Done(), informer.HasSynced, podInformer.HasSynced) {
+ return errors.Errorf(ctx, "timed out waiting for job/pod informer cache sync")
+}
+```
+
+### 4. Implement `HandlePod`
+
+Add to `task/executor/pkg/job_watcher.go`:
+
+```go
+func (w *jobWatcher) HandlePod(ctx context.Context, pod *corev1.Pod) {
+ taskIDStr, ok := pod.Labels["agent.benjamin-borbe.de/task-id"]
+ if !ok || taskIDStr == "" {
+ return
+ }
+ taskID := lib.TaskIdentifier(taskIDStr)
+
+ reason := classifyPodFailure(pod)
+ if reason == "" {
+ return
+ }
+
+ task, ok := w.taskStore.Load(taskID)
+ if !ok {
+ glog.V(3).Infof(
+ "pod %s/%s (task %s) in %s state but task not in store; sweeper will handle if still in flight",
+ pod.Namespace, pod.Name, taskID, reason,
+ )
+ return
+ }
+
+ jobName := ownerJobName(pod)
+ if jobName == "" {
+ glog.V(2).Infof(
+ "pod %s/%s (task %s) in %s state but has no Job ownerRef; ignoring",
+ pod.Namespace, pod.Name, taskID, reason,
+ )
+ return
+ }
+
+ if err := w.publisher.PublishFailure(ctx, task, jobName, reason.String()); err != nil {
+ glog.Errorf(
+ "publish pod-state failure for task %s (pod %s reason %s): %v",
+ taskID, pod.Name, reason, err,
+ )
+ return
+ }
+ glog.V(2).Infof(
+ "published pod-state failure for task %s (pod %s reason %s)",
+ taskID, pod.Name, reason,
+ )
+ // Do NOT call w.taskStore.Delete here. The pod may transition again (e.g. evicted then
+ // rescheduled). The Job-condition path or the deadline sweeper performs the final delete
+ // when terminal state is observed. Dedupe in PublishFailure (prompt 1) prevents
+ // double-publish for the same job name.
+}
+
+// classifyPodFailure returns a non-empty ZombieReason when the Pod is in a
+// terminal failure state we recognize. Returns "" for healthy, pending-without-
+// excessive-delay, and any state we should not act on from the informer path.
+// pod_not_scheduled is deliberately NOT returned here — it requires a grace
+// window the informer cannot evaluate (a freshly created Pod is always briefly
+// Pending before scheduling). The deadline sweeper (separate prompt) owns that
+// classification.
+func classifyPodFailure(pod *corev1.Pod) ZombieReason {
+ for _, cs := range pod.Status.ContainerStatuses {
+ if cs.State.Waiting != nil {
+ switch cs.State.Waiting.Reason {
+ case "ImagePullBackOff", "ErrImagePull":
+ return ZombieReasonImagePullBackOff
+ }
+ }
+ }
+ if pod.Status.Reason == "Evicted" {
+ return ZombieReasonPodEvicted
+ }
+ if pod.Status.Phase == corev1.PodFailed {
+ for _, cs := range pod.Status.ContainerStatuses {
+ if cs.State.Terminated != nil && cs.State.Terminated.ExitCode != 0 {
+ return ZombieReasonPodCrashNoStdout
+ }
+ }
+ }
+ return ""
+}
+
+// ownerJobName returns the name of the Job that owns the Pod, or "" when no
+// Job ownerRef is present.
+func ownerJobName(pod *corev1.Pod) string {
+ for _, ref := range pod.OwnerReferences {
+ if ref.Kind == "Job" {
+ return ref.Name
+ }
+ }
+ return ""
+}
+```
+
+### 5. Unit tests
+
+In `task/executor/pkg/job_watcher_test.go` (extend the existing file; add new `Describe` blocks):
+
+5a. **`Describe("HandlePod")` — table-driven Pod-state classifier.** Each entry constructs a `corev1.Pod` with the right state plus the `agent.benjamin-borbe.de/task-id` label and a Job ownerRef, seeds the `TaskStore` with a matching task, calls `HandlePod`, and asserts:
+- `FakeResultPublisher.PublishFailureCallCount() == 1`
+- The third argument (`reason string`) equals the expected `ZombieReason.String()`
+
+Rows:
+- ImagePullBackOff: `pod.Status.ContainerStatuses[0].State.Waiting = &corev1.ContainerStateWaiting{Reason: "ImagePullBackOff"}` → expect `"image_pull_backoff"`.
+- ErrImagePull: same with `Reason: "ErrImagePull"` → expect `"image_pull_backoff"` (same reason, different k8s string).
+- Evicted: `pod.Status.Reason = "Evicted"` → expect `"pod_evicted"`.
+- Crash: `pod.Status.Phase = corev1.PodFailed`, `pod.Status.ContainerStatuses[0].State.Terminated = &corev1.ContainerStateTerminated{ExitCode: 137}` → expect `"pod_crash_no_stdout"`.
+- Healthy Running pod: assert `PublishFailureCallCount() == 0`.
+
+5b. **`Describe("HandlePod no task in store")`** — Pod with label set but `TaskStore` empty; assert `PublishFailureCallCount() == 0` and no panic.
+
+5c. **`Describe("HandlePod no ownerRef")`** — Pod in ImagePullBackOff but with empty `OwnerReferences`; assert `PublishFailureCallCount() == 0`.
+
+5d. **`Describe("jobFailureReason mapping")`** — three rows: Failed condition `Reason: "DeadlineExceeded"` → `ZombieReasonDeadlineExceeded`; Failed condition `Reason: "BackoffLimitExceeded"` → `ZombieReasonDeadlineExceeded`; Failed condition `Reason: ""` → `ZombieReasonPodCrashNoStdout`.
+
+5e. **`Describe("HandleJob with DeadlineExceeded")`** — regression test: Job with condition `Reason: "DeadlineExceeded"` triggers `PublishFailure` with `reason == "deadline_exceeded"`. Use `FakeResultPublisher.PublishFailureArgsForCall(0)` to read back the third argument.
+
+All tests use Ginkgo/Gomega + the regenerated `FakeJobWatcher` and existing `FakeResultPublisher`. Construct the `jobWatcher` directly with a fake `kubernetes.Interface` (`fake.NewSimpleClientset()` from `k8s.io/client-go/kubernetes/fake`) so `HandlePod` and `HandleJob` can be driven without a real informer (the existing tests already use this pattern for `HandleJob`).
+
+### 6. Verify
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. The build will regenerate the counterfeiter mock for `JobWatcher` automatically (`go generate` triggered by precommit). If `make precommit` fails because the mock is out of date, run `make generate` once explicitly.
+
+
+
+- `github.com/bborbe/errors.Wrapf(ctx, err, ...)` for wrapping; no `fmt.Errorf`; no bare `return err`.
+- Ginkgo/Gomega + counterfeiter mocks for tests.
+- glog non-error logs gated with `V(n)` (use `V(2)` for the standard success path, `V(3)` for defensive skip-noise paths).
+- Do NOT modify `result_publisher.go` — the doctrine work (including the type-mismatch body format) landed in prompt 1.
+- Do NOT introduce the deadline sweeper or CRD knobs here — prompts 3 and 4.
+- Do NOT commit — dark-factory handles git.
+- Verification command is `cd task/executor && make precommit`.
+
+
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. Specifically:
+- `ZombieReason` constants exist with the seven values from spec DB #8.
+- Pod with `Status.ContainerStatuses[].State.Waiting.Reason == "ImagePullBackOff"` triggers one `PublishFailure` call with reason `"image_pull_backoff"`.
+- Pod with `Status.Reason == "Evicted"` triggers one `PublishFailure` call with reason `"pod_evicted"`.
+- Pod with `Status.Phase == PodFailed` and a non-zero terminated container triggers one `PublishFailure` call with reason `"pod_crash_no_stdout"`.
+- Job condition `Reason == "DeadlineExceeded"` triggers `PublishFailure` with reason `"deadline_exceeded"`.
+- Job condition `Reason == "BackoffLimitExceeded"` triggers `PublishFailure` with reason `"deadline_exceeded"`.
+
diff --git a/prompts/completed/196-spec-043-crd-knobs-and-active-deadline.md b/prompts/completed/196-spec-043-crd-knobs-and-active-deadline.md
new file mode 100644
index 0000000..617f5a6
--- /dev/null
+++ b/prompts/completed/196-spec-043-crd-knobs-and-active-deadline.md
@@ -0,0 +1,264 @@
+---
+status: completed
+spec: [043-executor-zombie-job-detection]
+summary: Added zombieSweeperIntervalSeconds and zombieJobTimeoutSeconds CRD fields with admission validation floors, wired zombieJobTimeoutSeconds through AgentConfiguration to Job.Spec.ActiveDeadlineSeconds on every spawned Job
+container: agent-zombie-detect-exec-196-spec-043-crd-knobs-and-active-deadline
+dark-factory-version: v0.173.0
+created: "2026-06-01T20:30:00Z"
+queued: "2026-06-01T20:11:58Z"
+started: "2026-06-01T20:20:31Z"
+completed: "2026-06-01T20:30:56Z"
+---
+
+
+- Adds two optional fields to the AgentConfig CRD: `zombieSweeperIntervalSeconds` (default 60, floor 10) and `zombieJobTimeoutSeconds` (default 1800, floor 30).
+- Admission rejects values below the floor with the spec's verbatim error messages so malicious or careless manifests cannot weaponize the sweeper or the per-Job deadline.
+- Propagates `zombieJobTimeoutSeconds` through the executor's `AgentConfiguration` mirror struct.
+- Stamps `Spec.ActiveDeadlineSeconds` on every Job spawned by `JobSpawner.SpawnJob`, sourced from the resolved `zombieJobTimeoutSeconds`.
+- After this prompt, k8s itself will kill a Pod that runs past its deadline and the existing Job-condition path (with the typed `deadline_exceeded` reason from prompt 2) will fire.
+
+
+
+Add the two CRD knobs with admission validation, wire `zombieJobTimeoutSeconds` end-to-end into the spawner so every Job carries `Spec.ActiveDeadlineSeconds`, and ship the defaults so unset manifests still work.
+
+
+
+Read `CLAUDE.md` for project conventions.
+
+Spec: `specs/in-progress/043-executor-zombie-job-detection.md` (Desired Behavior 5, 6; Acceptance Criteria 7, 8; Security / Abuse Cases for the floor rationale; Assumptions for the idea-note housekeeping).
+
+Files to read before changing:
+- `task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go` — `ConfigSpec` struct (line 44), `ConfigSpec.Validate` (line 146), `validateTrigger`, `validateTaskTypeValue`, `validateTaskTypesList`. New fields and their validation go here. The two new fields are pointer-typed (`*int32`) so unset (nil) is distinguishable from explicit zero.
+- `task/executor/k8s/apis/agent.benjamin-borbe.de/v1/zz_generated.deepcopy.go` — auto-generated; regenerated by `make generate` invoked from `make precommit`. Do not hand-edit.
+- `task/executor/pkg/agent_configuration.go` — `AgentConfiguration` (line 12) is the executor's per-agent mirror of `ConfigSpec`. Add a `ZombieJobTimeoutSeconds *int32` field here so the spawner can read it without taking a direct dependency on the CRD types package.
+- `task/executor/pkg/spawner/job_spawner.go` — `SpawnJob` (line 74), `jobBuilder` build sequence (lines 115-126). After `jobBuilder.SetTTLSecondsAfterFinished(jobTTLSecondsAfterFinished)` the Job is built and then post-edited (lines 128-133) for label/secret/ephemeral fields. `Spec.ActiveDeadlineSeconds` is a `*int64` field on `batchv1.JobSpec`; the simplest path is to set it on the built `job` struct AFTER `jobBuilder.Build(ctx)` returns but BEFORE `kubeClient.Create`.
+- `task/executor/pkg/agent_configuration_test.go` — pattern for `AgentConfiguration` tests.
+- The site that constructs `AgentConfiguration` from `ConfigSpec` — search with `grep -rn "AgentConfiguration{" task/executor/pkg/ | grep -v _test.go | grep -v mocks`. Each construction site must copy the new field across.
+
+Coding plugin docs:
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-kubernetes-crd-controller-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-error-wrapping-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-testing-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-validation-framework-guide.md`
+
+
+
+### 1. Add CRD constants and fields
+
+In `task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go`, declare package-level constants ABOVE the existing `taskTypePattern` var:
+
+```go
+// Defaults and validation floors for the zombie-detection knobs.
+// Floors prevent thrash (sweeper) and pathological short-deadline kills (timeout).
+const (
+ DefaultZombieSweeperIntervalSeconds int32 = 60
+ MinZombieSweeperIntervalSeconds int32 = 10
+ DefaultZombieJobTimeoutSeconds int32 = 1800
+ MinZombieJobTimeoutSeconds int32 = 30
+)
+```
+
+Add two optional fields to `ConfigSpec` (after the existing `Trigger *Trigger` field, before the closing brace at line 71):
+
+```go
+ // ZombieSweeperIntervalSeconds is how often the executor's deadline sweeper
+ // walks the TaskStore looking for zombie jobs. Optional; when nil, the executor
+ // uses DefaultZombieSweeperIntervalSeconds (60). Values below
+ // MinZombieSweeperIntervalSeconds (10) are rejected at admission to prevent
+ // sweeper thrash. Pointer-typed so "unset" is distinguishable from "0".
+ ZombieSweeperIntervalSeconds *int32 `json:"zombieSweeperIntervalSeconds,omitempty"`
+
+ // ZombieJobTimeoutSeconds is the deadline applied to every spawned Job (via
+ // Job.Spec.ActiveDeadlineSeconds) AND the elapsed-time threshold the sweeper
+ // uses when classifying zombies. Optional; when nil, the executor uses
+ // DefaultZombieJobTimeoutSeconds (1800 — 30 minutes). Values below
+ // MinZombieJobTimeoutSeconds (30) are rejected at admission to prevent
+ // pathological short-deadline kills. Pointer-typed so "unset" is
+ // distinguishable from "0".
+ ZombieJobTimeoutSeconds *int32 `json:"zombieJobTimeoutSeconds,omitempty"`
+```
+
+### 2. Extend `ConfigSpec.Equal` and admission validation
+
+In `ConfigSpec.Equal` (file `task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go:130-143` `Equal` method), replace the existing closing line `reflect.DeepEqual(s.Trigger, o.Trigger)` (current last conjunct) with `reflect.DeepEqual(s.Trigger, o.Trigger) &&` and append two new conjuncts (use `reflect.DeepEqual` for pointer-int32 equality, consistent with the existing `s.Trigger` comparison):
+
+```go
+ reflect.DeepEqual(s.ZombieSweeperIntervalSeconds, o.ZombieSweeperIntervalSeconds) &&
+ reflect.DeepEqual(s.ZombieJobTimeoutSeconds, o.ZombieJobTimeoutSeconds)
+```
+
+In `ConfigSpec.Validate` (line 146), AFTER the existing `validateTaskTypesList(ctx, s.TaskTypes)` call site but BEFORE the function returns, add validation for the two new fields. Because `Validate` currently returns `validateTaskTypesList(...)` directly, restructure to:
+
+```go
+ if err := validateTaskTypesList(ctx, s.TaskTypes); err != nil {
+ return err
+ }
+ if err := validateZombieSweeperInterval(ctx, s.ZombieSweeperIntervalSeconds); err != nil {
+ return err
+ }
+ return validateZombieJobTimeout(ctx, s.ZombieJobTimeoutSeconds)
+```
+
+Add the two helper functions in the same file:
+
+```go
+func validateZombieSweeperInterval(ctx context.Context, v *int32) error {
+ if v == nil {
+ return nil
+ }
+ if *v < MinZombieSweeperIntervalSeconds {
+ return errors.Wrapf(
+ ctx,
+ validation.Error,
+ "zombieSweeperIntervalSeconds invalid: must be >= %d",
+ MinZombieSweeperIntervalSeconds,
+ )
+ }
+ return nil
+}
+
+func validateZombieJobTimeout(ctx context.Context, v *int32) error {
+ if v == nil {
+ return nil
+ }
+ if *v < MinZombieJobTimeoutSeconds {
+ return errors.Wrapf(
+ ctx,
+ validation.Error,
+ "zombieJobTimeoutSeconds invalid: must be >= %d",
+ MinZombieJobTimeoutSeconds,
+ )
+ }
+ return nil
+}
+```
+
+The error-message contract from the spec is exactly `invalid: must be >= 30` / `invalid: must be >= 10` — the formatted strings above produce `zombieJobTimeoutSeconds invalid: must be >= 30` etc., which contains the spec's substring. Acceptance test asserts on the substring.
+
+### 3. Update CRD type tests
+
+In `task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types_test.go`, add Ginkgo `It` blocks for:
+
+3a. **Validate accepts nil fields** — a `Config` with all other required fields set and both zombie fields nil → `Validate(ctx)` returns nil.
+
+3b. **Validate accepts valid values** — `ZombieSweeperIntervalSeconds: ptrInt32(10)`, `ZombieJobTimeoutSeconds: ptrInt32(30)` → `Validate(ctx)` returns nil. Helper `ptrInt32(v int32) *int32 { return &v }` defined locally in the test.
+
+3c. **Validate rejects sweeper interval below floor** — `ZombieSweeperIntervalSeconds: ptrInt32(9)` → `Validate(ctx)` returns an error whose `.Error()` contains the substring `invalid: must be >= 10`.
+
+3d. **Validate rejects job timeout below floor** — `ZombieJobTimeoutSeconds: ptrInt32(29)` → `Validate(ctx)` returns an error whose `.Error()` contains the substring `invalid: must be >= 30`.
+
+3e. **Equal handles pointer fields** — two `ConfigSpec` instances with `ZombieJobTimeoutSeconds: ptrInt32(1800)` compare equal; `ptrInt32(1800)` vs `ptrInt32(900)` compare unequal; nil vs `ptrInt32(1800)` compare unequal.
+
+### 4. Extend executor's `AgentConfiguration` mirror
+
+In `task/executor/pkg/agent_configuration.go`, add a new field to `AgentConfiguration` (line 12):
+
+```go
+ // ZombieJobTimeoutSeconds mirrors ConfigSpec.ZombieJobTimeoutSeconds. The
+ // spawner stamps this value onto Job.Spec.ActiveDeadlineSeconds; the sweeper
+ // uses it as the elapsed-time threshold. nil means "use the default
+ // DefaultZombieJobTimeoutSeconds from the CRD types package".
+ ZombieJobTimeoutSeconds *int32
+```
+
+(The sweeper interval is per-executor, not per-agent — it stays out of this struct. Prompt 4 reads it from a different path. State: only `ZombieJobTimeoutSeconds` belongs in this per-agent mirror.)
+
+In `AgentConfigurations.TaggedConfigurations` (line 64), include the new field in the per-element copy:
+
+```go
+ZombieJobTimeoutSeconds: c.ZombieJobTimeoutSeconds,
+```
+
+Find all other construction sites of `AgentConfiguration{}` in non-test code and add the field. Run:
+
+```
+grep -rn "AgentConfiguration{" task/executor/pkg/ | grep -v _test.go | grep -v mocks/
+```
+
+For each result, confirm the new field is copied across (if it should be — some sites may legitimately leave it zero-valued, but the conversion from CRD `ConfigSpec` MUST copy it). Edit `task/executor/pkg/config_resolver.go` `convert()` (around line 66) to append `ZombieJobTimeoutSeconds: obj.Spec.ZombieJobTimeoutSeconds,` to the returned struct literal.
+
+### 5. Add `EffectiveZombieJobTimeoutSeconds` helper
+
+In `task/executor/pkg/agent_configuration.go`, add a method that resolves the effective value (returns the default when unset):
+
+```go
+import (
+ agentv1 "github.com/bborbe/agent/task/executor/k8s/apis/agent.benjamin-borbe.de/v1"
+)
+
+// EffectiveZombieJobTimeoutSeconds returns the effective deadline in seconds:
+// the configured value when non-nil, else agentv1.DefaultZombieJobTimeoutSeconds.
+func (a AgentConfiguration) EffectiveZombieJobTimeoutSeconds() int32 {
+ if a.ZombieJobTimeoutSeconds != nil {
+ return *a.ZombieJobTimeoutSeconds
+ }
+ return agentv1.DefaultZombieJobTimeoutSeconds
+}
+```
+
+Unit test in `task/executor/pkg/agent_configuration_test.go`:
+- `AgentConfiguration{}` → `EffectiveZombieJobTimeoutSeconds() == 1800`
+- `AgentConfiguration{ZombieJobTimeoutSeconds: ptrInt32(900)}` → returns `900`
+- Extend the existing `TaggedConfigurations` test to assert `ZombieJobTimeoutSeconds` survives the tag operation.
+
+### 6. Stamp `ActiveDeadlineSeconds` on the Job
+
+In `task/executor/pkg/spawner/job_spawner.go`, inside `SpawnJob` (line 74), AFTER the `jobBuilder.Build(ctx)` call (line 123) and AFTER the existing `applyTaskIDLabel`/`applySecretEnvFrom`/`applyEphemeralStorage`/`PriorityClassName` patches (lines 128-133), stamp the deadline onto the built Job:
+
+```go
+deadline := int64(config.EffectiveZombieJobTimeoutSeconds())
+job.Spec.ActiveDeadlineSeconds = &deadline
+```
+
+`Spec.ActiveDeadlineSeconds` is a `*int64` field on `batchv1.JobSpec` (verify in the vendored `k8s.io/api/batch/v1/types.go` if unsure — it is the standard kubernetes API). The two-line pattern (intermediate `deadline` variable, then take its address) mirrors how `pointer.Int64Ptr` would do it without adding a new import.
+
+Add a `glog.V(2).Infof("set activeDeadlineSeconds=%d on job %s for task %s", deadline, jobName, task.TaskIdentifier)` line after the assignment.
+
+### 7. Test the stamp
+
+In `task/executor/pkg/spawner/job_spawner_test.go`, add a Ginkgo `It` block under the existing `SpawnJob` describe:
+
+7a. **`It("stamps ActiveDeadlineSeconds from config")`** — construct a minimal `AgentConfiguration` with `ZombieJobTimeoutSeconds: ptrInt32(900)`, call `SpawnJob`, then read the created Job back from the fake clientset via `kubeClient.BatchV1().Jobs(ns).Get(ctx, jobName, metav1.GetOptions{})` and assert `job.Spec.ActiveDeadlineSeconds != nil && *job.Spec.ActiveDeadlineSeconds == 900`.
+
+7b. **`It("uses the default ActiveDeadlineSeconds when config is unset")`** — `AgentConfiguration` with `ZombieJobTimeoutSeconds: nil`; assert `*job.Spec.ActiveDeadlineSeconds == 1800`.
+
+Use the existing test fixture pattern (`fake.NewSimpleClientset()` and the existing `currentDateTimeGetter` helper). The existing `SpawnJob` test in the same file is the template.
+
+### 8. Verify
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. The build will:
+- Regenerate `zz_generated.deepcopy.go` for the CRD types (the two new pointer fields need deepcopy handling). Verify the regenerated deepcopy contains the pointer-deepcopy pattern: `if in.ZombieSweeperIntervalSeconds != nil { in, out := &in.ZombieSweeperIntervalSeconds, &out.ZombieSweeperIntervalSeconds; *out = new(int32); **out = **in }` (or the equivalent generator-emitted shape).
+- Pass the existing `ConfigSpec.Validate` tests (no regressions on the existing fields).
+- Pass the new tests added above.
+
+If `make generate` is a separate target and `make precommit` does not invoke it automatically, run `cd task/executor && make generate && make precommit`.
+
+
+
+- `github.com/bborbe/errors.Wrapf(ctx, err, ...)` for wrapping.
+- `libtime.CurrentDateTimeGetter` for any time math (this prompt's spawner edit does not need it, but the existing `currentDateTimeGetter` injection on `jobSpawner` remains untouched).
+- Ginkgo/Gomega + counterfeiter mocks for tests.
+- glog non-error logs gated with `V(n)`.
+- Do NOT add a new RBAC verb — the executor already has `get/list/watch` on Pods and Jobs per spec 009. ActiveDeadlineSeconds is a Job spec field, not a new permission.
+- Do NOT introduce the deadline sweeper goroutine here — prompt 4. This prompt only adds the CRD fields, validation, the per-agent mirror field, and the Job spec stamp.
+- Do NOT commit — dark-factory handles git.
+- Verification command is `cd task/executor && make precommit`.
+- The error-message contract is exact: validation rejection messages must contain `invalid: must be >= 30` (for `zombieJobTimeoutSeconds`) and `invalid: must be >= 10` (for `zombieSweeperIntervalSeconds`). Tests assert on these substrings.
+
+
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. Specifically:
+- `ConfigSpec{ZombieJobTimeoutSeconds: ptrInt32(29)}.Validate(ctx)` returns an error containing `invalid: must be >= 30`.
+- `ConfigSpec{ZombieSweeperIntervalSeconds: ptrInt32(9)}.Validate(ctx)` returns an error containing `invalid: must be >= 10`.
+- `AgentConfiguration{}.EffectiveZombieJobTimeoutSeconds() == 1800`.
+- A `SpawnJob` call with `ZombieJobTimeoutSeconds: ptrInt32(900)` creates a Job whose `Spec.ActiveDeadlineSeconds == &900`.
+- A `SpawnJob` call with `ZombieJobTimeoutSeconds: nil` creates a Job whose `Spec.ActiveDeadlineSeconds == &1800`.
+
diff --git a/prompts/completed/197-spec-043-deadline-sweeper.md b/prompts/completed/197-spec-043-deadline-sweeper.md
new file mode 100644
index 0000000..584f8c7
--- /dev/null
+++ b/prompts/completed/197-spec-043-deadline-sweeper.md
@@ -0,0 +1,450 @@
+---
+status: completed
+spec: [043-executor-zombie-job-detection]
+summary: Add deadline sweeper goroutine (ZombieSweeper) that classifies zombie tasks and publishes failures, wired into executor service.Run lifecycle
+container: agent-zombie-detect-exec-197-spec-043-deadline-sweeper
+dark-factory-version: v0.173.0
+created: "2026-06-01T20:30:00Z"
+queued: "2026-06-01T20:11:58Z"
+started: "2026-06-01T20:30:57Z"
+completed: "2026-06-01T20:40:50Z"
+---
+
+
+- Adds a background goroutine that periodically walks the in-memory TaskStore looking for zombie jobs the k8s-native and Pods-informer paths missed.
+- Classifies a task as zombie iff `elapsed > deadline AND pod not Running AND no recent heartbeat`, emitting `deadline_exceeded` or `executor_watch_lost` accordingly.
+- Wires the sweeper interval and timeout from the AgentConfig CRD knobs added in prompt 3.
+- Plugs the sweeper into the executor's existing `service.Run` lifecycle alongside the consumer, deferred-respawn loop, and Job/Pod informer.
+- After this prompt, persistent zombies surface within bounded time (`max_triggers × zombieJobTimeoutSeconds`) via the existing `applyTriggerCap` chokepoint.
+
+
+
+Add a deadline sweeper goroutine that classifies and publishes zombie failures for tasks whose Jobs are past their deadline with no recent heartbeat and no healthy Pod, and wire it into the executor's `service.Run` lifecycle.
+
+
+
+Read `CLAUDE.md` for project conventions.
+
+Spec: `specs/in-progress/043-executor-zombie-job-detection.md` (Desired Behavior 4, 7; Acceptance Criteria 5, 6, 7; Failure Modes rows for `Pod unschedulable beyond grace`, `executor_watch_lost`, `Sweeper fires after Job-condition informer already fired`).
+
+Files to read before changing:
+- `task/executor/pkg/task_store.go` — `TaskStore.Snapshot` (line 50) returns a shallow copy safe for read-only iteration; this is the sweeper's iteration source.
+- `task/executor/pkg/job_watcher.go` (as updated by prompt 2) — the existing informer-driven path. The sweeper is a safety net, NOT a replacement.
+- `task/executor/pkg/result_publisher.go` (as updated by prompt 1) — `PublishFailure` is dedupe-protected; the sweeper can safely fire even when the informer already fired.
+- `task/executor/pkg/agent_configuration.go` (as updated by prompt 3) — `EffectiveZombieJobTimeoutSeconds`. Note: the SWEEPER INTERVAL is not on `AgentConfiguration` — it is a single executor-wide value sourced from the CRD. Choose to read it from `ConfigSpec` via a "first non-nil wins" rule across all watched configs (see requirement 4 below), with the default applied when nothing is set.
+- `task/executor/pkg/handler/task_event_handler.go` — `RunDeferredRespawnLoop` (line 558) is the existing template for "periodic goroutine plugged into `service.Run`". Mirror its shape (ticker + select on `ctx.Done()`).
+- `task/executor/main.go` — `application.Run` (line 54). The sweeper's `Run(ctx)` method gets added to the existing `service.Run(...)` argument list at line 121.
+- `task/executor/pkg/factory/factory.go` — add a `CreateZombieSweeper` constructor here, matching the style of `CreateJobWatcher` (line 31).
+- `task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go` (as updated by prompt 3) — `DefaultZombieSweeperIntervalSeconds`, `DefaultZombieJobTimeoutSeconds`, `ConfigSpec.ZombieSweeperIntervalSeconds`.
+- `task/executor/pkg/event_handler_config.go` — `EventHandlerConfig` is the in-memory store of all watched Config CRs (type alias `k8s.EventHandler[agentv1.Config]` from `github.com/bborbe/k8s`). The sweeper reads the sweeper interval from here via the existing `Provider[T].Get(ctx) ([]T, error)` method — there is no `Configs()` accessor and one CANNOT be added (it is a third-party generic alias). See `task/executor/pkg/probe/probe.go:110` for the existing usage pattern: `configs, err := r.configProvider.Get(ctx)`.
+- `task/executor/pkg/job_watcher.go` (as updated by prompt 2) — prompt 2 introduces a Pod informer and exposes a Pod lister (e.g. `corev1listers.PodLister` from the shared informer factory). The sweeper REUSES that lister instead of issuing per-tick LIST calls to the API server.
+
+Coding plugin docs:
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-concurrency-patterns.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-time-injection.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-error-wrapping-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-glog-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-testing-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-factory-pattern.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-mocking-guide.md`
+
+
+
+### 1. Define the `ZombieSweeper` interface
+
+Create `task/executor/pkg/zombie_sweeper.go`:
+
+```go
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package pkg
+
+import (
+ "context"
+ "time"
+
+ "github.com/bborbe/errors"
+ libk8s "github.com/bborbe/k8s"
+ libtime "github.com/bborbe/time"
+ "github.com/golang/glog"
+ corev1 "k8s.io/api/core/v1"
+ "k8s.io/apimachinery/pkg/labels"
+ corev1listers "k8s.io/client-go/listers/core/v1"
+
+ lib "github.com/bborbe/agent/lib"
+ agentv1 "github.com/bborbe/agent/task/executor/k8s/apis/agent.benjamin-borbe.de/v1"
+)
+
+//counterfeiter:generate -o ../mocks/zombie_sweeper.go --fake-name FakeZombieSweeper . ZombieSweeper
+
+// ZombieSweeper is a background goroutine that periodically classifies stuck
+// tasks as zombies and emits failure events. It is the safety net for the
+// informer-driven paths in JobWatcher (which handle the cases k8s notifies us
+// about). The sweeper handles: pods unschedulable beyond a grace window,
+// executor restart losing watch on a Job, and any deadline path the informer
+// misses (Job-condition deferred indefinitely, informer cache drift).
+type ZombieSweeper interface {
+ // Run blocks until ctx is cancelled. Each tick (interval sourced from the
+ // first non-nil ConfigSpec.ZombieSweeperIntervalSeconds across the resolver's
+ // configs, else DefaultZombieSweeperIntervalSeconds) it calls SweepOnce.
+ Run(ctx context.Context) error
+ // SweepOnce performs a single sweep pass. Exposed for unit tests so they
+ // do not have to manage tickers. Returns an error only on context
+ // cancellation paths; per-task classification errors are logged.
+ SweepOnce(ctx context.Context) error
+}
+```
+
+Note the `counterfeiter:generate` directive — `make precommit` will regenerate the fake.
+
+### 2. Define the constructor and impl
+
+In the same file:
+
+```go
+// NewZombieSweeper creates a ZombieSweeper.
+func NewZombieSweeper(
+ podLister corev1listers.PodLister,
+ namespace libk8s.Namespace,
+ taskStore *TaskStore,
+ publisher ResultPublisher,
+ configProvider EventHandlerConfig,
+ currentDateTime libtime.CurrentDateTimeGetter,
+) ZombieSweeper {
+ return &zombieSweeper{
+ podLister: podLister,
+ namespace: namespace,
+ taskStore: taskStore,
+ publisher: publisher,
+ configProvider: configProvider,
+ currentDateTime: currentDateTime,
+ }
+}
+
+type zombieSweeper struct {
+ podLister corev1listers.PodLister
+ namespace libk8s.Namespace
+ taskStore *TaskStore
+ publisher ResultPublisher
+ configProvider EventHandlerConfig
+ currentDateTime libtime.CurrentDateTimeGetter
+}
+```
+
+`EventHandlerConfig` is the type alias `k8s.EventHandler[agentv1.Config]` defined in `task/executor/pkg/event_handler_config.go`. It exposes `Get(ctx context.Context) ([]agentv1.Config, error)` via the embedded `Provider[T]` interface — use that. Do NOT invent or add a `Configs()` method: `EventHandlerConfig` is a generic alias from `github.com/bborbe/k8s` and methods cannot be added to it from this package.
+
+### 3. Implement the sweep predicate
+
+In the same file, add the classification logic:
+
+```go
+const (
+ // podNotScheduledGraceWindow is the age threshold past which a Pending Pod
+ // with PodScheduled=False is classified pod_not_scheduled. Must exceed
+ // typical scheduler latency comfortably; 2 minutes is empirically generous.
+ podNotScheduledGraceWindow = 2 * time.Minute
+)
+
+// NOTE on "no recent heartbeat" from spec DB #9 / AC #6:
+// The spec predicate is `elapsed > deadline AND pod not Running AND no recent
+// heartbeat`. This codebase has NO separate heartbeat channel today — the only
+// liveness signal for a running job is "is a Pod currently Running?". Therefore
+// "no recent heartbeat" is implemented as "no Pod in PodRunning phase observed
+// for this task". If a per-job heartbeat is added later (a follow-up spec),
+// this predicate gets a real check; for now `classify` treats `pod not Running`
+// as covering both halves of the conjunction.
+
+func (s *zombieSweeper) Run(ctx context.Context) error {
+ // Resolve interval once per Run by fetching configs; reusing the same
+ // interval across ticks is acceptable — the executor pod is short-lived
+ // relative to CRD reconfiguration cycles.
+ interval, err := s.resolveSweeperInterval(ctx)
+ if err != nil {
+ return errors.Wrapf(ctx, err, "resolve sweeper interval")
+ }
+ ticker := time.NewTicker(interval)
+ defer ticker.Stop()
+ glog.V(2).Infof("zombie sweeper started interval=%s", interval)
+ for {
+ select {
+ case <-ctx.Done():
+ return nil
+ case <-ticker.C:
+ if err := s.SweepOnce(ctx); err != nil {
+ // Per-tick failures (transient lister errors, ctx-scoped
+ // failures from publisher) must NOT kill the sweeper goroutine
+ // — that would tear down the executor via service.Run. Log and
+ // continue.
+ glog.Errorf("zombie sweeper tick: %v", err)
+ }
+ }
+ }
+}
+
+func (s *zombieSweeper) SweepOnce(ctx context.Context) error {
+ snapshot := s.taskStore.Snapshot()
+ now := s.currentDateTime.Now().Time()
+ // Fetch configs ONCE per tick — used by taskDeadline() for every task in
+ // the snapshot. Avoids N calls into the provider per sweep.
+ cfgs, err := s.configProvider.Get(ctx)
+ if err != nil {
+ return errors.Wrapf(ctx, err, "list configs")
+ }
+ for taskID, task := range snapshot {
+ jobName := task.Frontmatter.CurrentJob()
+ if jobName == "" {
+ // No active job recorded; nothing to sweep.
+ continue
+ }
+ jobStartedAt, err := task.Frontmatter.JobStartedAt()
+ if err != nil || jobStartedAt.IsZero() {
+ glog.V(3).Infof(
+ "zombie sweeper: task %s job_started_at unparseable or zero; skipping",
+ taskID,
+ )
+ continue
+ }
+ deadline := s.taskDeadline(task, cfgs)
+ elapsed := now.Sub(jobStartedAt)
+ if elapsed < deadline {
+ continue
+ }
+ reason := s.classify(taskID, task, jobName, jobStartedAt, now)
+ if reason == "" {
+ continue
+ }
+ if err := s.publisher.PublishFailure(ctx, task, jobName, reason.String()); err != nil {
+ glog.Errorf(
+ "zombie sweeper: publish failure for task %s (job %s reason %s): %v",
+ taskID, jobName, reason, err,
+ )
+ continue
+ }
+ glog.V(2).Infof(
+ "zombie sweeper: published failure for task %s (job %s reason %s elapsed=%s)",
+ taskID, jobName, reason, elapsed,
+ )
+ }
+ return nil
+}
+```
+
+`taskDeadline` resolves the per-task deadline against the configs fetched once per tick by `SweepOnce`. `resolveSweeperInterval` does the same for the sweeper interval at `Run` startup:
+
+```go
+func (s *zombieSweeper) taskDeadline(task lib.Task, cfgs []agentv1.Config) time.Duration {
+ assignee := task.Frontmatter.Assignee().String()
+ for _, cfg := range cfgs {
+ if cfg.Spec.Assignee == assignee && cfg.Spec.ZombieJobTimeoutSeconds != nil {
+ return time.Duration(*cfg.Spec.ZombieJobTimeoutSeconds) * time.Second
+ }
+ }
+ return time.Duration(agentv1.DefaultZombieJobTimeoutSeconds) * time.Second
+}
+
+func (s *zombieSweeper) resolveSweeperInterval(ctx context.Context) (time.Duration, error) {
+ cfgs, err := s.configProvider.Get(ctx)
+ if err != nil {
+ return 0, errors.Wrapf(ctx, err, "list configs")
+ }
+ for _, cfg := range cfgs {
+ if cfg.Spec.ZombieSweeperIntervalSeconds != nil {
+ return time.Duration(*cfg.Spec.ZombieSweeperIntervalSeconds) * time.Second, nil
+ }
+ }
+ return time.Duration(agentv1.DefaultZombieSweeperIntervalSeconds) * time.Second, nil
+}
+```
+
+The "first non-nil wins" semantics is acceptable because the sweeper is a single executor-wide goroutine — there is no per-agent interval. The default is the documented behavior when no Config sets the field.
+
+### 4. Implement `classify`
+
+```go
+// classify determines whether a past-deadline task is a zombie and which
+// reason applies. Returns "" when the task is NOT a zombie (Pod still Running
+// — implicit heartbeat). Inspects Pod state via the shared Pod informer's
+// lister (introduced by prompt 2). Spec Failure-Mode row "k8s API rate-limit
+// (429)" mandates: "Sweeper relies on informer cache (no per-cycle list)" —
+// we MUST NOT issue API LIST calls here.
+func (s *zombieSweeper) classify(
+ taskID lib.TaskIdentifier,
+ task lib.Task,
+ jobName string,
+ jobStartedAt time.Time,
+ now time.Time,
+) ZombieReason {
+ selector := labels.SelectorFromSet(labels.Set{
+ "agent.benjamin-borbe.de/task-id": string(taskID),
+ })
+ pods, err := s.podLister.Pods(s.namespace.String()).List(selector)
+ if err != nil {
+ glog.Errorf("zombie sweeper: lister pods for task %s: %v", taskID, err)
+ return ""
+ }
+ // Zero pods AND past-deadline AND a Job was supposed to be running →
+ // executor lost the watch (Job exists in k8s but Pod GC happened, or the
+ // Job never created a Pod and was restarted across executor lifetimes).
+ // "No recent heartbeat" reduces to "no Pod observed" since this codebase
+ // has no separate heartbeat channel.
+ if len(pods) == 0 {
+ return ZombieReasonExecutorWatchLost
+ }
+ for _, pod := range pods {
+ // Healthy Running — NOT a zombie. A Running pod is the implicit
+ // heartbeat in the current system (no separate heartbeat channel).
+ if pod.Status.Phase == corev1.PodRunning {
+ return ""
+ }
+ // Pending past the unschedulable grace window with PodScheduled=False.
+ if pod.Status.Phase == corev1.PodPending {
+ age := now.Sub(pod.CreationTimestamp.Time)
+ if age > podNotScheduledGraceWindow && hasPodScheduledFalse(pod) {
+ return ZombieReasonPodNotScheduled
+ }
+ }
+ }
+ // Past deadline, no Running pod, no specific Pod-state reason — fall
+ // back to deadline_exceeded.
+ return ZombieReasonDeadlineExceeded
+}
+
+// hasPodScheduledFalse returns true when the Pod has a PodScheduled=False
+// condition (kube-scheduler could not place the pod).
+func hasPodScheduledFalse(pod *corev1.Pod) bool {
+ for _, c := range pod.Status.Conditions {
+ if c.Type == corev1.PodScheduled && c.Status == corev1.ConditionFalse {
+ return true
+ }
+ }
+ return false
+}
+```
+
+### 5. Wire into the factory and main
+
+In `task/executor/pkg/factory/factory.go`, add:
+
+```go
+// CreateZombieSweeper creates a deadline sweeper that classifies stuck tasks as
+// zombies and emits failure events via the publisher. Interval and per-task
+// deadline are sourced from the AgentConfig CRD knobs (see ConfigSpec). The
+// podLister parameter is the shared Pod informer's lister introduced by
+// prompt 2 — the sweeper reuses it (no per-cycle API LIST).
+func CreateZombieSweeper(
+ podLister corev1listers.PodLister,
+ namespace libk8s.Namespace,
+ taskStore *pkg.TaskStore,
+ publisher pkg.ResultPublisher,
+ configProvider pkg.EventHandlerConfig,
+ currentDateTime libtime.CurrentDateTimeGetter,
+) pkg.ZombieSweeper {
+ return pkg.NewZombieSweeper(
+ podLister,
+ namespace,
+ taskStore,
+ publisher,
+ configProvider,
+ currentDateTime,
+ )
+}
+```
+
+In `task/executor/main.go`, inside `application.Run` (line 54), AFTER the existing `jobWatcher := factory.CreateJobWatcher(...)` line (line 96), add (where `podLister` is the lister exposed by the Pod informer from prompt 2 — fetch it from `jobWatcher` or the shared informer factory wired in prompt 2; the exact accessor depends on prompt 2's API):
+
+```go
+zombieSweeper := factory.CreateZombieSweeper(
+ jobWatcher.PodLister(), // or whatever accessor prompt 2 exposes
+ a.Namespace,
+ taskStore,
+ resultPublisher,
+ eventHandlerConfig,
+ currentDateTimeGetter,
+)
+```
+
+In the `service.Run(...)` call (line 121), add `zombieSweeper.Run,` as a new argument alongside `jobWatcher.Run` and `taskEventHandler.RunDeferredRespawnLoop`. The argument list becomes:
+
+```go
+return service.Run(
+ ctx,
+ func(ctx context.Context) error {
+ return connector.Listen(ctx, a.Namespace, resourceEventHandler)
+ },
+ consumer.Consume,
+ taskEventHandler.RunDeferredRespawnLoop,
+ jobWatcher.Run,
+ zombieSweeper.Run,
+ a.createHTTPServer(eventHandlerConfig, healthcheckRunner),
+ healthcheckCron.Run,
+)
+```
+
+Sibling entry-point check: this codebase has a single `task/executor/main.go` — verified by `find ~/Documents/workspaces/agent-zombie-detect/task/executor -name "main.go"`. No `cmd/run-once/`, no other binaries. The factory's `CreateJobWatcher` and the new `CreateZombieSweeper` are called from one site each.
+
+### 6. Unit tests for `SweepOnce`
+
+Create `task/executor/pkg/zombie_sweeper_test.go`. Use Ginkgo/Gomega + `FakeResultPublisher`. For the Pod lister, build a fresh shared informer factory off `fake.NewSimpleClientset()` and use `factory.Core().V1().Pods().Lister()` — seed the informer's indexer with test pods via `factory.Core().V1().Pods().Informer().GetIndexer().Add(pod)` so the lister returns them deterministically without running the informer goroutine.
+
+Time helper: use `libtime.NewCurrentDateTime()` + `currentDateTime.SetNow(libtimetest.ParseDateTime(...))` — that is the established pattern in this codebase (see `task/executor/pkg/result_publisher_test.go:122-123` and `task/executor/pkg/handler/task_event_handler_test.go:58, 1211`). Import `libtimetest "github.com/bborbe/time/test"`.
+
+For `EventHandlerConfig`: use the real `pkg.NewEventHandlerConfig()` impl. It is a `k8s.EventHandler[agentv1.Config]` whose `Get(ctx)` reads the in-memory state — seed it via its existing add/upsert event handler interface (read `event_handler_config.go` and the `github.com/bborbe/k8s` `EventHandler` upstream contract to see how `OnAdd`/`Upsert` is invoked; the probe and resource_event_handler tests already do this — copy the shortest pattern you find).
+
+Test table — at least these four cells (Acceptance Criterion 6 requires them):
+
+6a. **deadline-exceeded-and-not-running → zombie** — TaskStore has one task with `current_job: "j1"`, `job_started_at: now-30min`, `assignee: "a"`. Config has `zombieJobTimeoutSeconds: 60`. Pod lister has one Pod in `Status.Phase: PodFailed`. → `PublishFailureCallCount() == 1` with reason `"deadline_exceeded"`.
+
+6b. **deadline-exceeded-but-running → NOT zombie** — same as 6a but the Pod is in `Status.Phase: PodRunning`. → `PublishFailureCallCount() == 0`.
+
+6c. **under-deadline → NOT zombie** — TaskStore has a task with `job_started_at: now-30sec`, deadline 60s (elapsed strictly less than deadline). → `PublishFailureCallCount() == 0`.
+
+6d. **watch-lost → `executor_watch_lost`** — TaskStore has a task with `job_started_at: now-30min`, but the Pod lister's indexer is empty (zero pods matching the task-id label selector). → `PublishFailureCallCount() == 1` with reason `"executor_watch_lost"`.
+
+6e. **pod_not_scheduled** — TaskStore has a task past deadline; Pod is `Status.Phase: PodPending` with `CreationTimestamp: now-5min` and condition `PodScheduled=False`. → `PublishFailureCallCount() == 1` with reason `"pod_not_scheduled"`.
+
+6f. **interval default** — `resolveSweeperInterval(ctx)` returns `60*time.Second` when no Config sets `ZombieSweeperIntervalSeconds`.
+
+6g. **interval override** — `resolveSweeperInterval(ctx)` returns `15*time.Second` when a Config sets `ZombieSweeperIntervalSeconds: ptrInt32(15)`.
+
+6h. **deadline default** — `taskDeadline(task, cfgs)` returns `1800*time.Second` when `cfgs` is empty or no Config matches the assignee with `ZombieJobTimeoutSeconds` set.
+
+Construct Pods directly via `&corev1.Pod{ObjectMeta: ..., Status: corev1.PodStatus{...}}` with the label `agent.benjamin-borbe.de/task-id: ` set on `ObjectMeta.Labels` so the lister's selector matches.
+
+### 7. Verify
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. The counterfeiter mock for `ZombieSweeper` is regenerated by `make generate` (transitively invoked by precommit).
+
+
+
+- `github.com/bborbe/errors.Wrapf(ctx, err, ...)` for wrapping; no `fmt.Errorf`; no bare `return err`.
+- `libtime.CurrentDateTimeGetter` for all time math; NEVER `time.Now()` directly anywhere in the sweeper or its tests. Use `s.currentDateTime.Now().Time()` for `time.Time` values.
+- Ginkgo/Gomega + counterfeiter mocks for tests.
+- glog non-error logs gated with `V(2)` (success) or `V(3)` (noisy skips).
+- `service.Run` for goroutine lifecycle — the sweeper's `Run(ctx)` is added as one more argument to the existing `service.Run` call in `main.go`.
+- Do NOT change the controller-side `applyTriggerCap` chokepoint — the sweeper's role is to make sure `trigger_count` increments fire so the chokepoint can eventually run.
+- Do NOT introduce a Prometheus metric in this prompt. The spec Non-goals list "expanded observability metrics" as out-of-scope; the spec specifies log lines (e.g. `event=zombie_dedupe`) for observability, not new counters. The existing `metrics.TaskEventsTotal` counter family is NOT extended.
+- Do NOT commit — dark-factory handles git.
+- Verification command is `cd task/executor && make precommit`.
+- The four-cell test table in requirement 6 (6a-6d) is a strict acceptance — fewer than four cells fails AC #6 only (AC #5 is the Pod-state classifier from prompt 2).
+
+
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. Specifically:
+- `SweepOnce` classifies a past-deadline task with a Failed Pod as zombie and publishes one failure with reason `"deadline_exceeded"`.
+- `SweepOnce` classifies a past-deadline task with NO Pods (empty lister indexer) as `executor_watch_lost` and publishes one failure.
+- `SweepOnce` skips a past-deadline task whose Pod is `PodRunning`.
+- `SweepOnce` skips an under-deadline task.
+- `SweepOnce` classifies a past-deadline Pending Pod older than the grace window with PodScheduled=False as `pod_not_scheduled`.
+- `resolveSweeperInterval` returns `60s` by default and `15s` when configured.
+- `main.go` constructs and passes `zombieSweeper.Run` to `service.Run`.
+
diff --git a/prompts/completed/198-spec-043-envtest-imagepullbackoff.md b/prompts/completed/198-spec-043-envtest-imagepullbackoff.md
new file mode 100644
index 0000000..fd8c6b1
--- /dev/null
+++ b/prompts/completed/198-spec-043-envtest-imagepullbackoff.md
@@ -0,0 +1,347 @@
+---
+status: completed
+spec: [043-executor-zombie-job-detection]
+container: agent-zombie-detect-exec-198-spec-043-envtest-imagepullbackoff
+dark-factory-version: v0.173.0
+created: "2026-06-01T20:30:00Z"
+queued: "2026-06-01T20:11:58Z"
+started: "2026-06-01T20:40:51Z"
+completed: "2026-06-01T21:08:31Z"
+---
+
+
+- Adds a single envtest that exercises the Pods informer wiring against a real in-process kube-apiserver (sigs.k8s.io/controller-runtime/pkg/envtest).
+- Test creates a Pod whose container has an obviously-bogus image reference, then forces the Pod into an ImagePullBackOff-equivalent status via a Status subresource update (envtest does not run a kubelet, so we simulate the status the informer would observe in a real cluster).
+- Verifies the executor's `JobWatcher.HandlePod` (driven by the real informer) emits exactly one `PublishFailure` with reason `"image_pull_backoff"` within `2 × zombieSweeperIntervalSeconds`.
+- Introduces `sigs.k8s.io/controller-runtime/pkg/envtest` as a test-only dependency on the executor module.
+- After this prompt, the round-trip from "Pod transitions to ImagePullBackOff in the API server" to "executor emits PublishFailure with the right reason" is covered against real informer machinery, not a hand-rolled mock.
+
+
+
+Prove against a real (in-process) Kubernetes API server that the executor's Pods informer correctly classifies an ImagePullBackOff Pod and emits one `PublishFailure` with reason `"image_pull_backoff"` within `2 × zombieSweeperIntervalSeconds` of the Pod entering that state.
+
+
+
+Read `CLAUDE.md` for project conventions.
+
+Spec: `specs/in-progress/043-executor-zombie-job-detection.md` (Acceptance Criterion 9; Scenario coverage note that explicitly limits scenarios; the rest of the manual verification is for `agent-dev` post-deploy and is NOT in scope for this prompt).
+
+Files to read before changing:
+- `task/executor/pkg/job_watcher.go` (as updated by prompt 2) — `JobWatcher.Run` starts both Jobs and Pods informers via a shared factory; the envtest drives `Run` against a real apiserver.
+- `task/executor/pkg/result_publisher.go` — but the test does NOT use the real publisher; it injects a `FakeResultPublisher` from `task/executor/mocks/result_publisher.go` so the test asserts on `PublishFailureCallCount()` / `PublishFailureArgsForCall(0)`.
+- `task/executor/pkg/task_store.go` — the envtest seeds one task in the store so the watcher's lookup succeeds.
+- `task/executor/go.mod` — the executor module currently depends on `k8s.io/client-go v0.36.1`. `sigs.k8s.io/controller-runtime/pkg/envtest` is compatible with that line of client-go; pick a controller-runtime version aligned with the client-go major (e.g. `v0.21.x` for client-go 0.36 — check the controller-runtime release notes for the exact pairing).
+- `task/executor/Makefile` — `make precommit` is the verification entrypoint. Envtest binaries (etcd, kube-apiserver) come from `setup-envtest`; the Makefile may need a new target `envtest-setup` that downloads them, OR the test can use `setup-envtest` programmatically. Read the Makefile first; if `setup-envtest` is not yet wired, add the minimal lines (see requirement 5).
+
+Coding plugin docs:
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-kubernetes-crd-controller-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-testing-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-error-wrapping-guide.md`
+- `/home/node/.claude/plugins/marketplaces/coding/docs/go-mod-dependency-fix-guide.md`
+
+
+
+### 1. Add the envtest dependency
+
+In `task/executor/go.mod`, add `sigs.k8s.io/controller-runtime` (test-only — only imported from `_test.go` files in the executor module). Pin the version to the line compatible with `k8s.io/client-go v0.36.1` — controller-runtime `v0.21.x` is the conventional pairing; if `v0.21.x` does not exist when this prompt is executed, use the latest `v0.MAJOR.x` whose go.mod requires `k8s.io/client-go v0.36.x`. To verify pairing without guessing, run:
+
+```bash
+cd task/executor
+go get -t sigs.k8s.io/controller-runtime@latest
+go mod tidy
+```
+
+If the resulting `go.mod` ends up with a client-go bump that conflicts with the rest of the workspace's pinned `v0.36.1`, instead pin the controller-runtime version explicitly:
+
+```bash
+go get -t sigs.k8s.io/controller-runtime@v0.21.0
+```
+
+Adjust until `go mod tidy && go build ./...` succeeds. Document the chosen version in a comment in the new test file's import block.
+
+**After `go mod tidy`, verify `grep 'k8s.io/client-go' go.mod` still shows `v0.36.1`** — if `go mod tidy` bumped it (controller-runtime's go.mod typically requires a recent client-go), downgrade `controller-runtime` until client-go is stable at `v0.36.1`. Iterate: `go get -t sigs.k8s.io/controller-runtime@v0.21.` and re-tidy until the pin holds.
+
+**Verify `make precommit` exits 0 end-to-end after the dep add** — including `vulncheck` / `osv-scanner` / `trivy` against the expanded transitive closure (controller-tools, klog v2, etc.). If any of these scanners report new findings, fix them in this prompt; do NOT defer.
+
+### 2. Set up envtest binaries
+
+The envtest framework needs `etcd` and `kube-apiserver` binaries on disk. The conventional source is the `setup-envtest` tool. In `task/executor/Makefile`, add (if not already present) a target:
+
+```makefile
+ENVTEST_K8S_VERSION ?= 1.31.0
+ENVTEST_DIR := $(shell go env GOPATH)/pkg/envtest
+
+.PHONY: envtest-setup
+envtest-setup:
+ @command -v setup-envtest >/dev/null 2>&1 || go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
+ @setup-envtest use $(ENVTEST_K8S_VERSION) -p path > /dev/null
+```
+
+Add `envtest-setup` as a dependency of the existing `test` (or `precommit`) target so `make precommit` automatically downloads the binaries on first run. The exact wiring depends on the existing Makefile shape — preserve existing targets and add the new dependency in a backward-compatible way (e.g. `test: envtest-setup` if the existing `test` target is `test:`).
+
+**Do NOT export `KUBEBUILDER_ASSETS` at file scope.** A file-scope `export KUBEBUILDER_ASSETS := $(shell setup-envtest use ...)` is evaluated on every Make invocation before any target runs — including `envtest-setup` itself — so on a clean machine (where `setup-envtest` is not yet installed) the `$(shell ...)` resolves to an empty string and is captured for the lifetime of the Make run.
+
+Instead, set `KUBEBUILDER_ASSETS` as a **recipe-line prefix inside the test target** so it evaluates after `envtest-setup` has run:
+
+```makefile
+.PHONY: test-envtest
+test-envtest: envtest-setup
+ ENVTEST_REQUIRED=1 KUBEBUILDER_ASSETS=$$(setup-envtest use $(ENVTEST_K8S_VERSION) -p path) \
+ go test -tags=envtest ./pkg/...
+```
+
+The `$$` escapes for Make so the shell expands `setup-envtest` at recipe execution time.
+
+### 3. Add the envtest
+
+Create `task/executor/pkg/job_watcher_envtest_test.go`:
+
+```go
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build envtest
+
+package pkg_test
+
+import (
+ "context"
+ "fmt"
+ "testing"
+ "time"
+
+ libk8s "github.com/bborbe/k8s"
+ . "github.com/onsi/ginkgo/v2"
+ . "github.com/onsi/gomega"
+ corev1 "k8s.io/api/core/v1"
+ metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+ "k8s.io/client-go/kubernetes"
+ "k8s.io/client-go/rest"
+ "sigs.k8s.io/controller-runtime/pkg/envtest"
+
+ lib "github.com/bborbe/agent/lib"
+ pkg "github.com/bborbe/agent/task/executor/pkg"
+ mocks "github.com/bborbe/agent/task/executor/mocks"
+)
+```
+
+The `//go:build envtest` build tag isolates the heavy test from the default `go test ./...` run; `make precommit` invokes `go test -tags=envtest ./pkg/...` to include it. This is the conventional pattern for envtest in projects that also want fast unit tests.
+
+Test body:
+
+```go
+func TestEnvtest(t *testing.T) {
+ RegisterFailHandler(Fail)
+ RunSpecs(t, "executor envtest suite")
+}
+
+var _ = Describe("JobWatcher (envtest)", func() {
+ var (
+ testEnv *envtest.Environment
+ cfg *rest.Config
+ kubeClient kubernetes.Interface
+ ctx context.Context
+ cancel context.CancelFunc
+ )
+
+ BeforeEach(func() {
+ testEnv = &envtest.Environment{}
+ var err error
+ cfg, err = testEnv.Start()
+ Expect(err).NotTo(HaveOccurred())
+ kubeClient, err = kubernetes.NewForConfig(cfg)
+ Expect(err).NotTo(HaveOccurred())
+ ctx, cancel = context.WithCancel(context.Background())
+ })
+
+ AfterEach(func() {
+ cancel()
+ Expect(testEnv.Stop()).To(Succeed())
+ })
+
+ It("classifies ImagePullBackOff and publishes one failure within the bound", func() {
+ ns := "default"
+ taskID := lib.TaskIdentifier("envtest-task-1")
+ jobName := "envtest-job-1"
+ publisher := &mocks.FakeResultPublisher{}
+ store := pkg.NewTaskStore()
+ store.Store(taskID, lib.Task{
+ TaskIdentifier: taskID,
+ Frontmatter: lib.TaskFrontmatter{
+ "current_job": jobName,
+ "assignee": "envtest-agent",
+ },
+ })
+ watcher := pkg.NewJobWatcher(kubeClient, libk8s.Namespace(ns), store, publisher)
+
+ // Start the watcher in a goroutine; cancel via ctx in AfterEach.
+ runErrCh := make(chan error, 1)
+ go func() { runErrCh <- watcher.Run(ctx) }()
+
+ // Create a Pod with the task-id label and a bogus image. envtest does
+ // not run a kubelet, so we will inject the ImagePullBackOff status
+ // ourselves via the Status subresource below; the informer sees the
+ // status update the same way it would in a real cluster.
+ pod := &corev1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: "envtest-pod-1",
+ Namespace: ns,
+ Labels: map[string]string{
+ "agent.benjamin-borbe.de/task-id": string(taskID),
+ },
+ OwnerReferences: []metav1.OwnerReference{
+ {APIVersion: "batch/v1", Kind: "Job", Name: jobName, UID: "fake-job-uid"},
+ },
+ },
+ Spec: corev1.PodSpec{
+ RestartPolicy: corev1.RestartPolicyNever,
+ Containers: []corev1.Container{
+ {Name: "agent", Image: "docker.example.com/does-not-exist:envtest"},
+ },
+ },
+ }
+ _, err := kubeClient.CoreV1().Pods(ns).Create(ctx, pod, metav1.CreateOptions{})
+ Expect(err).NotTo(HaveOccurred(), "if Create returns 422, add any other missing required-by-validator defaults; do not silently catch")
+
+ // Status subresource update flow — 4 steps to avoid ResourceVersion races
+ // and default-mutator overwrites:
+ // 1. Get the canonical Pod (fresh ResourceVersion).
+ // 2. Mutate Status on the fetched object.
+ // 3. UpdateStatus with the fetched object.
+ // 4. Get again to confirm the status survived (no mutator clobbered it).
+ // Step 1: Get
+ fetched, err := kubeClient.CoreV1().Pods(ns).Get(ctx, "envtest-pod-1", metav1.GetOptions{})
+ Expect(err).NotTo(HaveOccurred())
+ // Step 2: mutate Status on the freshly-fetched object
+ fetched.Status.Phase = corev1.PodPending
+ fetched.Status.ContainerStatuses = []corev1.ContainerStatus{
+ {
+ Name: "agent",
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "ImagePullBackOff",
+ Message: "Back-off pulling image",
+ },
+ },
+ },
+ }
+ // Step 3: UpdateStatus
+ _, err = kubeClient.CoreV1().Pods(ns).UpdateStatus(ctx, fetched, metav1.UpdateOptions{})
+ Expect(err).NotTo(HaveOccurred())
+ // Step 4: Get to confirm the status survived (no default mutator reverted Phase
+ // to Pending or dropped the Waiting state mid-update).
+ confirmed, err := kubeClient.CoreV1().Pods(ns).Get(ctx, "envtest-pod-1", metav1.GetOptions{})
+ Expect(err).NotTo(HaveOccurred())
+ Expect(confirmed.Status.ContainerStatuses).To(HaveLen(1))
+ Expect(confirmed.Status.ContainerStatuses[0].State.Waiting).NotTo(BeNil())
+ Expect(confirmed.Status.ContainerStatuses[0].State.Waiting.Reason).To(Equal("ImagePullBackOff"))
+
+ // Acceptance bound: 2 * zombieSweeperIntervalSeconds = 2 * 60s = 120s.
+ // In practice the informer reacts in well under a second once the
+ // status update lands; we use a generous wait with polling to stay
+ // well inside the bound while keeping the test fast.
+ Eventually(publisher.PublishFailureCallCount, 30*time.Second, 100*time.Millisecond).
+ Should(Equal(1), "expected one PublishFailure call within bound")
+
+ // Confirm "exactly one" — Eventually passes at the FIRST observation of 1;
+ // Consistently verifies no second call lands over a short follow-up window.
+ Consistently(publisher.PublishFailureCallCount, 2*time.Second, 200*time.Millisecond).
+ Should(Equal(1), "expected exactly one PublishFailure call (no duplicates)")
+
+ _, _, gotJobName, gotReason := publisher.PublishFailureArgsForCall(0)
+ Expect(gotJobName).To(Equal(jobName))
+ Expect(gotReason).To(Equal(string(pkg.ZombieReasonImagePullBackOff)))
+ })
+})
+```
+
+Verify the signature of `FakeResultPublisher.PublishFailureArgsForCall` against the regenerated mock. The counterfeiter mock returns positional arguments matching the interface — for `PublishFailure(ctx context.Context, task lib.Task, jobName string, reason string) error` the returns are `(context.Context, lib.Task, string, string)`. The test reads positional element 2 (jobName) and element 3 (reason). Adjust if the regenerated mock differs.
+
+### 4. Wire envtest into precommit
+
+Confirm the executor's existing `make test` (or wherever `go test` runs from inside `make precommit`) uses the `envtest` build tag when envtest binaries are present. Two acceptable shapes:
+
+4a. Add a new line in the Makefile's test target:
+```makefile
+test: envtest-setup
+ go test ./...
+ ENVTEST_REQUIRED=1 KUBEBUILDER_ASSETS=$$(setup-envtest use $(ENVTEST_K8S_VERSION) -p path) \
+ go test -tags=envtest ./pkg/...
+```
+
+4b. Or a dedicated `make test-envtest` target that `precommit` invokes (preferred — matches section 2's recipe):
+```makefile
+.PHONY: test-envtest
+test-envtest: envtest-setup
+ ENVTEST_REQUIRED=1 KUBEBUILDER_ASSETS=$$(setup-envtest use $(ENVTEST_K8S_VERSION) -p path) \
+ go test -tags=envtest ./pkg/...
+
+precommit: ... test-envtest
+```
+
+Pick the shape closest to the existing Makefile's pattern. Both forms satisfy AC #9 ("envtest passes (exit code 0)") and both set `ENVTEST_REQUIRED=1` so a missing `KUBEBUILDER_ASSETS` becomes a `Fail` instead of a silent skip (see section 5).
+
+### 5. Test environment skip-when-unavailable (with required-mode gate)
+
+The envtest binaries are not present on every developer workstation (only in CI / `make precommit`). Add a `BeforeSuite` that skips for interactive use but **fails when invoked from `make precommit`** (so a misconfigured precommit cannot silently exit 0 with zero envtest coverage):
+
+```go
+var _ = BeforeSuite(func() {
+ if os.Getenv("KUBEBUILDER_ASSETS") == "" {
+ if os.Getenv("ENVTEST_REQUIRED") == "1" {
+ Fail("KUBEBUILDER_ASSETS not set but ENVTEST_REQUIRED=1; envtest binaries must be available under precommit")
+ }
+ Skip("KUBEBUILDER_ASSETS not set; run via `make test-envtest` or `make precommit`")
+ }
+})
+```
+
+Add `os` to the imports.
+
+The Makefile's `test-envtest` recipe (shown in section 2 above) sets `ENVTEST_REQUIRED=1` before invoking `go test`, so the skip becomes a `Fail` whenever the suite runs under Make. Interactive `go test -tags=envtest ./...` without `KUBEBUILDER_ASSETS` set still skips cleanly (no `ENVTEST_REQUIRED` in the env).
+
+### 6. Verify
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. On a clean machine the first run downloads the envtest binaries (`setup-envtest` cache lives in `~/.local/share/kubebuilder-envtest/`); subsequent runs reuse the cache. If the download fails (network unavailable), `make precommit` MUST still exit non-zero — the test is not optional once wired into precommit.
+
+### 7. Non-requirements (explicit out-of-scope)
+
+- This prompt does NOT add envtests for the deadline sweeper, the doctrine publisher, or the type-mismatch path. Those are adequately covered by the prompts 1, 2, and 4 unit tests; the spec's Scenario coverage section pins envtest scope to exactly the ImagePullBackOff path.
+- This prompt does NOT introduce a `setup-envtest` binary check that opportunistically skips when assets are absent under `make precommit` — that would defeat AC #9.
+- This prompt does NOT modify the JobWatcher implementation; it only verifies the behavior from prompt 2 against real informer wiring.
+
+### 8. Edge case to confirm
+
+The `JobWatcher.Run` starts informers via a `SharedInformerFactoryWithOptions`. envtest provides a real apiserver, so the informer connects normally. The single subtlety: `WaitForCacheSync` must complete before the test creates the Pod (otherwise the AddFunc event handler may not fire for objects created during the initial list+watch). The test relies on the watcher's `Run` blocking on `<-ctx.Done()` AFTER `WaitForCacheSync` returns; ensure the `Eventually` poll begins AFTER the Pod's status update lands (the Create+UpdateStatus sequence is sequential, so by the time `UpdateStatus` returns the watch is already established). If flakes appear, add a `time.Sleep(500*time.Millisecond)` between starting the watcher goroutine and creating the Pod — but only if necessary; the synchronous Update return should suffice.
+
+
+
+- **Depends on prompt 2 (Pod-state classifier) being completed first.** If `JobWatcher.HandlePod` or `pkg.ZombieReasonImagePullBackOff` do not exist in `task/executor/pkg/`, stop and report blocker — do NOT attempt to implement them in this prompt.
+- `github.com/bborbe/errors.Wrapf(ctx, err, ...)` for wrapping; no `fmt.Errorf`; no bare `return err`.
+- Ginkgo/Gomega tests; counterfeiter `FakeResultPublisher` for capturing publish calls.
+- `libtime.CurrentDateTimeGetter` injection is NOT required in this prompt — the envtest exercises the Pods informer path which does not perform deadline math; the test asserts the publisher is called within a wall-clock bound.
+- envtest is gated by the `envtest` build tag — `go test ./...` (default) MUST remain fast (~seconds). Only `go test -tags=envtest ./...` invokes the heavy path.
+- Do NOT add envtests beyond the single ImagePullBackOff path — Scenario coverage in the spec explicitly limits scope.
+- Do NOT commit — dark-factory handles git.
+- Verification command is `cd task/executor && make precommit`.
+- The Acceptance bound is `2 * zombieSweeperIntervalSeconds` = 120s; the test polls with `Eventually(...).WithTimeout(30*time.Second)` which is well inside the bound and keeps test wall time low.
+
+
+
+```
+cd task/executor && make precommit
+```
+
+Must exit 0. Specifically:
+- The envtest spins up an in-process kube-apiserver.
+- Creates a Pod labelled `agent.benjamin-borbe.de/task-id=envtest-task-1`.
+- Sets the Pod status to `ImagePullBackOff` via the Status subresource.
+- The executor's `JobWatcher.HandlePod` (driven by the real informer) fires and calls `publisher.PublishFailure` exactly once.
+- The captured reason argument equals `"image_pull_backoff"`.
+- All happens within 30s wall-clock (well under the spec's 120s bound).
+
diff --git a/specs/ideas/job-active-deadline-seconds.md b/specs/ideas/job-active-deadline-seconds.md
deleted file mode 100644
index b0e378f..0000000
--- a/specs/ideas/job-active-deadline-seconds.md
+++ /dev/null
@@ -1,120 +0,0 @@
----
-status: idea
-tags:
- - dark-factory
- - spec
- - controller
- - executor
----
-
-## Summary
-
-- Agent Jobs get a hard runtime ceiling (`activeDeadlineSeconds` in the K8s Job spec)
-- The executor sets this when spawning each Job; once exceeded, K8s kills the Pod and the Job moves to Failed
-- Default value applies to every agent (sane upper bound, e.g. 30 min)
-- Per-agent override via a new `AgentConfig` field for slow agents (e.g. trade-analysis on a 100-trade sweep) or fast agents (e.g. heartbeat ping)
-- Closes the "agent runs forever" failure mode where a hung Claude session, infinite loop, or overshooting plan accumulates compute and blocks downstream queue progress
-
-## Problem
-
-Today an agent Job runs until it completes or the cluster reschedules it — there is no upper bound. We've seen single trade-analysis Phase 2 runs go past 35 minutes (live observation 2026-04-28, task `81f0affd`) when the prompt asked for too many trades. If Claude hangs on a tool call, gets stuck in a retry loop, or hits a degenerate plan, the Job consumes a Pod slot, an OAuth-PVC mount, and Claude API quota indefinitely until a human notices.
-
-Symptoms today:
-- No surface signal that an agent is "running too long" — operators learn from `kubectl get jobs` runtime
-- Trigger-cap escalation logic (3 retries → human_review) only fires when the Job *exits* with a non-success status; a perpetually-running Job never exits and never escalates
-- Resource leaks under sustained load: imagine a controller bug that re-spawns the same task every minute — without a deadline, every spawn lingers
-
-## Goal
-
-After this work, **every agent Job has a deadline**, and operators can tune that deadline per agent:
-
-1. Executor stamps `spec.activeDeadlineSeconds` on every Job it creates from an `AgentConfig`
-2. Default deadline is the cluster-wide default (e.g. `1800` = 30 min) shipped in code
-3. `AgentConfig.spec.jobActiveDeadlineSeconds` (new optional field) overrides the default per agent
-4. When the deadline fires, the Pod is killed with `DeadlineExceeded`. The existing executor-side completion handler (which already routes Job failures into the trigger-cap escalation path) sees this as a failure and increments `trigger_count` on the next pickup
-5. After 3 deadline-exceeded retries, the trigger-cap escalation routes to `human_review` — same path as crash-loops today
-
-## Non-goals
-
-- Soft graceful-shutdown with a SIGTERM grace period (K8s already handles this via `terminationGracePeriodSeconds`; we're not changing that)
-- Cluster-wide policy enforcement (e.g. via OPA/Gatekeeper) — the executor is the single point that creates Jobs, that's where the policy lives
-- Different deadlines per task (a fast smoke vs. slow nightly sweep) — same agent always gets the same deadline; if a specific task needs longer, split into a separate `AgentConfig` (sibling to today's per-stage Configs)
-- Mid-run deadline extension — the Job is killed and retried; no "snooze"
-- Backporting an `activeDeadlineSeconds` field to old running Jobs — the field is immutable on the K8s Job spec; the change applies to NEW Jobs only
-- Per-step or per-phase deadlines (the deadline is the whole Job, which today is one phase but in framework agents wraps a single phase too)
-
-## Desired Behavior
-
-### Default
-
-When `AgentConfig.spec.jobActiveDeadlineSeconds` is unset, the executor uses a hardcoded default — pick something operators can live with. Suggested: **`1800` (30 min)**. This is generous for current agents (claude/code/gemini/hypothesis ~5min, trade-analysis ~15min, backtest variable) and tight enough that a hung Job is detected within an hour.
-
-### Per-agent override
-
-```yaml
-apiVersion: agent.example.com/v1
-kind: AgentConfig
-metadata:
- name: trade-analysis-agent
- namespace: dev
-spec:
- assignee: trade-analysis-agent
- image: docker.example.com/agent-trade-analysis:dev
- # NEW: override the cluster default. Optional; omit to inherit default.
- jobActiveDeadlineSeconds: 3600 # 1 hour for slow sweeps
-```
-
-```yaml
-spec:
- assignee: heartbeat-agent
- jobActiveDeadlineSeconds: 60 # tight bound; should never exceed
-```
-
-### Stamping on Job creation
-
-In `task/executor`, where the Job is built from the AgentConfig (search `corev1.PodSpec` / `batchv1.JobSpec`), add:
-
-```go
-deadline := defaultJobActiveDeadlineSeconds // const, e.g. 1800
-if cfg.Spec.JobActiveDeadlineSeconds != nil && *cfg.Spec.JobActiveDeadlineSeconds > 0 {
- deadline = *cfg.Spec.JobActiveDeadlineSeconds
-}
-job.Spec.ActiveDeadlineSeconds = &deadline
-```
-
-`*int64` (pointer) on the CRD field so we distinguish "unset → use default" from "explicitly 0 → no deadline" (we MUST reject `0` at validation; K8s treats 0 as "kill immediately"). Either reject in CRD validation or coerce to default.
-
-### Failure semantics
-
-K8s emits a `Failed` Job condition with reason `DeadlineExceeded` when the deadline fires. The executor's existing failure handler (which today catches Pod crashes) sees this as a failure and:
-1. Increments `trigger_count` on the task file
-2. If `trigger_count >= max_triggers` (3), escalates to `human_review`
-3. Otherwise the controller reschedules and the next Job runs with the same deadline
-
-No new code path for the deadline-exceeded case — it reuses the existing failure handling.
-
-### Observability
-
-- Log line at Job creation time: `creating job assignee=X deadline=Ys`
-- Log line at Job failure with reason: `job failed assignee=X reason=DeadlineExceeded`
-- Existing Prometheus metric `agent_task_failed_total` covers it; consider adding a `reason` label so deadline-exceeded is distinguishable from crash
-
-## Open Questions
-
-- **Default value**: 30 min suggested. Survey current agents' p99 runtime first; adjust before shipping.
-- **Where to define the default**: code constant, env var on the executor, or cluster ConfigMap? Code constant is simplest; env var lets ops tune without redeploy. Recommend env var with a code default.
-- **Trade-analysis specifically**: today's task `81f0affd` ran 35+ min on 38 trades (~50s/trade). With the 30-min default it would have been killed mid-run. Either bump trade-analysis to `3600` in its AgentConfig, or scope tasks tighter (per [[Trade Analysis Agent Guide]] § E2E Testing tier-2 = 1 day = ~2 min). Both fixes apply, the spec covers the mechanism.
-- **Interaction with manual `kubectl exec` debug**: deadline applies to the Job's main container too. If we attach to a long-lived debug session, the Job dies. Acceptable — debug Jobs are a separate concern; if needed, set `jobActiveDeadlineSeconds: 86400` per-agent for debug variants.
-
-## Out of Scope
-
-- Soft "warning" thresholds at 50% / 75% of deadline (could add a metric later)
-- Per-task overrides via task frontmatter — keep deadlines at the AgentConfig level
-- Replacing `activeDeadlineSeconds` with a custom watchdog inside the agent binary
-
-## Related
-
-- [[Agent Task Controller Architecture]] — full pipeline architecture
-- `task/executor/pkg/job/builder.go` (or similar) — where Jobs are constructed today
-- AgentConfig CRD definition in the controller repo
-- Existing trigger-cap escalation logic — we lean on it; no new escalation path needed
diff --git a/specs/in-progress/043-executor-zombie-job-detection.md b/specs/in-progress/043-executor-zombie-job-detection.md
new file mode 100644
index 0000000..79a30e5
--- /dev/null
+++ b/specs/in-progress/043-executor-zombie-job-detection.md
@@ -0,0 +1,140 @@
+---
+status: verifying
+tags:
+ - dark-factory
+ - spec
+ - executor
+approved: "2026-06-01T18:52:44Z"
+generating: "2026-06-01T18:57:25Z"
+prompted: "2026-06-01T19:55:19Z"
+verifying: "2026-06-01T21:08:46Z"
+branch: dark-factory/executor-zombie-job-detection
+---
+
+## Summary
+
+- Executor must classify a dispatched agent Job that never produces an `AgentResult` as failed, so the task does not park silently with `current_job` set forever.
+- Two paths today fail this contract: silently-dying Pods (ImagePullBackOff, unschedulable, evicted, crash-before-stdout) and Jobs lost to executor restarts; neither emits a failure event.
+- Existing `PublishFailure` / `PublishTypeMismatchFailure` violate doctrine: they write `phase: human_review` / `phase: ai_review` AND clear assignee immediately, short-circuiting the existing `trigger_count` retry cap. Result: every failure escalates to the operator inbox on the first occurrence, including transient causes (image-pull race, eviction) that would self-heal on the very next dispatch.
+- Adds (a) retry-aware zombie failure publisher (transient causes bump `trigger_count` + clear `current_job`; assignee untouched; the existing `applyTriggerCap` handles escalation when `trigger_count >= max_triggers`), (b) Pod-state classifier for non-Job-terminal failures, (c) deadline sweeper as safety net, (d) k8s-native `activeDeadlineSeconds` as primary detector, (e) publish-layer dedupe, (f) corrected `PublishTypeMismatchFailure` that escalates immediately (semantic failure — retry won't help) but via the doctrine-correct shape.
+- Transient zombies (image-pull race, eviction, unschedulable) self-heal via auto-retry; persistent zombies escalate via the existing `applyTriggerCap` chokepoint which clears assignee and appends `## Trigger Cap Escalation`. Semantic failures (type mismatch) escalate immediately. All operator-inbox surfaces land via the `assignee == ""` signal.
+
+## Problem
+
+A dispatched agent Job that produces no `AgentResult` within its deadline is never classified as failed by `task/executor`. Causes observed in production: ImagePullBackOff during deploy rollover, OOM-kill before stdout flush, pod eviction, unschedulable Pod, executor restart between dispatch and Job observation. The controller's spec 039 unassign + `## Failure` doctrine therefore never fires. The task's `current_job` frontmatter stays set; `task_event_handler.checkActiveCurrentJob` reads non-empty `current_job` as "in flight, don't dispatch" so the task is parked indefinitely with no operator-inbox surface (assignee is non-empty, so the empty-assignee inbox filter misses it).
+
+Concrete incident (`bborbe-maintainer` release `837deb0`, 2026-05-31): controller advanced `phase: ai_review`, set `current_job: github-releaser-agent-...`, then 17 hours of silence. Operator discovered the parked task manually; manual `current_job` clear let the watcher re-dispatch successfully in under a minute. Root cause was a Job that died before stdout reached Kafka during a BUCA-induced image-pull race — exactly the silent path this spec closes.
+
+Compounding the operational bug: `PublishFailure` in `task/executor/pkg/result_publisher.go` writes `phase: human_review` (which routes through spec 042's `ClearAssigneeIfHumanReview` chokepoint, clearing assignee immediately) and `PublishTypeMismatchFailure` writes `phase: ai_review`. Both violate the doctrine codified in spec 039 (`docs/task-flow-and-failure-semantics.md` § Status Taxonomy & Inbox Signal): `human_review` is reserved for agent-emitted successful verdicts that need human confirmation, never for failure escalation.
+
+A second, more consequential bug compounds the first: clearing assignee on the first failure shortcuts the existing `trigger_count` retry cap (controller's `applyTriggerCap`, `result_writer.go:234-248`). The cap mechanism — incrementally bumping `trigger_count` at each spawn, allowing automatic re-dispatch until `trigger_count >= max_triggers`, then clearing assignee + appending `## Trigger Cap Escalation` — was built to handle exactly the transient-zombie case the 2026-05-31 incident represents. With assignee cleared on the first failure, the cap never fires; every transient failure burns operator attention. The Agent Phase Dispatch Guide's stated retry doctrine ("per-phase retry up to cap, then unassign") matches the cap-machinery's design intent but not the current code path.
+
+The correct failure shape for transient zombies is "leave phase+status+assignee untouched, clear current_job, bump trigger_count via the existing `PublishIncrementTriggerCount` semantics, append `## Failure` body — and let the next dispatch cycle either retry (`trigger_count < max_triggers`) or land at the existing `applyTriggerCap` escalation." Type-mismatch failures are semantically different (retrying with the same agent cannot help — the agent's config simply doesn't accept this task type), so they escalate immediately: leave phase+status untouched, clear assignee, set previous_assignee, clear current_job, append `## Failure` body. Same chokepoint, two semantics.
+
+## Goal
+
+After this work, every dispatched agent Job that does not produce an `AgentResult` within its deadline is observed by the executor and either (a) auto-retries via the existing `trigger_count` mechanism until `applyTriggerCap` escalates to the operator inbox, or (b) escalates immediately for semantic failures where retry cannot help. Transient zombies (image-pull race, eviction, unschedulable) self-heal on the next dispatch cycle; persistent zombies surface within a bounded time (`max_triggers × zombieJobTimeoutSeconds` worst case). All failure emissions leave `phase` and `status` unchanged, clear `current_job`, append a `## Failure` body with a distinct machine-readable reason string. Retry-eligible failures (zombies) leave `assignee` set so the next dispatch cycle re-spawns; immediate-escalation failures (type mismatch) clear `assignee` and set `previous_assignee` directly. The existing `applyTriggerCap` chokepoint handles the eventual operator-inbox surface for zombies at cap.
+
+## Non-goals
+
+- Do NOT introduce a separate retry counter for zombies — reuse the existing `trigger_count` mechanism. The cap (`max_triggers` on AgentConfig) already governs how many spawn attempts the executor makes; zombie classification simply participates in the same counter.
+- Do NOT change `current_job` semantics — "non-empty means in flight" is the existing contract; the bug is that nothing cleared it on Job death, not the semantics.
+- Do NOT modify controller-side cap logic — `applyTriggerCap` already does the right thing when `trigger_count >= max_triggers`. This spec corrects the executor's emission to let the existing chokepoint fire.
+- Do NOT introduce per-phase deadline tuning, Config CRD restructuring, or expanded observability metrics — separate concerns.
+- Do NOT add an opt-out flag that disables zombie classification — the classifier is invariant; if a future consumer demands variation, that's a separate spec.
+- Do NOT add per-task deadline override via task frontmatter — deadlines live at the AgentConfig level (consistent with the `job-active-deadline-seconds` idea note).
+
+## Desired Behavior
+
+1. **Retry-aware `PublishFailure` (for zombies — transient infrastructure failures).** A call to publish a zombie failure leaves `phase`, `status`, AND `assignee` unchanged on the task, clears `current_job` to `""`, and appends a `## Failure` body section containing timestamp, job name, and reason. It also atomically bumps `trigger_count` by 1 (via the existing increment semantics — same field the dispatch loop checks at line 422 of `task_event_handler.go`). The published command is a single atomic write (either `update-frontmatter` with a paired increment, or a composite command — agent decides at impl time based on existing patterns; the constraint is atomicity, not transport). The next dispatch cycle then either re-spawns (`trigger_count < max_triggers`) or skips and falls through to the existing `applyTriggerCap` escalation (`trigger_count >= max_triggers`), which clears assignee and appends `## Trigger Cap Escalation` on the controller side. NO `phase: human_review` or `phase: ai_review` is written.
+2. **Immediate-escalation `PublishTypeMismatchFailure` (for semantic failures — retry cannot help).** A call to publish a type-mismatch failure leaves `phase` and `status` unchanged, clears `assignee` to `""`, sets `previous_assignee` to the prior assignee value, clears `current_job` to `""`, and appends a `## Failure` body section with reason `type_mismatch`. NO `phase: ai_review` is written. The semantic distinction from (1) is that retrying with the same agent cannot resolve the mismatch — the agent's config simply doesn't accept this task type — so cap-aware retry has no value. Operator must reassign to a different agent.
+3. **Pod-state classifier extension.** In addition to the existing Job-condition classifier, the executor classifies a Pod-state failure when any of these hold and emits a failure via (1) with the listed reason: ImagePullBackOff / ErrImagePull → `image_pull_backoff`; unschedulable beyond a grace window → `pod_not_scheduled`; Pod evicted → `pod_evicted`; Pod terminated non-zero exit with no AgentResult ever published → `pod_crash_no_stdout`.
+4. **Deadline sweeper.** A goroutine runs alongside the existing job watcher loop. Each cycle (interval default 60 seconds), it walks the TaskStore. A task is classified as a zombie iff its Job's elapsed time exceeds the deadline AND its Pod is not Running AND no recent heartbeat (AgentResult or Pod transition) has been observed. Zombie classification emits a failure via (1) with reason `deadline_exceeded`, or `executor_watch_lost` if the Job exists in k8s but was absent from TaskStore on a prior cycle (i.e. executor restart after dispatch).
+5. **k8s-native primary deadline.** The Job spawner stamps `activeDeadlineSeconds` on every Job spec at dispatch time, sourced from the same Config knob the sweeper uses as fallback. When the deadline fires, kubelet kills the Pod, the Job acquires a `DeadlineExceeded` condition, and the existing Job-condition classifier path fires. The sweeper is the safety net for cases the k8s-native path misses (Pod never reaches Active, executor restart loses watch).
+6. **Configurable knobs on AgentConfig.** Two new optional fields: `spec.zombieSweeperIntervalSeconds` (default 60, validation floor 10 — admission rejects values < 10 with `invalid: must be >= 10`) and `spec.zombieJobTimeoutSeconds` (default 1800, validation floor 30 — admission rejects values < 30 with `invalid: must be >= 30`). The latter is also the source for the Job's `activeDeadlineSeconds`. If unset, defaults apply. Floors prevent thrash and pathological short-deadline kills (see Security / Abuse Cases for the threat model).
+7. **Idempotency / dedupe.** The publish layer deduplicates failure emissions keyed by `current_job` ID. Two concurrent classifications for the same job (e.g. sweeper fires after Job-condition informer already fired) result in one published failure event; the second emission is a no-op and logs `event=zombie_dedupe job=` at V(2). Dedupe TTL is pinned to `2 × zombieJobTimeoutSeconds` (default 3600s) so the window survives any normal sweeper-vs-informer race; LRU capacity is pinned at 1024 entries (the executor handles ≪1024 concurrent tasks in practice).
+8. **Reason enum.** A fixed set of reason strings is exposed in code: `image_pull_backoff`, `pod_evicted`, `deadline_exceeded`, `pod_not_scheduled`, `pod_crash_no_stdout`, `executor_watch_lost`, `type_mismatch`. The reason appears verbatim in the `## Failure` body so operators can grep and triage. New reasons require a new code constant (agent decides at impl time which Go construct — string const, typed enum, or table — best fits existing conventions).
+9. **Healthy-Job path preserved.** A Job that completes within its deadline and whose Pod is Running or Succeeded with normal stdout is never classified as a zombie. The combined predicate (`elapsed > deadline AND pod not Running AND no recent heartbeat`) is what protects long-but-completing Jobs from false-positive classification.
+10. **Eventual escalation bound for persistent zombies.** A task that experiences `max_triggers` consecutive zombie classifications without an intervening successful spawn surfaces in the operator inbox via the existing `applyTriggerCap` chokepoint — assignee cleared, `## Trigger Cap Escalation` appended. Worst-case latency to inbox is `max_triggers × zombieJobTimeoutSeconds` (default 3 × 30 min = 90 min). This is acceptable because the alternative (escalating on the first transient failure) burns operator attention on causes that self-heal in seconds.
+
+## Constraints
+
+- **Must not change.** The controller's spec 039 result-routing logic; the `lib.Task` schema or Kafka topic schemas; the AgentResult / agent-emitted-success path; the existing `current_job != ""` "in flight" gate semantics; `allowedPhases` filtering; the Pattern B Job-spawn invariant (one Job per task at a time).
+- **Frozen conventions.** `github.com/bborbe/errors.Wrapf(ctx, err, ...)` for wrapping (no bare `return err`, no `fmt.Errorf`); `libtime.CurrentDateTimeGetter` injection for all time math (never `time.Now()`); Ginkgo/Gomega + counterfeiter mocks for tests; glog with `V(n)` gating for non-error logs; `service.Run` for goroutine lifecycle.
+- **Build / deploy.** Verification is `cd task/executor && make precommit`; never `make precommit` at repo root. Deploy via `agent-dev` worktree, never this worktree.
+- **Doctrine reference.** Failure shape is defined by spec 039 (`specs/completed/039-controller-stop-setting-human-review-on-failure.md`) and the Inbox Signal section of `docs/task-flow-and-failure-semantics.md`. The previous_assignee field is defined by spec 027.
+- **Worktree.** This spec is implemented in the `agent-zombie-detect` worktree on branch `feature/zombie-detect` off `origin/master`.
+
+## Assumptions
+
+- The existing job_watcher informer wiring is sound for Job-condition events; the gap is purely Pod-level classification + sweeper + emission doctrine.
+- The existing `trigger_count` / `max_triggers` / `applyTriggerCap` machinery is functional and correctly handles cap-hit escalation today; the spec's "let the cap fire" behavior depends on this. If the cap path is also broken, that's a separate spec, but the worktree-local audit at spec-creation time observed: handler line 422 checks `trigger_count >= max_triggers` and skips; result_writer `applyTriggerCap` line 234 clears assignee and appends `## Trigger Cap Escalation`.
+- AgentConfig CRD updates are backward-compatible — adding optional `zombieSweeperIntervalSeconds` and `zombieJobTimeoutSeconds` does not break existing manifests.
+- TaskStore is the authoritative in-memory map of "tasks the executor believes are in flight" and is repopulated on restart from the Job informer's initial list (existing behavior per spec 009).
+- The `bborbe-maintainer` 2026-05-31 incident generalizes: any silent-death cause that prevents stdout from reaching Kafka leaves the task parked. The four Pod-state categories named in Desired Behavior (3) and the sweeper in (4) together cover all observed and projected modes.
+- An idea-stage spec already exists for `activeDeadlineSeconds` (`specs/ideas/job-active-deadline-seconds.md`). This spec subsumes the deadline-stamping behavior. Approval-time housekeeping: when `dark-factory spec approve` numbers and moves this spec to `in-progress/`, also delete the idea note (its full content is captured in DB #5 and the failure-mode "Job exceeds activeDeadlineSeconds" row).
+
+## Failure Modes
+
+| Trigger | Detection | Expected behavior | Recovery | Reversibility | Concurrency |
+|---------|-----------|-------------------|----------|---------------|-------------|
+| Pod ImagePullBackOff after dispatch (transient — e.g. BUCA race) | Pod-state classifier (Pod.Status.ContainerStatuses[].State.Waiting.Reason) | Publish zombie failure, reason=`image_pull_backoff`; trigger_count++; current_job cleared; assignee preserved | Next dispatch cycle re-spawns automatically; transient causes self-heal in seconds | Reversible — auto-retry | Dedupe by current_job; one failure event per job-instance |
+| Pod unschedulable (no node) beyond grace | Pod.Status.Phase=Pending + PodScheduled=False + age > grace | Publish zombie failure, reason=`pod_not_scheduled`; trigger_count++ | Auto-retry next cycle; persistent unschedulability hits cap → operator inbox | Reversible | Same dedupe |
+| Pod evicted (node pressure, preemption) | Pod.Status.Reason=`Evicted` | Publish zombie failure, reason=`pod_evicted`; trigger_count++ | Auto-retry next cycle; persistent eviction hits cap → operator inbox | Reversible | Same dedupe |
+| Pod crashes before stdout flush (panic, SIGKILL) | Pod terminated non-zero + no AgentResult on Kafka | Publish zombie failure, reason=`pod_crash_no_stdout`; trigger_count++ | Auto-retry next cycle; persistent crashes hit cap → operator inbox | Reversible | Same dedupe |
+| Job exceeds activeDeadlineSeconds | kubelet kills Pod → Job DeadlineExceeded condition → existing Job-condition path | Publish zombie failure, reason=`deadline_exceeded`; trigger_count++ | Auto-retry next cycle; persistent deadline overruns hit cap → operator inbox (often signals task too big — operator splits task or raises timeout) | Reversible | Same dedupe |
+| Executor restarts between dispatch and observation, Job still exists in k8s | Sweeper finds TaskStore entry was absent on prior cycle | Publish zombie failure, reason=`executor_watch_lost`; trigger_count++ | Auto-retry next cycle; informer reseeds TaskStore on restart | Reversible | Sweeper runs single-threaded; dedupe by current_job |
+| Persistent zombie cause (`trigger_count >= max_triggers`) | Controller's `applyTriggerCap` on next result write | Clear assignee; set previous_assignee; append `## Trigger Cap Escalation`; phase+status preserved | Operator inspects `## Failure` history, decides reassign / split task / abort | Reversible — operator action | Cap chokepoint is the single escalation path; subsequent failures of the same task hit the `containsEscalationSection` early-return |
+| Type mismatch (task_type not in agent's accepted set) | task_event_handler.go:199 | Publish immediate-escalation failure, reason=`type_mismatch`; clear assignee, set previous_assignee, clear current_job; phase+status preserved | Operator reassigns to a different agent (retrying same agent cannot help) | Reversible — operator action | Single emission; dedupe not needed (called once per task-trigger) |
+| Sweeper fires after Job-condition informer already fired | Dedupe LRU keyed by current_job | Second emission is no-op; logs `event=zombie_dedupe` at V(2) | None needed | N/A | Bounded by LRU size + TTL |
+| Long-but-completing Job (slow strategy run, not zombie) | Pod is Running with recent heartbeat | NOT classified as zombie; sweeper skips | None needed | N/A | Predicate combines elapsed + pod-state + heartbeat |
+| Kafka publish of failure command fails (broker outage) | publish call returns error | Error logged; sweeper retries next cycle; Job-condition path retries on next informer event | Automatic on next cycle | Reversible | Idempotent: dedupe LRU prevents double-publish once Kafka recovers |
+| Clock skew between executor pod and k8s node | `libtime.CurrentDateTimeGetter` is the single time source | Sweeper uses executor's clock for elapsed math; k8s `activeDeadlineSeconds` uses node clock; small skew tolerated (seconds) | None needed | N/A | Worst case: sweeper fires slightly before/after k8s; dedupe handles it |
+| TaskStore loses entry but Job still running healthily | Sweeper sees k8s Job without TaskStore record | Emit `executor_watch_lost` only if Job is also not-Running / past-deadline; healthy Running Jobs are repopulated by informer initial list | Automatic | Reversible if false positive (operator re-delegates) | Sweeper waits one full cycle after restart to allow informer initial list to seed TaskStore |
+| Two agent runs spawn concurrently for the same task (spec 009 idempotency gate) | Existing gate (`current_job` set) | This spec changes nothing about that gate | N/A | N/A | Existing dedupe holds |
+| k8s API rate-limit (429) on Job/Pod list during high-fan-out incident | client-go rate-limiter logs / 429 from API server | Sweeper relies on informer cache (no per-cycle list); high task count uses cached state, no extra API pressure | Automatic — informer resyncs on its own schedule | Reversible | Sweeper backs off naturally; dedupe prevents thrash if classifications retry |
+
+## Security / Abuse Cases
+
+- The executor already has RBAC for `get/list/watch` on Pods and Jobs in its namespace (per spec 009); this spec requires no new permissions. Stamping `activeDeadlineSeconds` is a Job spec field, not a new RBAC verb.
+- Reason strings appear verbatim in the public `## Failure` body in the vault. Reason values are a closed enum defined in code, not user input. No injection surface.
+- A malicious or careless AgentConfig with `zombieJobTimeoutSeconds: 1` would cause every Job to be killed almost immediately. Mitigation: validate the field at AgentConfig admission — `zombieJobTimeoutSeconds` must be `>= 30`; admission rejects with `invalid: must be >= 30`.
+- A malicious or careless AgentConfig with `zombieSweeperIntervalSeconds: 1` would cause sweeper thrash. Mitigation: same validation pattern — `zombieSweeperIntervalSeconds` must be `>= 10`; admission rejects with `invalid: must be >= 10`.
+- No new network endpoints, no new secrets, no new task content paths.
+
+## Acceptance Criteria
+
+- [ ] `PublishFailure` after this change writes (atomically) `current_job: ""` and bumps `trigger_count` by 1, AND does NOT write any `phase`, `status`, `assignee`, or `previous_assignee` keys; the `## Failure` body section is appended with timestamp + job + reason — evidence: unit test asserts the published command(s) produce exactly the expected frontmatter delta (current_job cleared, trigger_count +1, no other keys) and the body section is present; test exit code 0
+- [ ] After `PublishFailure` is called `max_triggers - 1` times in succession on a task whose `max_triggers` is 3, the resulting frontmatter has `assignee` non-empty, `trigger_count == max_triggers - 1`, `current_job == ""`, and `phase` + `status` unchanged — evidence: unit test asserts the resulting task file's frontmatter delta after each call; test exit code 0
+- [ ] After the `max_triggers`-th call (reaching `trigger_count == max_triggers`), the existing `applyTriggerCap` chokepoint fires on the next result write: assignee cleared to `""`, `previous_assignee` set to prior agent name, body contains `## Trigger Cap Escalation` section, `phase` + `status` still unchanged — evidence: unit test asserts the resulting task file matches all four conditions; test exit code 0
+- [ ] `PublishTypeMismatchFailure` after this change writes `assignee: ""`, `previous_assignee: `, `current_job: ""`, AND does NOT write any `phase` or `status` keys; the `## Failure` body contains `reason=type_mismatch` verbatim — evidence: unit test asserts published command's Updates map equals the expected set exactly (no extras, no missing keys) and body contains the reason; test exit code 0
+- [ ] Pod-state classifier emits exactly one failure event per Pod-state transition for each of the four Pod-state reasons (`image_pull_backoff`, `pod_not_scheduled`, `pod_evicted`, `pod_crash_no_stdout`) — evidence: table-driven unit test, one row per reason, asserts published reason string equals expected; test exit code 0
+- [ ] Deadline sweeper, given a mocked `CurrentDateTimeGetter`, classifies a task as zombie iff `elapsed > deadline AND pod not Running AND no recent heartbeat` — evidence: unit test with mocked time and TaskStore covering at least four cells (deadline-exceeded-and-not-running → zombie; deadline-exceeded-but-running → not zombie; under-deadline → not zombie; watch-lost → `executor_watch_lost`); test exit code 0
+- [ ] Sweeper interval is sourced from `AgentConfig.spec.zombieSweeperIntervalSeconds` (default 60, validation floor 10); deadline is sourced from `AgentConfig.spec.zombieJobTimeoutSeconds` (default 1800, validation floor 30); both defaults applied when unset; values below floor are rejected at admission — evidence: unit test asserts default values are used when CRD fields are nil, override values are used when set, and below-floor values produce a validation error; test exit code 0
+- [ ] `SpawnJob` stamps `Spec.ActiveDeadlineSeconds` equal to the resolved `zombieJobTimeoutSeconds` on every Job it creates — evidence: unit test inspects the constructed `batchv1.Job` and asserts the field; test exit code 0
+- [ ] Dedupe layer suppresses a second publish for the same `current_job` within a bounded TTL window — evidence: unit test calls `PublishFailure` twice with the same job name, asserts only one Kafka send is observed via the mocked sender; test exit code 0
+- [ ] envtest reproduces ImagePullBackOff against a real (in-process) k8s informer wiring: a Pod with a bogus image reference is observed and the corresponding task receives one failure event with reason=`image_pull_backoff` within `2 × zombieSweeperIntervalSeconds` of the Pod entering ImagePullBackOff — evidence: envtest passes (exit code 0); test inspects the published-failures mock and asserts exactly one entry with the expected reason; test fails (non-zero exit) if observation time exceeds the bound
+- [ ] `cd task/executor && make precommit` exits 0 — evidence: exit code
+
+**Scenario coverage.** NO new scenario. Rationale: unit tests cover doctrine, classifier table, sweeper math, dedupe. The single envtest covers the one path (real k8s informer reaction to ImagePullBackOff) that a unit mock cannot faithfully reproduce. The end-to-end "deploy a Job referencing a non-existent image and watch the operator inbox surface" check is documented under Verification as a post-deploy manual step in `agent-dev`, not as a scenario in CI — running it requires real cluster state that scenarios cannot stand up.
+
+## Verification
+
+Local build and unit/envtest gate:
+
+```
+cd task/executor && make precommit
+```
+
+Post-deploy manual verification (operator runs in `agent-dev` after merge + deploy):
+
+1. Apply an AgentConfig pointing at a non-existent image tag (e.g. `docker.example.com/agent-claude:does-not-exist`) for a throwaway agent name; set `max_triggers: 3` on the test task to keep total wait bounded.
+2. Create a task assigned to that agent name.
+3. Within a few minutes (Pod-state classifier fires on ImagePullBackOff well before `zombieJobTimeoutSeconds`), observe successive `## Failure` sections appearing in the task body with reason=`image_pull_backoff`, `current_job` cleared between each, and `trigger_count` incrementing toward `max_triggers`. Assignee stays set across the first `max_triggers - 1` failures.
+4. After `trigger_count` reaches `max_triggers`, observe `## Trigger Cap Escalation` section appended, `assignee: ""`, `previous_assignee: ` — i.e. the existing chokepoint fired.
+5. Confirm the task surfaces in the operator inbox via the existing `assignee == ""` filter at that moment (not earlier).
+6. Confirm no further re-spawns occur (executor respects the cleared assignee gate after cap).
+
+## Do-Nothing Option
+
+Keep the current `result_publisher.go` and `job_watcher.go`. Outcome: tasks whose Jobs die silently continue to park indefinitely with `current_job` set and `assignee` non-empty, invisible to the inbox filter. Operators discover stuck tasks by accident — the 2026-05-31 incident is the existence proof. Doctrine drift compounds: every new caller of `PublishFailure` learns the wrong failure shape from existing code, propagating the `phase: human_review` bug. As the agent fleet grows (more concurrent dispatches, more deploy rollovers, more chances for image-pull races), the incident rate scales with dispatch volume. Not acceptable.
diff --git a/task/executor/Makefile b/task/executor/Makefile
index 30f80a8..85b7791 100644
--- a/task/executor/Makefile
+++ b/task/executor/Makefile
@@ -10,7 +10,7 @@ export ROOTDIR ?= $(shell git rev-parse --show-toplevel)
default: precommit
-precommit: ensure format generate test check addlicense
+precommit: ensure format generate test test-envtest check addlicense
@echo "ready to commit"
ensure:
@@ -30,6 +30,18 @@ generate:
echo "package mocks" > mocks/mocks.go
go generate -mod=mod ./...
+ENVTEST_K8S_VERSION ?= 1.31.0
+
+.PHONY: envtest-setup
+envtest-setup:
+ @command -v setup-envtest >/dev/null 2>&1 || go install sigs.k8s.io/controller-runtime/tools/setup-envtest@latest
+ @setup-envtest use $(ENVTEST_K8S_VERSION) -p path > /dev/null
+
+.PHONY: test-envtest
+test-envtest: envtest-setup
+ ENVTEST_REQUIRED=1 KUBEBUILDER_ASSETS=$$(setup-envtest use $(ENVTEST_K8S_VERSION) -p path) \
+ go test -mod=mod -p=$${GO_TEST_PARALLEL:-1} -tags=envtest ./pkg/envtest
+
.PHONY: test
test:
go test -mod=mod -p=$${GO_TEST_PARALLEL:-1} -cover -race $(shell go list -mod=mod ./... | grep -v /vendor/)
diff --git a/task/executor/go.mod b/task/executor/go.mod
index 564cb27..662ed53 100644
--- a/task/executor/go.mod
+++ b/task/executor/go.mod
@@ -35,6 +35,7 @@ require (
k8s.io/apiextensions-apiserver v0.36.1
k8s.io/apimachinery v0.36.1
k8s.io/client-go v0.36.1
+ sigs.k8s.io/controller-runtime v0.21.0
sigs.k8s.io/structured-merge-diff/v6 v6.4.0
)
@@ -47,11 +48,13 @@ require (
github.com/bborbe/parse v1.10.12 // indirect
github.com/bborbe/strimzi v1.8.3 // indirect
github.com/beorn7/perks v1.0.1 // indirect
+ github.com/blang/semver/v4 v4.0.0 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc // indirect
github.com/eapache/go-resiliency v1.7.0 // indirect
github.com/eapache/queue v1.1.0 // indirect
github.com/emicklei/go-restful/v3 v3.13.0 // indirect
+ github.com/evanphx/json-patch/v5 v5.9.11 // indirect
github.com/fxamacker/cbor/v2 v2.9.0 // indirect
github.com/getsentry/sentry-go v0.46.2 // indirect
github.com/go-errors/errors v1.5.1 // indirect
diff --git a/task/executor/go.sum b/task/executor/go.sum
index bd01777..449a133 100644
--- a/task/executor/go.sum
+++ b/task/executor/go.sum
@@ -48,6 +48,8 @@ github.com/bborbe/vault-cli v0.64.3 h1:HHN2N6GhBCxdQB8Me++wSPS81fKMLHmJjsyxWf06K
github.com/bborbe/vault-cli v0.64.3/go.mod h1:lrrbavFV9kLuszwnmRmoJusuIbo5brBsPOF+eKZCUVE=
github.com/beorn7/perks v1.0.1 h1:VlbKKnNfV8bJzeqoa4cOKqO6bYr3WgKZxO8Z16+hsOM=
github.com/beorn7/perks v1.0.1/go.mod h1:G2ZrVWU2WbWT9wwq4/hrbKbnv/1ERSJQ0ibhJ6rlkpw=
+github.com/blang/semver/v4 v4.0.0 h1:1PFHFE6yCCTv8C1TeyNNarDzntLi7wMI5i/pzqYIsAM=
+github.com/blang/semver/v4 v4.0.0/go.mod h1:IbckMUScFkM3pff0VJDNKRiT6TG/YpiHIM2yvyW5YoQ=
github.com/cespare/xxhash/v2 v2.3.0 h1:UL815xU9SqsFlibzuggzjXhog7bL6oX9BbNZnL2UFvs=
github.com/cespare/xxhash/v2 v2.3.0/go.mod h1:VGX0DQ3Q6kWi7AoAeZDth3/j3BFtOZR5XLFGgcrjCOs=
github.com/davecgh/go-spew v1.1.0/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
@@ -60,6 +62,10 @@ github.com/eapache/queue v1.1.0 h1:YOEu7KNc61ntiQlcEeUIoDTJ2o8mQznoNvUhiigpIqc=
github.com/eapache/queue v1.1.0/go.mod h1:6eCeP0CKFpHLu8blIFXhExK/dRa7WDZfr6jVFPTqq+I=
github.com/emicklei/go-restful/v3 v3.13.0 h1:C4Bl2xDndpU6nJ4bc1jXd+uTmYPVUwkD6bFY/oTyCes=
github.com/emicklei/go-restful/v3 v3.13.0/go.mod h1:6n3XBCmQQb25CM2LCACGz8ukIrRry+4bhvbpWn3mrbc=
+github.com/evanphx/json-patch/v5 v5.9.11 h1:/8HVnzMq13/3x9TPvjG08wUGqBTmZBsCWzjTM0wiaDU=
+github.com/evanphx/json-patch/v5 v5.9.11/go.mod h1:3j+LviiESTElxA4p3EMKAB9HXj3/XEtnUf6OZxqIQTM=
+github.com/fsnotify/fsnotify v1.10.1 h1:b0/UzAf9yR5rhf3RPm9gf3ehBPpf0oZKIjtpKrx59Ho=
+github.com/fsnotify/fsnotify v1.10.1/go.mod h1:TLheqan6HD6GBK6PrDWyDPBaEV8LspOxvPSjC+bVfgo=
github.com/fxamacker/cbor/v2 v2.9.0 h1:NpKPmjDBgUfBms6tr6JZkTHtfFGcMKsw3eGcmD/sapM=
github.com/fxamacker/cbor/v2 v2.9.0/go.mod h1:vM4b+DJCtHn+zz7h3FFp/hDAI9WNWCsZj23V5ytsSxQ=
github.com/getsentry/sentry-go v0.46.2 h1:1jhYwrKGa3sIpo/y5iDNXS5wDoT7I1KNzMHrnK6ojns=
@@ -72,6 +78,8 @@ github.com/go-errors/errors v1.5.1 h1:ZwEMSLRCapFLflTpT7NKaAc7ukJ8ZPEjzlxt8rPN8b
github.com/go-errors/errors v1.5.1/go.mod h1:sIVyrIiJhuEF+Pj9Ebtd6P/rEYROXFi3BopGUQ5a5Og=
github.com/go-logr/logr v1.4.3 h1:CjnDlHq8ikf6E492q6eKboGOC0T8CDaOvkHCIg8idEI=
github.com/go-logr/logr v1.4.3/go.mod h1:9T104GzyrTigFIr8wt5mBrctHMim0Nb2HLGrmQ40KvY=
+github.com/go-logr/zapr v1.3.0 h1:XGdV8XW8zdwFiwOA2Dryh1gj2KRQyOOoNmBy4EplIcQ=
+github.com/go-logr/zapr v1.3.0/go.mod h1:YKepepNBd1u/oyhd/yQmtjVXmm9uML4IXUgMOwR8/Gg=
github.com/go-openapi/jsonpointer v0.22.5 h1:8on/0Yp4uTb9f4XvTrM2+1CPrV05QPZXu+rvu2o9jcA=
github.com/go-openapi/jsonpointer v0.22.5/go.mod h1:gyUR3sCvGSWchA2sUBJGluYMbe1zazrYWIkWPjjMUY0=
github.com/go-openapi/jsonreference v0.21.5 h1:6uCGVXU/aNF13AQNggxfysJ+5ZcU4nEAe+pJyVWRdiE=
@@ -222,6 +230,10 @@ go.etcd.io/bbolt v1.4.3 h1:dEadXpI6G79deX5prL3QRNP6JB8UxVkqo4UPnHaNXJo=
go.etcd.io/bbolt v1.4.3/go.mod h1:tKQlpPaYCVFctUIgFKFnAlvbmB3tpy1vkTnDWohtc0E=
go.uber.org/goleak v1.3.0 h1:2K3zAYmnTNqV73imy9J1T3WC+gmCePx2hEGkimedGto=
go.uber.org/goleak v1.3.0/go.mod h1:CoHD4mav9JJNrW/WLlf7HGZPjdw8EucARQHekz1X6bE=
+go.uber.org/multierr v1.11.0 h1:blXXJkSxSSfBVBlC76pxqeO+LN3aDfLQo+309xJstO0=
+go.uber.org/multierr v1.11.0/go.mod h1:20+QtiLqy0Nd6FdQB9TLXag12DsQkrbs3htMFfDN80Y=
+go.uber.org/zap v1.27.1 h1:08RqriUEv8+ArZRYSTXy1LeBScaMpVSTBhCeaZYfMYc=
+go.uber.org/zap v1.27.1/go.mod h1:GB2qFLM7cTU87MWRP2mPIjqfIDnGu+VIO4V/SdhGo2E=
go.yaml.in/yaml/v2 v2.4.4 h1:tuyd0P+2Ont/d6e2rl3be67goVK4R6deVxCUX5vyPaQ=
go.yaml.in/yaml/v2 v2.4.4/go.mod h1:gMZqIpDtDqOfM0uNfy0SkpRhvUryYH0Z6wdMYcacYXQ=
go.yaml.in/yaml/v3 v3.0.4 h1:tfq32ie2Jv2UxXFdLJdh3jXuOzWiL1fo0bu/FbuKpbc=
@@ -277,6 +289,8 @@ golang.org/x/tools v0.1.12/go.mod h1:hNGJHUnrk76NpqgfD5Aqm5Crs+Hm0VOH/i9J2+nxYbc
golang.org/x/tools v0.44.0 h1:UP4ajHPIcuMjT1GqzDWRlalUEoY+uzoZKnhOjbIPD2c=
golang.org/x/tools v0.44.0/go.mod h1:KA0AfVErSdxRZIsOVipbv3rQhVXTnlU6UhKxHd1seDI=
golang.org/x/xerrors v0.0.0-20190717185122-a985d3407aa7/go.mod h1:I/5z698sn9Ka8TeJc9MKroUUfqBBauWjQqLJ2OPfmY0=
+gomodules.xyz/jsonpatch/v2 v2.4.0 h1:Ci3iUJyx9UeRx7CeFN8ARgGbkESwJK+KB9lLcWxY/Zw=
+gomodules.xyz/jsonpatch/v2 v2.4.0/go.mod h1:AH3dM2RI6uoBZxn3LVrfvJ3E0/9dG4cSrbuBJT4moAY=
google.golang.org/protobuf v1.36.12-0.20260120151049-f2248ac996af h1:+5/Sw3GsDNlEmu7TfklWKPdQ0Ykja5VEmq2i817+jbI=
google.golang.org/protobuf v1.36.12-0.20260120151049-f2248ac996af/go.mod h1:HTf+CrKn2C3g5S8VImy6tdcUvCska2kB7j23XfzDpco=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
@@ -304,6 +318,8 @@ k8s.io/kube-openapi v0.0.0-20260317180543-43fb72c5454a h1:xCeOEAOoGYl2jnJoHkC3hk
k8s.io/kube-openapi v0.0.0-20260317180543-43fb72c5454a/go.mod h1:uGBT7iTA6c6MvqUvSXIaYZo9ukscABYi2btjhvgKGZ0=
k8s.io/utils v0.0.0-20260210185600-b8788abfbbc2 h1:AZYQSJemyQB5eRxqcPky+/7EdBj0xi3g0ZcxxJ7vbWU=
k8s.io/utils v0.0.0-20260210185600-b8788abfbbc2/go.mod h1:xDxuJ0whA3d0I4mf/C4ppKHxXynQ+fxnkmQH0vTHnuk=
+sigs.k8s.io/controller-runtime v0.21.0 h1:CYfjpEuicjUecRk+KAeyYh+ouUBn4llGyDYytIGcJS8=
+sigs.k8s.io/controller-runtime v0.21.0/go.mod h1:OSg14+F65eWqIu4DceX7k/+QRAbTTvxeQSNSOQpukWM=
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730 h1:IpInykpT6ceI+QxKBbEflcR5EXP7sU1kvOlxwZh5txg=
sigs.k8s.io/json v0.0.0-20250730193827-2d320260d730/go.mod h1:mdzfpAEoE6DHQEN0uh9ZbOCuHbLK5wOm7dK4ctXE9Tg=
sigs.k8s.io/randfill v1.0.0 h1:JfjMILfT8A6RbawdsK2JXGBR5AQVfd+9TbzrlneTyrU=
diff --git a/task/executor/k8s/agent-task-executor-role.yaml b/task/executor/k8s/agent-task-executor-role.yaml
index c2bd793..bd8dccc 100644
--- a/task/executor/k8s/agent-task-executor-role.yaml
+++ b/task/executor/k8s/agent-task-executor-role.yaml
@@ -14,6 +14,14 @@ rules:
- list
- watch
- delete
+ - apiGroups:
+ - ""
+ resources:
+ - pods
+ verbs:
+ - get
+ - list
+ - watch
- apiGroups:
- agent.benjamin-borbe.de
resources:
diff --git a/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go b/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go
index d34890b..a7bcf83 100644
--- a/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go
+++ b/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types.go
@@ -19,6 +19,15 @@ import (
// var _ k8s.Type = Config{} ensures Config implements k8s.Type at compile time.
var _ libk8s.Type = Config{}
+// Defaults and validation floors for the zombie-detection knobs.
+// Floors prevent thrash (sweeper) and pathological short-deadline kills (timeout).
+const (
+ DefaultZombieSweeperIntervalSeconds int32 = 60
+ MinZombieSweeperIntervalSeconds int32 = 10
+ DefaultZombieJobTimeoutSeconds int32 = 1800
+ MinZombieJobTimeoutSeconds int32 = 30
+)
+
var taskTypePattern = regexp.MustCompile(`^[a-z0-9-]+$`)
// +genclient
@@ -68,6 +77,20 @@ type ConfigSpec struct {
PriorityClassName string `json:"priorityClassName,omitempty"`
// Trigger declares the per-agent phase and status conditions under which the executor spawns a Job.
Trigger *Trigger `json:"trigger,omitempty"`
+ // ZombieSweeperIntervalSeconds is how often the executor's deadline sweeper
+ // walks the TaskStore looking for zombie jobs. Optional; when nil, the executor
+ // uses DefaultZombieSweeperIntervalSeconds (60). Values below
+ // MinZombieSweeperIntervalSeconds (10) are rejected at admission to prevent
+ // sweeper thrash. Pointer-typed so "unset" is distinguishable from "0".
+ ZombieSweeperIntervalSeconds *int32 `json:"zombieSweeperIntervalSeconds,omitempty"`
+ // ZombieJobTimeoutSeconds is the deadline applied to every spawned Job (via
+ // Job.Spec.ActiveDeadlineSeconds) AND the elapsed-time threshold the sweeper
+ // uses when classifying zombies. Optional; when nil, the executor uses
+ // DefaultZombieJobTimeoutSeconds (1800 — 30 minutes). Values below
+ // MinZombieJobTimeoutSeconds (30) are rejected at admission to prevent
+ // pathological short-deadline kills. Pointer-typed so "unset" is
+ // distinguishable from "0".
+ ZombieJobTimeoutSeconds *int32 `json:"zombieJobTimeoutSeconds,omitempty"`
}
// AgentResources holds optional resource requests and limits for the agent container.
@@ -139,7 +162,9 @@ func (s ConfigSpec) Equal(o ConfigSpec) bool {
s.PriorityClassName == o.PriorityClassName &&
reflect.DeepEqual(s.Env, o.Env) &&
reflect.DeepEqual(s.Resources, o.Resources) &&
- reflect.DeepEqual(s.Trigger, o.Trigger)
+ reflect.DeepEqual(s.Trigger, o.Trigger) &&
+ reflect.DeepEqual(s.ZombieSweeperIntervalSeconds, o.ZombieSweeperIntervalSeconds) &&
+ reflect.DeepEqual(s.ZombieJobTimeoutSeconds, o.ZombieJobTimeoutSeconds)
}
// Validate validates the ConfigSpec fields.
@@ -169,7 +194,43 @@ func (s ConfigSpec) Validate(ctx context.Context) error {
if err := validateTaskTypeValue(ctx, s.TaskType); err != nil {
return err
}
- return validateTaskTypesList(ctx, s.TaskTypes)
+ if err := validateTaskTypesList(ctx, s.TaskTypes); err != nil {
+ return err
+ }
+ if err := validateZombieSweeperInterval(ctx, s.ZombieSweeperIntervalSeconds); err != nil {
+ return err
+ }
+ return validateZombieJobTimeout(ctx, s.ZombieJobTimeoutSeconds)
+}
+
+func validateZombieSweeperInterval(ctx context.Context, v *int32) error {
+ if v == nil {
+ return nil
+ }
+ if *v < MinZombieSweeperIntervalSeconds {
+ return errors.Wrapf(
+ ctx,
+ validation.Error,
+ "zombieSweeperIntervalSeconds invalid: must be >= %d",
+ MinZombieSweeperIntervalSeconds,
+ )
+ }
+ return nil
+}
+
+func validateZombieJobTimeout(ctx context.Context, v *int32) error {
+ if v == nil {
+ return nil
+ }
+ if *v < MinZombieJobTimeoutSeconds {
+ return errors.Wrapf(
+ ctx,
+ validation.Error,
+ "zombieJobTimeoutSeconds invalid: must be >= %d",
+ MinZombieJobTimeoutSeconds,
+ )
+ }
+ return nil
}
func validateTrigger(ctx context.Context, trigger *Trigger) error {
diff --git a/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types_test.go b/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types_test.go
index 3500d7a..a8a4b02 100644
--- a/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types_test.go
+++ b/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/types_test.go
@@ -595,6 +595,80 @@ var _ = Describe("ConfigSpec", func() {
Expect(a.Equal(b)).To(BeTrue())
})
})
+
+ Describe("Validate - zombie knobs", func() {
+ ptrInt32 := func(v int32) *int32 { return &v }
+
+ baseSpec := func() agentv1.ConfigSpec {
+ return agentv1.ConfigSpec{
+ Assignee: "agent",
+ Image: "img:latest",
+ Heartbeat: "1m",
+ TaskType: "claude",
+ }
+ }
+
+ It("accepts nil zombie fields", func() {
+ spec := baseSpec()
+ Expect(spec.Validate(ctx)).To(Succeed())
+ })
+
+ It("accepts valid zombie values at the floor", func() {
+ spec := baseSpec()
+ spec.ZombieSweeperIntervalSeconds = ptrInt32(10)
+ spec.ZombieJobTimeoutSeconds = ptrInt32(30)
+ Expect(spec.Validate(ctx)).To(Succeed())
+ })
+
+ It("rejects zombieSweeperIntervalSeconds below floor", func() {
+ spec := baseSpec()
+ spec.ZombieSweeperIntervalSeconds = ptrInt32(9)
+ err := spec.Validate(ctx)
+ Expect(err).To(HaveOccurred())
+ Expect(err.Error()).To(ContainSubstring("invalid: must be >= 10"))
+ })
+
+ It("rejects zombieJobTimeoutSeconds below floor", func() {
+ spec := baseSpec()
+ spec.ZombieJobTimeoutSeconds = ptrInt32(29)
+ err := spec.Validate(ctx)
+ Expect(err).To(HaveOccurred())
+ Expect(err.Error()).To(ContainSubstring("invalid: must be >= 30"))
+ })
+ })
+
+ Describe("Equal - zombie fields", func() {
+ ptrInt32 := func(v int32) *int32 { return &v }
+
+ It("equal when both zombie fields nil", func() {
+ a := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t"}
+ b := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t"}
+ Expect(a.Equal(b)).To(BeTrue())
+ })
+
+ It("equal when both have same non-nil zombie values", func() {
+ a := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t",
+ ZombieJobTimeoutSeconds: ptrInt32(1800)}
+ b := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t",
+ ZombieJobTimeoutSeconds: ptrInt32(1800)}
+ Expect(a.Equal(b)).To(BeTrue())
+ })
+
+ It("not equal when one zombie field nil and other non-nil", func() {
+ a := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t"}
+ b := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t",
+ ZombieJobTimeoutSeconds: ptrInt32(1800)}
+ Expect(a.Equal(b)).To(BeFalse())
+ })
+
+ It("not equal when zombie values differ", func() {
+ a := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t",
+ ZombieJobTimeoutSeconds: ptrInt32(1800)}
+ b := agentv1.ConfigSpec{Assignee: "x", Image: "y", Heartbeat: "1m", TaskType: "t",
+ ZombieJobTimeoutSeconds: ptrInt32(900)}
+ Expect(a.Equal(b)).To(BeFalse())
+ })
+ })
})
var _ = Describe("JSON round-trip for taskType", func() {
diff --git a/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/zz_generated.deepcopy.go b/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/zz_generated.deepcopy.go
index 2e81859..6a005b0 100644
--- a/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/zz_generated.deepcopy.go
+++ b/task/executor/k8s/apis/agent.benjamin-borbe.de/v1/zz_generated.deepcopy.go
@@ -132,6 +132,16 @@ func (in *ConfigSpec) DeepCopyInto(out *ConfigSpec) {
*out = new(Trigger)
(*in).DeepCopyInto(*out)
}
+ if in.ZombieSweeperIntervalSeconds != nil {
+ in, out := &in.ZombieSweeperIntervalSeconds, &out.ZombieSweeperIntervalSeconds
+ *out = new(int32)
+ **out = **in
+ }
+ if in.ZombieJobTimeoutSeconds != nil {
+ in, out := &in.ZombieJobTimeoutSeconds, &out.ZombieJobTimeoutSeconds
+ *out = new(int32)
+ **out = **in
+ }
return
}
diff --git a/task/executor/main.go b/task/executor/main.go
index 29efe28..0eebd0a 100644
--- a/task/executor/main.go
+++ b/task/executor/main.go
@@ -51,6 +51,7 @@ type application struct {
HealthcheckCronExpression string ` arg:"healthcheck-cron-expression" env:"HEALTHCHECK_CRON_EXPRESSION" usage:"Cron expression for agent liveness health checks" default:"0 0 8 * * 1"`
}
+//nolint:funlen // Initialization sequence; wiring is linear with no branching.
func (a *application) Run(ctx context.Context, sentryClient libsentry.Client) error {
libmetrics.NewBuildInfoMetrics().SetBuildInfo(a.BuildGitVersion, a.BuildGitCommit, a.BuildDate)
glog.V(1).
@@ -95,6 +96,15 @@ func (a *application) Run(ctx context.Context, sentryClient libsentry.Client) er
taskStore := pkg.NewTaskStore()
jobWatcher := factory.CreateJobWatcher(kubeClient, a.Namespace, taskStore, resultPublisher)
+ zombieSweeper := factory.CreateZombieSweeper(
+ jobWatcher,
+ a.Namespace,
+ taskStore,
+ resultPublisher,
+ eventHandlerConfig,
+ currentDateTimeGetter,
+ )
+
healthcheckRunner := factory.CreateHealthcheckRunner(
eventHandlerConfig,
syncProducer,
@@ -126,6 +136,7 @@ func (a *application) Run(ctx context.Context, sentryClient libsentry.Client) er
consumer.Consume,
taskEventHandler.RunDeferredRespawnLoop,
jobWatcher.Run,
+ zombieSweeper.Run,
a.createHTTPServer(eventHandlerConfig, healthcheckRunner),
healthcheckCron.Run,
)
diff --git a/task/executor/mocks/job_watcher.go b/task/executor/mocks/job_watcher.go
index ee5ec28..5bbb8e2 100644
--- a/task/executor/mocks/job_watcher.go
+++ b/task/executor/mocks/job_watcher.go
@@ -7,6 +7,8 @@ import (
"github.com/bborbe/agent/task/executor/pkg"
v1 "k8s.io/api/batch/v1"
+ v1a "k8s.io/api/core/v1"
+ v1b "k8s.io/client-go/listers/core/v1"
)
type FakeJobWatcher struct {
@@ -16,6 +18,22 @@ type FakeJobWatcher struct {
arg1 context.Context
arg2 *v1.Job
}
+ HandlePodStub func(context.Context, *v1a.Pod)
+ handlePodMutex sync.RWMutex
+ handlePodArgsForCall []struct {
+ arg1 context.Context
+ arg2 *v1a.Pod
+ }
+ PodListerStub func() v1b.PodLister
+ podListerMutex sync.RWMutex
+ podListerArgsForCall []struct {
+ }
+ podListerReturns struct {
+ result1 v1b.PodLister
+ }
+ podListerReturnsOnCall map[int]struct {
+ result1 v1b.PodLister
+ }
RunStub func(context.Context) error
runMutex sync.RWMutex
runArgsForCall []struct {
@@ -64,6 +82,92 @@ func (fake *FakeJobWatcher) HandleJobArgsForCall(i int) (context.Context, *v1.Jo
return argsForCall.arg1, argsForCall.arg2
}
+func (fake *FakeJobWatcher) HandlePod(arg1 context.Context, arg2 *v1a.Pod) {
+ fake.handlePodMutex.Lock()
+ fake.handlePodArgsForCall = append(fake.handlePodArgsForCall, struct {
+ arg1 context.Context
+ arg2 *v1a.Pod
+ }{arg1, arg2})
+ stub := fake.HandlePodStub
+ fake.recordInvocation("HandlePod", []interface{}{arg1, arg2})
+ fake.handlePodMutex.Unlock()
+ if stub != nil {
+ fake.HandlePodStub(arg1, arg2)
+ }
+}
+
+func (fake *FakeJobWatcher) HandlePodCallCount() int {
+ fake.handlePodMutex.RLock()
+ defer fake.handlePodMutex.RUnlock()
+ return len(fake.handlePodArgsForCall)
+}
+
+func (fake *FakeJobWatcher) HandlePodCalls(stub func(context.Context, *v1a.Pod)) {
+ fake.handlePodMutex.Lock()
+ defer fake.handlePodMutex.Unlock()
+ fake.HandlePodStub = stub
+}
+
+func (fake *FakeJobWatcher) HandlePodArgsForCall(i int) (context.Context, *v1a.Pod) {
+ fake.handlePodMutex.RLock()
+ defer fake.handlePodMutex.RUnlock()
+ argsForCall := fake.handlePodArgsForCall[i]
+ return argsForCall.arg1, argsForCall.arg2
+}
+
+func (fake *FakeJobWatcher) PodLister() v1b.PodLister {
+ fake.podListerMutex.Lock()
+ ret, specificReturn := fake.podListerReturnsOnCall[len(fake.podListerArgsForCall)]
+ fake.podListerArgsForCall = append(fake.podListerArgsForCall, struct {
+ }{})
+ stub := fake.PodListerStub
+ fakeReturns := fake.podListerReturns
+ fake.recordInvocation("PodLister", []interface{}{})
+ fake.podListerMutex.Unlock()
+ if stub != nil {
+ return stub()
+ }
+ if specificReturn {
+ return ret.result1
+ }
+ return fakeReturns.result1
+}
+
+func (fake *FakeJobWatcher) PodListerCallCount() int {
+ fake.podListerMutex.RLock()
+ defer fake.podListerMutex.RUnlock()
+ return len(fake.podListerArgsForCall)
+}
+
+func (fake *FakeJobWatcher) PodListerCalls(stub func() v1b.PodLister) {
+ fake.podListerMutex.Lock()
+ defer fake.podListerMutex.Unlock()
+ fake.PodListerStub = stub
+}
+
+func (fake *FakeJobWatcher) PodListerReturns(result1 v1b.PodLister) {
+ fake.podListerMutex.Lock()
+ defer fake.podListerMutex.Unlock()
+ fake.PodListerStub = nil
+ fake.podListerReturns = struct {
+ result1 v1b.PodLister
+ }{result1}
+}
+
+func (fake *FakeJobWatcher) PodListerReturnsOnCall(i int, result1 v1b.PodLister) {
+ fake.podListerMutex.Lock()
+ defer fake.podListerMutex.Unlock()
+ fake.PodListerStub = nil
+ if fake.podListerReturnsOnCall == nil {
+ fake.podListerReturnsOnCall = make(map[int]struct {
+ result1 v1b.PodLister
+ })
+ }
+ fake.podListerReturnsOnCall[i] = struct {
+ result1 v1b.PodLister
+ }{result1}
+}
+
func (fake *FakeJobWatcher) Run(arg1 context.Context) error {
fake.runMutex.Lock()
ret, specificReturn := fake.runReturnsOnCall[len(fake.runArgsForCall)]
diff --git a/task/executor/mocks/zombie_sweeper.go b/task/executor/mocks/zombie_sweeper.go
new file mode 100644
index 0000000..5e7060e
--- /dev/null
+++ b/task/executor/mocks/zombie_sweeper.go
@@ -0,0 +1,182 @@
+// Code generated by counterfeiter. DO NOT EDIT.
+package mocks
+
+import (
+ "context"
+ "sync"
+
+ "github.com/bborbe/agent/task/executor/pkg"
+)
+
+type FakeZombieSweeper struct {
+ RunStub func(context.Context) error
+ runMutex sync.RWMutex
+ runArgsForCall []struct {
+ arg1 context.Context
+ }
+ runReturns struct {
+ result1 error
+ }
+ runReturnsOnCall map[int]struct {
+ result1 error
+ }
+ SweepOnceStub func(context.Context) error
+ sweepOnceMutex sync.RWMutex
+ sweepOnceArgsForCall []struct {
+ arg1 context.Context
+ }
+ sweepOnceReturns struct {
+ result1 error
+ }
+ sweepOnceReturnsOnCall map[int]struct {
+ result1 error
+ }
+ invocations map[string][][]interface{}
+ invocationsMutex sync.RWMutex
+}
+
+func (fake *FakeZombieSweeper) Run(arg1 context.Context) error {
+ fake.runMutex.Lock()
+ ret, specificReturn := fake.runReturnsOnCall[len(fake.runArgsForCall)]
+ fake.runArgsForCall = append(fake.runArgsForCall, struct {
+ arg1 context.Context
+ }{arg1})
+ stub := fake.RunStub
+ fakeReturns := fake.runReturns
+ fake.recordInvocation("Run", []interface{}{arg1})
+ fake.runMutex.Unlock()
+ if stub != nil {
+ return stub(arg1)
+ }
+ if specificReturn {
+ return ret.result1
+ }
+ return fakeReturns.result1
+}
+
+func (fake *FakeZombieSweeper) RunCallCount() int {
+ fake.runMutex.RLock()
+ defer fake.runMutex.RUnlock()
+ return len(fake.runArgsForCall)
+}
+
+func (fake *FakeZombieSweeper) RunCalls(stub func(context.Context) error) {
+ fake.runMutex.Lock()
+ defer fake.runMutex.Unlock()
+ fake.RunStub = stub
+}
+
+func (fake *FakeZombieSweeper) RunArgsForCall(i int) context.Context {
+ fake.runMutex.RLock()
+ defer fake.runMutex.RUnlock()
+ argsForCall := fake.runArgsForCall[i]
+ return argsForCall.arg1
+}
+
+func (fake *FakeZombieSweeper) RunReturns(result1 error) {
+ fake.runMutex.Lock()
+ defer fake.runMutex.Unlock()
+ fake.RunStub = nil
+ fake.runReturns = struct {
+ result1 error
+ }{result1}
+}
+
+func (fake *FakeZombieSweeper) RunReturnsOnCall(i int, result1 error) {
+ fake.runMutex.Lock()
+ defer fake.runMutex.Unlock()
+ fake.RunStub = nil
+ if fake.runReturnsOnCall == nil {
+ fake.runReturnsOnCall = make(map[int]struct {
+ result1 error
+ })
+ }
+ fake.runReturnsOnCall[i] = struct {
+ result1 error
+ }{result1}
+}
+
+func (fake *FakeZombieSweeper) SweepOnce(arg1 context.Context) error {
+ fake.sweepOnceMutex.Lock()
+ ret, specificReturn := fake.sweepOnceReturnsOnCall[len(fake.sweepOnceArgsForCall)]
+ fake.sweepOnceArgsForCall = append(fake.sweepOnceArgsForCall, struct {
+ arg1 context.Context
+ }{arg1})
+ stub := fake.SweepOnceStub
+ fakeReturns := fake.sweepOnceReturns
+ fake.recordInvocation("SweepOnce", []interface{}{arg1})
+ fake.sweepOnceMutex.Unlock()
+ if stub != nil {
+ return stub(arg1)
+ }
+ if specificReturn {
+ return ret.result1
+ }
+ return fakeReturns.result1
+}
+
+func (fake *FakeZombieSweeper) SweepOnceCallCount() int {
+ fake.sweepOnceMutex.RLock()
+ defer fake.sweepOnceMutex.RUnlock()
+ return len(fake.sweepOnceArgsForCall)
+}
+
+func (fake *FakeZombieSweeper) SweepOnceCalls(stub func(context.Context) error) {
+ fake.sweepOnceMutex.Lock()
+ defer fake.sweepOnceMutex.Unlock()
+ fake.SweepOnceStub = stub
+}
+
+func (fake *FakeZombieSweeper) SweepOnceArgsForCall(i int) context.Context {
+ fake.sweepOnceMutex.RLock()
+ defer fake.sweepOnceMutex.RUnlock()
+ argsForCall := fake.sweepOnceArgsForCall[i]
+ return argsForCall.arg1
+}
+
+func (fake *FakeZombieSweeper) SweepOnceReturns(result1 error) {
+ fake.sweepOnceMutex.Lock()
+ defer fake.sweepOnceMutex.Unlock()
+ fake.SweepOnceStub = nil
+ fake.sweepOnceReturns = struct {
+ result1 error
+ }{result1}
+}
+
+func (fake *FakeZombieSweeper) SweepOnceReturnsOnCall(i int, result1 error) {
+ fake.sweepOnceMutex.Lock()
+ defer fake.sweepOnceMutex.Unlock()
+ fake.SweepOnceStub = nil
+ if fake.sweepOnceReturnsOnCall == nil {
+ fake.sweepOnceReturnsOnCall = make(map[int]struct {
+ result1 error
+ })
+ }
+ fake.sweepOnceReturnsOnCall[i] = struct {
+ result1 error
+ }{result1}
+}
+
+func (fake *FakeZombieSweeper) Invocations() map[string][][]interface{} {
+ fake.invocationsMutex.RLock()
+ defer fake.invocationsMutex.RUnlock()
+ copiedInvocations := map[string][][]interface{}{}
+ for key, value := range fake.invocations {
+ copiedInvocations[key] = value
+ }
+ return copiedInvocations
+}
+
+func (fake *FakeZombieSweeper) recordInvocation(key string, args []interface{}) {
+ fake.invocationsMutex.Lock()
+ defer fake.invocationsMutex.Unlock()
+ if fake.invocations == nil {
+ fake.invocations = map[string][][]interface{}{}
+ }
+ if fake.invocations[key] == nil {
+ fake.invocations[key] = [][]interface{}{}
+ }
+ fake.invocations[key] = append(fake.invocations[key], args)
+}
+
+var _ pkg.ZombieSweeper = new(FakeZombieSweeper)
diff --git a/task/executor/pkg/agent_configuration.go b/task/executor/pkg/agent_configuration.go
index 5c603be..0b4bc93 100644
--- a/task/executor/pkg/agent_configuration.go
+++ b/task/executor/pkg/agent_configuration.go
@@ -43,6 +43,20 @@ type AgentConfiguration struct {
ImagePullSecret string
// Trigger declares the per-agent phase and status conditions under which the executor spawns a Job.
Trigger *agentv1.Trigger
+ // ZombieJobTimeoutSeconds mirrors ConfigSpec.ZombieJobTimeoutSeconds. The
+ // spawner stamps this value onto Job.Spec.ActiveDeadlineSeconds; the sweeper
+ // uses it as the elapsed-time threshold. nil means "use the default
+ // DefaultZombieJobTimeoutSeconds from the CRD types package".
+ ZombieJobTimeoutSeconds *int32
+}
+
+// EffectiveZombieJobTimeoutSeconds returns the effective deadline in seconds:
+// the configured value when non-nil, else agentv1.DefaultZombieJobTimeoutSeconds.
+func (a AgentConfiguration) EffectiveZombieJobTimeoutSeconds() int32 {
+ if a.ZombieJobTimeoutSeconds != nil {
+ return *a.ZombieJobTimeoutSeconds
+ }
+ return agentv1.DefaultZombieJobTimeoutSeconds
}
// AgentConfigurations is a list of agent configurations.
@@ -65,18 +79,19 @@ func (a AgentConfigurations) TaggedConfigurations(branch string) AgentConfigurat
result := make(AgentConfigurations, len(a))
for i, c := range a {
result[i] = AgentConfiguration{
- Assignee: c.Assignee,
- TaskType: c.TaskType,
- TaskTypes: append([]string(nil), c.TaskTypes...),
- Image: c.Image + ":" + branch,
- Env: c.Env,
- VolumeClaim: c.VolumeClaim,
- VolumeMountPath: c.VolumeMountPath,
- SecretName: c.SecretName,
- Resources: c.Resources.DeepCopy(),
- PriorityClassName: c.PriorityClassName,
- ImagePullSecret: c.ImagePullSecret,
- Trigger: c.Trigger,
+ Assignee: c.Assignee,
+ TaskType: c.TaskType,
+ TaskTypes: append([]string(nil), c.TaskTypes...),
+ Image: c.Image + ":" + branch,
+ Env: c.Env,
+ VolumeClaim: c.VolumeClaim,
+ VolumeMountPath: c.VolumeMountPath,
+ SecretName: c.SecretName,
+ Resources: c.Resources.DeepCopy(),
+ PriorityClassName: c.PriorityClassName,
+ ImagePullSecret: c.ImagePullSecret,
+ Trigger: c.Trigger,
+ ZombieJobTimeoutSeconds: c.ZombieJobTimeoutSeconds,
}
}
return result
diff --git a/task/executor/pkg/agent_configuration_test.go b/task/executor/pkg/agent_configuration_test.go
index a151e5d..c42b761 100644
--- a/task/executor/pkg/agent_configuration_test.go
+++ b/task/executor/pkg/agent_configuration_test.go
@@ -19,6 +19,22 @@ func TestPkg(t *testing.T) {
RunSpecs(t, "Pkg Suite")
}
+var _ = Describe("AgentConfiguration", func() {
+ Describe("EffectiveZombieJobTimeoutSeconds", func() {
+ ptrInt32 := func(v int32) *int32 { return &v }
+
+ It("returns default when ZombieJobTimeoutSeconds is nil", func() {
+ cfg := pkg.AgentConfiguration{}
+ Expect(cfg.EffectiveZombieJobTimeoutSeconds()).To(Equal(int32(1800)))
+ })
+
+ It("returns configured value when set", func() {
+ cfg := pkg.AgentConfiguration{ZombieJobTimeoutSeconds: ptrInt32(900)}
+ Expect(cfg.EffectiveZombieJobTimeoutSeconds()).To(Equal(int32(900)))
+ })
+ })
+})
+
var _ = Describe("AgentConfigurations", func() {
var configs pkg.AgentConfigurations
@@ -123,5 +139,14 @@ var _ = Describe("AgentConfigurations", func() {
result := configs.TaggedConfigurations("prod")
Expect(result[1].Resources).To(BeNil())
})
+
+ It("preserves ZombieJobTimeoutSeconds", func() {
+ ptr := int32(900)
+ configs[0].ZombieJobTimeoutSeconds = &ptr
+ result := configs.TaggedConfigurations("prod")
+ Expect(result[0].ZombieJobTimeoutSeconds).NotTo(BeNil())
+ Expect(*result[0].ZombieJobTimeoutSeconds).To(Equal(int32(900)))
+ Expect(result[1].ZombieJobTimeoutSeconds).To(BeNil())
+ })
})
})
diff --git a/task/executor/pkg/config_resolver.go b/task/executor/pkg/config_resolver.go
index 70985d8..50ecb9f 100644
--- a/task/executor/pkg/config_resolver.go
+++ b/task/executor/pkg/config_resolver.go
@@ -65,17 +65,18 @@ func (r *configResolver) Resolve(
func convert(obj agentv1.Config, branch string) AgentConfiguration {
return AgentConfiguration{
- Assignee: obj.Spec.Assignee,
- TaskType: obj.Spec.TaskType,
- TaskTypes: append([]string(nil), obj.Spec.TaskTypes...),
- Image: obj.Spec.Image + ":" + branch,
- Env: copyEnv(obj.Spec.Env),
- SecretName: obj.Spec.SecretName,
- VolumeClaim: obj.Spec.VolumeClaim,
- VolumeMountPath: obj.Spec.VolumeMountPath,
- Resources: obj.Spec.Resources.DeepCopy(),
- PriorityClassName: obj.Spec.PriorityClassName,
- Trigger: obj.Spec.Trigger,
+ Assignee: obj.Spec.Assignee,
+ TaskType: obj.Spec.TaskType,
+ TaskTypes: append([]string(nil), obj.Spec.TaskTypes...),
+ Image: obj.Spec.Image + ":" + branch,
+ Env: copyEnv(obj.Spec.Env),
+ SecretName: obj.Spec.SecretName,
+ VolumeClaim: obj.Spec.VolumeClaim,
+ VolumeMountPath: obj.Spec.VolumeMountPath,
+ Resources: obj.Spec.Resources.DeepCopy(),
+ PriorityClassName: obj.Spec.PriorityClassName,
+ Trigger: obj.Spec.Trigger,
+ ZombieJobTimeoutSeconds: obj.Spec.ZombieJobTimeoutSeconds,
}
}
diff --git a/task/executor/pkg/envtest/job_watcher_envtest_test.go b/task/executor/pkg/envtest/job_watcher_envtest_test.go
new file mode 100644
index 0000000..795e8db
--- /dev/null
+++ b/task/executor/pkg/envtest/job_watcher_envtest_test.go
@@ -0,0 +1,170 @@
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+//go:build envtest
+
+package envtest_test
+
+import (
+ "context"
+ "os"
+ "testing"
+ "time"
+
+ libk8s "github.com/bborbe/k8s"
+ . "github.com/onsi/ginkgo/v2"
+ . "github.com/onsi/gomega"
+ corev1 "k8s.io/api/core/v1"
+ metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+ "k8s.io/client-go/kubernetes"
+ "k8s.io/client-go/rest"
+ "sigs.k8s.io/controller-runtime/pkg/envtest"
+
+ lib "github.com/bborbe/agent/lib"
+ mocks "github.com/bborbe/agent/task/executor/mocks"
+ pkg "github.com/bborbe/agent/task/executor/pkg"
+)
+
+// controller-runtime v0.21.x is compatible with k8s.io/client-go v0.36.x
+// (client-go 0.36 ships with Kubernetes 1.36; controller-runtime 0.21.x
+// supports Kubernetes 1.21–1.31; envtest binaries are pinned to 1.31.0).
+
+func TestEnvtest(t *testing.T) {
+ RegisterFailHandler(Fail)
+ RunSpecs(t, "executor envtest suite")
+}
+
+var _ = BeforeSuite(func() {
+ if os.Getenv("KUBEBUILDER_ASSETS") == "" {
+ if os.Getenv("ENVTEST_REQUIRED") == "1" {
+ Fail(
+ "KUBEBUILDER_ASSETS not set but ENVTEST_REQUIRED=1; envtest binaries must be available under precommit",
+ )
+ }
+ Skip("KUBEBUILDER_ASSETS not set; run via `make test-envtest` or `make precommit`")
+ }
+})
+
+var _ = Describe("JobWatcher (envtest)", func() {
+ var (
+ testEnv *envtest.Environment
+ cfg *rest.Config
+ kubeClient kubernetes.Interface
+ ctx context.Context
+ cancel context.CancelFunc
+ )
+
+ BeforeEach(func() {
+ testEnv = &envtest.Environment{}
+ var err error
+ cfg, err = testEnv.Start()
+ Expect(err).NotTo(HaveOccurred())
+ kubeClient, err = kubernetes.NewForConfig(cfg)
+ Expect(err).NotTo(HaveOccurred())
+ ctx, cancel = context.WithCancel(context.Background())
+ })
+
+ AfterEach(func() {
+ cancel()
+ Expect(testEnv.Stop()).To(Succeed())
+ })
+
+ It("classifies ImagePullBackOff and publishes one failure within the bound", func() {
+ ns := "default"
+ taskID := lib.TaskIdentifier("envtest-task-1")
+ jobName := "envtest-job-1"
+ publisher := &mocks.FakeResultPublisher{}
+ store := pkg.NewTaskStore()
+ store.Store(taskID, lib.Task{
+ TaskIdentifier: taskID,
+ Frontmatter: lib.TaskFrontmatter{
+ "current_job": jobName,
+ "assignee": "envtest-agent",
+ },
+ })
+ watcher := pkg.NewJobWatcher(kubeClient, libk8s.Namespace(ns), store, publisher)
+
+ runErrCh := make(chan error, 1)
+ go func() { runErrCh <- watcher.Run(ctx) }()
+
+ // Create a Pod with the task-id label and a bogus image. envtest does
+ // not run a kubelet, so we inject the ImagePullBackOff status ourselves
+ // via the Status subresource; the informer sees the update the same way
+ // it would in a real cluster.
+ pod := &corev1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: "envtest-pod-1",
+ Namespace: ns,
+ Labels: map[string]string{
+ "agent.benjamin-borbe.de/task-id": string(taskID),
+ },
+ OwnerReferences: []metav1.OwnerReference{
+ {APIVersion: "batch/v1", Kind: "Job", Name: jobName, UID: "fake-job-uid"},
+ },
+ },
+ Spec: corev1.PodSpec{
+ RestartPolicy: corev1.RestartPolicyNever,
+ Containers: []corev1.Container{
+ {Name: "agent", Image: "docker.example.com/does-not-exist:envtest"},
+ },
+ },
+ }
+ _, err := kubeClient.CoreV1().Pods(ns).Create(ctx, pod, metav1.CreateOptions{})
+ Expect(
+ err,
+ ).NotTo(HaveOccurred(), "if Create returns 422, add required defaults; do not silently catch")
+
+ // Status subresource update flow — 4 steps to avoid ResourceVersion races
+ // and default-mutator overwrites:
+ // 1. Get the canonical Pod (fresh ResourceVersion).
+ // 2. Mutate Status on the fetched object.
+ // 3. UpdateStatus with the fetched object.
+ // 4. Get again to confirm the status survived.
+ // Step 1: Get
+ fetched, err := kubeClient.CoreV1().Pods(ns).Get(ctx, "envtest-pod-1", metav1.GetOptions{})
+ Expect(err).NotTo(HaveOccurred())
+ // Step 2: mutate Status on the freshly-fetched object
+ fetched.Status.Phase = corev1.PodPending
+ fetched.Status.ContainerStatuses = []corev1.ContainerStatus{
+ {
+ Name: "agent",
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "ImagePullBackOff",
+ Message: "Back-off pulling image",
+ },
+ },
+ },
+ }
+ // Step 3: UpdateStatus
+ _, err = kubeClient.CoreV1().Pods(ns).UpdateStatus(ctx, fetched, metav1.UpdateOptions{})
+ Expect(err).NotTo(HaveOccurred())
+ // Step 4: Get to confirm the status survived
+ confirmed, err := kubeClient.CoreV1().
+ Pods(ns).
+ Get(ctx, "envtest-pod-1", metav1.GetOptions{})
+ Expect(err).NotTo(HaveOccurred())
+ Expect(confirmed.Status.ContainerStatuses).To(HaveLen(1))
+ Expect(confirmed.Status.ContainerStatuses[0].State.Waiting).NotTo(BeNil())
+ Expect(
+ confirmed.Status.ContainerStatuses[0].State.Waiting.Reason,
+ ).To(Equal("ImagePullBackOff"))
+
+ // Acceptance bound: 2 * zombieSweeperIntervalSeconds = 2 * 60s = 120s.
+ // In practice the informer reacts in well under a second once the
+ // status update lands; we use a generous wait with polling to stay
+ // well inside the bound while keeping the test fast.
+ Eventually(publisher.PublishFailureCallCount, 30*time.Second, 100*time.Millisecond).
+ Should(Equal(1), "expected one PublishFailure call within bound")
+
+ // Confirm "exactly one" — Eventually passes at the FIRST observation of 1;
+ // Consistently verifies no second call lands over a short follow-up window.
+ Consistently(publisher.PublishFailureCallCount, 2*time.Second, 200*time.Millisecond).
+ Should(Equal(1), "expected exactly one PublishFailure call (no duplicates)")
+
+ _, _, gotJobName, gotReason := publisher.PublishFailureArgsForCall(0)
+ Expect(gotJobName).To(Equal(jobName))
+ Expect(gotReason).To(Equal(string(pkg.ZombieReasonImagePullBackOff)))
+ })
+})
diff --git a/task/executor/pkg/factory/factory.go b/task/executor/pkg/factory/factory.go
index d4e3a05..2aca7d2 100644
--- a/task/executor/pkg/factory/factory.go
+++ b/task/executor/pkg/factory/factory.go
@@ -37,6 +37,32 @@ func CreateJobWatcher(
return pkg.NewJobWatcher(kubeClient, namespace, taskStore, publisher)
}
+// CreateZombieSweeper creates a deadline sweeper that classifies stuck tasks as
+// zombies and emits failure events via the publisher. Interval and per-task
+// deadline are sourced from the AgentConfig CRD knobs (see ConfigSpec). The
+// sweeper receives the JobWatcher (not its lister) because the lister is
+// populated only after JobWatcher.Run completes its informer cache sync; passing
+// the watcher lets the sweeper resolve the lister lazily on each tick and skip
+// the tick if cache sync has not yet happened (avoids a nil-deref panic at the
+// first tick when service.Run starts all components concurrently).
+func CreateZombieSweeper(
+ jobWatcher pkg.JobWatcher,
+ namespace libk8s.Namespace,
+ taskStore *pkg.TaskStore,
+ publisher pkg.ResultPublisher,
+ configProvider pkg.EventHandlerConfig,
+ currentDateTime libtime.CurrentDateTimeGetter,
+) pkg.ZombieSweeper {
+ return pkg.NewZombieSweeper(
+ jobWatcher,
+ namespace,
+ taskStore,
+ publisher,
+ configProvider,
+ currentDateTime,
+ )
+}
+
// CreateK8sConnector returns a K8sConnector wired to the given rest.Config.
func CreateK8sConnector(config *rest.Config) pkg.K8sConnector {
return pkg.NewK8sConnector(config, pkg.DefaultCRDClientBuilder)
diff --git a/task/executor/pkg/job_watcher.go b/task/executor/pkg/job_watcher.go
index 849511e..b944e19 100644
--- a/task/executor/pkg/job_watcher.go
+++ b/task/executor/pkg/job_watcher.go
@@ -6,6 +6,7 @@ package pkg
import (
"context"
+ "sync/atomic"
"time"
"github.com/bborbe/errors"
@@ -16,6 +17,7 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
k8sinformers "k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
+ corev1listers "k8s.io/client-go/listers/core/v1"
"k8s.io/client-go/tools/cache"
lib "github.com/bborbe/agent/lib"
@@ -23,14 +25,22 @@ import (
//counterfeiter:generate -o ../mocks/job_watcher.go --fake-name FakeJobWatcher . JobWatcher
-// JobWatcher watches batch/v1 Jobs in the executor's namespace and publishes
-// synthetic failure results for terminal-state Jobs that belong to spawned tasks.
+// JobWatcher watches batch/v1 Jobs and their Pods in the executor's namespace and
+// publishes synthetic failure results for terminal-state objects that belong to
+// spawned tasks.
type JobWatcher interface {
- // Run starts the Job informer and blocks until ctx is cancelled.
+ // Run starts the Job and Pod informers and blocks until ctx is cancelled.
Run(ctx context.Context) error
// HandleJob processes a single Job (invoked by the informer event handlers
// and by unit tests directly, avoiding the need for a fake informer).
HandleJob(ctx context.Context, job *batchv1.Job)
+ // HandlePod processes a single Pod (invoked by the Pod informer event handler
+ // and by unit tests directly).
+ HandlePod(ctx context.Context, pod *corev1.Pod)
+ // PodLister returns the Pod lister backed by the shared informer cache, for
+ // use by the deadline sweeper. The returned lister is safe for concurrent
+ // read access.
+ PodLister() corev1listers.PodLister
}
// NewJobWatcher creates a JobWatcher.
@@ -53,6 +63,7 @@ type jobWatcher struct {
namespace libk8s.Namespace
taskStore *TaskStore
publisher ResultPublisher
+ podLister atomic.Pointer[corev1listers.PodLister]
}
func (w *jobWatcher) Run(ctx context.Context) error {
@@ -86,15 +97,46 @@ func (w *jobWatcher) Run(ctx context.Context) error {
return errors.Wrapf(ctx, err, "add job informer event handler")
}
+ podInformer := factory.Core().V1().Pods().Informer()
+ _, err = podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
+ AddFunc: func(obj interface{}) {
+ pod, ok := obj.(*corev1.Pod)
+ if !ok {
+ return
+ }
+ w.HandlePod(ctx, pod)
+ },
+ UpdateFunc: func(_, newObj interface{}) {
+ pod, ok := newObj.(*corev1.Pod)
+ if !ok {
+ return
+ }
+ w.HandlePod(ctx, pod)
+ },
+ })
+ if err != nil {
+ return errors.Wrapf(ctx, err, "add pod informer event handler")
+ }
+
factory.Start(ctx.Done())
- if !cache.WaitForCacheSync(ctx.Done(), informer.HasSynced) {
- return errors.Errorf(ctx, "timed out waiting for job informer cache sync")
+ if !cache.WaitForCacheSync(ctx.Done(), informer.HasSynced, podInformer.HasSynced) {
+ return errors.Errorf(ctx, "timed out waiting for job/pod informer cache sync")
}
- glog.V(2).Infof("job informer started in namespace %s", w.namespace)
+ lister := factory.Core().V1().Pods().Lister()
+ w.podLister.Store(&lister)
+ glog.V(2).Infof("job and pod informer started in namespace %s", w.namespace)
<-ctx.Done()
return nil
}
+func (w *jobWatcher) PodLister() corev1listers.PodLister {
+ lister := w.podLister.Load()
+ if lister == nil {
+ return nil
+ }
+ return *lister
+}
+
func (w *jobWatcher) HandleJob(ctx context.Context, job *batchv1.Job) {
taskIDStr, ok := job.Labels["agent.benjamin-borbe.de/task-id"]
if !ok || taskIDStr == "" {
@@ -103,7 +145,7 @@ func (w *jobWatcher) HandleJob(ctx context.Context, job *batchv1.Job) {
taskID := lib.TaskIdentifier(taskIDStr)
if isJobFailed(job) {
- reason := jobFailureReason(job)
+ reason := JobFailureReason(job)
glog.V(2).Infof("job %s/%s failed (task %s): %s", job.Namespace, job.Name, taskID, reason)
w.handleTerminal(ctx, taskID, job, reason, true)
return
@@ -129,7 +171,7 @@ func (w *jobWatcher) handleTerminal(
ctx context.Context,
taskID lib.TaskIdentifier,
job *batchv1.Job,
- reason string,
+ reason ZombieReason,
alwaysPublish bool,
) {
task, ok := w.taskStore.Load(taskID)
@@ -145,9 +187,9 @@ func (w *jobWatcher) publishSyntheticFailure(
taskID lib.TaskIdentifier,
task lib.Task,
job *batchv1.Job,
- reason string,
+ reason ZombieReason,
) {
- if err := w.publisher.PublishFailure(ctx, task, job.Name, reason); err != nil {
+ if err := w.publisher.PublishFailure(ctx, task, job.Name, reason.String()); err != nil {
glog.Errorf("publish synthetic failure for task %s (job %s): %v", taskID, job.Name, err)
} else {
glog.V(2).Infof("published synthetic failure for task %s (job %s)", taskID, job.Name)
@@ -193,11 +235,118 @@ func isJobSucceeded(job *batchv1.Job) bool {
return false
}
-func jobFailureReason(job *batchv1.Job) string {
+// JobFailureReason maps a failed Job's conditions to a ZombieReason. Returns
+// ZombieReasonDeadlineExceeded when any Failed condition has Reason
+// "DeadlineExceeded" or "BackoffLimitExceeded" (kubelet killed the pod for
+// running past activeDeadlineSeconds or exhausting BackoffLimit). Returns
+// ZombieReasonPodCrashNoStdout for any other Failed condition (the pod
+// terminated non-zero and no AgentResult was observed; the Job-condition
+// informer only fires AFTER terminal state, so absence of an AgentResult is
+// implicit at this point).
+func JobFailureReason(job *batchv1.Job) ZombieReason {
for _, c := range job.Status.Conditions {
- if c.Type == batchv1.JobFailed && c.Status == corev1.ConditionTrue && c.Message != "" {
- return c.Message
+ if c.Type == batchv1.JobFailed && c.Status == corev1.ConditionTrue {
+ switch c.Reason {
+ case "DeadlineExceeded", "BackoffLimitExceeded":
+ return ZombieReasonDeadlineExceeded
+ }
+ }
+ }
+ return ZombieReasonPodCrashNoStdout
+}
+
+// HandlePod processes a Pod that has transitioned to a terminal failure state.
+// It publishes a single zombie failure event and returns without deleting the
+// task from the TaskStore — the Job-condition path or the deadline sweeper
+// performs the final delete when terminal state is observed.
+func (w *jobWatcher) HandlePod(ctx context.Context, pod *corev1.Pod) {
+ taskIDStr, ok := pod.Labels["agent.benjamin-borbe.de/task-id"]
+ if !ok || taskIDStr == "" {
+ return
+ }
+ taskID := lib.TaskIdentifier(taskIDStr)
+
+ reason := classifyPodFailure(pod)
+ if reason == "" {
+ return
+ }
+
+ task, ok := w.taskStore.Load(taskID)
+ if !ok {
+ glog.V(3).Infof(
+ "pod %s/%s (task %s) in %s state but task not in store; sweeper will handle if still in flight",
+ pod.Namespace, pod.Name, taskID, reason,
+ )
+ return
+ }
+
+ jobName := ownerJobName(pod)
+ if jobName == "" {
+ glog.V(2).Infof(
+ "pod %s/%s (task %s) in %s state but has no Job ownerRef; ignoring",
+ pod.Namespace, pod.Name, taskID, reason,
+ )
+ return
+ }
+
+ if err := w.publisher.PublishFailure(ctx, task, jobName, reason.String()); err != nil {
+ glog.Errorf(
+ "publish pod-state failure for task %s (pod %s reason %s): %v",
+ taskID, pod.Name, reason, err,
+ )
+ return
+ }
+ glog.V(2).Infof(
+ "published pod-state failure for task %s (pod %s reason %s)",
+ taskID, pod.Name, reason,
+ )
+ // Do NOT call w.taskStore.Delete here. The pod may transition again (e.g. evicted then
+ // rescheduled). The Job-condition path or the deadline sweeper performs the final delete
+ // when terminal state is observed. Dedupe in PublishFailure (prompt 1) prevents
+ // double-publish for the same job name.
+}
+
+// classifyPodFailure returns a non-empty ZombieReason when the Pod is in a
+// terminal failure state we recognize. Returns "" for healthy, pending-without-
+// excessive-delay, and any state we should not act on from the informer path.
+// pod_not_scheduled is deliberately NOT returned here — it requires a grace
+// window the informer cannot evaluate (a freshly created Pod is always briefly
+// Pending before scheduling). The deadline sweeper (separate prompt) owns that
+// classification.
+func classifyPodFailure(pod *corev1.Pod) ZombieReason {
+ for _, cs := range pod.Status.ContainerStatuses {
+ if cs.State.Waiting != nil {
+ switch cs.State.Waiting.Reason {
+ case "ImagePullBackOff", "ErrImagePull":
+ return ZombieReasonImagePullBackOff
+ case "CrashLoopBackOff":
+ // With BackoffLimit=0 in the spawner, crash-looping pods never
+ // reach PodFailed phase, so this branch is the only signal that
+ // classifies them before activeDeadlineSeconds fires.
+ return ZombieReasonPodCrashNoStdout
+ }
+ }
+ }
+ if pod.Status.Reason == "Evicted" {
+ return ZombieReasonPodEvicted
+ }
+ if pod.Status.Phase == corev1.PodFailed {
+ for _, cs := range pod.Status.ContainerStatuses {
+ if cs.State.Terminated != nil && cs.State.Terminated.ExitCode != 0 {
+ return ZombieReasonPodCrashNoStdout
+ }
+ }
+ }
+ return ""
+}
+
+// ownerJobName returns the name of the Job that owns the Pod, or "" when no
+// Job ownerRef is present.
+func ownerJobName(pod *corev1.Pod) string {
+ for _, ref := range pod.OwnerReferences {
+ if ref.Kind == "Job" {
+ return ref.Name
}
}
- return "unknown failure reason"
+ return ""
}
diff --git a/task/executor/pkg/job_watcher_test.go b/task/executor/pkg/job_watcher_test.go
index 83d57db..503b154 100644
--- a/task/executor/pkg/job_watcher_test.go
+++ b/task/executor/pkg/job_watcher_test.go
@@ -95,7 +95,7 @@ var _ = Describe("JobWatcher", func() {
_, calledTask, calledJobName, calledReason := fakePublisher.PublishFailureArgsForCall(0)
Expect(string(calledTask.TaskIdentifier)).To(Equal(string(testTaskID)))
Expect(calledJobName).To(Equal("job-1"))
- Expect(calledReason).To(ContainSubstring("OOMKilled"))
+ Expect(calledReason).To(Equal("pod_crash_no_stdout"))
_, err = fakeKubeClient.BatchV1().Jobs("test-ns").Get(ctx, "job-1", metav1.GetOptions{})
Expect(err).To(BeNil())
@@ -197,7 +197,7 @@ var _ = Describe("JobWatcher", func() {
Expect(err).To(BeNil())
})
- It("uses unknown failure reason when condition has no message", func() {
+ It("uses pod_crash_no_stdout when condition has no message", func() {
job := makeJob("job-7", string(testTaskID), batchv1.JobCondition{
Type: batchv1.JobFailed,
Status: corev1.ConditionTrue,
@@ -212,7 +212,301 @@ var _ = Describe("JobWatcher", func() {
Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
_, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
- Expect(calledReason).To(ContainSubstring("unknown failure reason"))
+ Expect(calledReason).To(Equal("pod_crash_no_stdout"))
+ })
+ })
+
+ Describe("HandleJob with DeadlineExceeded", func() {
+ It("maps DeadlineExceeded job condition to deadline_exceeded", func() {
+ job := makeJob("job-deadline", string(testTaskID), batchv1.JobCondition{
+ Type: batchv1.JobFailed,
+ Status: corev1.ConditionTrue,
+ Reason: "DeadlineExceeded",
+ Message: "Job was active longer than specified deadline",
+ })
+ _, err := fakeKubeClient.BatchV1().
+ Jobs("test-ns").
+ Create(ctx, job, metav1.CreateOptions{})
+ Expect(err).To(BeNil())
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandleJob(ctx, job)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledReason).To(Equal("deadline_exceeded"))
+ })
+
+ It("maps BackoffLimitExceeded job condition to deadline_exceeded", func() {
+ job := makeJob("job-backoff", string(testTaskID), batchv1.JobCondition{
+ Type: batchv1.JobFailed,
+ Status: corev1.ConditionTrue,
+ Reason: "BackoffLimitExceeded",
+ Message: "Job failed due to backoff limit",
+ })
+ _, err := fakeKubeClient.BatchV1().
+ Jobs("test-ns").
+ Create(ctx, job, metav1.CreateOptions{})
+ Expect(err).To(BeNil())
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandleJob(ctx, job)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledReason).To(Equal("deadline_exceeded"))
+ })
+ })
+
+ Describe("jobFailureReason mapping", func() {
+ It("returns deadline_exceeded for DeadlineExceeded", func() {
+ job := makeJob("j", string(testTaskID), batchv1.JobCondition{
+ Type: batchv1.JobFailed,
+ Status: corev1.ConditionTrue,
+ Reason: "DeadlineExceeded",
+ })
+ Expect(pkg.JobFailureReason(job)).To(Equal(pkg.ZombieReasonDeadlineExceeded))
+ })
+
+ It("returns deadline_exceeded for BackoffLimitExceeded", func() {
+ job := makeJob("j", string(testTaskID), batchv1.JobCondition{
+ Type: batchv1.JobFailed,
+ Status: corev1.ConditionTrue,
+ Reason: "BackoffLimitExceeded",
+ })
+ Expect(pkg.JobFailureReason(job)).To(Equal(pkg.ZombieReasonDeadlineExceeded))
+ })
+
+ It("returns pod_crash_no_stdout for other Failed condition reasons", func() {
+ job := makeJob("j", string(testTaskID), batchv1.JobCondition{
+ Type: batchv1.JobFailed,
+ Status: corev1.ConditionTrue,
+ Reason: "",
+ })
+ Expect(pkg.JobFailureReason(job)).To(Equal(pkg.ZombieReasonPodCrashNoStdout))
+ })
+ })
+
+ Describe("HandlePod", func() {
+ makePod := func(name string, taskID string, phase corev1.PodPhase, containerStatuses []corev1.ContainerStatus, ownerRefs []metav1.OwnerReference, podStatusReason string) *corev1.Pod {
+ labels := map[string]string{}
+ if taskID != "" {
+ labels["agent.benjamin-borbe.de/task-id"] = taskID
+ }
+ pod := &corev1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: name,
+ Namespace: "test-ns",
+ Labels: labels,
+ OwnerReferences: ownerRefs,
+ },
+ Status: corev1.PodStatus{
+ Phase: phase,
+ },
+ }
+ if len(containerStatuses) > 0 {
+ pod.Status.ContainerStatuses = containerStatuses
+ }
+ if podStatusReason != "" {
+ pod.Status.Reason = podStatusReason
+ }
+ return pod
+ }
+
+ makeJobOwnerRef := func(name string) []metav1.OwnerReference {
+ return []metav1.OwnerReference{
+ {
+ Kind: "Job",
+ Name: name,
+ },
+ }
+ }
+
+ It("publishes failure for ImagePullBackOff container", func() {
+ pod := makePod(
+ "pod-imgpull",
+ string(testTaskID),
+ corev1.PodPending,
+ []corev1.ContainerStatus{
+ {
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "ImagePullBackOff",
+ },
+ },
+ },
+ },
+ makeJobOwnerRef("my-job"),
+ "",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, calledJobName, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledJobName).To(Equal("my-job"))
+ Expect(calledReason).To(Equal("image_pull_backoff"))
+ })
+
+ It("publishes failure for ErrImagePull container", func() {
+ pod := makePod(
+ "pod-errimg",
+ string(testTaskID),
+ corev1.PodPending,
+ []corev1.ContainerStatus{
+ {
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "ErrImagePull",
+ },
+ },
+ },
+ },
+ makeJobOwnerRef("my-job"),
+ "",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledReason).To(Equal("image_pull_backoff"))
+ })
+
+ It("publishes failure for CrashLoopBackOff container", func() {
+ pod := makePod(
+ "pod-crashloop",
+ string(testTaskID),
+ corev1.PodPending,
+ []corev1.ContainerStatus{
+ {
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "CrashLoopBackOff",
+ },
+ },
+ },
+ },
+ makeJobOwnerRef("my-job"),
+ "",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, calledJobName, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledJobName).To(Equal("my-job"))
+ Expect(calledReason).To(Equal("pod_crash_no_stdout"))
+ })
+
+ It("publishes failure for Evicted pod", func() {
+ pod := makePod(
+ "pod-evicted",
+ string(testTaskID),
+ corev1.PodPending,
+ nil,
+ makeJobOwnerRef("my-job"),
+ "Evicted",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledReason).To(Equal("pod_evicted"))
+ })
+
+ It("publishes failure for PodFailed with non-zero exit code", func() {
+ pod := makePod(
+ "pod-crash",
+ string(testTaskID),
+ corev1.PodFailed,
+ []corev1.ContainerStatus{
+ {
+ State: corev1.ContainerState{
+ Terminated: &corev1.ContainerStateTerminated{
+ ExitCode: 137,
+ },
+ },
+ },
+ },
+ makeJobOwnerRef("my-job"),
+ "",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledReason).To(Equal("pod_crash_no_stdout"))
+ })
+
+ It("does NOT publish failure for healthy Running pod", func() {
+ pod := makePod(
+ "pod-running",
+ string(testTaskID),
+ corev1.PodRunning,
+ nil,
+ makeJobOwnerRef("my-job"),
+ "",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+
+ It("does NOT publish failure when task is not in store", func() {
+ pod := makePod(
+ "pod-imgpull",
+ string(testTaskID),
+ corev1.PodPending,
+ []corev1.ContainerStatus{
+ {
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "ImagePullBackOff",
+ },
+ },
+ },
+ },
+ makeJobOwnerRef("my-job"),
+ "",
+ )
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+
+ It("does NOT publish failure when pod has no Job ownerRef", func() {
+ pod := makePod(
+ "pod-noowner",
+ string(testTaskID),
+ corev1.PodPending,
+ []corev1.ContainerStatus{
+ {
+ State: corev1.ContainerState{
+ Waiting: &corev1.ContainerStateWaiting{
+ Reason: "ImagePullBackOff",
+ },
+ },
+ },
+ },
+ nil,
+ "",
+ )
+ taskStore.Store(testTaskID, testTask)
+
+ watcher.HandlePod(ctx, pod)
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
})
})
})
diff --git a/task/executor/pkg/result_publisher.go b/task/executor/pkg/result_publisher.go
index 5d820a2..8022ca1 100644
--- a/task/executor/pkg/result_publisher.go
+++ b/task/executor/pkg/result_publisher.go
@@ -7,6 +7,7 @@ package pkg
import (
"context"
"fmt"
+ "sync"
"time"
"github.com/bborbe/cqrs/base"
@@ -16,11 +17,17 @@ import (
libkafka "github.com/bborbe/kafka"
"github.com/bborbe/log"
libtime "github.com/bborbe/time"
+ "github.com/golang/glog"
lib "github.com/bborbe/agent/lib"
taskcmd "github.com/bborbe/agent/lib/command/task"
)
+const (
+ dedupeCapacity = 1024
+ dedupeTTL = 3600 * time.Second
+)
+
//counterfeiter:generate -o ../mocks/result_publisher.go --fake-name FakeResultPublisher . ResultPublisher
// ResultPublisher publishes agent-task-v1-request commands to Kafka so the
@@ -29,15 +36,20 @@ type ResultPublisher interface {
// PublishSpawnNotification publishes current_job, job_started_at, and
// spawn_notification without touching any other frontmatter keys.
PublishSpawnNotification(ctx context.Context, task lib.Task, jobName string) error
- // PublishFailure publishes a partial frontmatter update setting status, phase,
- // and current_job. Body content is not mutated by this publisher.
+ // PublishFailure publishes a zombie failure: clears current_job and atomically
+ // bumps trigger_count by 1 via a paired IncrementFrontmatterCommand. Leaves
+ // phase, status, and assignee untouched so the existing trigger_count retry
+ // cap (applyTriggerCap in task/controller/pkg/result/result_writer.go) handles
+ // eventual operator-inbox escalation. Idempotent per current_job via a TTL'd
+ // LRU; concurrent classifications for the same job emit one event.
PublishFailure(ctx context.Context, task lib.Task, jobName string, reason string) error
// PublishIncrementTriggerCount sends an IncrementFrontmatterCommand that atomically
// increments trigger_count by 1. Must complete before SpawnJob is called.
PublishIncrementTriggerCount(ctx context.Context, task lib.Task) error
// PublishTypeMismatchFailure publishes a synthetic failure when the task's task_type
- // is not in the agent's effective type set. Sets phase=ai_review and clears assignee
- // so the task surfaces in the operator inbox. Does not bump trigger_count or retry_count.
+ // is not in the agent's effective type set. Clears assignee and current_job so the
+ // task surfaces in the operator inbox via assignee=="" filter. Does not bump
+ // trigger_count or retry_count.
PublishTypeMismatchFailure(ctx context.Context, task lib.Task, reason string) error
// PublishRaw publishes a raw payload for testing error paths.
PublishRaw(ctx context.Context, operation base.CommandOperation, payload interface{}) error
@@ -56,13 +68,68 @@ func NewResultPublisher(
log.DefaultSamplerFactory,
),
currentDateTime: currentDateTime,
+ dedupe: newDedupe(dedupeCapacity, currentDateTime),
+ }
+}
+
+// ttlDedupe implements a minimal TTL'd LRU with RWMutex for publish-layer dedupe.
+// The eviction order is tracked via a separate []string so map lookups never
+// hold stale slice indices after the oldest entry is shifted out.
+type ttlDedupe struct {
+ mu sync.RWMutex
+ capacity int
+ ttl time.Duration
+ order []string // insertion order; index 0 is oldest
+ seen map[string]time.Time // jobName -> insertion ts; existence = "in dedupe window"
+ now libtime.CurrentDateTimeGetter
+}
+
+func newDedupe(capacity int, now libtime.CurrentDateTimeGetter) *ttlDedupe {
+ return &ttlDedupe{
+ capacity: capacity,
+ ttl: dedupeTTL,
+ order: make([]string, 0, capacity),
+ seen: make(map[string]time.Time, capacity),
+ now: now,
+ }
+}
+
+// checkDedupe returns true if a non-expired entry exists for jobName.
+// No mutation occurs.
+func (d *ttlDedupe) checkDedupe(jobName string) bool {
+ d.mu.RLock()
+ defer d.mu.RUnlock()
+ ts, ok := d.seen[jobName]
+ if !ok {
+ return false
}
+ return d.now.Now().Time().Sub(ts) < d.ttl
+}
+
+// recordDedupe inserts or refreshes the entry for jobName with the current timestamp.
+// Evicts the oldest entry if at capacity.
+func (d *ttlDedupe) recordDedupe(jobName string) {
+ d.mu.Lock()
+ defer d.mu.Unlock()
+ now := d.now.Now().Time()
+ if _, ok := d.seen[jobName]; ok {
+ d.seen[jobName] = now // refresh ts
+ return
+ }
+ if len(d.order) >= d.capacity {
+ oldest := d.order[0]
+ d.order = d.order[1:]
+ delete(d.seen, oldest)
+ }
+ d.order = append(d.order, jobName)
+ d.seen[jobName] = now
}
// resultPublisher implements ResultPublisher by sending CQRS command objects to Kafka.
type resultPublisher struct {
commandObjectSender cdb.CommandObjectSender
currentDateTime libtime.CurrentDateTimeGetter
+ dedupe *ttlDedupe
}
func (p *resultPublisher) PublishSpawnNotification(
@@ -87,6 +154,11 @@ func (p *resultPublisher) PublishFailure(
jobName string,
reason string,
) error {
+ if p.dedupe.checkDedupe(jobName) {
+ glog.V(2).Infof("event=zombie_dedupe job=%s task=%s", jobName, task.TaskIdentifier)
+ return nil
+ }
+
now := p.currentDateTime.Now().UTC().Format(time.RFC3339)
section := fmt.Sprintf(
"## Failure\n\n- **Timestamp:** %s\n- **Job:** %s\n- **Reason:** %s\n",
@@ -94,11 +166,9 @@ func (p *resultPublisher) PublishFailure(
jobName,
reason,
)
- cmd := taskcmd.UpdateFrontmatterCommand{
+ updateCmd := taskcmd.UpdateFrontmatterCommand{
TaskIdentifier: task.TaskIdentifier,
Updates: lib.TaskFrontmatter{
- "status": "in_progress",
- "phase": "human_review",
"current_job": "",
},
Body: &taskcmd.BodySection{
@@ -106,7 +176,37 @@ func (p *resultPublisher) PublishFailure(
Section: section,
},
}
- return p.publishRaw(ctx, taskcmd.UpdateFrontmatterCommandOperation, cmd)
+ if err := p.publishRaw(ctx, taskcmd.UpdateFrontmatterCommandOperation, updateCmd); err != nil {
+ return errors.Wrapf(
+ ctx,
+ err,
+ "publish zombie failure update for task %s",
+ task.TaskIdentifier,
+ )
+ }
+
+ incrementCmd := taskcmd.IncrementFrontmatterCommand{
+ TaskIdentifier: task.TaskIdentifier,
+ Field: "trigger_count",
+ Delta: 1,
+ }
+ if err := p.publishRaw(ctx, taskcmd.IncrementFrontmatterCommandOperation, incrementCmd); err != nil {
+ return errors.Wrapf(
+ ctx,
+ err,
+ "publish zombie failure trigger_count increment for task %s",
+ task.TaskIdentifier,
+ )
+ }
+
+ // Record dedupe only after BOTH publishes succeed. If the increment fails,
+ // the next cycle re-sends both messages — the duplicate current_job=""
+ // write is idempotent (writing empty to already-empty is a no-op visually),
+ // and the retry allows trigger_count to eventually bump so the retry cap
+ // (applyTriggerCap in result_writer.go) fires.
+ p.dedupe.recordDedupe(jobName)
+
+ return nil
}
func (p *resultPublisher) PublishIncrementTriggerCount(ctx context.Context, task lib.Task) error {
@@ -124,26 +224,39 @@ func (p *resultPublisher) PublishTypeMismatchFailure(
reason string,
) error {
now := p.currentDateTime.Now().UTC().Format(time.RFC3339)
+ priorAssignee := string(task.Frontmatter.Assignee())
section := fmt.Sprintf(
"## Failure\n\n- **Timestamp:** %s\n- **Assignee:** %s\n- **Reason:** %s\n",
now,
- task.Frontmatter.Assignee(),
+ priorAssignee,
reason,
)
+
+ updates := lib.TaskFrontmatter{
+ "assignee": "",
+ "current_job": "",
+ }
+ if priorAssignee != "" {
+ updates["previous_assignee"] = priorAssignee
+ }
+
cmd := taskcmd.UpdateFrontmatterCommand{
TaskIdentifier: task.TaskIdentifier,
- Updates: lib.TaskFrontmatter{
- "status": "in_progress",
- "phase": "ai_review",
- "assignee": "",
- "current_job": "",
- },
+ Updates: updates,
Body: &taskcmd.BodySection{
Heading: "## Failure",
Section: section,
},
}
- return p.publishRaw(ctx, taskcmd.UpdateFrontmatterCommandOperation, cmd)
+ if err := p.publishRaw(ctx, taskcmd.UpdateFrontmatterCommandOperation, cmd); err != nil {
+ return errors.Wrapf(
+ ctx,
+ err,
+ "publish type mismatch failure for task %s",
+ task.TaskIdentifier,
+ )
+ }
+ return nil
}
func (p *resultPublisher) publishRaw(
diff --git a/task/executor/pkg/result_publisher_test.go b/task/executor/pkg/result_publisher_test.go
index 265b827..70f1122 100644
--- a/task/executor/pkg/result_publisher_test.go
+++ b/task/executor/pkg/result_publisher_test.go
@@ -70,6 +70,44 @@ func (f *failingSyncProducer) Close() error { return nil }
var _ libkafka.SyncProducer = &failingSyncProducer{}
+// partialFailingSyncProducer succeeds for the first `successCount` sends, then
+// returns `err` on every subsequent send. Captures all attempted messages
+// (including the ones that failed) so tests can assert exact send counts.
+type partialFailingSyncProducer struct {
+ successCount int
+ calls int
+ err error
+ messages []*sarama.ProducerMessage
+}
+
+func (p *partialFailingSyncProducer) SendMessage(
+ _ context.Context,
+ msg *sarama.ProducerMessage,
+) (int32, int64, error) {
+ p.calls++
+ p.messages = append(p.messages, msg)
+ if p.calls > p.successCount {
+ return 0, 0, p.err
+ }
+ return 0, 0, nil
+}
+
+func (p *partialFailingSyncProducer) SendMessages(
+ _ context.Context,
+ msgs []*sarama.ProducerMessage,
+) error {
+ p.calls++
+ p.messages = append(p.messages, msgs...)
+ if p.calls > p.successCount {
+ return p.err
+ }
+ return nil
+}
+
+func (p *partialFailingSyncProducer) Close() error { return nil }
+
+var _ libkafka.SyncProducer = &partialFailingSyncProducer{}
+
// decodeUpdateFrontmatterCommand extracts the operation and UpdateFrontmatterCommand from a captured message.
func decodeUpdateFrontmatterCommand(
msg *sarama.ProducerMessage,
@@ -165,7 +203,7 @@ var _ = Describe("ResultPublisher", func() {
Describe("PublishFailure", func() {
It(
- "publishes a failure command with phase human_review and a ## Failure body section",
+ "publishes two commands: UpdateFrontmatterCommand clearing current_job with ## Failure body, then IncrementFrontmatterCommand bumping trigger_count",
func() {
task := lib.Task{
TaskIdentifier: lib.TaskIdentifier("test-task-2"),
@@ -185,37 +223,206 @@ var _ = Describe("ResultPublisher", func() {
)
Expect(err).NotTo(HaveOccurred())
- Expect(producer.messages).To(HaveLen(1))
- operation, cmd := decodeUpdateFrontmatterCommand(producer.messages[0])
+ Expect(producer.messages).To(HaveLen(2))
+ // First message: UpdateFrontmatterCommand
+ operation, updateCmd := decodeUpdateFrontmatterCommand(producer.messages[0])
Expect(
string(operation),
).To(Equal(string(taskcmd.UpdateFrontmatterCommandOperation)))
- Expect(cmd.Updates).To(HaveLen(3))
+ Expect(updateCmd.Updates).To(HaveLen(1))
+ Expect(updateCmd.Updates["current_job"]).To(Equal(""))
+
+ _, hasStatus := updateCmd.Updates["status"]
+ Expect(hasStatus).To(BeFalse(), "status must not be in failure update")
+ _, hasPhase := updateCmd.Updates["phase"]
+ Expect(hasPhase).To(BeFalse(), "phase must not be in failure update")
+ _, hasAssignee := updateCmd.Updates["assignee"]
+ Expect(hasAssignee).To(BeFalse(), "assignee must not be in failure update")
+ _, hasPreviousAssignee := updateCmd.Updates["previous_assignee"]
+ Expect(
+ hasPreviousAssignee,
+ ).To(BeFalse(), "previous_assignee must not be in failure update")
+ _, hasTriggerCount := updateCmd.Updates["trigger_count"]
+ Expect(hasTriggerCount).To(BeFalse(), "trigger_count must not be in failure update")
- Expect(cmd.Updates["status"]).To(Equal("in_progress"))
- Expect(cmd.Updates["phase"]).To(Equal("human_review"))
- Expect(cmd.Updates["current_job"]).To(Equal(""))
+ Expect(updateCmd.Body).NotTo(BeNil())
+ Expect(updateCmd.Body.Heading).To(Equal("## Failure"))
+ Expect(updateCmd.Body.Section).To(ContainSubstring("2026-04-18T12:00:00Z"))
+ Expect(updateCmd.Body.Section).To(ContainSubstring("claude-20260418120000"))
+ Expect(updateCmd.Body.Section).To(ContainSubstring("pod OOM killed"))
- _, hasTriggerCount := cmd.Updates["trigger_count"]
- Expect(hasTriggerCount).To(BeFalse(), "trigger_count must not be in failure update")
- _, hasSpawnNotification := cmd.Updates["spawn_notification"]
+ // Second message: IncrementFrontmatterCommand
+ incOperation, incCmd := decodeIncrementFrontmatterCommand(producer.messages[1])
Expect(
- hasSpawnNotification,
- ).To(BeFalse(), "spawn_notification must not be in failure update")
+ string(incOperation),
+ ).To(Equal(string(taskcmd.IncrementFrontmatterCommandOperation)))
+ Expect(string(incCmd.TaskIdentifier)).To(Equal("test-task-2"))
+ Expect(incCmd.Field).To(Equal("trigger_count"))
+ Expect(incCmd.Delta).To(Equal(1))
+ },
+ )
+ })
- Expect(cmd.Body).NotTo(BeNil())
- Expect(cmd.Body.Heading).To(Equal("## Failure"))
- Expect(cmd.Body.Section).To(ContainSubstring("2026-04-18T12:00:00Z"))
- Expect(cmd.Body.Section).To(ContainSubstring("claude-20260418120000"))
- Expect(cmd.Body.Section).To(ContainSubstring("pod OOM killed"))
+ Describe("PublishFailure dedupe", func() {
+ It("suppresses a second call with the same job name", func() {
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("test-task-dedupe"),
+ Frontmatter: lib.TaskFrontmatter{
+ "status": "in_progress",
+ },
+ Content: lib.TaskContent("do the work"),
+ }
+
+ err := publisher.PublishFailure(ctx, task, "claude-20260418120000", "pod OOM killed")
+ Expect(err).NotTo(HaveOccurred())
+ Expect(producer.messages).To(HaveLen(2))
+
+ err = publisher.PublishFailure(ctx, task, "claude-20260418120000", "pod OOM killed")
+ Expect(err).NotTo(HaveOccurred())
+ Expect(producer.messages).To(HaveLen(2), "second call should be deduped")
+ })
+
+ It(
+ "does NOT record dedupe when increment publish fails, so next cycle retries both messages",
+ func() {
+ partialProducer := &partialFailingSyncProducer{
+ successCount: 1, // first send (update) succeeds, second (increment) fails
+ err: errors.New(context.Background(), "kafka: leader not available"),
+ }
+ partialPublisher := pkg.NewResultPublisher(
+ partialProducer,
+ base.Branch("prod"),
+ currentDateTime,
+ )
+
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("test-task-partial"),
+ Frontmatter: lib.TaskFrontmatter{
+ "status": "in_progress",
+ },
+ Content: lib.TaskContent("do the work"),
+ }
+
+ // First call: update commits, increment fails — caller sees the increment error.
+ err := partialPublisher.PublishFailure(
+ ctx,
+ task,
+ "claude-20260418120000",
+ "pod OOM killed",
+ )
+ Expect(err).To(HaveOccurred())
+ Expect(err.Error()).To(ContainSubstring("trigger_count increment"))
+ Expect(
+ partialProducer.messages,
+ ).To(HaveLen(2), "both update and increment were attempted")
+
+ // Second call with the same jobName: dedupe must NOT suppress it,
+ // because the increment failed last time. The publish is attempted
+ // again — verified by the producer recording at least one more
+ // message (this producer's state means the update also fails on
+ // the retry, but the key invariant is: not deduped to zero sends).
+ err = partialPublisher.PublishFailure(
+ ctx,
+ task,
+ "claude-20260418120000",
+ "pod OOM killed",
+ )
+ Expect(err).To(HaveOccurred())
+ Expect(
+ len(partialProducer.messages),
+ ).To(BeNumerically(">", 2), "second call must re-attempt publishing (not deduped)")
+ },
+ )
+
+ It(
+ "allows re-send after dedupeTTL expires",
+ func() {
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("test-task-ttl"),
+ Frontmatter: lib.TaskFrontmatter{
+ "status": "in_progress",
+ },
+ Content: lib.TaskContent("do the work"),
+ }
+
+ err := publisher.PublishFailure(
+ ctx,
+ task,
+ "claude-20260418120000",
+ "pod OOM killed",
+ )
+ Expect(err).NotTo(HaveOccurred())
+ Expect(producer.messages).To(HaveLen(2))
+
+ // Advance past dedupeTTL (3600s).
+ currentDateTime.SetNow(libtimetest.ParseDateTime("2026-04-18T13:00:01Z"))
+
+ err = publisher.PublishFailure(
+ ctx,
+ task,
+ "claude-20260418120000",
+ "pod OOM killed",
+ )
+ Expect(err).NotTo(HaveOccurred())
+ Expect(
+ producer.messages,
+ ).To(HaveLen(4), "second call after TTL expiry must publish both messages again")
+ },
+ )
+
+ It(
+ "does NOT record dedupe when the first (update) publish fails and does not attempt the increment",
+ func() {
+ // successCount: 0 — first send (update) fails immediately.
+ partialProducer := &partialFailingSyncProducer{
+ successCount: 0,
+ err: errors.New(context.Background(), "kafka: leader not available"),
+ }
+ partialPublisher := pkg.NewResultPublisher(
+ partialProducer,
+ base.Branch("prod"),
+ currentDateTime,
+ )
+
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("test-task-first-fail"),
+ Frontmatter: lib.TaskFrontmatter{
+ "status": "in_progress",
+ },
+ Content: lib.TaskContent("do the work"),
+ }
+
+ err := partialPublisher.PublishFailure(
+ ctx,
+ task,
+ "claude-20260418120000",
+ "pod OOM killed",
+ )
+ Expect(err).To(HaveOccurred())
+ Expect(err.Error()).To(ContainSubstring("zombie failure update"))
+ Expect(
+ partialProducer.messages,
+ ).To(HaveLen(1), "only the update was attempted; increment must not run after update fails")
+
+ // Verify dedupe was NOT recorded: second call attempts publishing again.
+ err = partialPublisher.PublishFailure(
+ ctx,
+ task,
+ "claude-20260418120000",
+ "pod OOM killed",
+ )
+ Expect(err).To(HaveOccurred())
+ Expect(
+ partialProducer.messages,
+ ).To(HaveLen(2), "second call re-attempts the update (dedupe was not recorded)")
},
)
})
Describe("PublishTypeMismatchFailure", func() {
It(
- "publishes phase=ai_review, assignee='', current_job='' and Assignee bullet in body",
+ "publishes assignee='', previous_assignee=, current_job='' and Assignee bullet in body",
func() {
task := lib.Task{
TaskIdentifier: lib.TaskIdentifier("test-task-3"),
@@ -238,23 +445,60 @@ var _ = Describe("ResultPublisher", func() {
Expect(
string(operation),
).To(Equal(string(taskcmd.UpdateFrontmatterCommandOperation)))
- Expect(cmd.Updates).To(HaveLen(4))
-
- Expect(cmd.Updates["status"]).To(Equal("in_progress"))
- Expect(cmd.Updates["phase"]).To(Equal("ai_review"))
+ Expect(cmd.Updates).To(HaveLen(3))
Expect(cmd.Updates["assignee"]).To(Equal(""))
+ Expect(cmd.Updates["previous_assignee"]).To(Equal("agent-pr-reviewer"))
Expect(cmd.Updates["current_job"]).To(Equal(""))
+ _, hasStatus := cmd.Updates["status"]
+ Expect(hasStatus).To(BeFalse(), "status must not be in type mismatch update")
+ _, hasPhase := cmd.Updates["phase"]
+ Expect(hasPhase).To(BeFalse(), "phase must not be in type mismatch update")
+ _, hasTriggerCount := cmd.Updates["trigger_count"]
+ Expect(
+ hasTriggerCount,
+ ).To(BeFalse(), "trigger_count must not be in type mismatch update")
+
Expect(cmd.Body).NotTo(BeNil())
Expect(cmd.Body.Heading).To(Equal("## Failure"))
Expect(cmd.Body.Section).To(ContainSubstring("2026-04-18T12:00:00Z"))
Expect(cmd.Body.Section).To(ContainSubstring("agent-pr-reviewer"))
Expect(cmd.Body.Section).To(ContainSubstring("healthcheck"))
+ },
+ )
- _, hasTriggerCount := cmd.Updates["trigger_count"]
+ It(
+ "omits previous_assignee when prior assignee is empty",
+ func() {
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("test-task-empty-assignee"),
+ Frontmatter: lib.TaskFrontmatter{
+ "status": "in_progress",
+ "phase": "planning",
+ "assignee": "",
+ },
+ }
+ err := publisher.PublishTypeMismatchFailure(
+ ctx,
+ task,
+ "reason=type_mismatch",
+ )
+ Expect(err).NotTo(HaveOccurred())
+
+ Expect(producer.messages).To(HaveLen(1))
+ _, cmd := decodeUpdateFrontmatterCommand(producer.messages[0])
+
+ Expect(cmd.Updates).To(HaveLen(2))
+ Expect(cmd.Updates["assignee"]).To(Equal(""))
+ Expect(cmd.Updates["current_job"]).To(Equal(""))
+
+ _, hasPreviousAssignee := cmd.Updates["previous_assignee"]
Expect(
- hasTriggerCount,
- ).To(BeFalse(), "trigger_count must not be in type mismatch update")
+ hasPreviousAssignee,
+ ).To(BeFalse(), "previous_assignee must be omitted when prior assignee is empty")
+
+ Expect(cmd.Body).NotTo(BeNil())
+ Expect(cmd.Body.Section).To(ContainSubstring("reason=type_mismatch"))
},
)
})
diff --git a/task/executor/pkg/spawner/job_spawner.go b/task/executor/pkg/spawner/job_spawner.go
index 569889b..18f3e68 100644
--- a/task/executor/pkg/spawner/job_spawner.go
+++ b/task/executor/pkg/spawner/job_spawner.go
@@ -102,11 +102,7 @@ func (s *jobSpawner) SpawnJob(
podSpecBuilder.SetContainersBuilder(containersBuilder)
podSpecBuilder.SetRestartPolicy(corev1.RestartPolicyNever)
- secretName := "docker"
- if config.ImagePullSecret != "" {
- secretName = config.ImagePullSecret
- }
- podSpecBuilder.SetImagePullSecrets([]string{secretName})
+ podSpecBuilder.SetImagePullSecrets([]string{imagePullSecretName(config)})
objectMetaBuilder := k8s.NewObjectMetaBuilder()
objectMetaBuilder.SetName(k8s.Name(jobName))
@@ -131,6 +127,7 @@ func (s *jobSpawner) SpawnJob(
if config.PriorityClassName != "" {
job.Spec.Template.Spec.PriorityClassName = config.PriorityClassName
}
+ applyActiveDeadlineSeconds(config, jobName, task.TaskIdentifier, job)
_, err = s.kubeClient.BatchV1().
Jobs(s.namespace.String()).
@@ -208,6 +205,15 @@ func applyVolumeMount(
return nil
}
+// imagePullSecretName returns the image pull secret name from the config,
+// falling back to "docker" when not set.
+func imagePullSecretName(config pkg.AgentConfiguration) string {
+ if config.ImagePullSecret != "" {
+ return config.ImagePullSecret
+ }
+ return "docker"
+}
+
// applyCPUMemoryResources sets CPU and memory requests/limits on the container builder
// when the corresponding config values are non-empty. Empty values leave builder defaults untouched.
func applyCPUMemoryResources(config pkg.AgentConfiguration, containerBuilder k8s.ContainerBuilder) {
@@ -283,6 +289,19 @@ func applyTaskIDLabel(taskID lib.TaskIdentifier, job *batchv1.Job) {
job.Spec.Template.Labels[taskIDLabelKey] = string(taskID)
}
+// applyActiveDeadlineSeconds stamps Job.Spec.ActiveDeadlineSeconds from the config's
+// effective zombie job timeout so Kubernetes enforces a hard deadline on every spawned Job.
+func applyActiveDeadlineSeconds(
+ config pkg.AgentConfiguration,
+ jobName string,
+ taskID lib.TaskIdentifier,
+ job *batchv1.Job,
+) {
+ deadline := int64(config.EffectiveZombieJobTimeoutSeconds())
+ job.Spec.ActiveDeadlineSeconds = &deadline
+ glog.V(2).Infof("set activeDeadlineSeconds=%d on job %s for task %s", deadline, jobName, taskID)
+}
+
// taskPhaseString returns the string value of the task's phase, or "" when absent.
func taskPhaseString(f lib.TaskFrontmatter) string {
if p := f.Phase(); p != nil {
diff --git a/task/executor/pkg/spawner/job_spawner_test.go b/task/executor/pkg/spawner/job_spawner_test.go
index 6a0cba0..e502db7 100644
--- a/task/executor/pkg/spawner/job_spawner_test.go
+++ b/task/executor/pkg/spawner/job_spawner_test.go
@@ -942,6 +942,51 @@ var _ = Describe("JobSpawner", func() {
})
})
+ Describe("ActiveDeadlineSeconds", func() {
+ ptrInt32 := func(v int32) *int32 { return &v }
+
+ It("stamps ActiveDeadlineSeconds from config", func() {
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("deadline-task"),
+ Frontmatter: lib.TaskFrontmatter{"assignee": "claude"},
+ Content: lib.TaskContent("do the work"),
+ }
+ config := pkg.AgentConfiguration{
+ Assignee: "claude",
+ Image: "my-image:latest",
+ ZombieJobTimeoutSeconds: ptrInt32(900),
+ }
+ jobName, err := jobSpawner.SpawnJob(ctx, task, config)
+ Expect(err).To(BeNil())
+ Expect(jobName).NotTo(BeEmpty())
+
+ job, err := fakeClient.BatchV1().Jobs("test-ns").Get(ctx, jobName, metav1.GetOptions{})
+ Expect(err).To(BeNil())
+ Expect(job.Spec.ActiveDeadlineSeconds).NotTo(BeNil())
+ Expect(*job.Spec.ActiveDeadlineSeconds).To(Equal(int64(900)))
+ })
+
+ It("uses default ActiveDeadlineSeconds when config is unset", func() {
+ task := lib.Task{
+ TaskIdentifier: lib.TaskIdentifier("default-deadline-task"),
+ Frontmatter: lib.TaskFrontmatter{"assignee": "claude"},
+ Content: lib.TaskContent("do the work"),
+ }
+ config := pkg.AgentConfiguration{
+ Assignee: "claude",
+ Image: "my-image:latest",
+ }
+ jobName, err := jobSpawner.SpawnJob(ctx, task, config)
+ Expect(err).To(BeNil())
+ Expect(jobName).NotTo(BeEmpty())
+
+ job, err := fakeClient.BatchV1().Jobs("test-ns").Get(ctx, jobName, metav1.GetOptions{})
+ Expect(err).To(BeNil())
+ Expect(job.Spec.ActiveDeadlineSeconds).NotTo(BeNil())
+ Expect(*job.Spec.ActiveDeadlineSeconds).To(Equal(int64(1800)))
+ })
+ })
+
// Regression guard: SpawnJob and IsJobActive must agree on the label key used
// to identify a Job. A mismatch causes the executor to treat a freshly-spawned
// Job as "no active job" and respawn another every poll cycle.
diff --git a/task/executor/pkg/zombie_reason.go b/task/executor/pkg/zombie_reason.go
new file mode 100644
index 0000000..4e91a00
--- /dev/null
+++ b/task/executor/pkg/zombie_reason.go
@@ -0,0 +1,24 @@
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package pkg
+
+// ZombieReason is the closed set of machine-readable reason strings emitted in
+// the ## Failure body section. Operators grep on these values to triage.
+// Adding a new value requires updating this list and the documentation; renaming
+// or removing a value is a breaking change to the on-disk task body contract.
+type ZombieReason string
+
+const (
+ ZombieReasonImagePullBackOff ZombieReason = "image_pull_backoff"
+ ZombieReasonPodEvicted ZombieReason = "pod_evicted"
+ ZombieReasonDeadlineExceeded ZombieReason = "deadline_exceeded"
+ ZombieReasonPodNotScheduled ZombieReason = "pod_not_scheduled"
+ ZombieReasonPodCrashNoStdout ZombieReason = "pod_crash_no_stdout"
+ ZombieReasonExecutorWatchLost ZombieReason = "executor_watch_lost"
+ ZombieReasonTypeMismatch ZombieReason = "type_mismatch"
+)
+
+// String returns the reason as a string (for use with PublishFailure).
+func (r ZombieReason) String() string { return string(r) }
diff --git a/task/executor/pkg/zombie_reason_test.go b/task/executor/pkg/zombie_reason_test.go
new file mode 100644
index 0000000..e6cfce8
--- /dev/null
+++ b/task/executor/pkg/zombie_reason_test.go
@@ -0,0 +1,38 @@
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package pkg_test
+
+import (
+ . "github.com/onsi/ginkgo/v2"
+ . "github.com/onsi/gomega"
+
+ "github.com/bborbe/agent/task/executor/pkg"
+)
+
+var _ = Describe("ZombieReason", func() {
+ Describe("String()", func() {
+ It("returns image_pull_backoff for ZombieReasonImagePullBackOff", func() {
+ Expect(pkg.ZombieReasonImagePullBackOff.String()).To(Equal("image_pull_backoff"))
+ })
+ It("returns pod_evicted for ZombieReasonPodEvicted", func() {
+ Expect(pkg.ZombieReasonPodEvicted.String()).To(Equal("pod_evicted"))
+ })
+ It("returns deadline_exceeded for ZombieReasonDeadlineExceeded", func() {
+ Expect(pkg.ZombieReasonDeadlineExceeded.String()).To(Equal("deadline_exceeded"))
+ })
+ It("returns pod_not_scheduled for ZombieReasonPodNotScheduled", func() {
+ Expect(pkg.ZombieReasonPodNotScheduled.String()).To(Equal("pod_not_scheduled"))
+ })
+ It("returns pod_crash_no_stdout for ZombieReasonPodCrashNoStdout", func() {
+ Expect(pkg.ZombieReasonPodCrashNoStdout.String()).To(Equal("pod_crash_no_stdout"))
+ })
+ It("returns executor_watch_lost for ZombieReasonExecutorWatchLost", func() {
+ Expect(pkg.ZombieReasonExecutorWatchLost.String()).To(Equal("executor_watch_lost"))
+ })
+ It("returns type_mismatch for ZombieReasonTypeMismatch", func() {
+ Expect(pkg.ZombieReasonTypeMismatch.String()).To(Equal("type_mismatch"))
+ })
+ })
+})
diff --git a/task/executor/pkg/zombie_sweeper.go b/task/executor/pkg/zombie_sweeper.go
new file mode 100644
index 0000000..26f3f0d
--- /dev/null
+++ b/task/executor/pkg/zombie_sweeper.go
@@ -0,0 +1,251 @@
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package pkg
+
+import (
+ "context"
+ "time"
+
+ "github.com/bborbe/errors"
+ libk8s "github.com/bborbe/k8s"
+ libtime "github.com/bborbe/time"
+ "github.com/golang/glog"
+ corev1 "k8s.io/api/core/v1"
+ "k8s.io/apimachinery/pkg/labels"
+ corev1listers "k8s.io/client-go/listers/core/v1"
+
+ lib "github.com/bborbe/agent/lib"
+ agentv1 "github.com/bborbe/agent/task/executor/k8s/apis/agent.benjamin-borbe.de/v1"
+)
+
+//counterfeiter:generate -o ../mocks/zombie_sweeper.go --fake-name FakeZombieSweeper . ZombieSweeper
+
+// ZombieSweeper is a background goroutine that periodically classifies stuck
+// tasks as zombies and emits failure events. It is the safety net for the
+// informer-driven paths in JobWatcher (which handle the cases k8s notifies us
+// about). The sweeper handles: pods unschedulable beyond a grace window,
+// executor restart losing watch on a Job, and any deadline path the informer
+// misses (Job-condition deferred indefinitely, informer cache drift).
+type ZombieSweeper interface {
+ // Run blocks until ctx is cancelled. Each tick (interval sourced from the
+ // first non-nil ConfigSpec.ZombieSweeperIntervalSeconds across the resolver's
+ // configs, else DefaultZombieSweeperIntervalSeconds) it calls SweepOnce.
+ Run(ctx context.Context) error
+ // SweepOnce performs a single sweep pass. Exposed for unit tests so they
+ // do not have to manage tickers. Returns an error only on context
+ // cancellation paths; per-task classification errors are logged.
+ SweepOnce(ctx context.Context) error
+}
+
+// NewZombieSweeper creates a ZombieSweeper. The JobWatcher is held by reference
+// (not its lister) because the lister is only populated after JobWatcher.Run
+// completes its cache sync; service.Run starts all components concurrently, so
+// extracting the lister at wiring time would capture nil. SweepOnce resolves
+// the lister lazily on every tick and skips the tick if it is not yet ready.
+func NewZombieSweeper(
+ jobWatcher JobWatcher,
+ namespace libk8s.Namespace,
+ taskStore *TaskStore,
+ publisher ResultPublisher,
+ configProvider EventHandlerConfig,
+ currentDateTime libtime.CurrentDateTimeGetter,
+) ZombieSweeper {
+ return &zombieSweeper{
+ jobWatcher: jobWatcher,
+ namespace: namespace,
+ taskStore: taskStore,
+ publisher: publisher,
+ configProvider: configProvider,
+ currentDateTime: currentDateTime,
+ }
+}
+
+type zombieSweeper struct {
+ jobWatcher JobWatcher
+ namespace libk8s.Namespace
+ taskStore *TaskStore
+ publisher ResultPublisher
+ configProvider EventHandlerConfig
+ currentDateTime libtime.CurrentDateTimeGetter
+}
+
+const (
+ // podNotScheduledGraceWindow is the age threshold past which a Pending Pod
+ // with PodScheduled=False is classified pod_not_scheduled. Must exceed
+ // typical scheduler latency comfortably; 2 minutes is empirically generous.
+ podNotScheduledGraceWindow = 2 * time.Minute
+)
+
+// NOTE on "no recent heartbeat" from spec DB #9 / AC #6:
+// The spec predicate is `elapsed > deadline AND pod not Running AND no recent
+// heartbeat`. This codebase has NO separate heartbeat channel today — the only
+// liveness signal for a running job is "is a Pod currently Running?". Therefore
+// "no recent heartbeat" is implemented as "no Pod in PodRunning phase observed
+// for this task". If a per-job heartbeat is added later (a follow-up spec),
+// this predicate gets a real check; for now `classify` treats `pod not Running`
+// as covering both halves of the conjunction.
+
+func (s *zombieSweeper) Run(ctx context.Context) error {
+ // Interval is resolved once at startup. CRD changes to ZombieSweeperIntervalSeconds
+ // take effect only after pod restart. Acceptable because executor pods are short-lived
+ // relative to CRD reconciliation cycles.
+ interval, err := s.resolveSweeperInterval(ctx)
+ if err != nil {
+ return errors.Wrapf(ctx, err, "resolve sweeper interval")
+ }
+ ticker := time.NewTicker(interval)
+ defer ticker.Stop()
+ glog.V(2).Infof("zombie sweeper started interval=%s", interval)
+ for {
+ select {
+ case <-ctx.Done():
+ return nil
+ case <-ticker.C:
+ if err := s.SweepOnce(ctx); err != nil {
+ // Per-tick failures (transient lister errors, ctx-scoped
+ // failures from publisher) must NOT kill the sweeper goroutine
+ // — that would tear down the executor via service.Run. Log and
+ // continue.
+ glog.Errorf("zombie sweeper tick: %v", err)
+ }
+ }
+ }
+}
+
+func (s *zombieSweeper) SweepOnce(ctx context.Context) error {
+ // Resolve the Pod lister lazily — JobWatcher.Run populates it only after
+ // the informer cache has synced, and service.Run starts all components
+ // concurrently. If the lister is not yet available, skip this tick rather
+ // than publishing spurious failures.
+ lister := s.jobWatcher.PodLister()
+ if lister == nil {
+ glog.V(2).Infof("sweep skipped: pod lister not yet synced")
+ return nil
+ }
+ snapshot := s.taskStore.Snapshot()
+ now := s.currentDateTime.Now().Time()
+ // Fetch configs ONCE per tick — used by taskDeadline() for every task in
+ // the snapshot. Avoids N calls into the provider per sweep.
+ cfgs, err := s.configProvider.Get(ctx)
+ if err != nil {
+ return errors.Wrapf(ctx, err, "list configs")
+ }
+ for taskID, task := range snapshot {
+ jobName := task.Frontmatter.CurrentJob()
+ if jobName == "" {
+ // No active job recorded; nothing to sweep.
+ continue
+ }
+ jobStartedAt, err := task.Frontmatter.JobStartedAt()
+ if err != nil || jobStartedAt.IsZero() {
+ glog.V(3).Infof(
+ "zombie sweeper: task %s job_started_at unparseable or zero; skipping",
+ taskID,
+ )
+ continue
+ }
+ deadline := s.taskDeadline(task, cfgs)
+ elapsed := now.Sub(jobStartedAt)
+ if elapsed < deadline {
+ continue
+ }
+ reason := s.classify(lister, taskID, now)
+ if reason == "" {
+ continue
+ }
+ if err := s.publisher.PublishFailure(ctx, task, jobName, reason.String()); err != nil {
+ glog.Errorf(
+ "zombie sweeper: publish failure for task %s (job %s reason %s): %v",
+ taskID, jobName, reason, err,
+ )
+ continue
+ }
+ glog.V(2).Infof(
+ "zombie sweeper: published failure for task %s (job %s reason %s elapsed=%s)",
+ taskID, jobName, reason, elapsed,
+ )
+ }
+ return nil
+}
+
+func (s *zombieSweeper) taskDeadline(task lib.Task, cfgs []agentv1.Config) time.Duration {
+ assignee := task.Frontmatter.Assignee().String()
+ for _, cfg := range cfgs {
+ if cfg.Spec.Assignee == assignee && cfg.Spec.ZombieJobTimeoutSeconds != nil {
+ return time.Duration(*cfg.Spec.ZombieJobTimeoutSeconds) * time.Second
+ }
+ }
+ return time.Duration(agentv1.DefaultZombieJobTimeoutSeconds) * time.Second
+}
+
+func (s *zombieSweeper) resolveSweeperInterval(ctx context.Context) (time.Duration, error) {
+ cfgs, err := s.configProvider.Get(ctx)
+ if err != nil {
+ return 0, errors.Wrapf(ctx, err, "list configs")
+ }
+ for _, cfg := range cfgs {
+ if cfg.Spec.ZombieSweeperIntervalSeconds != nil {
+ return time.Duration(*cfg.Spec.ZombieSweeperIntervalSeconds) * time.Second, nil
+ }
+ }
+ return time.Duration(agentv1.DefaultZombieSweeperIntervalSeconds) * time.Second, nil
+}
+
+// classify determines whether a past-deadline task is a zombie and which
+// reason applies. Returns "" when the task is NOT a zombie (Pod still Running
+// — implicit heartbeat). Inspects Pod state via the shared Pod informer's
+// lister (introduced by prompt 2). Spec Failure-Mode row "k8s API rate-limit
+// (429)" mandates: "Sweeper relies on informer cache (no per-cycle list)" —
+// we MUST NOT issue API LIST calls here.
+func (s *zombieSweeper) classify(
+ lister corev1listers.PodLister,
+ taskID lib.TaskIdentifier,
+ now time.Time,
+) ZombieReason {
+ selector := labels.SelectorFromSet(labels.Set{
+ "agent.benjamin-borbe.de/task-id": string(taskID),
+ })
+ pods, err := lister.Pods(s.namespace.String()).List(selector)
+ if err != nil {
+ glog.Errorf("zombie sweeper: lister pods for task %s: %v", taskID, err)
+ return ""
+ }
+ // Zero pods AND past-deadline AND a Job was supposed to be running →
+ // executor lost the watch (Job exists in k8s but Pod GC happened, or the
+ // Job never created a Pod and was restarted across executor lifetimes).
+ // "No recent heartbeat" reduces to "no Pod observed" since this codebase
+ // has no separate heartbeat channel.
+ if len(pods) == 0 {
+ return ZombieReasonExecutorWatchLost
+ }
+ for _, pod := range pods {
+ // Healthy Running — NOT a zombie. A Running pod is the implicit
+ // heartbeat in the current system (no separate heartbeat channel).
+ if pod.Status.Phase == corev1.PodRunning {
+ return ""
+ }
+ // Pending past the unschedulable grace window with PodScheduled=False.
+ if pod.Status.Phase == corev1.PodPending {
+ age := now.Sub(pod.CreationTimestamp.Time)
+ if age > podNotScheduledGraceWindow && hasPodScheduledFalse(pod) {
+ return ZombieReasonPodNotScheduled
+ }
+ }
+ }
+ // Past deadline, no Running pod, no specific Pod-state reason — fall
+ // back to deadline_exceeded.
+ return ZombieReasonDeadlineExceeded
+}
+
+// hasPodScheduledFalse returns true when the Pod has a PodScheduled=False
+// condition (kube-scheduler could not place the pod).
+func hasPodScheduledFalse(pod *corev1.Pod) bool {
+ for _, c := range pod.Status.Conditions {
+ if c.Type == corev1.PodScheduled && c.Status == corev1.ConditionFalse {
+ return true
+ }
+ }
+ return false
+}
diff --git a/task/executor/pkg/zombie_sweeper_test.go b/task/executor/pkg/zombie_sweeper_test.go
new file mode 100644
index 0000000..502c9c9
--- /dev/null
+++ b/task/executor/pkg/zombie_sweeper_test.go
@@ -0,0 +1,581 @@
+// Copyright (c) 2026 Benjamin Borbe All rights reserved.
+// Use of this source code is governed by a BSD-style
+// license that can be found in the LICENSE file.
+
+package pkg_test
+
+import (
+ "context"
+ "time"
+
+ libtime "github.com/bborbe/time"
+ libtimetest "github.com/bborbe/time/test"
+ . "github.com/onsi/ginkgo/v2"
+ . "github.com/onsi/gomega"
+ corev1 "k8s.io/api/core/v1"
+ metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+ k8sinformers "k8s.io/client-go/informers"
+ "k8s.io/client-go/kubernetes/fake"
+
+ lib "github.com/bborbe/agent/lib"
+ agentv1 "github.com/bborbe/agent/task/executor/k8s/apis/agent.benjamin-borbe.de/v1"
+ "github.com/bborbe/agent/task/executor/mocks"
+ "github.com/bborbe/agent/task/executor/pkg"
+)
+
+var _ = Describe("ZombieSweeper", func() {
+ var (
+ ctx context.Context
+ fakePublisher *mocks.FakeResultPublisher
+ taskStore *pkg.TaskStore
+ eventHandlerConfig pkg.EventHandlerConfig
+ currentDateTime libtime.CurrentDateTime
+ )
+
+ BeforeEach(func() {
+ ctx = context.Background()
+ fakePublisher = &mocks.FakeResultPublisher{}
+ taskStore = pkg.NewTaskStore()
+ eventHandlerConfig = pkg.NewEventHandlerConfig()
+ currentDateTime = libtime.NewCurrentDateTime()
+ // Default "now" for tests: 2026-06-01T12:00:00Z
+ currentDateTime.SetNow(libtimetest.ParseDateTime("2026-06-01T12:00:00Z"))
+ })
+
+ makeTask := func(id string, assignee string, currentJob string, jobStartedAt string) lib.Task {
+ fm := lib.TaskFrontmatter{
+ "status": "in_progress",
+ "assignee": assignee,
+ }
+ if currentJob != "" {
+ fm["current_job"] = currentJob
+ }
+ if jobStartedAt != "" {
+ fm["job_started_at"] = jobStartedAt
+ }
+ return lib.Task{
+ TaskIdentifier: lib.TaskIdentifier(id),
+ Frontmatter: fm,
+ Content: lib.TaskContent("do the work"),
+ }
+ }
+
+ // makePodWithCreation creates a pod with a specific CreationTimestamp offset from "now"
+ // (which is set by currentDateTime to 2026-06-01T12:00:00Z).
+ makePod := func(name, namespace, taskID string, phase corev1.PodPhase, creationAge time.Duration, conditions ...corev1.PodCondition) *corev1.Pod {
+ return &corev1.Pod{
+ ObjectMeta: metav1.ObjectMeta{
+ Name: name,
+ Namespace: namespace,
+ Labels: map[string]string{
+ "agent.benjamin-borbe.de/task-id": taskID,
+ },
+ CreationTimestamp: metav1.Time{
+ // currentDateTime is 2026-06-01T12:00:00Z, subtract age to get creation time
+ Time: time.Date(2026, 6, 1, 12, 0, 0, 0, time.UTC).Add(-creationAge),
+ },
+ },
+ Status: corev1.PodStatus{
+ Phase: phase,
+ Conditions: conditions,
+ },
+ }
+ }
+
+ podScheduledFalseCondition := func() corev1.PodCondition {
+ return corev1.PodCondition{
+ Type: corev1.PodScheduled,
+ Status: corev1.ConditionFalse,
+ Reason: "Unschedulable",
+ }
+ }
+
+ ptrInt32 := func(v int32) *int32 { return &v }
+
+ Describe("SweepOnce", func() {
+ // 6a: deadline-exceeded-and-not-running → zombie (deadline_exceeded)
+ Context("deadline exceeded with failed pod", func() {
+ It("publishes failure with deadline_exceeded", func() {
+ taskID := lib.TaskIdentifier("task-6a-deadline-exceeded")
+ // job_started_at = 11:30:00Z, elapsed = 30min at 12:00:00Z
+ ts := libtimetest.ParseDateTime("2026-06-01T11:30:00Z")
+ task := makeTask(string(taskID), "agent-a", "job-1", ts.Format(time.RFC3339))
+ taskStore.Store(taskID, task)
+
+ // Config: deadline = 60s
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podInformer := informerFactory.Core().V1().Pods().Informer()
+ _ = podInformer.GetIndexer().
+ Add(makePod("pod-failed", "test-ns", string(taskID), corev1.PodFailed, 0))
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, calledTask, calledJobName, calledReason := fakePublisher.PublishFailureArgsForCall(
+ 0,
+ )
+ Expect(string(calledTask.TaskIdentifier)).To(Equal(string(taskID)))
+ Expect(calledJobName).To(Equal("job-1"))
+ Expect(calledReason).To(Equal("deadline_exceeded"))
+ })
+ })
+
+ // 6b: deadline-exceeded-but-running → NOT zombie
+ Context("deadline exceeded but pod is Running", func() {
+ It("skips publish when pod is Running", func() {
+ taskID := lib.TaskIdentifier("task-6b-running")
+ ts := libtimetest.ParseDateTime("2026-06-01T11:30:00Z")
+ task := makeTask(string(taskID), "agent-a", "job-1", ts.Format(time.RFC3339))
+ taskStore.Store(taskID, task)
+
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podInformer := informerFactory.Core().V1().Pods().Informer()
+ _ = podInformer.GetIndexer().
+ Add(makePod("pod-running", "test-ns", string(taskID), corev1.PodRunning, 0))
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+ })
+
+ // 6c: under-deadline → NOT zombie
+ Context("elapsed time is under deadline", func() {
+ It("skips publish when elapsed < deadline", func() {
+ taskID := lib.TaskIdentifier("task-6c-under-deadline")
+ // job_started_at = 11:59:30Z, elapsed = 30s at 12:00:00Z < deadline of 60s
+ ts := libtimetest.ParseDateTime("2026-06-01T11:59:30Z")
+ task := makeTask(string(taskID), "agent-a", "job-1", ts.Format(time.RFC3339))
+ taskStore.Store(taskID, task)
+
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podInformer := informerFactory.Core().V1().Pods().Informer()
+ _ = podInformer.GetIndexer().
+ Add(makePod("pod-failed", "test-ns", string(taskID), corev1.PodFailed, 0))
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+ })
+
+ // 6d: watch-lost → executor_watch_lost
+ Context("no pods found for task past deadline", func() {
+ It("publishes failure with executor_watch_lost", func() {
+ taskID := lib.TaskIdentifier("task-6d-watch-lost")
+ ts := libtimetest.ParseDateTime("2026-06-01T11:30:00Z")
+ task := makeTask(string(taskID), "agent-a", "job-1", ts.Format(time.RFC3339))
+ taskStore.Store(taskID, task)
+
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ // Empty fake client — no pods at all
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, calledTask, calledJobName, calledReason := fakePublisher.PublishFailureArgsForCall(
+ 0,
+ )
+ Expect(string(calledTask.TaskIdentifier)).To(Equal(string(taskID)))
+ Expect(calledJobName).To(Equal("job-1"))
+ Expect(calledReason).To(Equal("executor_watch_lost"))
+ })
+ })
+
+ // 6e: pod_not_scheduled
+ Context("pod is Pending past grace window with PodScheduled=False", func() {
+ It("publishes failure with pod_not_scheduled", func() {
+ taskID := lib.TaskIdentifier("task-6e-pod-not-scheduled")
+ ts := libtimetest.ParseDateTime("2026-06-01T11:30:00Z")
+ task := makeTask(string(taskID), "agent-a", "job-1", ts.Format(time.RFC3339))
+ taskStore.Store(taskID, task)
+
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podInformer := informerFactory.Core().V1().Pods().Informer()
+ // Pod created 5 minutes ago (exceeds 2min grace window), Pending, PodScheduled=False
+ _ = podInformer.GetIndexer().
+ Add(makePod("pod-unschedulable", "test-ns", string(taskID), corev1.PodPending, 5*time.Minute, podScheduledFalseCondition()))
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(1))
+ _, _, _, calledReason := fakePublisher.PublishFailureArgsForCall(0)
+ Expect(calledReason).To(Equal("pod_not_scheduled"))
+ })
+ })
+
+ // 6f: interval default — indirectly verified: SweepOnce uses the default interval
+ // when no config is set. The 6c test (under-deadline) implicitly uses the default
+ // interval via the sweeper. We add an explicit test that exercises the default
+ // path through the interval resolver by using a task with no matching config.
+ Context(
+ "interval default: task with no matching config uses 1800s default deadline",
+ func() {
+ It(
+ "does not publish failure when elapsed (29min) < default deadline (1800s)",
+ func() {
+ taskID := lib.TaskIdentifier("task-6f-default-deadline")
+ // job_started_at = 11:31:00Z, elapsed = 29min, default deadline = 1800s (30min)
+ // elapsed < deadline → no publish
+ ts := libtimetest.ParseDateTime("2026-06-01T11:31:00Z")
+ task := makeTask(
+ string(taskID),
+ "agent-no-config",
+ "job-1",
+ ts.Format(time.RFC3339),
+ )
+ taskStore.Store(taskID, task)
+
+ // No config added — uses default deadline of 1800s
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podInformer := informerFactory.Core().V1().Pods().Informer()
+ _ = podInformer.GetIndexer().
+ Add(makePod("pod-failed", "test-ns", string(taskID), corev1.PodFailed, 0))
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ // elapsed (29min) < default deadline (30min) → no zombie
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ },
+ )
+ },
+ )
+
+ // 6g: interval override — tested via 6a-6e which use configs with specific timeouts
+ // 6h: deadline default — verified by 6f (uses default when no config matches)
+
+ Context("task with no current_job is skipped", func() {
+ It("skips tasks without a current_job set", func() {
+ taskID := lib.TaskIdentifier("task-no-job")
+ task := makeTask(string(taskID), "agent-a", "", "")
+ taskStore.Store(taskID, task)
+
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+ })
+
+ Context("task with unparseable job_started_at is skipped", func() {
+ It("skips tasks with malformed job_started_at", func() {
+ taskID := lib.TaskIdentifier("task-bad-time")
+ fm := lib.TaskFrontmatter{
+ "status": "in_progress",
+ "assignee": "agent-a",
+ "current_job": "job-1",
+ "job_started_at": "not-a-valid-timestamp",
+ }
+ task := lib.Task{
+ TaskIdentifier: taskID,
+ Frontmatter: fm,
+ Content: lib.TaskContent("do the work"),
+ }
+ taskStore.Store(taskID, task)
+
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+ })
+
+ // Per-task publish errors must NOT abort the sweep — every past-deadline
+ // task must get a PublishFailure attempt even when an earlier one fails.
+ // Guards the regression where a single broken publisher.PublishFailure
+ // (e.g. transient kafka error for one task's UUID) would skip the rest.
+ Context("multiple tasks with one publish failure mid-loop", func() {
+ It(
+ "calls PublishFailure for every task and continues past per-task errors",
+ func() {
+ ts := libtimetest.ParseDateTime("2026-06-01T11:30:00Z")
+ tasks := []lib.TaskIdentifier{
+ "task-multi-1",
+ "task-multi-2",
+ "task-multi-3",
+ }
+ for _, id := range tasks {
+ task := makeTask(
+ string(id),
+ "agent-a",
+ "job-"+string(id),
+ ts.Format(time.RFC3339),
+ )
+ taskStore.Store(id, task)
+ }
+
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ // Empty pod lister — all tasks classify as executor_watch_lost.
+ fakeClient := fake.NewSimpleClientset()
+ informerFactory := k8sinformers.NewSharedInformerFactoryWithOptions(
+ fakeClient,
+ 0,
+ k8sinformers.WithNamespace("test-ns"),
+ )
+ podLister := informerFactory.Core().V1().Pods().Lister()
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+ fakeJobWatcher.PodListerReturns(podLister)
+
+ // Snapshot iterates a map → non-deterministic order. Fail
+ // the second call regardless of which task lands there;
+ // the assertion (call count == 3) covers the loop-continue
+ // invariant without depending on order.
+ fakePublisher.PublishFailureStub = func(
+ _ context.Context,
+ _ lib.Task,
+ _ string,
+ _ string,
+ ) error {
+ if fakePublisher.PublishFailureCallCount() == 2 {
+ return context.Canceled
+ }
+ return nil
+ }
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(3))
+ },
+ )
+ })
+
+ // Pod lister not yet synced — sweeper must skip the tick without
+ // publishing failures. Guards the regression where service.Run starts
+ // the sweeper before JobWatcher.Run finishes WaitForCacheSync.
+ Context("pod lister is nil (JobWatcher not yet synced)", func() {
+ It("skips the tick without publishing failures", func() {
+ taskID := lib.TaskIdentifier("task-lister-nil")
+ ts := libtimetest.ParseDateTime("2026-06-01T11:30:00Z")
+ task := makeTask(string(taskID), "agent-a", "job-1", ts.Format(time.RFC3339))
+ taskStore.Store(taskID, task)
+
+ cfg := agentv1.Config{
+ ObjectMeta: metav1.ObjectMeta{Name: "cfg-a"},
+ Spec: agentv1.ConfigSpec{
+ Assignee: "agent-a",
+ ZombieJobTimeoutSeconds: ptrInt32(60),
+ },
+ }
+ _ = eventHandlerConfig.OnAdd(ctx, cfg)
+
+ // FakeJobWatcher returns a nil PodLister by default — simulates
+ // the pre-cache-sync state.
+ fakeJobWatcher := &mocks.FakeJobWatcher{}
+
+ sweeper := pkg.NewZombieSweeper(
+ fakeJobWatcher,
+ "test-ns",
+ taskStore,
+ fakePublisher,
+ eventHandlerConfig,
+ currentDateTime,
+ )
+
+ err := sweeper.SweepOnce(ctx)
+ Expect(err).To(BeNil())
+ Expect(fakePublisher.PublishFailureCallCount()).To(Equal(0))
+ })
+ })
+ })
+})