Skip to content

feat: capability-aware scheduler + AgenticTaskWatcher + stub executor (Foreman v0.1 M2)#504

Open
Defilan wants to merge 1 commit into
defilantech:mainfrom
Defilan:feat/foreman-scheduler
Open

feat: capability-aware scheduler + AgenticTaskWatcher + stub executor (Foreman v0.1 M2)#504
Defilan wants to merge 1 commit into
defilantech:mainfrom
Defilan:feat/foreman-scheduler

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 20, 2026

What

Closes the Pending -> Scheduled -> Running -> Succeeded dispatch loop for the Foreman v0.1 add-on. Adds the capability-aware scheduler to the foreman-operator; the AgenticTaskWatcher, Executor abstraction, and StubExecutor to foreman-agent; and the foreman.v1 Result envelope they share. End-to-end demoed against kind-llmkube-local.

Why

Refs #500.

M2 is the architectural validation milestone for Foreman: it proves the CRDs from M0 and the FleetNode heartbeat from M1 actually compose into a working dispatch loop, without yet depending on the M3 native agent loop. The StubExecutor stands in for M3's real native agent loop so the plumbing is observable end to end today.

M2 also locked down the function-calling substrate decision the v0.1 plan rests on. Smoke-tested the live Carnice-Qwen3.6-MoE-35B-A3B-APEX-MTP endpoint serving on Apple Silicon Metal: 13/15 multi-turn flows pass cleanly; the 2 failures are a known upstream llama.cpp issue (ggml-org/llama.cpp#22072, comment with our repro added) where the strict tool-call argument parser rejects truncated JSON from the model. M3's native loop will treat this as a recoverable transient with bounded retry, getting effective success to ~99.78%. No change to M2.

How

Scheduler (internal/foreman/controller/agentictask_controller.go) evolved from the M0 logging stub into the real reconciler:

  • Normalizes empty phase to Pending so subsequent logic only branches on enum values it owns.
  • Cascade-fails (phase=Failed, reason=UpstreamFailed) when any dependsOn target is in phase=Failed.
  • Waits with requeue while any dependsOn target is pre-terminal.
  • First-fit FleetNode picker: alphabetical-by-name over Ready nodes whose capability satisfies the task's RequiredCapability (accelerator family, MinRAMGB <= AvailableRAMGB, MinContextTokens <= MaxContextTokens, NodeSelector subset of node labels). Skips nodes whose status.currentTask is non-empty (v0.1 worker-concurrency=1).
  • Watches FleetNode in addition to AgenticTask; re-enqueues all Pending tasks on each FleetNode event so a node going Ready dispatches immediately rather than waiting for the requeue-after timer.

Node-side watcher (pkg/foreman/agent/watcher.go):

  • Polls every --task-poll-interval (default 5s) for AgenticTasks where status.assignedNode == myNodeName && phase == Scheduled.
  • Claims via status merge-patch with optimistic concurrency; race losers see the new phase on the next poll.
  • Runs Executor.Execute in a goroutine; v0.1 keeps one task per node in flight via a mutex'd inflight slot.
  • Re-fetches the task before the terminal patch to avoid clobbering concurrent edits. On executor error, patches phase=Failed with verdict=INCOMPLETE and the error in the Completed condition. Defensive contract-violation path on nil error + nil Result.
  • Returns ErrWatcherStalled after MaxConsecutiveFailures (default 3) List() failures in a row, mirroring pkg/agent/watcher.go's supervisor-restart pattern.

Executor abstraction (pkg/foreman/agent/{executor,executor_stub,result}.go):

  • Executor interface: Kind() string + Execute(ctx, task) (*Result, error). Small interface so M3 (native agent loop), M4 (gate-job executor), and any future kind plug in behind the same shape.
  • Result is the foreman.v1 envelope shared between executors and any downstream consumer (planner evaluator, future DecisionLog, the human reviewer): SchemaVersion, Kind, Verdict, Summary, Extra, ElapsedSec.
  • AgenticTaskVerdict constants added to api/foreman/v1alpha1 (GO, NO-GO, INCOMPLETE, GATE-PASS, GATE-FAIL, GATE-ERROR). No CRD schema change; the enum validator was already on the type.
  • StubExecutor: sleeps for --stub-sleep (default 10s) and returns a synthetic GO-verdict Result. Used to validate dispatch end-to-end today; M3 swaps in the native agent loop behind the same interface.

Binary wiring (cmd/foreman-agent/main.go):

  • New flags: --task-poll-interval, --task-namespace, --stub-sleep.
  • Runs the registrar (M1) and the watcher (M2) concurrently via errgroup. On clean shutdown via SIGTERM both return nil; binary exits cleanly.

Tests: the M0 stub-smoke test in agentictask_controller_test.go is rewritten to exercise the M2 contract (Pending → Scheduled with capability matching, cascade-fail, dependency waits, busy-node skip). Existing M0+M1 tests untouched. Envtest harness from #501 carries forward.

Verification

Live demo on kind-llmkube-local with foreman-operator + foreman-agent (--heartbeat-interval=3s --task-poll-interval=2s --stub-sleep=8s) + apply of examples/foreman/m2-stub-task.yaml. Observed the full conditions trail in order:

Scheduled  True  FleetNodeAssigned   scheduled to FleetNode "m5-max"
Running    True  Claimed              claimed by m5-max
Completed  True  ExecutorSucceeded    stub executor slept 8.001s ...

Final status.result:

{
  "schemaVersion": "foreman.v1",
  "kind":          "stub",
  "verdict":       "GO",
  "elapsedSec":    8.001,
  "extra":         { "taskKind": "freeform", "agentName": "stub", "modelRef": "" },
  "summary":       "stub executor slept 8.001s on task default/m2-stub-demo"
}

make test passes on both arches. make lint 0 issues on darwin and on GOOS=linux (addresses the cross-arch gap captured in #503). No regressions in the LLMKube core inference flow.

Checklist

  • Tests added/updated (M0 stub test rewritten as M2 scheduler test)
  • make test passes locally (both arches)
  • make lint passes locally (both arches)
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • Documentation updated — README + chart README land alongside M6

What's next

M3 (#500): build the native agent loop in Go using OpenAI function calling, land the Agent CRD, ship the minimax-coder Agent. The retry policy for the upstream llama.cpp tools-strict-parse 500 (#22072) is the only architectural decision M3 takes on top of what M2 already proves works.

… (v0.1 M2)

M2 proves the Foreman dispatch loop end-to-end without depending on the
native agent loop (M3): a Pending AgenticTask is scheduled to a Ready
FleetNode by capability match, claimed by that node's foreman-agent,
handed to the configured Executor, and patched to terminal status with
the structured foreman.v1 Result envelope serialized into
status.result. Refs defilantech#500.

Scheduler (internal/foreman/controller/agentictask_controller.go):
  - Normalizes empty phase to Pending so the rest of the logic only
    branches on enum values it knows about.
  - Cascade-fails the task (phase=Failed, reason=UpstreamFailed) when
    any dependsOn target is itself Failed.
  - Waits with requeue while any dependsOn target is pre-terminal.
  - First-fit FleetNode picker: alphabetical-by-name over Ready nodes
    whose advertised capability satisfies the task's RequiredCapability
    (accelerator family, MinRAMGB <= AvailableRAMGB, MinContextTokens
    <= MaxContextTokens, NodeSelector subset of node labels). v0.2 may
    add least-loaded or LRU.
  - Watches FleetNode and re-enqueues every Pending AgenticTask on
    each FleetNode event so a node going Ready dispatches immediately
    rather than waiting for the requeue-after timer.

Node-side watcher (pkg/foreman/agent/watcher.go):
  - Polls AgenticTasks every --task-poll-interval (default 5s) for the
    set assigned to this node in phase=Scheduled.
  - Claims via status merge-patch with optimistic concurrency (race
    losers see the new phase on the next poll).
  - Runs Executor.Execute in a goroutine; v0.1 keeps one task per node
    in flight via a mutex'd inflight slot.
  - Re-fetches the task before the terminal patch to avoid clobbering
    concurrent edits. On executor error, patches phase=Failed with
    verdict=INCOMPLETE and the error in the Completed condition.
  - Returns ErrWatcherStalled after three consecutive List() failures,
    mirroring pkg/agent/watcher.go's supervisor-restart pattern.

Executor abstraction (pkg/foreman/agent/{executor,executor_stub,result}.go):
  - Executor interface: Kind() string + Execute(ctx, task) (*Result, error).
  - Result is the foreman.v1 envelope shared between executors and
    consumers downstream: SchemaVersion, Kind, Verdict, Summary, Extra
    (kind-discriminated), ElapsedSec.
  - AgenticTaskVerdict constants added to api/foreman/v1alpha1
    (GO, NO-GO, INCOMPLETE, GATE-PASS, GATE-FAIL, GATE-ERROR). No CRD
    schema change; the enum validator was already on the type.
  - StubExecutor: sleeps for --stub-sleep (default 10s) and returns a
    GO-verdict synthetic Result. Used to validate the dispatch loop
    today; M3 swaps in the native agent loop behind the same interface.

Binary wiring (cmd/foreman-agent/main.go):
  - Adds --task-poll-interval, --task-namespace, --stub-sleep flags.
  - Runs the registrar and the watcher concurrently via errgroup.
  - Updated startup log to reflect the M2 stub executor.

Verification on kind-llmkube-local:
  - Build, vet, lint clean. make test passes.
  - foreman-operator + foreman-agent + apply of
    examples/foreman/m2-stub-task.yaml.
  - Lifecycle observed: empty -> Pending -> Scheduled (assignedNode=
    m5-max, condition Scheduled=True/FleetNodeAssigned) -> Running
    (condition Running=True/Claimed) -> Succeeded after the stub's 8s
    sleep (condition Completed=True/ExecutorSucceeded; verdict=GO;
    status.result populated with foreman.v1 envelope).
  - No regressions in the LLMKube core inference path (envtest suite
    unchanged).

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan added enhancement New feature or request area/foreman Foreman: the agentic fleet orchestrator add-on labels May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/foreman Foreman: the agentic fleet orchestrator add-on enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant