Skip to content

feat: scaffold Foreman as an opt-in add-on (M0 + M1)#501

Merged
Defilan merged 5 commits into
defilantech:mainfrom
Defilan:feat/foreman-crds
May 20, 2026
Merged

feat: scaffold Foreman as an opt-in add-on (M0 + M1)#501
Defilan merged 5 commits into
defilantech:mainfrom
Defilan:feat/foreman-crds

Conversation

@Defilan
Copy link
Copy Markdown
Member

@Defilan Defilan commented May 20, 2026

What

Scaffolds Foreman, an opt-in add-on layered on LLMKube that schedules
agentic workloads (Workload, AgenticTask) across a fleet of nodes
(FleetNode). This PR covers v0.1 milestones M0 and M1:

  • M0: new API group foreman.llmkube.dev/v1alpha1 (Workload,
    AgenticTask, FleetNode), empty reconciler stubs, foreman-operator
    binary, charts/foreman skeleton.
  • M1: foreman-agent node-side daemon with FleetNode
    self-registration, 30s heartbeat, drain on SIGTERM. Cross-platform
    via build tags (darwin uses live sysctl + vm_stat memory probing;
    linux/amd64 builds via a stub for M4).

Why

Refs #500

Foreman is the fleet-scale evolution of the single-node autofix pipeline.
It is the fleet-aware control plane LLMKube's North Star always pointed at
("treat intelligence as a workload"): one layer up, from serving models to
running agentic workloads on them.

The LLMKube core stays untouched: a user who only wants Kubernetes-managed
local LLM serving installs LLMKube exactly as today, sees nothing new in
their cluster, RBAC, kubectl api-resources, or values file. Foreman is a
separate API group, a separate operator binary, a separate node-agent, a
separate Helm chart with dependsOn: llmkube. Same pattern as cert-manager
/ trust-manager, Istio base / istiod, kube-prometheus-stack /
kube-state-metrics.

How

Packaging is the design choice: same repo for iteration velocity, fully
separate everything that ships.

Concern LLMKube core (unchanged) Foreman add-on (new)
API group inference.llmkube.dev/v1alpha1 foreman.llmkube.dev/v1alpha1
Operator binary cmd/main.gollmkube-operator cmd/foreman-operator/main.go
Node agent cmd/metal-agent (untouched) cmd/foreman-agent (separate)
Go packages api/v1alpha1/, internal/controller/, pkg/agent/ api/foreman/v1alpha1/, internal/foreman/controller/, pkg/foreman/
Helm chart charts/llmkube (no new fields) charts/foreman (dependsOn: llmkube)

Surgical core changes (the only non-foreman files this PR touches,
all additive):

  • Makefile: new foreman-chart-crds target. Existing manifests /
    generate / chart-crds / test / lint targets are unchanged in
    behavior.
  • scripts/sync-crds.sh: narrow the glob from *.yaml to
    inference.llmkube.dev_*.yaml so the foreman group is not pulled into
    charts/llmkube. Inference CRDs continue to copy identically.
  • config/rbac/role.yaml: auto-regenerated; gains the foreman RBAC
    marker output. The LLMKube chart's ClusterRole is hand-authored in
    charts/llmkube/templates/clusterrole.yaml and lists only
    inference.llmkube.dev, so the LLMKube operator pod gains zero new
    privileges from this PR.

LLMKube core inference flow is byte-identical: no changes in
api/v1alpha1/, internal/controller/, cmd/main.go,
cmd/metal-agent/, pkg/agent/, charts/llmkube/, go.mod, or
go.sum. pkg/foreman/agent imports pkg/agent.DarwinMemoryProvider
without modifying it.

M1 verification (kind-llmkube-local, --heartbeat-interval=3s,
10-second run):

NAME     PHASE   ACCELERATOR   RAM   CURRENT TASK   HEARTBEAT   AGE
m5-max   Ready   metal         22                   1s          10s

status.capability.totalRAMGB=128 (live sysctl hw.memsize),
availableRAMGB=22 (live vm_stat), installedModels=[minimax-m2-7],
maxContextTokens=131072, tokensPerSecond=47. Three heartbeat patches
over 10 s, all successful. SIGTERM produces phase=Draining; agent exits
cleanly.

M0 verification: kubectl apply of smoke-test Workload, AgenticTask,
and FleetNode against kind-llmkube-local all accepted; printer columns
render. The foreman-operator binary starts cleanly against kind; all
three reconcilers log reconciles on the smoke-test objects and return
ctrl.Result{} as the stub design intends.

What comes next (not in this PR; tracked by the epic): M2 lands the
capability-aware scheduler and a function-calling smoke test against
MiniMax M2.7 on the M5 Max. M3 builds the native agent loop (OAI-style
function-calling, in-process tool execution) and the Agent CRD. Foreman's
agentic loop is owned in Go natively, not by wrapping opencode.

Checklist

  • Tests added/updated
  • make test passes locally
  • make lint passes locally
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • Documentation updated (if user-facing change)

Tests note: M0+M1 are scaffolding; reconcilers are empty stubs and
the Registrar runs against a live apiserver. Unit + envtest coverage for
the Foreman packages lands with M2 (scheduler), when there is concrete
reconcile logic to test. The M1 demo above is the integration check.

Docs note: README + chart README updates land alongside M6 (the v0.1
ship gate), when Foreman is user-installable end-to-end. M0+M1 are not
yet user-installable: there is no operator Deployment or RBAC in
charts/foreman, only CRDs.

@Defilan Defilan added enhancement New feature or request area/foreman Foreman: the agentic fleet orchestrator add-on labels May 20, 2026
Defilan added a commit to Defilan/LLMKube that referenced this pull request May 20, 2026
PR defilantech#501 shipped scaffolding (CRDs, foreman-operator, foreman-agent
Registrar, capability providers) without test coverage, on the plan
that 'unit + envtest coverage lands with M2'. This commit fronts that
work onto defilantech#501 so M0+M1 ship with proper tests instead of a deferred
promise. Refs defilantech#500.

Coverage delta on the new packages:

  pkg/foreman/agent              0.0% -> 87.0%
  internal/foreman/controller    0.0% -> 85.7%

What's covered:

cmd/foreman-agent/main_test.go (stdlib testing.T):
  - clampInt32: negative/zero/MaxInt32-bound/overflow paths.
  - sanitizeName: DNS-1123 cleanup (lowercase, invalid-char collapse,
    leading/trailing hyphen trim, empty-and-all-invalid fallback,
    63-char truncation, macOS '<name>.local' hostname case).
  - splitCSV: empty / single / multi / whitespace / empty-entries /
    separator-only cases. Found a real inconsistency along the way:
    empty input returned nil but separator-only returned []string{};
    splitCSV now collapses both to nil so the FleetNodeSpec.Roles and
    CapabilityOptions.InstalledModels fields see one 'absent'
    representation. No external callers depended on the distinction.

pkg/foreman/agent/capability_darwin_test.go (//go:build darwin):
  - bytesToGB: zero, sub-1GB rounding, 36GB/128GB sanity, MaxInt32
    edge, and uint64-max saturation.
  - NewCapability: default-metal accelerator, explicit override
    honored, flag-supplied InstalledModels/MaxContextTokens/
    TokensPerSecond propagation.
  - Live memory probe sanity: TotalRAMGB > 0 and AvailableRAMGB <=
    TotalRAMGB on a real Darwin host; skip if sysctl unavailable
    (CI sandbox).

pkg/foreman/agent/capability_other_test.go (//go:build !darwin):
  - Stub provider propagates all flag-supplied fields.
  - AvailableRAMGB == StaticTotalRAMGB in v0.1 until M4 wires up live
    Linux probing.
  - Empty Accelerator is preserved (no silent default on non-darwin).

pkg/foreman/agent/fleetnode_test.go (stdlib + fake client):
  - specEqual: 7 table-driven cases including role-ordering sensitivity.
  - Registrar.Upsert: creates if missing; updates if spec changed;
    no-ops (no resourceVersion bump) if spec identical.
  - Registrar.PatchHeartbeat: writes phase, fresh LastHeartbeatTime,
    full Capability snapshot.
  - Registrar.Run: heartbeats while running; drains (phase=Draining)
    on ctx cancel; exits cleanly within 2s.

internal/foreman/controller/suite_test.go (Ginkgo + envtest):
  - Mirrors internal/controller/suite_test.go: BeforeSuite starts
    envtest, loads config/crd/bases/, registers
    foremanv1alpha1 into scheme. AfterSuite tears down.
  - Same getFirstFoundEnvTestBinaryDir helper for IDE-run support.

internal/foreman/controller/{agentictask,workload,fleetnode}_controller_test.go:
  - Stub-smoke contracts: each M0/M1 reconciler is exercised against
    a real apiserver and must (1) return no error for missing
    resources, (2) reconcile an existing resource without erroring,
    (3) leave .status unmutated. M2 deliberately breaks the
    agentictask contract with a corresponding test update.

CI: no .github/workflows/*.yml changes needed. The existing
test.yml (.github/workflows/test.yml) runs make test, which globs
the foreman packages automatically.

Signed-off-by: Christopher Maher <chris@mahercode.io>
Defilan added 5 commits May 20, 2026 00:47
Foreman is an opt-in add-on layered on LLMKube that schedules agentic
workloads (Workload, AgenticTask) across a fleet of nodes (FleetNode).
Installing LLMKube alone does not install or require it.

M0 is the scaffolding milestone: types, controller stubs, operator
binary, Helm chart skeleton. The reconcilers log and return for now;
real scheduling lands in M2, the planner in M6.

New API group foreman.llmkube.dev/v1alpha1:
- Workload: the v0.1 entrypoint, a natural-language intent the planner
  decomposes into AgenticTasks.
- AgenticTask: a dispatchable unit of work (issue-fix, verify, freeform),
  with RequiredCapability for capability-aware scheduling.
- FleetNode: cluster-scoped registry entry the FleetAgent owns; carries
  the heartbeat and the capability the scheduler matches against.

New paths:
- api/foreman/v1alpha1/        the three CRD types + groupversion_info
- internal/foreman/controller/ empty reconciler stubs (one per kind)
- cmd/foreman-operator/        the new operator binary, separate from
                               cmd/main.go; only registers the foreman
                               group, leader-election ID is its own
- charts/foreman/              new Helm chart, dependsOn llmkube

Core touches are surgical and inference-flow-byte-identical:
- scripts/sync-crds.sh now scopes its glob to inference.llmkube.dev_*
  so foreman CRDs are not pulled into the llmkube chart.
- Makefile gains foreman-chart-crds (mirrors chart-crds for the foreman
  chart). manifests / generate / chart-crds are untouched in behavior;
  they still produce exactly the same inference outputs.
- config/rbac/role.yaml grows by the kubebuilder:rbac markers on the
  three foreman reconcilers (auto-regenerated by make manifests).

Verification:
- make generate produces api/foreman/v1alpha1/zz_generated.deepcopy.go.
- make manifests produces the three foreman CRD YAMLs.
- make foreman-chart-crds copies them into charts/foreman/templates/crds.
- make chart-crds remains inference-only (verified: charts/llmkube/templates/crds
  has only the three inference CRDs).
- make test passes the full envtest suite; no core regressions.
- make lint passes (0 issues).
- go build ./cmd/foreman-operator produces a working binary.
- kubectl apply of each foreman CRD against kind-llmkube-local accepts
  a real object; the operator's three reconcilers log the reconcile and
  return ctrl.Result{} as the stub design intends.

Part of the Foreman v0.1 MVP plan: M0 done; M1 (FleetNode heartbeat) next.

Signed-off-by: Christopher Maher <chris@mahercode.io>
…t (v0.1 M1)

The Foreman node-side daemon. One foreman-agent runs per fleet host. In
M1 it owns a single responsibility: keep this host's FleetNode CR
present and current so the scheduler (lands in M2) can target it.

Lifecycle:
  - on startup: upsert the FleetNode (create if missing, update spec if
    flag-supplied identity changed since last run);
  - every --heartbeat-interval (default 30s): patch FleetNode.status with
    phase=Ready, fresh lastHeartbeatTime, current capability snapshot;
  - on SIGTERM/SIGINT: best-effort drain patch (phase=Draining) so the
    scheduler stops dispatching to this node before the process exits.

Cross-platform:
  - capability_darwin.go uses the metal-agent's existing
    DarwinMemoryProvider (sysctl hw.memsize + vm_stat) so available RAM
    is live, not flag-supplied. Defaults accelerator=metal.
  - capability_other.go is a stub for linux/amd64 so the binary builds
    cross-arch from day one. Live probing on Linux + NVIDIA lands at M4
    when ShadowStack joins the fleet.

Reuse, not modification: pkg/foreman/agent imports
pkg/agent.DarwinMemoryProvider but does not touch it. The LLMKube
metal-agent's behavior is unchanged.

Flags:
  --fleet-node-name, --tailscale-addr, --roles, --accelerator,
  --installed-models, --max-context-tokens, --tokens-per-second,
  --total-ram-gb, --heartbeat-interval, --kube-context,
  --workspace-dir, --opencode-bin (last two are placeholders the M3
  executor will require).
  --kubeconfig is auto-registered by controller-runtime's config init.

New paths:
  - pkg/foreman/agent/fleetnode.go         Registrar (Upsert/Run/PatchHeartbeat)
  - pkg/foreman/agent/capability.go        CapabilityOptions
  - pkg/foreman/agent/capability_darwin.go DarwinMemoryProvider backed
  - pkg/foreman/agent/capability_other.go  !darwin stub
  - cmd/foreman-agent/main.go              the binary

Verification on kind-llmkube-local, --heartbeat-interval=3s, 10s run:
  - kubectl get fleetnodes
      NAME     PHASE   ACCELERATOR   RAM   CURRENT TASK   HEARTBEAT   AGE
      m5-max   Ready   metal         22                   1s          10s
  - status.capability.totalRAMGB=128 (live sysctl), availableRAMGB=22
    (live vm_stat), installedModels=[minimax-m2-7],
    maxContextTokens=131072, tokensPerSecond=47.
  - 3 heartbeat patches over 10s, all successful.
  - SIGTERM produced phase=Draining; agent exited cleanly.
  - make test (full envtest), make lint (0 issues), go vet all clean.

Signed-off-by: Christopher Maher <chris@mahercode.io>
PR defilantech#501 shipped scaffolding (CRDs, foreman-operator, foreman-agent
Registrar, capability providers) without test coverage, on the plan
that 'unit + envtest coverage lands with M2'. This commit fronts that
work onto defilantech#501 so M0+M1 ship with proper tests instead of a deferred
promise. Refs defilantech#500.

Coverage delta on the new packages:

  pkg/foreman/agent              0.0% -> 87.0%
  internal/foreman/controller    0.0% -> 85.7%

What's covered:

cmd/foreman-agent/main_test.go (stdlib testing.T):
  - clampInt32: negative/zero/MaxInt32-bound/overflow paths.
  - sanitizeName: DNS-1123 cleanup (lowercase, invalid-char collapse,
    leading/trailing hyphen trim, empty-and-all-invalid fallback,
    63-char truncation, macOS '<name>.local' hostname case).
  - splitCSV: empty / single / multi / whitespace / empty-entries /
    separator-only cases. Found a real inconsistency along the way:
    empty input returned nil but separator-only returned []string{};
    splitCSV now collapses both to nil so the FleetNodeSpec.Roles and
    CapabilityOptions.InstalledModels fields see one 'absent'
    representation. No external callers depended on the distinction.

pkg/foreman/agent/capability_darwin_test.go (//go:build darwin):
  - bytesToGB: zero, sub-1GB rounding, 36GB/128GB sanity, MaxInt32
    edge, and uint64-max saturation.
  - NewCapability: default-metal accelerator, explicit override
    honored, flag-supplied InstalledModels/MaxContextTokens/
    TokensPerSecond propagation.
  - Live memory probe sanity: TotalRAMGB > 0 and AvailableRAMGB <=
    TotalRAMGB on a real Darwin host; skip if sysctl unavailable
    (CI sandbox).

pkg/foreman/agent/capability_other_test.go (//go:build !darwin):
  - Stub provider propagates all flag-supplied fields.
  - AvailableRAMGB == StaticTotalRAMGB in v0.1 until M4 wires up live
    Linux probing.
  - Empty Accelerator is preserved (no silent default on non-darwin).

pkg/foreman/agent/fleetnode_test.go (stdlib + fake client):
  - specEqual: 7 table-driven cases including role-ordering sensitivity.
  - Registrar.Upsert: creates if missing; updates if spec changed;
    no-ops (no resourceVersion bump) if spec identical.
  - Registrar.PatchHeartbeat: writes phase, fresh LastHeartbeatTime,
    full Capability snapshot.
  - Registrar.Run: heartbeats while running; drains (phase=Draining)
    on ctx cancel; exits cleanly within 2s.

internal/foreman/controller/suite_test.go (Ginkgo + envtest):
  - Mirrors internal/controller/suite_test.go: BeforeSuite starts
    envtest, loads config/crd/bases/, registers
    foremanv1alpha1 into scheme. AfterSuite tears down.
  - Same getFirstFoundEnvTestBinaryDir helper for IDE-run support.

internal/foreman/controller/{agentictask,workload,fleetnode}_controller_test.go:
  - Stub-smoke contracts: each M0/M1 reconciler is exercised against
    a real apiserver and must (1) return no error for missing
    resources, (2) reconcile an existing resource without erroring,
    (3) leave .status unmutated. M2 deliberately breaks the
    agentictask contract with a corresponding test update.

CI: no .github/workflows/*.yml changes needed. The existing
test.yml (.github/workflows/test.yml) runs make test, which globs
the foreman packages automatically.

Signed-off-by: Christopher Maher <chris@mahercode.io>
…y lll

CI's golangci-lint v2.4.0 on linux caught a 123-character line in
the //go:build !darwin variant of the capability test that the M5 Max
local lint missed (the file does not compile on darwin, so the
darwin-side lint never sees it). Wrapped the t.Errorf to keep all
lines under the 120-char limit.

Signed-off-by: Christopher Maher <chris@mahercode.io>
…ueAfter

controller-runtime deprecated Result.Requeue (bool) in favor of
expressing 'no requeue' as RequeueAfter == 0. The neighboring
Expect(res.RequeueAfter).To(BeZero()) already covers the assertion,
so dropping the Result.Requeue check resolves SA1019 staticcheck
without changing test semantics.

Caught locally via GOOS=linux golangci-lint after the previous
darwin-only run missed it; same cross-arch gotcha covered in
feedback_cross_arch_lint.md.

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan force-pushed the feat/foreman-crds branch from 09db359 to 662dadc Compare May 20, 2026 07:48
@Defilan Defilan merged commit cd40491 into defilantech:main May 20, 2026
20 checks passed
@Defilan Defilan deleted the feat/foreman-crds branch May 20, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/foreman Foreman: the agentic fleet orchestrator add-on enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant