Skip to content

fleetnode: run with heartbeat + plugin scaffolding (stack 2/3)#323

Merged
ankitgoswami merged 7 commits into
mainfrom
ankitg/fleetnode-run-heartbeat
May 28, 2026
Merged

fleetnode: run with heartbeat + plugin scaffolding (stack 2/3)#323
ankitgoswami merged 7 commits into
mainfrom
ankitg/fleetnode-run-heartbeat

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

Summary

PR 2 of 3. Stacked on #322. Lights up `fleetnode run` as a heartbeat-only daemon and carries the agent's runtime scaffolding (plugin-dir validator, orphan reaper, plugin manager wrapper) that PR 3's control loop consumes.

  • `fleetnode run`. State-lock acquisition (`WithStateLock`), session-refresh-on-tick (1h leeway), 30s heartbeat loop. SIGINT/SIGTERM/SIGHUP shutdown. Logs to stdout so IDE-attached terminals see them.
  • Plugin-dir validator. `resolvePluginsDir` checks `/plugins` ownership + permissions; `validatePluginFiles` walks each entry, rejects symlinks and group/world-writable files, requires regular files owned by root or the agent uid.
  • Orphan reaper sweeps stray plugin children from prior crashes before the new agent spawns its own.
  • Plugin manager wrapper. `pluginDiscoverer` + `newPluginDiscoverer` load subprocess plugins via `hashicorp/go-plugin`. Heartbeat-only mode loads them but never dispatches — the control loop lands in PR 3. `reportFromDiscovered` + `synthesizeIdentifier` ride along for the next layer.
  • `stubGatewayClient` in `run_test.go` already satisfies the full gateway interface (UploadHeartbeat / ReportDiscoveredDevices / ControlStream) so PR 3 doesn't have to retouch the fixture.
  • README grows Plugins, Run, Security-plugins, Troubleshooting sections.

Stack

  1. PR 1: install + enrollment hardening → `main`
  2. PR 2 (this one): `fleetnode run` heartbeat + plugin scaffolding → PR 1
  3. PR 3: discovery (`ControlStream` + nmap + IPList normalizer) → PR 2

Test plan

  • `go build ./server/...`
  • `go vet ./server/...`
  • `golangci-lint run` → 0 issues
  • `go test -race ./server/cmd/fleetnode/... ./server/internal/fleetnodebootstrap/...`
  • `just build-fleetnode` then `./server/.fleetnode/fleetnode run --server-url=http://localhost:4000\` after enrolling; verify heartbeat logs every 30s and `acquiring state lock` succeeds

🤖 Generated with Claude Code

@ankitgoswami ankitgoswami requested a review from a team as a code owner May 27, 2026 19:22
@github-actions github-actions Bot added documentation Improvements or additions to documentation server labels May 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2226f776c0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/control.go
Comment thread server/cmd/fleetnode/orphan_reaper.go Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (fbe3efba53ed76135e0b884780fd6a438b5213d9...e65754cf9e677e9a313439cc008306c1f7dacc51, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] Optional Plugin Bootstrap Can Take Down Heartbeat-Only Daemon

  • Category: Plugin | Reliability
  • Location: server/cmd/fleetnode/run.go:115
  • Description: fleetnode run now loads every executable in the adjacent plugins/ directory before entering runLocked. If any plugin fails to load, run exits before session refresh or UploadHeartbeat. In this diff, the daemon still only sends heartbeats; the control/discovery loop is not implemented yet, so optional discovery plugin health now gates the core liveness path.
  • Impact: A stale, wrong-architecture, corrupted, or otherwise broken plugin can make the fleet node stop heartbeating entirely. With the staged layout including multiple plugins, one bad optional driver can make the server mark the node offline and prevent normal fleet-node operation.
  • Recommendation: Keep heartbeat/session maintenance independent from plugin bootstrap. Treat unsafe plugin directory permissions as hard-fail, but for plugin load/runtime failures either continue heartbeat-only with discovery disabled, or defer plugin loading until the control/discovery loop actually needs it.

Notes

No changed hunk showed SQL interpolation, auth bypass, pool/wallet rewriting, hardcoded payout addresses, protobuf wire-format changes, or new frontend/infrastructure exposure.

I attempted a targeted go test ./cmd/fleetnode, but this environment is read-only and Go could not create its module/cache directories, so tests were not run.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-run-heartbeat branch from 2226f77 to 3694b1f Compare May 27, 2026 19:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3694b1f5d0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/orphan_reaper.go Outdated
Base automatically changed from ankitg/fleetnode-install-enrollment to main May 27, 2026 21:27
@ankitgoswami ankitgoswami changed the base branch from main to ankitg/fleetnode-install-enrollment May 27, 2026 21:29
@ankitgoswami ankitgoswami changed the base branch from ankitg/fleetnode-install-enrollment to main May 27, 2026 21:29
ankitgoswami and others added 2 commits May 27, 2026 14:31
- fleetnode run daemon: state lock acquisition, session-refresh-on-tick,
  30s heartbeat loop. SIGINT/SIGTERM/SIGHUP shutdown. Logs to stdout for
  IDE-terminal visibility.
- Plugin scaffolding lives on disk: resolvePluginsDir validates
  <exe-dir>/plugins ownership + permissions, validatePluginFiles checks
  each entry (rejects symlinks + group/world-writable + non-owner files).
- Orphan reaper sweeps stray plugin children from prior crashes before
  the new agent spawns its own.
- Plugin manager wrapper (pluginDiscoverer / newPluginDiscoverer) loads
  subprocess plugins via hashicorp/go-plugin. No control loop yet, so
  the manager just bootstraps and idles; reportFromDiscovered +
  synthesizeIdentifier ride along for the discovery layer.
- stubGatewayClient in run_test.go satisfies the full gateway interface
  (UploadHeartbeat / ReportDiscoveredDevices / ControlStream) so PR 3
  can wire the control loop without touching the test fixture.
- README grows Plugins, Run, Troubleshooting sections.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- newPluginDiscoverer now calls manager.Shutdown with a bounded timeout
  on LoadPlugins failure. Aggregate-error returns can still leave earlier
  plugins started, and the previous no-op cleanup leaked those
  subprocesses when the agent exited.
- reapOrphanedPlugins now consults ppid. Two agents sharing the binary
  and plugins layout but holding different state locks no longer kill
  each other's live plugin children. Reap fires only when the parent is
  init (ppid == 1) or no longer present in the ps snapshot.
- Add orphan_reaper_test.go covering the live-parent skip, the self-pid
  skip, and the dead-parent reap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-run-heartbeat branch from 3694b1f to dc50f50 Compare May 27, 2026 21:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dc50f5019b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/orphan_reaper.go Outdated
- Reject symlinked plugins dir at resolve time so the loader and
  orphan reaper agree on the canonical path.
- Wire signal.NotifyContext above plugin loading so SIGTERM during
  the up-to-60s LoadPlugins window aborts cleanly. Plumb the ctx
  through reapOrphanedPlugins and newPluginDiscoverer.
- Bound the cleanup-path manager.Shutdown to 10s instead of
  context.Background(), matching the partial-load error path.
- Add 5s timeout on the orphan-reaper ps invocation; a hung ps no
  longer blocks startup while the state lock is held.
- Tighten the reaper match to direct children of pluginsAbs so an
  operator process invoked from a subpath isn't reaped.
- Snapshot state under stateMu before SaveState so the marshal
  doesn't race the tokenSource goroutine the control loop adds.
- Rename run() parameter stderr -> logOutput to match the actual
  destination (os.Stdout).
- Cover empty/whitespace ps output, kill-failure-continues,
  subdirectory-process-skip, symlinked plugins dir, and isolated
  group-/world-writable bits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 27, 2026 21:43
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d5c0ff64ca

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/orphan_reaper.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR (stack 2/3) enables fleetnode run to operate as a long-running heartbeat-only daemon and introduces runtime scaffolding for the upcoming discovery control loop: plugin directory validation, orphaned plugin process reaping, and a plugin-backed discoverer wrapper.

Changes:

  • Implement fleetnode run heartbeat loop with session refresh and signal-based shutdown.
  • Add plugin directory resolution + safety validation (ownership/permissions, reject symlinks) and plugin discovery scaffolding.
  • Add orphan plugin subprocess reaper and expand README + tests around the new behaviors.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
server/cmd/fleetnode/run.go Heartbeat daemon loop, state-lock usage, plugin bootstrap wiring, signal handling
server/cmd/fleetnode/run_test.go Stub gateway extended to satisfy upcoming gateway interface needs
server/cmd/fleetnode/README.md Document plugins, run behavior, security constraints, and troubleshooting
server/cmd/fleetnode/plugins_dir.go Resolve <exe-dir>/plugins and enforce safety checks before use
server/cmd/fleetnode/plugins_dir_unix.go Unix ownership/permission validation for plugin dir and plugin binaries
server/cmd/fleetnode/plugins_dir_windows.go Windows plugin dir existence checks (ACL validation deferred)
server/cmd/fleetnode/plugins_dir_test.go Unit tests for plugin dir resolution and per-file validation (non-Windows)
server/cmd/fleetnode/orphan_reaper.go Non-Windows orphan plugin process sweep using ps output + SIGKILL
server/cmd/fleetnode/orphan_reaper_windows.go Windows no-op orphan reaper placeholder
server/cmd/fleetnode/orphan_reaper_test.go Unit tests for orphan reaper selection logic (non-Windows)
server/cmd/fleetnode/control.go Plugin discoverer wrapper + report/identifier synthesis utilities
server/cmd/fleetnode/control_test.go Unit tests for report synthesis + identifier behavior

Comment thread server/cmd/fleetnode/orphan_reaper_windows.go Outdated
Comment thread server/cmd/fleetnode/run.go
Comment thread server/cmd/fleetnode/run.go
- orphan_reaper: drop filepath.EvalSymlinks so reaper prefix matches the
  unresolved path the loader passes to exec.Command (resolvePluginsDir
  already rejects symlinked leaf; path components above can still be
  symlinks and were producing a loader-vs-reaper path mismatch).
- orphan_reaper: re-check strings.HasPrefix(argv0, prefix) after
  truncating at the first space so plugin install paths containing
  spaces (e.g. "/opt/Proto Fleet/plugins/x") don't slice into the
  prefix and panic with a slice-bounds error.
- orphan_reaper_windows: match the Unix signature
  (ctx context.Context, string, *slog.Logger) so the shared run.go
  caller compiles on Windows.
- signals_unix / signals_windows: extract defaultSignals() to a
  platform-specific helper so syscall.SIGHUP (Unix-only) no longer
  leaks into the shared run.go file.
- run.go: log "using injected discoverer" when r.discoverer is non-nil
  and resolvedPluginsDir is empty (test-injection path), instead of
  the misleading "heartbeat only" message.
- plugins_dir_windows: refuse plugin loading when a plugins dir is
  present, until proper Windows ACL validation lands. Heartbeat-only
  mode (no plugins dir) remains the default; this only closes the
  silent-no-validation case Codex Security Review flagged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 82c698016b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/orphan_reaper.go Outdated
The previous fix re-checked HasPrefix(argv0, prefix) after truncating on
the first space, which prevented the slice-bounds panic but also caused
the reaper to silently skip every orphan when the install path itself
contains whitespace. ps space-joins argv without quoting, so any argv0
parsing from the reconstructed command is ambiguous when the prefix
contains spaces.

Switch to matching against the actual installed-plugin paths from
os.ReadDir(pluginsDir): e.command must equal an allowed path, or start
with an allowed path followed by a space (its first argument). This
sidesteps ps's argv-joining ambiguity entirely and reaps correctly on
paths like "/opt/Proto Fleet/plugins/...".

Test TestReapOrphans_PluginsPathWithSpaces now asserts the orphans ARE
reaped (previously asserted they were skipped without panic, which is
the regression Codex flagged in discussion_r3314133920).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 465bddca5c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/plugins_dir_unix.go Outdated
- plugins_dir_unix: apply ownership and writability checks to every
  regular file in the plugins dir, not just executables. A non-exec
  file owned by another user can be chmod +x'd between validation and
  plugin load, becoming a stealth RCE vector under the agent uid.
- plugins_dir_unix: add validatePathChain to walk every ancestor of
  the plugins dir and require root/agent ownership + non-writable
  perms (with a sticky-bit exception for /tmp-style dirs). Walks both
  the original path and the symlink-resolved form so a swappable
  symlink in the chain is caught via its containing-dir's perms and
  the underlying target tree is caught via the resolved walk.
- plugins_dir_windows: matching no-op stub so resolvePluginsDir's
  shared call sequence type-checks.
- orphan_reaper: hard-code /bin/ps instead of letting exec resolve
  via PATH. A hostile $PATH could inject a shim that runs as the
  daemon during startup; /bin/ps is canonical on Linux and macOS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e6565a6124

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/plugins_dir_unix.go
Removed verbose docblocks and what-comments; kept the non-obvious why
on ppid filtering, os.ReadDir matching, /bin/ps hard-coding, the
non-exec ownership check, validatePathChain's two-walk strategy, the
sticky-bit exception, and the cleanup ctx separation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread server/cmd/fleetnode/run.go
@ankitgoswami ankitgoswami merged commit 5569170 into main May 28, 2026
70 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/fleetnode-run-heartbeat branch May 28, 2026 15:54
rongxin-liu added a commit that referenced this pull request May 29, 2026
Brings in:
- #325 RBAC swap: List/GetActive/Update curtailment RPCs now gated via
  RequirePermission; new entries in ProcedurePermissions.
- #329 UpdateCurtailmentEvent client wiring + edit-modal UI.
- #321 Curtailment start/restore action UI.
- #322 / #323 fleetnode enrollment + heartbeat stack.

Conflict resolution: server/internal/handlers/middleware/rpc_permissions.go.
Main added CurtailmentService reads + Update + DeviceCollection +
DeviceSet + ErrorQuery + FleetManagement entries; the branch added
IngestCurtailmentSignal. Took the union — folded the Ingest entry
into the CurtailmentService block alongside Update/reads/AdminTerminate.

handler.go auto-merged cleanly; build + targeted tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants