Skip to content

fleetnode: server-driven discovery via ControlStream (stack 3/3)#324

Merged
ankitgoswami merged 10 commits into
mainfrom
ankitg/fleetnode-discovery
May 29, 2026
Merged

fleetnode: server-driven discovery via ControlStream (stack 3/3)#324
ankitgoswami merged 10 commits into
mainfrom
ankitg/fleetnode-discovery

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented May 27, 2026

Summary

PR 3 of 3. Stacked on #323. Completes the agent surface: `ControlStream` consumer, nmap, IPList normalizer, control-loop wiring in `run.go`.

  • `runControlLoop`. Bidi `ControlStream` with exponential-backoff reconnect (1s→30s), permanent-error short-circuit on `CodeNotFound`/`ErrBeginAuthRejected`. `runControlSession` performs Hello/Accepted; a watcher goroutine closes the stream on `ctx.Done` because `http2.pipe.Read` blocks on a `sync.Cond` that ctx alone can't unblock.
  • `handleCommand` decodes payload as `pairing.v1.DiscoverRequest` and dispatches:
    • `IPListModeRequest` → `netutil.NormalizeIPListEntry` per entry, then bounded-concurrency probes via the plugin manager (32 in-flight, 3s per probe).
    • `NmapModeRequest` → `buildNmapOptions` (target validator + IPv6 detection + `-6` when needed; IPv6 CIDR rejected) → `Ullaakut/nmap/v3`, then plugin probes against open host:port (16 in-flight). 10-min overall scan budget.
    • `MDNSModeRequest` → rejected via ack `error_message`.
  • Caps + commandTimeout. `maxIPsPerCommand = 1024`, `maxPortsPerIP = 10` (matches `pairing.MaxPortsPerIP`), `commandTimeout = 10m`. `resolveAndValidatePorts` rejects range/comma/protocol-prefix bypass and dedupes.
  • `netutil.NormalizeIPListEntry`. Centralizes IP-list normalization: rejects scoped/link-local IPv6, canonicalizes spelling, resolves hostnames preferring IPv4. Pairing service now uses it too (replaces ~45 inline lines in `DiscoverWithIPList`).
  • `run.go` delta. Adds `nmapPath` field + `resolveNmapPath`, control-loop goroutine alongside heartbeat.
  • README adds Nmap, Control stream, Identifier synthesis, discovery security/troubleshooting sections.

Test plan

  • `go build ./server/...`
  • `go vet ./server/...`
  • `golangci-lint run` → 0 issues
  • `go test -race ./server/cmd/fleetnode/... ./server/internal/fleetnodebootstrap/... ./server/internal/domain/netutil/...`
  • Pair the agent with a server that has the matching `ControlStream` + `DiscoverOnFleetNode` handlers (PR feat(fleetnode): server-initiated discovery via ControlStream [PR 2/2] #235); trigger discovery and confirm streamed `DiscoverResponse` + a `discovered_device` row with `discovered_by_fleet_node_id` set

Stack

  1. PR 1: install + enrollment hardening → `main`
  2. PR 2: `fleetnode run` heartbeat + plugin scaffolding → PR 1
  3. PR 3 (this one): discovery via ControlStream → PR 2

🤖 Generated with Claude Code

@ankitgoswami ankitgoswami requested a review from a team as a code owner May 27, 2026 19:23
@github-actions github-actions Bot added documentation Improvements or additions to documentation server labels May 27, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc9502fcf4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/control.go
Comment thread server/cmd/fleetnode/nmap.go
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 27, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (4abf2e8e1f8674e427bdc95a0564c7bce59ec318...b052028643d7b9461585d3dea7e333e2fe359c16, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Fleet node executes server-supplied discovery targets without a local allowlist

  • Category: Network Discovery
  • Location: server/cmd/fleetnode/control.go:287
  • Description: The new control path normalizes and probes server-supplied IpList entries, and the new nmap path similarly accepts valid IPs/CIDRs/ranges/hostnames, but neither path enforces an agent-side network policy. A compromised or malicious configured fleet server can command the on-prem agent to scan loopback, public IPs, or sensitive internal ranges reachable from the agent.
  • Impact: The fleet node becomes a pivot for internal network reconnaissance or localhost probing from the operator environment.
  • Recommendation: Add an agent-local discovery allowlist, preferably configured CIDRs defaulting to detected local miner subnets. Apply it after DNS resolution and range expansion, reject loopback/link-local/public targets by default, and avoid falling back to unresolved hostnames that cannot be policy-checked.

[MEDIUM] url_scheme now accepts values that downstream storage and UI do not safely model

  • Category: Protobuf
  • Location: proto/fleetnodegateway/v1/fleetnodegateway.proto:116
  • Description: The validator was widened from http|https to any string up to 32 bytes. Existing discovered_device.url_scheme is VARCHAR(10), and fleet management later builds clickable URLs from this scheme. Agent-side validation can now accept reports that the DB rejects, and short non-web schemes can become frontend links.
  • Impact: A bad plugin/fleet node can cause discovery report persistence failures or expose operators to unexpected external-protocol links.
  • Recommendation: Align the proto with storage and UI semantics. Either restrict to known protocol values such as http, https, tcp, virtual, or separate transport scheme from web-view URL and only expose http/https links in the frontend. If arbitrary schemes are required, migrate the DB and sanitize before UI use.

[MEDIUM] New control-loop client path has no server implementation in this commit

  • Category: Reliability
  • Location: server/cmd/fleetnode/run.go:188
  • Description: The agent now starts runControlLoop when plugins are available, but this reviewed commit contains no ControlStream or ReportDiscoveredDevices handler implementation; only the generated unimplemented methods exist.
  • Impact: Against this exact server commit, fleet-node discovery commands remain non-functional and the agent falls back to heartbeat-only behavior.
  • Recommendation: Land the server handlers and integration tests with this client path, or keep the control loop/docs gated until the server side is present.

[MEDIUM] Probe supervisor can accumulate stuck goroutines across commands

  • Category: Concurrency
  • Location: server/cmd/fleetnode/control.go:458
  • Description: waitWithSupervisor returns partial results on timeout, but probes that ignore context continue running, and the waiter goroutine remains blocked on wg.Wait. The worker then accepts later commands, so repeated scans against a stuck plugin can accumulate goroutines.
  • Impact: A buggy or malicious plugin, combined with repeated server-issued commands, can cause unbounded agent-side resource growth.
  • Recommendation: Treat supervisor timeout as a plugin health failure: close the control session, restart/kill plugin subprocesses, or refuse new commands until the outstanding probe set drains.

Notes

No cryptostealing, pool/wallet rewrite, SQL interpolation, Rust, or infrastructure changes were present in the reviewed diff. I did not run tests; this was a static diff review.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-run-heartbeat branch from 3694b1f to dc50f50 Compare May 27, 2026 21:34
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from 2ecc137 to ce5370e Compare May 27, 2026 21:49
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from ce5370e to bb5c562 Compare May 27, 2026 22:12
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from bb5c562 to 83c42e7 Compare May 27, 2026 22:22
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@github-actions github-actions Bot added javascript Pull requests that update javascript code client shared labels May 27, 2026
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from e31e9ee to b86dc70 Compare May 27, 2026 22:39
chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from fbfb62d to 3e0c139 Compare May 28, 2026 00:45
Base automatically changed from ankitg/fleetnode-run-heartbeat to main May 28, 2026 15:54
chatgpt-codex-connector[bot]

This comment was marked as outdated.

@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from 66ef99d to 38dd55c Compare May 28, 2026 15:57
chatgpt-codex-connector[bot]

This comment was marked as outdated.

Copilot AI review requested due to automatic review settings May 28, 2026 16:34
chatgpt-codex-connector[bot]

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

PR 3 of 3 completes the fleet-node agent's server-driven discovery surface by wiring up the bidirectional ControlStream, adding nmap support, and centralizing IP-list normalization for reuse between the agent and the existing pairing service.

Changes:

  • New runControlLoop/runControlSession/handleCommand in control.go that consumes serialized pairing.v1.DiscoverRequest commands, dispatches IPList/IPRange/Nmap discovery, batches reports through ReportDiscoveredDevices, and emits typed AckCode acknowledgements (with proto + generated Go/TS updated for the new enum).
  • New nmap.go (with unix/windows ownership-check shims) implementing strict target validation, adjacent-binary safety checks, hostname → IPv4-preferred resolution, IPv6 CIDR rejection, and bounded probe fan-out from open ports.
  • New shared netutil.NormalizeIPListEntry consumed by both pairing.Service.DiscoverWithIPList (replacing ~45 lines of inline logic) and the agent's IPList command path; run.go adds nmapPath, the control-loop goroutine, and a WaitGroup/errCh shutdown pattern alongside the heartbeat.

Reviewed changes

Copilot reviewed 12 out of 14 changed files in this pull request and generated no comments.

Show a summary per file
File Description
server/cmd/fleetnode/control.go Adds full ControlStream consumer, command dispatch, port/IP validation, IPv4 range expansion, fan-out probing, batched report uploads, and AckCode plumbing.
server/cmd/fleetnode/control_test.go Extensive new coverage: probe/report happy path, mode rejections, port/IP validation, range expansion, partial-timeout reporting, reconnect/EOF, serialization, ctx-cancel, and stuck-plugin supervisor.
server/cmd/fleetnode/nmap.go New nmap target validation, safe binary resolution, hostname resolution, and scan-to-probe pipeline.
server/cmd/fleetnode/nmap_unix.go / nmap_windows.go Per-OS ownership/permission validation (no-op on Windows).
server/cmd/fleetnode/nmap_test.go Target grammar, binary resolution safety, hostname substitution, IPv6 flagging, and CIDR rejection tests.
server/cmd/fleetnode/run.go Adds nmapPath, resolver fields; runs heartbeat + control loop concurrently when a discoverer is available.
server/cmd/fleetnode/README.md Documents control stream, nmap mode, security caveats, and troubleshooting entries.
server/internal/domain/netutil/iplist.go(_test.go) New shared NormalizeIPListEntry helper with full unit-test coverage.
server/internal/domain/pairing/service.go DiscoverWithIPList delegates to the new shared normalizer.
proto/fleetnodegateway/v1/fleetnodegateway.proto Adds AckCode enum and ControlAck.code field.
server/generated/grpc/.../fleetnodegateway.pb.go Regenerated Go bindings for the new enum/field.
client/src/protoFleet/api/generated/.../fleetnodegateway_pb.ts Regenerated TS bindings for the new enum/field.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ab56c775e8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/nmap.go Outdated
chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

chatgpt-codex-connector[bot]

This comment was marked as outdated.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f269298d2a

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/control.go Outdated
PR 3/3 of the fleetnode agent stack. Completes the agent surface for
server-driven discovery: bidi ControlStream consumer, nmap- and
IpList-based scan paths, plugin-manager probe orchestration, and
report streaming back through ReportDiscoveredDevices. Original
implementation plus all review feedback.

Control loop.
- runControlLoop opens a bidi ControlStream and reconnects with
  exponential backoff (1s -> 30s) plus half-jitter. CodeNotFound and
  ErrBeginAuthRejected are treated as permanent; CodeUnimplemented
  drops the agent to heartbeat-only against pre-ControlStream
  servers. stableSessionThreshold (30s) resets the backoff once a
  session stabilises so flapping connections keep growing.
- runControlSession does Hello/Accepted and then offloads command
  execution to a single-slot commandWorker goroutine. The dispatch
  channel is buffer=1; if the worker is already busy with a queued
  command, the receive loop ack-busies the new arrival rather than
  parking on a full channel and hiding stream events.
- Worker runs on a session-scoped ctx so a dropped stream cancels
  the in-flight scan immediately. Without this the agent would burn
  up to commandTimeout (10 min) finishing an old scan before opening
  a new session.
- All ControlStreamRequest sends go through a lockedAcker wrapper
  so the worker's completion ack and the receive loop's busy ack
  can't race on connect-go's bidi stream.
- A watcher goroutine closes the bidi stream when the parent ctx
  fires; required because stream.Receive parks in http2.pipe on a
  sync.Cond that ctx alone can't unblock.

Command dispatch.
- discoverForCommand routes by oneof: IpList -> NormalizeIPListEntry
  + bounded plugin probes (probeConcurrency=32, perProbeTimeout=10s
  matching server-side pairing.perDeviceDiscoveryTimeout); IpRange
  -> expandIPv4Range with the same .0/.1 skip the server applies;
  Nmap -> validateNmapTarget + buildNmapOptions + Ullaakut/nmap/v3
  with 16-way probe concurrency; Mdns -> rejected as AGENT_INCAPABLE.
- Untrusted-server input is treated as adversarial: resolveAndValidatePorts
  emits canonical decimals (strconv.Itoa) so non-canonical inputs like
  "+80" or "080" don't poison reports; multi-octet nmap ranges
  (10.0-255.0-255.1-254) are rejected before the hostname fallback so
  they can't slip past the /22 CIDR cap; IPv4-mapped IPv6 collapses to
  v4 in NormalizeIPListEntry so v4-only checks see v4.
- maxIPsPerCommand=1024, maxPortsPerIP=10 mirror server-side caps.
  commandTimeout=10m overall; per-batch report upload bounded at 30s.
- fanOutProbes has a wall-clock supervisor (perProbeTimeout*2) so a
  plugin Probe that ignores ctx can't pin the agent.
- Per-device protovalidate runs before each report is added to the
  batch; one bad device-supplied url_scheme/driver_name no longer
  fails the whole ReportDiscoveredDevices batch.
- Scanned (ip, port) overrides anything the plugin reports for those
  fields so a malicious plugin can't spoof endpoints.
- Truncated scans (cmdCtx hit commandTimeout) still upload their
  partial reports and ack PROBE_PARTIAL via the daemon ctx; the
  intent is operators see partial discovery rather than nothing.

Nmap binary safety.
- resolveNmapPath prefers an adjacent <exe-dir>/nmap and fails closed
  if that candidate is unsafe (no PATH fall-through past a staged
  binary). PATH fallback resolves once via LookPath.
- validateNmapBinary returns the symlink-resolved path so a symlink
  swap after startup can't redirect exec to an unvalidated binary.
- Both candidates pass validatePathChain (no writable ancestor
  without sticky bit), mirroring the plugin loader.
- Empty nmap path fails closed in runNmapDiscovery rather than
  letting the library re-resolve via PATH at scan time.
- POSIX-mode safety (uid + write bits + exec bit) lives in
  checkNmapBinaryOwnership on Unix; Windows .exe doesn't model these
  bits, so the windows build of the helper is a no-op.

Wire contract.
- ControlAck.code is an enum: OK, PARTIAL, BAD_REQUEST,
  AGENT_INCAPABLE, SCAN_FAILED, REPORT_FAILED, INTERNAL. Codes track
  consumer-visible distinctions only; error_message carries the
  human-readable detail and is utf8-safe-truncated at the proto's
  4096-byte max_len.
- discoveryReportTimeout per-batch (30s) bounds an individual
  ReportDiscoveredDevices call; the per-batch context is derived
  from the daemon ctx so partial-success uploads survive a session
  drop.
- streamReports chunks at maxDevicesPerReport=1024 via slices.Chunk.

Shared helpers in netutil.
- NormalizeIPListEntry centralises ip_addresses handling (rejects
  scoped IPv6, link-local fe80::/10, hostname-prefer-IPv4); the
  pairing service uses it too, replacing ~45 lines of inline logic.
- ParseIPv4 / IPv4ToUint32 / Uint32ToIPv4 / AdjustIPv4RangeStart
  consolidate the four bijection helpers previously duplicated
  between control.go and pairing/service.go. pairing's
  DiscoverWithIPRange now uses the shared helpers.
- IsIPv4NetworkOrGateway mirrors the server-side .0/.1 filter so
  agent nmap result rows where a gateway answers for the whole
  subnet don't poison the batch.

Tests.
- control_test.go drives the loop end-to-end against a fake bidi
  gateway (controlFakeGateway) with knobs for permanent-error
  rejection, post-handshake stream close, report-upload failure,
  and a signal-based mid-stream close used to assert in-flight
  cancellation. Table-driven TestControlLoop_AcksAndReports covers
  the ack-code/report invariants across IpList happy path, IPv6
  normalization, all-unusable IPs, over-cap IPs, over-cap ports,
  mdns rejection, nmap port-range bypass, IpRange expansion,
  corrupt payload, and report-upload failure in one pass.
- Discrete tests for partial-results-survive-scan-deadline,
  permanent-error propagation, Unimplemented->heartbeat-only,
  reconnect-after-EOF, long-command-doesn't-block-receive,
  ctx-cancel-during-in-flight, fanOutProbes supervisor budget,
  drop-invalid-report-instead-of-poisoning-batch,
  override-plugin-supplied-endpoint, dropped-stream-cancels-scan,
  pipelined-command-gets-busy-ack, concurrent-acks-serialize,
  sendAck UTF-8-boundary truncation.
- nmap_test.go covers resolveNmapPath happy path, adjacent unsafe
  fails closed, world-writable adjacent rejected, world-writable
  ancestor rejected, validateNmapTarget table (multi-octet ranges
  rejected, shell metacharacters rejected, etc.), resolveNmapTarget
  hostname substitution, IPv6 CIDR rejected, nmapPath="" fails
  closed, IPv6 scanning option assertions on actual scanner.Args().
- iputil_test.go: NormalizeIPListEntry table (scoped/link-local/
  trailing-whitespace/IPv4-mapped IPv6 cases), ParseIPv4 +
  IPv4Uint32 round-trip, AdjustIPv4RangeStart with loopback
  carve-out, IsIPv4NetworkOrGateway.

README updates document the control flow, nmap target rules,
discovery security model, and the operator troubleshooting paths.

Verification.
- go build ./server/...                                        clean
- go vet ./server/...                                          clean
- go test -race -count=1 -timeout 180s ./server/cmd/fleetnode/...
  ./server/internal/domain/netutil/...                         green
- golangci-lint on the touched packages                    0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-discovery branch from abbbf5f to faf1da1 Compare May 28, 2026 19:47
ankitgoswami and others added 3 commits May 28, 2026 12:56
The gateway proto restricted DiscoveredDeviceReport.url_scheme to
{"http","https"} via buf-validate; pairing.Device (the pre-existing
source schema in the pairing service) has no such restriction. The
packaged virtual plugin reports URLScheme: "virtual", and just
build-fleetnode includes native plugins, so server-driven discovery of
virtual miners had its report dropped by the agent's protovalidate gate
in fanOutProbes -- silently, with no record reaching the server.

Loosen the gateway constraint to match the source schema. Keep the
field bounded via max_len = 32 (long enough for any real URI scheme,
short enough that a buggy plugin can't sneak a multi-KB string into the
batch).

Regenerate the gateway pb.go and pb.ts.

Adjust TestFanOutProbes_DropsInvalidReportInsteadOfPoisoningBatch:
"virtual" is now valid, so use an oversized model string (>255 chars)
as the proto-rejected case. Add
TestFanOutProbes_AcceptsNonHTTPPluginScheme to lock the new behavior in.

Verification on the touched scope:
- go build clean
- go vet clean
- go test -race -count=1 -timeout 180s green
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trim comments that restated what well-named identifiers already say or
were verbose where one line suffices. Load-bearing WHY comments
(http2.pipe watcher, sessionCtx rationale, lockedAcker concurrency
note, UTF-8 truncation rationale, partial-results ctx choice, dev-
fixture loopback carve-outs, etc.) are intact.

- control.go: drop commandError struct doc, shrink resolveAndValidatePorts
  doc and inline canonical-form note from 3 lines to 2/1.
- control_test.go: drop discoverIPList doc, shrink runControlLoopOnce
  doc, drop "All fixtures carry url_scheme..." prose, shrink the
  DroppedStream test setup explanation.
- netutil/iputil.go: drop pure-WHAT docs on IPv4ToUint32 / Uint32ToIPv4
  (the WHY-non-obvious panic warning stays as a one-liner).

No behavior change.

Verification on the touched scope:
- go build clean
- go vet clean
- go test -race -count=1 -timeout 180s green
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five tests dropped where the compiler, a stronger sibling test, or the
Go language itself already covered the same ground.

- TestNormalizeIPListEntry_DefaultResolverSatisfiesInterface
  Pure `var _ IPListResolver = net.DefaultResolver` compile-time
  assertion. The compiler runs this on every build; the wrapping test
  function added nothing.

- TestIPv4Uint32RoundTrip
  Asserted Uint32ToIPv4(IPv4ToUint32(addr)) == addr.String() for a
  handful of IPs. The encoding is the contract -- TestIPv4ToUint32_
  KnownValues and TestUint32ToIPv4_KnownValues pin specific bytes for
  both directions, which is strictly more informative; a swapped byte
  order would fool the round-trip property but fail the known-values
  tables.

- TestSendAck_ClampsErrorMessageToProtoLimit
  Strict subset of TestSendAck_TruncationPreservesUTF8Boundaries,
  which checks every invariant the ASCII variant did plus utf8.
  ValidString.

- TestBuildNmapOptions_RejectsIPv6CIDR
  TestResolveNmapTarget already has the `"ipv6 cidr rejected"`
  case at the layer where the rejection lives; the wrapping test was
  just verifying Go's `return err` propagation up one call frame.

- TestResolveNmapPath_EmptyExeDir_NoPath
  TestResolveNmapPath_Default_AdjacentNotExecutableFallsBack reaches
  the same final code path (empty PATH lookup returns ""); the
  preceding branch (skip-adjacent-when-exeDir-empty) is just an `if`
  with no logic worth a dedicated test.

The substantive coverage stays intact: every business rule, every
security path, every regression behavior. No code change, no
production behavior change. -74 test lines.

Verification on the touched scope:
- go build clean
- go vet clean
- go test -race -count=1 -timeout 180s green
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 245600be43

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/nmap.go Outdated
…indows

Two of the four Codex findings from this round; the other two are
unchanged-from-prior-rounds product decisions surfaced below.

handleCommand now runs protovalidate.Validate(cmd) before using any
field. This catches oversized command_id (>128 bytes), empty
command_id (min_len=1), and oversized payload (>1 MiB max_len) before
they propagate into ReportDiscoveredDevices/ControlAck messages that
would themselves fail validation at the gateway. If command_id is the
field that's invalid, echoing it back in an ack would also fail
validation and close the stream -- in that case the agent drops
silently and lets the server time out. New regression queues an empty-
command_id command followed by a valid one and asserts only the valid
one is acked, proving the receive loop kept running normally past the
dropped command.

resolveNmapPath disables PATH fallback on Windows. The Windows
checkNmapBinaryOwnership is a no-op (Windows ACLs aren't modeled), so
any nmap.exe the process resolves on PATH was previously trusted -- an
elevated service plus a less-privileged-writable PATH entry would let
an attacker swap the binary. New build-tag-split constants
nmapBinaryName and nmapAllowPATHFallback: Windows uses "nmap.exe" with
no PATH fallback (require an installer-staged adjacent binary under an
Administrator-only directory); Unix keeps the existing "nmap" + PATH
fallback. Cross-compile verification: GOOS=windows go build clean.

Two findings deferred:

- [HIGH] CIDR allowlist for probe targets. Fourth review round this has
  come up; the decision (where the allowlist lives, default behavior,
  operator UX) isn't a code change I can land unilaterally. Incremental
  defenses already in this PR: IPv4-mapped IPv6 collapse, /22 nmap
  CIDR cap, multi-octet nmap range rejection, agent-side .0/.1 result
  filter. Broader allowlist remains a product decision for a separate
  PR.

- [MEDIUM] Straggler-goroutine accumulation. fanOutProbes already has
  the perProbeTimeout*2 supervisor that returns the partial batch, but
  the in-flight goroutine itself keeps running until the plugin RPC
  returns. Real fix is plugin-manager territory (signal plugin health,
  kill+restart misbehaving subprocesses); doesn't belong in the
  control loop.

Verification on the touched scope:
- go build clean (unix + GOOS=windows cross-compile)
- go vet clean
- go test -race -count=1 -timeout 180s green
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c99010faf0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/control.go
ankitgoswami and others added 4 commits May 28, 2026 13:17
cmdCh capacity 1 had a real race: the receive loop can land a second
command before the worker has drained the first, so the buffer briefly
looks full to the non-blocking send and a perfectly valid command gets
busy-acked. TestControlLoop_LongCommandDoesNotBlockNextReceiveAndSerializes
hit this nondeterministically. Bump to capacity 2 (1 in flight + 1
queued); a 3rd outstanding command still hits the busy-ack path.

TestControlLoop_PipelinedCommandGetsBusyAck adjusted to queue four
commands so cmd-D is the one that overflows.

Stress: 200 iterations of both tests under -race, all green.

Comment cleanup pass on the protovalidate / Windows-nmap / cmdCh
additions: the WHY survives, the prose around it gets trimmed
(11 lines of comment removed).

Verification on the touched scope:
- go build clean (unix + GOOS=windows cross-compile)
- go test -race -count=1 -timeout 180s green
- go test -race -count=200 on the race-sensitive tests green
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a plugin Probe ignores its context, the fanOutProbes supervisor
budget (perProbeTimeout*2) fires and returns the reports collected so
far. Before this change, handleCommand only sent ACK_CODE_PARTIAL when
cmdCtx itself hit commandTimeout; supervisor-truncated batches were
acked OK and the server had no signal that some endpoints were silently
dropped.

Thread a truncated bool from waitWithSupervisor -> fanOutProbes ->
probeIPsAndPorts / runNmapDiscovery -> discoverForCommand ->
handleCommand. handleCommand now branches: ack PARTIAL if cmdCtx
deadline exceeded OR truncated, OK otherwise. Two distinct PARTIAL
error_message strings so the server can tell command-deadline vs.
supervisor-budget apart.

New end-to-end TestControlLoop_SupervisorTruncatedScanAcksPartial drives
the full bidi flow with a probe that ignores ctx and asserts the agent
emits ACK_CODE_PARTIAL with the supervisor reason. The pure-unit
TestFanOutProbes_SupervisorReturnsPartialOnStuckPlugin now also asserts
the truncated bool.

Verification on the touched scope:
- go build clean
- go test -race -count=1 -timeout 180s green
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four-line PARTIAL ack rationale -> two lines. fanOutProbes signature
doc dropped the bool's "(supervisor fired or ctx cancelled mid-fan-out
so the caller can ack PARTIAL)" tail; the variable name "truncated" and
the body carry it. Test assertions and arrange/setup comments tightened
the same way. ctxIgnoringDiscoverer doc collapsed to one line.

Behavior unchanged. -8 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Asserting cmd-D specifically was timing-dependent: whether cmd-D or
cmd-C (or cmd-E) lands the busy ack depends on whether the worker
goroutine drains cmd-A before the receive loop reads the next command.
CI hit the cmd-C-busy-acked path while my laptop favored cmd-D.

Queue 5 commands and assert ANY trailing one gets the busy ack. With
cmd-A's probe blocking and buffer capacity 2, at least one of cmd-C /
cmd-D / cmd-E must overflow regardless of scheduling, so the test is
deterministic now. Also bumped the eventually budget 2s -> 3s for CI
slack.

200 iterations under -race green for both
TestControlLoop_PipelinedCommandGetsBusyAck and
TestControlLoop_LongCommandDoesNotBlockNextReceiveAndSerializes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread server/cmd/fleetnode/control.go Outdated
Reusing ACK_CODE_INTERNAL for the "agent already has a command in
flight" rejection forced callers to string-match the error_message
("in flight") to distinguish it from real internal errors, and the
test was doing exactly that. Promote it to a first-class code.

ACK_CODE_BUSY = 8 in the proto. The busy-ack site in control.go now
emits the new code; the pipelined-command test asserts on the code
directly and drops the error_message substring check.

Semantics: BUSY is transient (retry after the prior ack); INTERNAL
remains the catch-all for unexpected errors.

Verification on the touched scope:
- go build clean
- go test -race -count=1 -timeout 180s green
- 200 iterations under -race on the pipelined + long-command tests
- golangci-lint: 0 issues

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami merged commit 0abe567 into main May 29, 2026
72 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/fleetnode-discovery branch May 29, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client documentation Improvements or additions to documentation javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants