Skip to content

feat(fleetnode): discovery + pairing E2E (server)#235

Draft
ankitgoswami wants to merge 5 commits into
mainfrom
ankitg/discovery-pairing
Draft

feat(fleetnode): discovery + pairing E2E (server)#235
ankitgoswami wants to merge 5 commits into
mainfrom
ankitg/discovery-pairing

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented May 14, 2026

Summary

Server-side half of the fleet-node discovery + pairing flow under RFC 0001 phase 2. Operator-facing admin RPCs to pair/unpair/list devices by fleet_node, gateway RPC that accepts agent-reported discoveries, and the schema + guards needed to keep agent-reported rows from contaminating the cloud's existing server-local pairing paths.

The agent half (scan loop, ReportDiscoveredDevices caller, ipscanner.GenerateIPsFromCIDR export) ships in #256. The two PRs share identical proto content and merge cleanly in either order.

Protocol

  • proto/fleetnodegateway/v1/fleetnodegateway.proto: ReportDiscoveredDevices unary batch RPC (max_items = 1024), DiscoveredDeviceReport message.
  • proto/fleetnodeadmin/v1/fleetnodeadmin.proto: three admin RPCs — PairDeviceToFleetNode, UnpairDevice, ListFleetNodeDevices. Session-only auth.

Schema

server/migrations/000050_add_fleet_node_attribution_to_discovered_device.up.sql:

  • Adds discovered_device.discovered_by_fleet_node_id BIGINT NULL with FK to fleet_node(id) ON DELETE SET NULL and a partial index on the non-null subset.
  • No backfill — no existing fleet nodes in production.

Domain + handlers

  • server/internal/domain/fleetnodepairing/: new domain service. PairDevice requires EnrollmentStatus == CONFIRMED, locks the fleet_node row inside the TX, and updates fleet_node_device + discovered_device.discovered_by_fleet_node_id atomically. UnpairDevice clears attribution. UpsertDiscoveredDevices validates each report (IP must be RFC1918/RFC4193, port 1-65535, scheme http/https), then calls a single SQL upsert.
  • server/internal/handlers/fleetnodegateway/handler.go: ReportDiscoveredDevices reads Subject from the auth context for fleet_node_id + org_id (never from the body) and forwards to the domain service.
  • server/internal/handlers/fleetnodeadmin/handler.go: three pairing RPCs wired into SessionOnlyProcedures.
  • server/internal/handlers/interceptors/config.go: ReportDiscoveredDevices added to FleetNodeAuthenticatedProcedures.

SQL reconciliation

UpsertDiscoveredDeviceFromFleetNode is a single statement that:

  1. Reconciles via CTE: if the agent's identifier has the auto: prefix, substitutes any existing auto:* identifier at the same (fleet_node, ip, port) endpoint (mirrors combined-mode pairing.service's GetByIPAndPort reuse). mac:* and serial:* pass through unchanged.
  2. Inserts; on conflict, updates the row.
  3. The WHERE on the DO UPDATE branch blocks two attack surfaces: claims from a different fleet node, and retargeting of devices the cloud paired locally (which have a device row but no fleet_node_device binding).

One round-trip per report regardless of identifier prefix.

Guards against agent-reported rows poisoning server flows

  • GetActiveUnpairedDiscoveredDevices / CountActiveUnpairedDiscoveredDevices: filter out rows where discovered_by_fleet_node_id IS NOT NULL. The cloud-side pairing flow dials discovered_device.ip_address; an authenticated fleet node must not be able to plant a private address into the inventory and have the server reach for it. Remote-origin rows are paired through PairDeviceToFleetNode, which never touches the IP.
  • pairing.PairDevices: defense-in-depth guard that rejects any device whose linked discovered_device.discovered_by_fleet_node_id is non-nil, even if the caller passes the identifier directly.
  • RevokeFleetNode: now deletes fleet_node_device rows and clears discovered_device.discovered_by_fleet_node_id for the revoked node in the same transaction (sealed via a new RevocationCleanupStore sub-interface). Soft-delete previously left attribution pointing at an unreachable node, blocking replacement nodes from refreshing the row.

What's not in this PR

  • The agent: server/cmd/fleetnode/discovery.go, run.go discovery loop, ipscanner.GenerateIPsFromCIDR export — see feat(fleetnode): discovery scan + ReportDiscoveredDevices agent #256.
  • ControlStream + per-device commands (next).
  • UploadTelemetry / UploadEvents.
  • Frontend pair page (admin RPCs are curl-usable today).
  • Per-fleet-node CIDR allowlist for discovery reports — current validateReport is a coarse first line (RFC1918/RFC4193 only); follow-up before any server-side code starts dialing discovered_device.ip_address.

Test plan

  • PATH=./bin:$PATH just gen (idempotent)
  • PATH=./bin:$PATH go build ./server/...
  • PATH=./bin:$PATH go vet ./server/...
  • cd server && PATH=../bin:$PATH golangci-lint run -c .golangci.yaml ./... — 0 issues
  • DB_PASSWORD=fleet PATH=../bin:$PATH go test ./server/... -count=1 — full suite passes
  • cd server && DB_PASSWORD=fleet PATH=../bin:$PATH go test ./internal/domain/fleetnodepairing/... ./internal/domain/fleetnodeenrollment/... ./internal/domain/pairing/... ./internal/handlers/fleetnodegateway/... ./internal/handlers/fleetnodeadmin/... -race
  • Migration 000050 round-trips cleanly
  • CI runs the full suite
  • E2E smoke with feat(fleetnode): discovery scan + ReportDiscoveredDevices agent #256 merged (or built locally) — agent reports a discovery, operator pairs, attribution flips, second agent's report rejected if attribution mismatches

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 14, 2026 18:02
@ankitgoswami ankitgoswami requested a review from a team as a code owner May 14, 2026 18:02
@github-actions github-actions Bot added javascript Pull requests that update javascript code client server shared labels May 14, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (19a03050f8be5f51375a7b83c1e6f93d9e820873...e5c338a7aa370be73bcf669edcae055e4f43de34, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Control registry can panic on send-after-close races

  • Category: Concurrency
  • Location: server/internal/domain/fleetnodecontrol/registry.go:141
  • Description: Send takes a stream pointer, unlocks the registry, then sends on ns.outgoing. A reconnect can concurrently call Register and close that same channel. The same pattern exists for command event channels: PublishAck / publish fetch a channel, unlock, then send while cleanup, Register, or Unregister may close it.
  • Impact: Sending on a closed channel panics. An authenticated fleet node reconnecting or racing an ack/report with admin discovery cleanup can crash fleetd.
  • Recommendation: Do not close channels that other goroutines may still send to. Use per-stream done channels or closed flags, and serialize close/send under the same synchronization. For event channels, prefer deleting state plus a terminal event/context cancellation over receiver-side channel closes.

[MEDIUM] Fleet nodes can persist discovery reports for non-issued commands

  • Category: Network Discovery
  • Location: server/internal/handlers/fleetnodegateway/handler.go:104
  • Description: ReportDiscoveredDevices persists reports before proving command_id belongs to an active server-issued command. A random non-empty command_id passes request validation, and PublishBatch is only a best-effort no-op for unknown IDs.
  • Impact: Any authenticated fleet node can poison discovered_device rows outside an operator-initiated scan, including claiming unpaired local identifiers and moving them into the fleet-node-attributed path that cloud pairing refuses.
  • Recommendation: Validate command_id against the registry before any upsert, and reject unknown/stale command IDs. If unsolicited background reports are intended, make that a separate flow with explicit quotas and ownership rules.

[MEDIUM] Rejected ownership reports are still streamed to operators

  • Category: Network Discovery
  • Location: server/internal/handlers/fleetnodegateway/handler.go:110
  • Description: UpsertDiscoveredDevices can reject individual reports via rows == 0, but the handler ignores which reports were rejected and builds the operator batch from the full original request whenever at least one report was accepted.
  • Impact: A malicious or buggy fleet node can include one valid device plus spoofed ownership-conflicting devices, causing DiscoverOnFleetNode to show devices the database refused to accept.
  • Recommendation: Return per-report status or the accepted report list from the pairing service, and publish only accepted devices to the operator stream.

[MEDIUM] Remote discovery commands can hang indefinitely

  • Category: Reliability
  • Location: server/internal/handlers/fleetnodeadmin/handler.go:211
  • Description: After dispatching a discovery command, the handler waits forever for a batch/ack unless the operator client disconnects or the control stream closes. There is no server-side command deadline.
  • Impact: A fleet node that reads the command but never acks can keep admin streams, goroutines, and registry command state alive indefinitely.
  • Recommendation: Wrap each command in a server-controlled timeout, clean up registry state on deadline, and return DeadlineExceeded or FailedPrecondition.

Notes

I did not find changed-hunk evidence of SQL injection, pool/wallet hijacking, command injection, frontend XSS, or protobuf wire-format breakage. Review was static against .git/codex-review.diff; I did not run tests.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 44d1f14dad

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/cmd/fleetnode/discovery.go Outdated
Comment thread server/cmd/fleetnode/discovery.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires up end-to-end fleet-node device discovery reporting and operator-driven device pairing. It introduces new gateway/admin RPCs, persists discovery attribution to the reporting fleet node, and adds an agent-side discovery loop that periodically scans CIDRs and reports discovered devices back to the server.

Changes:

  • Add new RPCs: ReportDiscoveredDevices (gateway) and PairDeviceToFleetNode / UnpairDevice / ListFleetNodeDevices (admin), plus server auth interceptor allowlists.
  • Add DB support for discovery attribution (discovered_device.discovered_by_fleet_node_id) and pairing queries + domain service/store for pairing operations.
  • Add fleetnode agent discovery scanning loop with plugin-backed probing and unit tests.

Reviewed changes

Copilot reviewed 21 out of 32 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/sqlc/queries/fleetnodepairing.sql New sqlc queries for discovery upsert, pairing/unpairing, and listing pairings.
server/sdk/v1/python/proto_fleet_sdk/generated/pb/driver_pb2_grpc.py Regenerated Python gRPC stubs (version constant change).
server/migrations/000050_add_fleet_node_attribution_to_discovered_device.up.sql Adds nullable discovered_by_fleet_node_id with FK + index for attribution.
server/migrations/000050_add_fleet_node_attribution_to_discovered_device.down.sql Rollback for attribution column, FK, and index.
server/internal/handlers/interceptors/config.go Adds new admin/gateway procedures to auth allowlists.
server/internal/handlers/fleetnodegateway/handler.go Adds ReportDiscoveredDevices handler and injects pairing service into gateway handler.
server/internal/handlers/fleetnodegateway/handler_heartbeat_test.go Updates handler construction to include pairing service dependency.
server/internal/handlers/fleetnodegateway/handler_discovery_test.go New tests for discovery reporting persistence + missing subject rejection.
server/internal/handlers/fleetnodeadmin/handler.go Adds pairing RPC handlers and injects pairing service dependency.
server/internal/handlers/fleetnodeadmin/handler_pairing_test.go New integration tests covering admin pairing/unpair/list and auth gating.
server/internal/domain/stores/sqlstores/fleetnodepairing.go New SQL store implementing pairing/discovery upsert operations.
server/internal/domain/fleetnodepairing/service.go New domain service for pairing/listing and discovery upsert batching with error mapping.
server/internal/domain/fleetnodepairing/models.go New domain models for discovery reports and pairing summaries.
server/internal/domain/fleetnodepairing/integration_test.go Integration tests for service behavior (pair/unpair/list + discovery attribution).
server/generated/sqlc/models.go Regenerated sqlc models to include DiscoveredByFleetNodeID.
server/generated/sqlc/fleetnodepairing.sql.go Regenerated sqlc code for new pairing/discovery queries.
server/generated/sqlc/discovered_device.sql.go Regenerated sqlc code to select new attribution column.
server/generated/sqlc/db.go Regenerated sqlc statement preparation/closing for new queries.
server/generated/grpc/fleetnodegateway/v1/fleetnodegatewayv1connect/fleetnodegateway.connect.go Regenerated Connect stubs for new gateway RPC.
server/generated/grpc/fleetnodeadmin/v1/fleetnodeadminv1connect/fleetnodeadmin.connect.go Regenerated Connect stubs for new admin RPCs.
server/generated/grpc/fleetnodeadmin/v1/fleetnodeadmin.pb.go Regenerated protobuf Go types for pairing RPCs/messages.
server/cmd/fleetnode/run.go Adds discovery flags and runs discovery tick loop alongside heartbeat.
server/cmd/fleetnode/run_test.go Extends stub gateway client to record discovery RPC calls/batches.
server/cmd/fleetnode/fake_gateway_test.go Extends fake gateway server to handle ReportDiscoveredDevices.
server/cmd/fleetnode/discovery.go Implements CIDR expansion, probing, and reporting logic (plugin-backed discoverer).
server/cmd/fleetnode/discovery_test.go New unit tests for CIDR enumeration and discovery tick behavior/error handling.
server/cmd/fleetd/main.go Wires new pairing service into server handlers.
proto/fleetnodegateway/v1/fleetnodegateway.proto Adds ReportDiscoveredDevices RPC + request/response/messages.
proto/fleetnodeadmin/v1/fleetnodeadmin.proto Adds pairing/admin RPCs + request/response/messages.
client/src/protoFleet/api/generated/fleetnodegateway/v1/fleetnodegateway_pb.ts Regenerated TS API types/service for new gateway RPC.
client/src/protoFleet/api/generated/fleetnodeadmin/v1/fleetnodeadmin_pb.ts Regenerated TS API types/service for new admin RPCs.

Comment thread server/cmd/fleetnode/discovery.go Outdated
Comment thread server/cmd/fleetnode/discovery.go Outdated
@ankitgoswami ankitgoswami marked this pull request as draft May 14, 2026 18:10
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from 9839915 to 0f7b702 Compare May 14, 2026 22:37
@ankitgoswami ankitgoswami changed the title feat(fleetnode): discovery + pairing E2E (server + agent) feat(fleetnode): discovery + pairing E2E (server) May 18, 2026
ankitgoswami and others added 2 commits May 19, 2026 13:42
PR #257 landed the wire contracts on main, so this branch drops the
proto + generated bindings + interceptor config entries that PR A
now owns. What's left is the server-side implementation that
consumes those contracts.

What stays from the previous branch state:
- Migration adding discovered_device.discovered_by_fleet_node_id +
  partial index. Renumbered to 000051 to avoid collision with
  000050_add_curtailment_event_created_by which landed on main.
- fleetnodepairing domain service + sqlstore + integration tests:
  PairDevice (CONFIRMED-only + intra-TX lock), UnpairDevice,
  ListPairs, ListDevicesForFleetNode, UpsertDiscoveredDevices with
  auto:* CTE reconciliation by (fleet_node, ip, port).
- fleetnodegateway handler: ReportDiscoveredDevices (uses
  fleetnodeauth.Subject for org/node).
- fleetnodeadmin handler: PairDeviceToFleetNode, UnpairDevice,
  ListFleetNodeDevices.
- pairing.PairDevices SSRF guard rejecting agent-reported rows.
- RevokeFleetNode cleanup (deletes pairings, clears attribution).
- GetActiveUnpairedDiscoveredDevices filter.
- minerdiscovery model + discovered_device store DiscoveredByFleetNodeID
  plumbing.

What adapted to the new wire:
- DiscoveredDeviceReport.Type renamed to DriverName everywhere
  (model, store, handler, tests) to match the proto rename.
- fleetd/main.go: wire fleetnodepairing.Service and pass it to both
  fleetnodegateway and fleetnodeadmin handler constructors.

The new server-initiated components (fleetnodecontrol.Registry,
ControlStream handler, DiscoverOnFleetNode admin handler) follow
in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the server-initiated discovery flow that pairs with the
proto contracts merged in #257. Operators trigger
DiscoverOnFleetNode(fleet_node_id, pairing.v1.DiscoverRequest); the
server dispatches a ControlCommand to the agent's open ControlStream,
the agent reports devices via ReportDiscoveredDevices with the
matching command_id, and the server forwards each batch back to the
operator stream until the agent's ControlAck closes the call.

- fleetnodecontrol.Registry: in-memory map[fleet_node_id]stream plus
  per-command_id event channels. Single-instance fleetd only.
- ControlStream handler: Hello -> Accepted -> bidi pump of outgoing
  ControlCommand and incoming ControlAck. PublishAck wires acks to
  the in-flight command channel.
- DiscoverOnFleetNode admin RPC: validates session + fleet node org
  scope, expands IPRange to IPList server-side, rejects MDNS/Nmap,
  encodes payload, streams batches until ack.
- ReportDiscoveredDevices: when command_id is set, publishes each
  persisted batch to the registry so the operator stream wakes up.
- Export ipscanner.GenerateIPsFromCIDR so other packages can reuse
  CIDR enumeration.

Tests cover registry register/send/ack/disconnect, ControlStream
hello/dispatch/ack/duplicate-stream rejection, DiscoverOnFleetNode
happy path + no-stream + MDNS reject + IPRange expansion + viewer
gate, and the ReportDiscoveredDevices command_id correlation path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/discovery-pairing branch from b4d4ad4 to be68841 Compare May 19, 2026 21:01
@github-actions github-actions Bot removed javascript Pull requests that update javascript code client shared labels May 19, 2026
ankitgoswami and others added 3 commits May 19, 2026 14:44
Drop the Nmap rejection in DiscoverOnFleetNode. With the agent shipping
a bundled nmap binary (#256), NmapModeRequest now passes through unchanged
so the agent can execute it. MDNS still rejected; the agent doesn't run
an mDNS listener.

Server-side validation: require nmap.target to be non-empty so an
all-zero NmapModeRequest is caught before round-tripping through the
control stream.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply ce-simplify findings:

- Drop fleetnodegateway.MarshalDiscoverRequest. It was a 3-line proto.Marshal
  wrapper that made the admin handler import the gateway handler package
  for no abstraction benefit. Admin handler now calls proto.Marshal directly.
- Extract toPairingDevice helper in ReportDiscoveredDevices so the
  pb.DiscoveredDeviceReport -> pairingpb.Device conversion isn't repeated
  inline when the registry-publish path forwards a batch.
- Dedup Registry.PublishAck and PublishBatch via a shared private publish
  method; structural twins now share one implementation.
- Strip narrative comments that restated the code (Phase/RFC pointers,
  WHAT comments, multi-paragraph descriptions). Keep only WHY where the
  reasoning isn't obvious from the signature.
- Add expandIPv4Range unit tests (inclusive range, single address, invalid
  start IP, IPv6 rejection, end-before-start rejection, overflow cap).
- Add DiscoverOnFleetNode test for the FailedPrecondition path when the
  fleet node is not CONFIRMED.

Build, vet, lint, and tests green on the affected packages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reconnecting agents repeatedly hit FailedPrecondition because the server's
HTTP/2 layer doesn't always notice a SIGKILL'd peer immediately,
especially through docker-proxy on macOS. During that window the prior
control-stream registry entry was orphaned and blocked all reconnects.

Switch to MQTT-style newest-wins semantics. When Register sees an
existing entry, evict it: close its outgoing channel (wakes the prior
handler's main loop, which returns) and every in-flight command channel
(operator streams get FailedPrecondition with "stream closed before
command completed"). The old handler's deferred Unregister becomes a
safe no-op via pointer-equality on the map entry, so a stale cleanup
can't clobber the fresh registration.

Stream now carries the *nodeStream back-pointer so PublishAck on an
evicted handle is also a no-op rather than a write to the replacement's
command channels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants