feat(fleetnode): discovery + pairing E2E (server)#235
Conversation
🔐 Codex Security Review
Review SummaryOverall Risk: HIGH Findings[HIGH] Control registry can panic on send-after-close races
[MEDIUM] Fleet nodes can persist discovery reports for non-issued commands
[MEDIUM] Rejected ownership reports are still streamed to operators
[MEDIUM] Remote discovery commands can hang indefinitely
NotesI did not find changed-hunk evidence of SQL injection, pool/wallet hijacking, command injection, frontend XSS, or protobuf wire-format breakage. Review was static against Generated by Codex Security Review | |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 44d1f14dad
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR wires up end-to-end fleet-node device discovery reporting and operator-driven device pairing. It introduces new gateway/admin RPCs, persists discovery attribution to the reporting fleet node, and adds an agent-side discovery loop that periodically scans CIDRs and reports discovered devices back to the server.
Changes:
- Add new RPCs:
ReportDiscoveredDevices(gateway) andPairDeviceToFleetNode/UnpairDevice/ListFleetNodeDevices(admin), plus server auth interceptor allowlists. - Add DB support for discovery attribution (
discovered_device.discovered_by_fleet_node_id) and pairing queries + domain service/store for pairing operations. - Add fleetnode agent discovery scanning loop with plugin-backed probing and unit tests.
Reviewed changes
Copilot reviewed 21 out of 32 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| server/sqlc/queries/fleetnodepairing.sql | New sqlc queries for discovery upsert, pairing/unpairing, and listing pairings. |
| server/sdk/v1/python/proto_fleet_sdk/generated/pb/driver_pb2_grpc.py | Regenerated Python gRPC stubs (version constant change). |
| server/migrations/000050_add_fleet_node_attribution_to_discovered_device.up.sql | Adds nullable discovered_by_fleet_node_id with FK + index for attribution. |
| server/migrations/000050_add_fleet_node_attribution_to_discovered_device.down.sql | Rollback for attribution column, FK, and index. |
| server/internal/handlers/interceptors/config.go | Adds new admin/gateway procedures to auth allowlists. |
| server/internal/handlers/fleetnodegateway/handler.go | Adds ReportDiscoveredDevices handler and injects pairing service into gateway handler. |
| server/internal/handlers/fleetnodegateway/handler_heartbeat_test.go | Updates handler construction to include pairing service dependency. |
| server/internal/handlers/fleetnodegateway/handler_discovery_test.go | New tests for discovery reporting persistence + missing subject rejection. |
| server/internal/handlers/fleetnodeadmin/handler.go | Adds pairing RPC handlers and injects pairing service dependency. |
| server/internal/handlers/fleetnodeadmin/handler_pairing_test.go | New integration tests covering admin pairing/unpair/list and auth gating. |
| server/internal/domain/stores/sqlstores/fleetnodepairing.go | New SQL store implementing pairing/discovery upsert operations. |
| server/internal/domain/fleetnodepairing/service.go | New domain service for pairing/listing and discovery upsert batching with error mapping. |
| server/internal/domain/fleetnodepairing/models.go | New domain models for discovery reports and pairing summaries. |
| server/internal/domain/fleetnodepairing/integration_test.go | Integration tests for service behavior (pair/unpair/list + discovery attribution). |
| server/generated/sqlc/models.go | Regenerated sqlc models to include DiscoveredByFleetNodeID. |
| server/generated/sqlc/fleetnodepairing.sql.go | Regenerated sqlc code for new pairing/discovery queries. |
| server/generated/sqlc/discovered_device.sql.go | Regenerated sqlc code to select new attribution column. |
| server/generated/sqlc/db.go | Regenerated sqlc statement preparation/closing for new queries. |
| server/generated/grpc/fleetnodegateway/v1/fleetnodegatewayv1connect/fleetnodegateway.connect.go | Regenerated Connect stubs for new gateway RPC. |
| server/generated/grpc/fleetnodeadmin/v1/fleetnodeadminv1connect/fleetnodeadmin.connect.go | Regenerated Connect stubs for new admin RPCs. |
| server/generated/grpc/fleetnodeadmin/v1/fleetnodeadmin.pb.go | Regenerated protobuf Go types for pairing RPCs/messages. |
| server/cmd/fleetnode/run.go | Adds discovery flags and runs discovery tick loop alongside heartbeat. |
| server/cmd/fleetnode/run_test.go | Extends stub gateway client to record discovery RPC calls/batches. |
| server/cmd/fleetnode/fake_gateway_test.go | Extends fake gateway server to handle ReportDiscoveredDevices. |
| server/cmd/fleetnode/discovery.go | Implements CIDR expansion, probing, and reporting logic (plugin-backed discoverer). |
| server/cmd/fleetnode/discovery_test.go | New unit tests for CIDR enumeration and discovery tick behavior/error handling. |
| server/cmd/fleetd/main.go | Wires new pairing service into server handlers. |
| proto/fleetnodegateway/v1/fleetnodegateway.proto | Adds ReportDiscoveredDevices RPC + request/response/messages. |
| proto/fleetnodeadmin/v1/fleetnodeadmin.proto | Adds pairing/admin RPCs + request/response/messages. |
| client/src/protoFleet/api/generated/fleetnodegateway/v1/fleetnodegateway_pb.ts | Regenerated TS API types/service for new gateway RPC. |
| client/src/protoFleet/api/generated/fleetnodeadmin/v1/fleetnodeadmin_pb.ts | Regenerated TS API types/service for new admin RPCs. |
9839915 to
0f7b702
Compare
PR #257 landed the wire contracts on main, so this branch drops the proto + generated bindings + interceptor config entries that PR A now owns. What's left is the server-side implementation that consumes those contracts. What stays from the previous branch state: - Migration adding discovered_device.discovered_by_fleet_node_id + partial index. Renumbered to 000051 to avoid collision with 000050_add_curtailment_event_created_by which landed on main. - fleetnodepairing domain service + sqlstore + integration tests: PairDevice (CONFIRMED-only + intra-TX lock), UnpairDevice, ListPairs, ListDevicesForFleetNode, UpsertDiscoveredDevices with auto:* CTE reconciliation by (fleet_node, ip, port). - fleetnodegateway handler: ReportDiscoveredDevices (uses fleetnodeauth.Subject for org/node). - fleetnodeadmin handler: PairDeviceToFleetNode, UnpairDevice, ListFleetNodeDevices. - pairing.PairDevices SSRF guard rejecting agent-reported rows. - RevokeFleetNode cleanup (deletes pairings, clears attribution). - GetActiveUnpairedDiscoveredDevices filter. - minerdiscovery model + discovered_device store DiscoveredByFleetNodeID plumbing. What adapted to the new wire: - DiscoveredDeviceReport.Type renamed to DriverName everywhere (model, store, handler, tests) to match the proto rename. - fleetd/main.go: wire fleetnodepairing.Service and pass it to both fleetnodegateway and fleetnodeadmin handler constructors. The new server-initiated components (fleetnodecontrol.Registry, ControlStream handler, DiscoverOnFleetNode admin handler) follow in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implement the server-initiated discovery flow that pairs with the proto contracts merged in #257. Operators trigger DiscoverOnFleetNode(fleet_node_id, pairing.v1.DiscoverRequest); the server dispatches a ControlCommand to the agent's open ControlStream, the agent reports devices via ReportDiscoveredDevices with the matching command_id, and the server forwards each batch back to the operator stream until the agent's ControlAck closes the call. - fleetnodecontrol.Registry: in-memory map[fleet_node_id]stream plus per-command_id event channels. Single-instance fleetd only. - ControlStream handler: Hello -> Accepted -> bidi pump of outgoing ControlCommand and incoming ControlAck. PublishAck wires acks to the in-flight command channel. - DiscoverOnFleetNode admin RPC: validates session + fleet node org scope, expands IPRange to IPList server-side, rejects MDNS/Nmap, encodes payload, streams batches until ack. - ReportDiscoveredDevices: when command_id is set, publishes each persisted batch to the registry so the operator stream wakes up. - Export ipscanner.GenerateIPsFromCIDR so other packages can reuse CIDR enumeration. Tests cover registry register/send/ack/disconnect, ControlStream hello/dispatch/ack/duplicate-stream rejection, DiscoverOnFleetNode happy path + no-stream + MDNS reject + IPRange expansion + viewer gate, and the ReportDiscoveredDevices command_id correlation path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b4d4ad4 to
be68841
Compare
Drop the Nmap rejection in DiscoverOnFleetNode. With the agent shipping a bundled nmap binary (#256), NmapModeRequest now passes through unchanged so the agent can execute it. MDNS still rejected; the agent doesn't run an mDNS listener. Server-side validation: require nmap.target to be non-empty so an all-zero NmapModeRequest is caught before round-tripping through the control stream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply ce-simplify findings: - Drop fleetnodegateway.MarshalDiscoverRequest. It was a 3-line proto.Marshal wrapper that made the admin handler import the gateway handler package for no abstraction benefit. Admin handler now calls proto.Marshal directly. - Extract toPairingDevice helper in ReportDiscoveredDevices so the pb.DiscoveredDeviceReport -> pairingpb.Device conversion isn't repeated inline when the registry-publish path forwards a batch. - Dedup Registry.PublishAck and PublishBatch via a shared private publish method; structural twins now share one implementation. - Strip narrative comments that restated the code (Phase/RFC pointers, WHAT comments, multi-paragraph descriptions). Keep only WHY where the reasoning isn't obvious from the signature. - Add expandIPv4Range unit tests (inclusive range, single address, invalid start IP, IPv6 rejection, end-before-start rejection, overflow cap). - Add DiscoverOnFleetNode test for the FailedPrecondition path when the fleet node is not CONFIRMED. Build, vet, lint, and tests green on the affected packages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reconnecting agents repeatedly hit FailedPrecondition because the server's HTTP/2 layer doesn't always notice a SIGKILL'd peer immediately, especially through docker-proxy on macOS. During that window the prior control-stream registry entry was orphaned and blocked all reconnects. Switch to MQTT-style newest-wins semantics. When Register sees an existing entry, evict it: close its outgoing channel (wakes the prior handler's main loop, which returns) and every in-flight command channel (operator streams get FailedPrecondition with "stream closed before command completed"). The old handler's deferred Unregister becomes a safe no-op via pointer-equality on the map entry, so a stale cleanup can't clobber the fresh registration. Stream now carries the *nodeStream back-pointer so PublishAck on an evicted handle is also a no-op rather than a write to the replacement's command channels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Server-side half of the fleet-node discovery + pairing flow under RFC 0001 phase 2. Operator-facing admin RPCs to pair/unpair/list devices by fleet_node, gateway RPC that accepts agent-reported discoveries, and the schema + guards needed to keep agent-reported rows from contaminating the cloud's existing server-local pairing paths.
The agent half (scan loop,
ReportDiscoveredDevicescaller,ipscanner.GenerateIPsFromCIDRexport) ships in #256. The two PRs share identical proto content and merge cleanly in either order.Protocol
proto/fleetnodegateway/v1/fleetnodegateway.proto:ReportDiscoveredDevicesunary batch RPC (max_items = 1024),DiscoveredDeviceReportmessage.proto/fleetnodeadmin/v1/fleetnodeadmin.proto: three admin RPCs —PairDeviceToFleetNode,UnpairDevice,ListFleetNodeDevices. Session-only auth.Schema
server/migrations/000050_add_fleet_node_attribution_to_discovered_device.up.sql:discovered_device.discovered_by_fleet_node_id BIGINT NULLwith FK tofleet_node(id) ON DELETE SET NULLand a partial index on the non-null subset.Domain + handlers
server/internal/domain/fleetnodepairing/: new domain service.PairDevicerequiresEnrollmentStatus == CONFIRMED, locks the fleet_node row inside the TX, and updatesfleet_node_device+discovered_device.discovered_by_fleet_node_idatomically.UnpairDeviceclears attribution.UpsertDiscoveredDevicesvalidates each report (IP must be RFC1918/RFC4193, port 1-65535, scheme http/https), then calls a single SQL upsert.server/internal/handlers/fleetnodegateway/handler.go:ReportDiscoveredDevicesreadsSubjectfrom the auth context for fleet_node_id + org_id (never from the body) and forwards to the domain service.server/internal/handlers/fleetnodeadmin/handler.go: three pairing RPCs wired intoSessionOnlyProcedures.server/internal/handlers/interceptors/config.go:ReportDiscoveredDevicesadded toFleetNodeAuthenticatedProcedures.SQL reconciliation
UpsertDiscoveredDeviceFromFleetNodeis a single statement that:auto:prefix, substitutes any existingauto:*identifier at the same(fleet_node, ip, port)endpoint (mirrors combined-modepairing.service'sGetByIPAndPortreuse).mac:*andserial:*pass through unchanged.WHEREon theDO UPDATEbranch blocks two attack surfaces: claims from a different fleet node, and retargeting of devices the cloud paired locally (which have adevicerow but nofleet_node_devicebinding).One round-trip per report regardless of identifier prefix.
Guards against agent-reported rows poisoning server flows
GetActiveUnpairedDiscoveredDevices/CountActiveUnpairedDiscoveredDevices: filter out rows wherediscovered_by_fleet_node_id IS NOT NULL. The cloud-side pairing flow dialsdiscovered_device.ip_address; an authenticated fleet node must not be able to plant a private address into the inventory and have the server reach for it. Remote-origin rows are paired throughPairDeviceToFleetNode, which never touches the IP.pairing.PairDevices: defense-in-depth guard that rejects any device whose linkeddiscovered_device.discovered_by_fleet_node_idis non-nil, even if the caller passes the identifier directly.RevokeFleetNode: now deletesfleet_node_devicerows and clearsdiscovered_device.discovered_by_fleet_node_idfor the revoked node in the same transaction (sealed via a newRevocationCleanupStoresub-interface). Soft-delete previously left attribution pointing at an unreachable node, blocking replacement nodes from refreshing the row.What's not in this PR
server/cmd/fleetnode/discovery.go, run.go discovery loop,ipscanner.GenerateIPsFromCIDRexport — see feat(fleetnode): discovery scan + ReportDiscoveredDevices agent #256.ControlStream+ per-device commands (next).UploadTelemetry/UploadEvents.validateReportis a coarse first line (RFC1918/RFC4193 only); follow-up before any server-side code starts dialingdiscovered_device.ip_address.Test plan
PATH=./bin:$PATH just gen(idempotent)PATH=./bin:$PATH go build ./server/...PATH=./bin:$PATH go vet ./server/...cd server && PATH=../bin:$PATH golangci-lint run -c .golangci.yaml ./...— 0 issuesDB_PASSWORD=fleet PATH=../bin:$PATH go test ./server/... -count=1— full suite passescd server && DB_PASSWORD=fleet PATH=../bin:$PATH go test ./internal/domain/fleetnodepairing/... ./internal/domain/fleetnodeenrollment/... ./internal/domain/pairing/... ./internal/handlers/fleetnodegateway/... ./internal/handlers/fleetnodeadmin/... -race🤖 Generated with Claude Code