Skip to content

feat(fleetnode): device pairing + agent reporting (server) [PR 1/2]#332

Merged
ankitgoswami merged 4 commits into
mainfrom
ankitg/fleetnode-pairing
May 29, 2026
Merged

feat(fleetnode): device pairing + agent reporting (server) [PR 1/2]#332
ankitgoswami merged 4 commits into
mainfrom
ankitg/fleetnode-pairing

Conversation

@ankitgoswami
Copy link
Copy Markdown
Contributor

@ankitgoswami ankitgoswami commented May 28, 2026

PR 1 of a stack. Adds the server-side surfaces operators need to manage fleet-node device pairings and the gateway endpoint agents call to report devices they discovered on their LAN.

Summary

  • fleetnodepairing domainService + Store with PairDevice, UnpairDevice, ListPairs, ListDevicesForFleetNode, UpsertDiscoveredDevices. Agent reports are validated for RFC1918/RFC4193 IP, port range, and http/https scheme.
  • UpsertDiscoveredDeviceFromFleetNodeNOT EXISTS pairing guard prevents fleet node B from overwriting a device already paired with fleet node A.
  • FleetNodeGateway.ReportDiscoveredDevices — agent-authenticated RPC that ingests discovered devices and routes them through the pairing service.
  • FleetNodeAdmin.PairDeviceToFleetNode / UnpairDevice / ListFleetNodeDevices — operator endpoints, gated by fleetnode:manage / fleetnode:read via middleware.RequirePermission.
  • Revoke now cascades — pulled RevocationCleanupStore out of fleetnodeenrollment.Store so RevokeFleetNode deletes fleet_node_device rows for the revoked node in the same transaction.

Deferred to PR 2 (#235)

  • fleetnodecontrol.Registry (in-memory ControlStream + per-command_id event dispatch).
  • FleetNodeGateway.ControlStream bidi handler.
  • FleetNodeAdmin.DiscoverOnFleetNode operator-initiated discovery + proto max_items caps on DiscoverRequest modes.
  • The command_id correlation hook in ReportDiscoveredDevices that fans batches to the operator's waiting stream.

Test plan

  • cd server && go build ./... && go vet ./...
  • just lint clean (buf, eslint, golangci-lint)
  • DB_PASSWORD=fleet go test ./internal/handlers/middleware/... ./internal/handlers/fleetnodeadmin/... ./internal/handlers/fleetnodegateway/... ./internal/domain/fleetnodepairing/... ./internal/domain/fleetnodeenrollment/... — all green
  • CI green

🤖 Generated with Claude Code

@ankitgoswami ankitgoswami requested a review from a team as a code owner May 28, 2026 21:45
Copilot AI review requested due to automatic review settings May 28, 2026 21:45
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 55468dd665

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/handlers/fleetnodegateway/handler.go Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (0abe567af8bc6605c0b7bd35c0d1c8f593a17fd9...14f02c51937027cbd9dc517b32e6e3518caeccb3, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: HIGH

Findings

[HIGH] Pair/unpair RPCs bypass miner pairing permissions

  • Category: Auth
  • Location: server/internal/handlers/fleetnodeadmin/handler.go:101
  • Description: PairDeviceToFleetNode and UnpairDevice require only fleetnode:manage. The existing miner pairing surfaces are gated by miner:pair and miner unpairing by miner:unpair, so a custom role intended only to manage fleet-node enrollment can now attach or detach miners by ID.
  • Impact: This expands fleet-node administration into miner routing control. A user without miner-pair/unpair rights can disrupt assignments and, as fleet-node control routing lands, potentially redirect miner command ownership to a node they manage.
  • Recommendation: Require miner permissions as well: fleetnode:manage plus miner:pair for pairing, and fleetnode:manage plus miner:unpair for unpairing. Mirror this in ProcedurePermissions and add tests for roles with only one of the required permissions.

[MEDIUM] Discovery reports ignore the required server-issued command ID

  • Category: gRPC | Network Discovery
  • Location: server/internal/handlers/fleetnodegateway/handler.go:99
  • Description: ReportDiscoveredDevices persists reports using only the authenticated fleet-node subject. It never checks req.Msg.GetCommandId() against an outstanding server-issued discovery command, despite the proto contract saying reports are only accepted in response to a scheduled command.
  • Impact: Any authenticated or compromised fleet node can submit arbitrary private-IP discovery data outside an operator-initiated scan, poisoning discovered-device inventory and potentially flooding up to the per-request cap repeatedly.
  • Recommendation: Track pending discovery command IDs keyed by fleet node, validate expiry and one-time use before upserting, and reject unknown/replayed command IDs. Until that exists, avoid persisting reports from this RPC or explicitly downgrade the proto contract.

Notes

No cryptostealing/pool-hijack behavior, raw SQL injection, command injection, or protobuf wire-format break was evident in the reviewed diff. The added rejected_count = 2 response field is wire-compatible.


Generated by Codex Security Review |
Triggered by: @ankitgoswami |
Review workflow run

ankitgoswami added a commit that referenced this pull request May 28, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the pairing + agent-reporting surface in PR 1 (#332). Builds on the existing fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory registry correlates server-issued ControlCommand requests with the agent's eventual ReportDiscoveredDevices batches.

What's in this PR:
- fleetnodecontrol.Registry — single-instance in-memory map of fleet_node_id -> active ControlStream + per-command_id event channel (CommandEvent { Batch | Ack }). Newest-wins eviction, dropped-event counter (64-slot buffer), atomic accounting for tests + observability.
- FleetNodeGateway.ControlStream — bidi handler. After Hello, registers the stream and pumps outgoing ControlCommand requests + incoming ControlAck responses through a side goroutine (2-buffer to avoid linger on exit).
- ReportDiscoveredDevices hook — when the agent reports devices with a command_id, the batch is also published to the registry so the operator's waiting stream wakes up.
- FleetNodeAdmin.DiscoverOnFleetNode — operator-facing streaming RPC. Validates target is CONFIRMED, normalizes IPRange to IPList (capped at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap. Uses id.GenerateID() for command_id and proto.Marshal for the payload.
- pairing.proto — buf.validate count caps on DiscoverRequest modes (4096 IPs, 256 ports per mode).
- ipscanner — exports GenerateIPsFromCIDR for cross-package reuse.
- middleware/rpc_permissions — DiscoverOnFleetNode -> fleetnode:manage.
- Tests — registry register/send/ack/eviction, ControlStream hello + dispatch + duplicate-stream rejection, DiscoverOnFleetNode happy/no-stream/MDNS-reject/IPRange-expand/Nmap-passthrough/viewer-gate, ReportDiscoveredDevices command_id correlation, expandIPv4Range overflow + boundary cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds server-side fleet-node device pairing and discovery reporting surfaces, wiring a new pairing domain into the fleet-node admin and gateway handlers plus revocation cleanup.

Changes:

  • Adds fleetnodepairing service/store models, SQL queries, and integration coverage.
  • Implements FleetNodeAdmin pair/unpair/list endpoints and FleetNodeGateway discovery report ingestion.
  • Updates revocation to delete fleet-node pairings and wires services into fleetd.

Reviewed changes

Copilot reviewed 14 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/sqlc/queries/fleetnodepairing.sql Adds pairing, unpairing, listing, cleanup, and fleet-node discovery upsert queries.
server/internal/domain/fleetnodepairing/service.go Adds pairing business logic and discovery report validation/upsert flow.
server/internal/domain/fleetnodepairing/models.go Defines pairing and discovery report domain models.
server/internal/domain/stores/sqlstores/fleetnodepairing.go Implements SQL store adapter for the new pairing domain.
server/internal/domain/stores/sqlstores/fleetnodeenrollment.go Adds revocation cleanup store method.
server/internal/domain/fleetnodeenrollment/service.go Extends revoke flow to delete fleet-node device pairings.
server/internal/handlers/fleetnodeadmin/handler.go Adds admin RPC handlers for pair, unpair, and list fleet-node devices.
server/internal/handlers/fleetnodeadmin/handler_pairing_test.go Adds handler tests for admin pairing endpoints and permissions.
server/internal/handlers/fleetnodegateway/handler.go Adds gateway RPC handler for discovered device reports.
server/internal/handlers/fleetnodegateway/handler_heartbeat_test.go Updates handler test setup for the new pairing dependency.
server/internal/handlers/fleetnodegateway/handler_discovery_test.go Adds gateway discovery report handler tests.
server/internal/handlers/middleware/rpc_permissions.go Moves implemented fleet-node admin RPCs into permission mapping.
server/cmd/fleetd/main.go Wires pairing service/store into production handlers.
server/generated/sqlc/fleetnodepairing.sql.go, server/generated/sqlc/db.go Regenerated sqlc bindings for new queries.

Comment thread server/sqlc/queries/fleetnodepairing.sql
Comment thread server/internal/domain/fleetnodepairing/service.go Outdated
ankitgoswami added a commit that referenced this pull request May 28, 2026
Address security review on PR #332: the validator allows empty scheme but rejects "virtual", the scheme the virtual plugin emits (plugin/virtual/internal/driver/driver.go:160,186). Every legitimate virtual-plugin discovery report currently fails validation.

- Add "virtual" to the allowlist so virtual-plugin reports round-trip cleanly.
- Drop "" from the allowlist — empty was an undocumented placeholder, not a graceful default. The agent's plugin driver always knows the scheme at probe time.
- Tests: TestUpsertDiscoveredDevices_AcceptsVirtualScheme (new positive case), TestUpsertDiscoveredDevices_RejectsEmptyScheme (new negative case), existing RejectsDisallowedScheme (ftp) untouched.

The other two findings from the same review (command_id binding, attribution-based cloud-pairing quarantine) live in PR #235.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f5a1df0053

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/sqlc/queries/fleetnodepairing.sql
ankitgoswami added a commit that referenced this pull request May 28, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the pairing + agent-reporting surface in PR 1 (#332). Builds on the existing fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory registry correlates server-issued ControlCommand requests with the agent's eventual ReportDiscoveredDevices batches.

What's in this PR:
- fleetnodecontrol.Registry — single-instance in-memory map of fleet_node_id -> active ControlStream + per-command_id event channel (CommandEvent { Batch | Ack }). Newest-wins eviction, dropped-event counter (64-slot buffer), atomic accounting for tests + observability.
- FleetNodeGateway.ControlStream — bidi handler. After Hello, registers the stream and pumps outgoing ControlCommand requests + incoming ControlAck responses through a side goroutine (2-buffer to avoid linger on exit).
- ReportDiscoveredDevices hook — when the agent reports devices with a command_id, the batch is also published to the registry so the operator's waiting stream wakes up.
- FleetNodeAdmin.DiscoverOnFleetNode — operator-facing streaming RPC. Validates target is CONFIRMED, normalizes IPRange to IPList (capped at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap. Uses id.GenerateID() for command_id and proto.Marshal for the payload.
- pairing.proto — buf.validate count caps on DiscoverRequest modes (4096 IPs, 256 ports per mode).
- ipscanner — exports GenerateIPsFromCIDR for cross-package reuse.
- middleware/rpc_permissions — DiscoverOnFleetNode -> fleetnode:manage.
- Tests — registry register/send/ack/eviction, ControlStream hello + dispatch + duplicate-stream rejection, DiscoverOnFleetNode happy/no-stream/MDNS-reject/IPRange-expand/Nmap-passthrough/viewer-gate, ReportDiscoveredDevices command_id correlation, expandIPv4Range overflow + boundary cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread server/internal/domain/fleetnodepairing/service.go Outdated
Comment thread server/internal/domain/fleetnodepairing/service.go
Comment thread server/internal/handlers/fleetnodegateway/handler.go Outdated
Comment thread server/sqlc/queries/fleetnodepairing.sql Outdated
ankitgoswami added a commit that referenced this pull request May 29, 2026
Four findings from PR #332 inline review:

1. URLScheme proto/validator mismatch: proto's url_scheme had buf.validate `in: ["http", "https"]` which would reject "virtual" before the handler ever saw it, making the recent validator allowlist change dead surface. Added "virtual" to the proto `in: [...]` list and regenerated. Now virtual-plugin reports pass both layers.

2. UpsertDiscoveredDevices batch wasn't atomic: each per-report upsert auto-committed, so a mid-batch validation failure left a committed prefix. Now: validate every report up-front (validateReport is pure / O(n)), then wrap the writes in s.transactor.RunInTx so either the whole batch commits or none does. Ownership-rejected rows (0 rows affected) still count toward rejectedOwnership without aborting the tx — that's the store's normal "we refused to overwrite a hijacked row" signal, not an error.

3. Added RejectedCount field to ReportDiscoveredDevicesResponse and slog.Warn when rejectedOwnership > 0. Field is additive (proto3 backward-compatible).

4. UpsertDiscoveredDeviceFromFleetNode NOT EXISTS guard was over-blocking: the predicate `WHERE fnd.fleet_node_id IS NULL` blocked any device row not paired to the reporting node — including unpaired devices. After UnpairDevice the originating node could never refresh that discovered_device (is_active / last_seen / ip would freeze). Rewrote the predicate as `NOT EXISTS (... JOIN fnd ... AND fnd.fleet_node_id <> $10 ...)` — block only when paired to a *different* fleet_node, which is the actual hijack case the comment described. Updated the test that asserted the old over-blocking behavior (renamed and inverted assertions); added TestUpsertDiscoveredDevices_BatchValidationErrorRollsBack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added javascript Pull requests that update javascript code client shared labels May 29, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 35e7a64a8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/sqlc/queries/fleetnodepairing.sql
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the pairing + agent-reporting surface in PR 1 (#332). Builds on the existing fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory registry correlates server-issued ControlCommand requests with the agent's eventual ReportDiscoveredDevices batches.

What's in this PR:
- fleetnodecontrol.Registry — single-instance in-memory map of fleet_node_id -> active ControlStream + per-command_id event channel (CommandEvent { Batch | Ack }). Newest-wins eviction, dropped-event counter (64-slot buffer), atomic accounting for tests + observability.
- FleetNodeGateway.ControlStream — bidi handler. After Hello, registers the stream and pumps outgoing ControlCommand requests + incoming ControlAck responses through a side goroutine (2-buffer to avoid linger on exit).
- ReportDiscoveredDevices hook — when the agent reports devices with a command_id, the batch is also published to the registry so the operator's waiting stream wakes up.
- FleetNodeAdmin.DiscoverOnFleetNode — operator-facing streaming RPC. Validates target is CONFIRMED, normalizes IPRange to IPList (capped at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap. Uses id.GenerateID() for command_id and proto.Marshal for the payload.
- pairing.proto — buf.validate count caps on DiscoverRequest modes (4096 IPs, 256 ports per mode).
- ipscanner — exports GenerateIPsFromCIDR for cross-package reuse.
- middleware/rpc_permissions — DiscoverOnFleetNode -> fleetnode:manage.
- Tests — registry register/send/ack/eviction, ControlStream hello + dispatch + duplicate-stream rejection, DiscoverOnFleetNode happy/no-stream/MDNS-reject/IPRange-expand/Nmap-passthrough/viewer-gate, ReportDiscoveredDevices command_id correlation, expandIPv4Range overflow + boundary cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ankitgoswami and others added 3 commits May 29, 2026 10:47
Adds the server-side surfaces operators need to manage fleet-node device pairings and the gateway endpoint agents call to report devices they discovered on their LAN. This is PR 1 of a stack — PR 2 (open as #235) layers server-initiated discovery via ControlStream on top.

What's in this PR:
- fleetnodepairing domain (Service + Store) with PairDevice, UnpairDevice, ListPairs, ListDevicesForFleetNode, UpsertDiscoveredDevices, plus IP/port/scheme validation on agent-reported devices.
- fleet_node_device pairing table queries and UpsertDiscoveredDeviceFromFleetNode with a NOT EXISTS guard that prevents fleet node B from overwriting a device already paired with fleet node A.
- FleetNodeGateway.ReportDiscoveredDevices RPC: agents authenticated via fleetnodeauth submit batches of devices; ip_address must be RFC1918/RFC4193, port 1-65535, url_scheme http or https.
- FleetNodeAdmin.PairDeviceToFleetNode, UnpairDevice, ListFleetNodeDevices RPCs, gated by fleetnode:manage / fleetnode:read via middleware.RequirePermission.
- RevocationCleanupStore extracted from fleetnodeenrollment.Store so RevokeFleetNode now deletes the fleet node's pairings as part of the same TX.
- Integration tests for the pairing CRUD round trip, double-pair rejection, soft-deleted/pending node rejection, cross-org isolation, agent-report validation (invalid IP, port, scheme, non-private ranges, RFC4193 IPv6), the NOT EXISTS pairing guard, and revoke-clears-pairings.

What's deferred to PR 2:
- fleetnodecontrol.Registry (in-memory ControlStream + per-command_id event dispatch).
- FleetNodeGateway.ControlStream bidi handler.
- FleetNodeAdmin.DiscoverOnFleetNode operator-initiated discovery, plus the proto-level max_items caps on DiscoverRequest modes.
- The command_id correlation hook in ReportDiscoveredDevices that fans batches to the operator's waiting stream.

Build, vet, lint, and tests for middleware, fleetnodeadmin, fleetnodegateway, fleetnodepairing, and fleetnodeenrollment are green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address security review on PR #332: the validator allows empty scheme but rejects "virtual", the scheme the virtual plugin emits (plugin/virtual/internal/driver/driver.go:160,186). Every legitimate virtual-plugin discovery report currently fails validation.

- Add "virtual" to the allowlist so virtual-plugin reports round-trip cleanly.
- Drop "" from the allowlist — empty was an undocumented placeholder, not a graceful default. The agent's plugin driver always knows the scheme at probe time.
- Tests: TestUpsertDiscoveredDevices_AcceptsVirtualScheme (new positive case), TestUpsertDiscoveredDevices_RejectsEmptyScheme (new negative case), existing RejectsDisallowedScheme (ftp) untouched.

The other two findings from the same review (command_id binding, attribution-based cloud-pairing quarantine) live in PR #235.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four findings from PR #332 inline review:

1. URLScheme proto/validator mismatch: proto's url_scheme had buf.validate `in: ["http", "https"]` which would reject "virtual" before the handler ever saw it, making the recent validator allowlist change dead surface. Added "virtual" to the proto `in: [...]` list and regenerated. Now virtual-plugin reports pass both layers.

2. UpsertDiscoveredDevices batch wasn't atomic: each per-report upsert auto-committed, so a mid-batch validation failure left a committed prefix. Now: validate every report up-front (validateReport is pure / O(n)), then wrap the writes in s.transactor.RunInTx so either the whole batch commits or none does. Ownership-rejected rows (0 rows affected) still count toward rejectedOwnership without aborting the tx — that's the store's normal "we refused to overwrite a hijacked row" signal, not an error.

3. Added RejectedCount field to ReportDiscoveredDevicesResponse and slog.Warn when rejectedOwnership > 0. Field is additive (proto3 backward-compatible).

4. UpsertDiscoveredDeviceFromFleetNode NOT EXISTS guard was over-blocking: the predicate `WHERE fnd.fleet_node_id IS NULL` blocked any device row not paired to the reporting node — including unpaired devices. After UnpairDevice the originating node could never refresh that discovered_device (is_active / last_seen / ip would freeze). Rewrote the predicate as `NOT EXISTS (... JOIN fnd ... AND fnd.fleet_node_id <> $10 ...)` — block only when paired to a *different* fleet_node, which is the actual hijack case the comment described. Updated the test that asserted the old over-blocking behavior (renamed and inverted assertions); added TestUpsertDiscoveredDevices_BatchValidationErrorRollsBack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami force-pushed the ankitg/fleetnode-pairing branch from 35e7a64 to ad35e6c Compare May 29, 2026 17:50
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ad35e6cdf9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/fleetnodepairing/service.go
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The committed driver_pb2_grpc.py carried GRPC_GENERATED_VERSION 1.76.0,
generated against a stale proto-python-gen venv. CI regenerates with the
pinned grpcio-tools 1.80.0, so the generated-code-check and
python-gen-staleness gates failed. Regenerate to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14f02c5193

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/internal/domain/fleetnodepairing/service.go
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ankitgoswami ankitgoswami enabled auto-merge (squash) May 29, 2026 18:43
@ankitgoswami ankitgoswami merged commit 9b5bf09 into main May 29, 2026
135 of 137 checks passed
@ankitgoswami ankitgoswami deleted the ankitg/fleetnode-pairing branch May 29, 2026 19:39
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ankitgoswami added a commit that referenced this pull request May 29, 2026
PR 2 of a stack. Layers operator-initiated discovery on top of the
pairing + agent-reporting surface in PR 1 (#332). Builds on the existing
fleetnodepairing.UpsertDiscoveredDevices ingestion path; an in-memory
registry correlates server-issued ControlCommand requests with the
agent's eventual ReportDiscoveredDevices batches.

What's in this PR:

- fleetnodecontrol.Registry: single-instance in-memory map of
  fleet_node_id -> active ControlStream + per-command_id event channel
  (CommandEvent { Batch | Ack }). Newest-wins eviction signaled via a
  done channel (so outgoing channel is never closed under a publisher);
  Send selects on done to bail cleanly. Publishers hold the mutex
  through the bounded non-blocking send to avoid panicking on a
  closed channel when cleanup races. Dropped-event counter on a 64-slot
  buffer, exposed via DroppedEvents().

- FleetNodeGateway.ControlStream: bidi handler. Hello receive is wrapped
  in a 5s timeout (HelloTimeout var) so an authenticated-but-idle agent
  cannot hold a server goroutine + HTTP/2 stream indefinitely. After
  Hello, registers the stream and pumps outgoing ControlCommand
  requests + incoming ControlAck responses through a side goroutine
  (2-buffer to avoid linger on exit).

- ReportDiscoveredDevices: rejects reports without a command_id or
  whose command_id is not in flight for this fleet_node (binds to
  server-issued ControlCommand). UpsertDiscoveredDevices now returns
  acceptedIdx []int instead of an opaque count; only the rows the store
  actually accepted are forwarded to the operator's command stream so
  ownership-rejected rows can't leak.

- FleetNodeAdmin.DiscoverOnFleetNode: operator-facing streaming RPC.
  Validates target is CONFIRMED, normalizes IPRange to IPList (capped
  at 4096 expanded addresses), rejects MDNS, forwards IPList/Nmap.
  Wraps the operator ctx with DiscoverCommandTimeout (5m default, var
  for test override) so a buggy/silent agent cannot pin operator
  streams + registry entries forever. Returns CodeDeadlineExceeded on
  timeout. Uses id.GenerateID() for command_id and proto.Marshal for
  the payload.

- discovered_by_fleet_node_id is immutable origin tracking. Set on
  first agent report; never cleared by PairDevice / UnpairDevice /
  RevokeFleetNode. Cloud-side pairing.PairDevices refuses to dial any
  discovered_device with DiscoveredByFleetNodeID != nil so an
  agent-reported private IP cannot redirect cloud credentialing later.
  Migration 000064 adds the column + FK + partial index.

- UpsertDiscoveredDeviceFromFleetNode reconciles auto:* identifiers
  per (fleet_node, ip, port) endpoint so re-keyed scans collapse onto
  one row; mac:/serial: identifiers pass through unchanged.

- pairing.proto: buf.validate count caps on DiscoverRequest modes
  (4096 IPs, 256 ports per mode).

- middleware: DiscoverOnFleetNode gated on fleetnode:manage.

Review fixes folded in:

- Migration 000065 widens discovered_device.url_scheme from VARCHAR(10)
  to VARCHAR(32) to match the gateway proto's advertised max_len. Schemes
  of 11-32 chars (e.g. "stratum+tcp") passed validation but overflowed the
  column, failing the whole batch as an internal error.

- UpsertDiscoveredDevices tallies accepted/rejected into per-attempt
  locals reset on closure entry, so a RunInTx retry after a retryable
  Postgres/commit failure can no longer double-count a batch. Adds a unit
  test for the retry path and a DB-backed test for the 32-char scheme.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants