Skip to content

feat: basic request tracing with jaeger ui for dev#10

Merged
illegalprime merged 2 commits into
mainfrom
eden/tracing
Apr 22, 2026
Merged

feat: basic request tracing with jaeger ui for dev#10
illegalprime merged 2 commits into
mainfrom
eden/tracing

Conversation

@illegalprime
Copy link
Copy Markdown
Contributor

basic visualization with jaeger for request timings, nothing deeper at the moment:

Screenshot 2026-04-21 at 2 01 17 PM

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 21, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (a2f3e8ee1a521b70e65906d7ed9308096a4a1d1e...cc95c7855f97c9145d38866a1b316d7e25f65a20, exact PR three-dot diff)
  • Model: gpt-5.4

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] HTTP-level tracing will misclassify failed Connect/gRPC streams as successful

  • Category: gRPC
  • Location: server/internal/handlers/middleware/telemetry.go:15
  • Description: The new middleware wraps every RPC with otelhttp and only sees the outer HTTP response code. In this server, that wrapper sits in front of Connect handlers, and Connect servers also speak gRPC, including streaming. For gRPC, the RPC outcome is communicated in trailers when the call closes, not in the outer HTTP status. That means once a stream has started, later connect.CodeInternal, cancellation, or send failures can still be exported as an apparently successful 200 span. (connectrpc.com)
  • Impact: Trace-based debugging and alerting will under-report failures on the platform’s core streaming flows, which makes auth failures, cancellations, and backend send errors materially harder to detect during incident response.
  • Recommendation: Instrument at the Connect/gRPC layer instead of only at the HTTP layer, so span status comes from connect.Error/RPC status. If you keep the HTTP middleware, exclude or separately sample long-lived streaming RPCs.

[LOW] The new telemetry containers are not fully hardened

  • Category: Infrastructure
  • Location: server/docker-compose.yaml:94
  • Description: The new otel-collector and jaeger services are referenced by tag only, and the collector bind-mounts ./otel-collector-config.yaml without :ro. Docker documents that digests are immutable while tags are not, and that bind mounts are read-write by default unless explicitly marked read-only. In practice, that gives any compromised or retagged collector image a host write path back into the checked-out config file. (docs.docker.com)
  • Impact: This is limited to environments that run the compose stack, but on a developer workstation or CI runner it creates a supply-chain/persistence path that can tamper with local telemetry configuration.
  • Recommendation: Pin both images by digest and mount ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml:ro.

Notes

  • The behavior-changing portion of this diff is almost entirely the new telemetry wiring; I did not find new auth, SQL injection, or command-execution issues in the changed hunks.
  • No tests were added for TelemetryMiddleware or fleet_telemetry.Setup.
  • I could not run Go tests in this environment because the filesystem is read-only.

Generated by Codex Security Review |
Triggered by: @illegalprime |
Review workflow run

Comment thread server/cmd/fleetd/main.go
ankitgoswami
ankitgoswami previously approved these changes Apr 21, 2026
@ankitgoswami
Copy link
Copy Markdown
Contributor

maybe rename tracing to observability since we'd also want to use some of these same configs for metrics.

Comment thread server/internal/infrastructure/fleet-telemetry/telemetry.go
@illegalprime illegalprime force-pushed the eden/tracing branch 2 times, most recently from 829349c to b1483d8 Compare April 22, 2026 17:48
@illegalprime illegalprime requested a review from a team as a code owner April 22, 2026 17:48
@github-actions github-actions Bot added the dependencies Pull requests that update a dependency file label Apr 22, 2026
@illegalprime illegalprime force-pushed the eden/tracing branch 2 times, most recently from 0284b1a to 26cec51 Compare April 22, 2026 18:02
@illegalprime illegalprime merged commit d734e7f into main Apr 22, 2026
110 of 112 checks passed
@illegalprime illegalprime deleted the eden/tracing branch April 22, 2026 18:27
flesher added a commit that referenced this pull request May 8, 2026
Triage of the parallel adversarial review run on PR #206. Findings
applied as code:

[P1 #1] activity_log composite-FK MATCH SIMPLE bypass on NULL
organization_id closed via CHECK constraint:
ck_activity_log_site_requires_org enforces site_id IS NULL OR
organization_id IS NOT NULL. Verified: insert with site_id + NULL
org rejected; insert with NULL site_id + NULL org accepted (system
events unchanged).

[P1 #2] command_on_device_log org_id backfill: pre-flight DO block
counts orphan rows (codl rows whose device row is missing) and
RAISES with a clear message before SET NOT NULL. A clean abort beats
SET NOT NULL failing mid-migration with the dirty flag set.

[P2 #5] InsertError CTE rewrite: documented the contract change
(missing device $3 yields sql.ErrNoRows instead of FK violation).
Existing caller wraps generically so surface unchanged.

[P2 #6] Plan doc trimmed: power-contract column list marked DEFERRED
in entity description and Phase 1 migration bullet. Future readers
won't write service code against columns that don't exist.

[P2 #7] ListSites count subqueries: added org_id predicate to
device and building scans so they hit idx_device_org_site /
idx_building_org_deleted instead of full-table scan in multi-tenant
prod.

[P2 #8] InsertDeviceMetrics sub-select dropped AND deleted_at IS NULL
to match InsertError / InsertMinerStateSnapshot. Telemetry from a
soft-deleted device is still legitimate per-site history; three
writers, one behavior.

[P3 #10] building.default_rack_order_index: added
ck_building_default_rack_order_index CHECK (BETWEEN 0 AND 4) to
match sibling CHECKs.

[P3 #11] fk_device_set_rack_device_set_org: added ON DELETE CASCADE
to match the single-column FK on device_set_id and the building FK
on the same row. Composite adds the org-matching invariant without
changing cascade semantics.

Findings deferred:
- #3 (non-CONCURRENTLY indexes inside tx): deploy-time concern,
  document in PR/deploy notes; restructuring migrations to use
  CONCURRENTLY is a separate effort.
- #4 (no integration tests): defer to Phase 1B service layer where
  end-to-end flows exercise the invariants.
- #9 (device.uq_device_id_org_id missing): already exists (verified
  on live DB); finding is incorrect.
- #12 (CREATE TABLE IF NOT EXISTS): IF NOT EXISTS hides genuine
  schema errors; force-clean is the right golang-migrate pattern.

Round-trip clean, build clean, lint clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants