Centralize telemetry collector error handling via Poller.safe_invoke/3#4148
Centralize telemetry collector error handling via Poller.safe_invoke/3#4148
Conversation
telemetry_poller removes a measurement permanently from its polling list after the first failure, so transient collector errors (GenServer restart races, ETS tables not yet created, DB unavailable) silently disable metrics for the lifetime of the poller. Wrap every MFA built by ElectricTelemetry.Poller.periodic_measurements/2 in safe_invoke/3, which absorbs :noproc/:timeout/:shutdown/:normal exits and ArgumentError silently and logs any other failure as a warning tagged with the offending MFA. The measurement always returns :ok to telemetry_poller and stays on the polling list. Strip now-redundant defensive code from count_shapes/2 (with-fallthrough) and report_retained_wal_size/3 (try/catch on :noproc and catch-all). Refs electric-sql/alco-agent-tasks#32
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4148 +/- ##
===========================================
- Coverage 89.20% 66.59% -22.62%
===========================================
Files 25 155 +130
Lines 2520 17676 +15156
Branches 636 4188 +3552
===========================================
+ Hits 2248 11771 +9523
- Misses 270 5901 +5631
- Partials 2 4 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13b3967b18
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Claude Code ReviewSummaryThe new commit ( What's Working Well
Issues FoundImportant (Should Fix)
File: :builtin ->
module.builtin_periodic_measurements(telemetry_opts)The 2-arity The fix is a one-liner (same suggestion as previous two iterations): :builtin ->
module.builtin_periodic_measurements(telemetry_opts)
|> Enum.map(&wrap_mfa/1)Suggestions (Nice to Have)Silent Issue ConformanceNo linked public issue (refs Previous Review Status
Review iteration: 3 | 2026-04-27 |
…tting - count_shapes/2: restore `with` fallthrough on `:error` from shape_counts and emit `active_shapes` independently so a shape-cache outage doesn't drop both metrics for the tick (Codex feedback). - report_retained_wal_size/3: restore try/catch around the Postgrex call so direct callers (including the stack-down regression test) don't crash on transient DB/pool failures (Codex feedback). - Reformat poller.ex and poller_test.exs per `mix format`. safe_invoke/3 stays as the backstop for unexpected errors through the poller wrapper; these local handlers ensure graceful partial emission for known stack-startup states.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the local error handlers added in 275e0d9 — safe_invoke is the wrapper for transient errors, so the collectors should raise and let it log/swallow. - count_shapes/2: emit active_shapes first so a shape_counts/1 failure (e.g. shape cache not yet started) only drops total_shapes for the tick. - report_retained_wal_size/3: let Postgrex errors propagate; safe_invoke contains and logs them. - Drop the regression test that asserted direct calls don't crash on stack-down — that contract no longer holds and safe_invoke has its own tests in electric-telemetry. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3a0fb8bb9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Summary
ElectricTelemetry.Poller.safe_invoke/3and wraps every MFA built byperiodic_measurements/2in it, so transient collector failures no longer cause:telemetry_pollerto permanently drop a measurement from its polling list.:noproc/:timeout/:shutdown/:normalexits andArgumentError(typical startup/restart races); logs a warning tagged with the MFA for anything else.try/catchandwith-fallthrough code from individual collectors inElectric.StackSupervisor.Telemetry, sincesafe_invoke/3is the single contract for containing transient errors. The one ordering tweak worth noting:count_shapes/2emitsactive_shapesbefore callingshape_counts/1, so a startup-race failure in the latter only dropstotal_shapesfor the tick.Background
:telemetry_pollercatches MFA exceptions but returnserrorto its internal loop, which removes the measurement permanently. A single GenServer restart race or ETS-table startup race is enough to silently disable a metric for the lifetime of the poller. See issue electric-sql/alco-agent-tasks#32 for the full audit and design.Note on semantics
User-supplied periodic measurement functions passed to
ElectricTelemetryno longer have their exceptions propagated up to:telemetry_poller's own error logger — they are now caught and logged viaElectricTelemetry.Pollerinstead. This is called out in the changeset.Test plan
ElectricTelemetry.PollerTestcovers all catch clauses + the wrapping logic inperiodic_measurements/2mix formatcleanmix compileclean for@core/sync-serviceand@core/electric-telemetryElectric.StackSupervisorTestpassesRefs electric-sql/alco-agent-tasks#32