Skip to content

Centralize telemetry collector error handling via Poller.safe_invoke/3#4148

Open
alco wants to merge 4 commits intomainfrom
erik/issue-32-safe-invoke
Open

Centralize telemetry collector error handling via Poller.safe_invoke/3#4148
alco wants to merge 4 commits intomainfrom
erik/issue-32-safe-invoke

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Apr 21, 2026

vetted by human ready for review

Summary

  • Adds ElectricTelemetry.Poller.safe_invoke/3 and wraps every MFA built by periodic_measurements/2 in it, so transient collector failures no longer cause :telemetry_poller to permanently drop a measurement from its polling list.
  • Silently absorbs :noproc/:timeout/:shutdown/:normal exits and ArgumentError (typical startup/restart races); logs a warning tagged with the MFA for anything else.
  • Strips now-redundant defensive try/catch and with-fallthrough code from individual collectors in Electric.StackSupervisor.Telemetry, since safe_invoke/3 is the single contract for containing transient errors. The one ordering tweak worth noting: count_shapes/2 emits active_shapes before calling shape_counts/1, so a startup-race failure in the latter only drops total_shapes for the tick.

Background

:telemetry_poller catches MFA exceptions but returns error to its internal loop, which removes the measurement permanently. A single GenServer restart race or ETS-table startup race is enough to silently disable a metric for the lifetime of the poller. See issue electric-sql/alco-agent-tasks#32 for the full audit and design.

Note on semantics

User-supplied periodic measurement functions passed to ElectricTelemetry no longer have their exceptions propagated up to :telemetry_poller's own error logger — they are now caught and logged via ElectricTelemetry.Poller instead. This is called out in the changeset.

Test plan

  • New ElectricTelemetry.PollerTest covers all catch clauses + the wrapping logic in periodic_measurements/2
  • mix format clean
  • mix compile clean for @core/sync-service and @core/electric-telemetry
  • Electric.StackSupervisorTest passes
  • Existing telemetry / stack-supervisor tests pass in CI

Refs electric-sql/alco-agent-tasks#32

telemetry_poller removes a measurement permanently from its polling list
after the first failure, so transient collector errors (GenServer restart
races, ETS tables not yet created, DB unavailable) silently disable metrics
for the lifetime of the poller.

Wrap every MFA built by ElectricTelemetry.Poller.periodic_measurements/2 in
safe_invoke/3, which absorbs :noproc/:timeout/:shutdown/:normal exits and
ArgumentError silently and logs any other failure as a warning tagged with
the offending MFA. The measurement always returns :ok to telemetry_poller
and stays on the polling list.

Strip now-redundant defensive code from count_shapes/2 (with-fallthrough)
and report_retained_wal_size/3 (try/catch on :noproc and catch-all).

Refs electric-sql/alco-agent-tasks#32
@alco alco added the claude label Apr 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 66.59%. Comparing base (388ec63) to head (e3a0fb8).
⚠️ Report is 47 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...lectric-telemetry/lib/electric/telemetry/poller.ex 94.73% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4148       +/-   ##
===========================================
- Coverage   89.20%   66.59%   -22.62%     
===========================================
  Files          25      155      +130     
  Lines        2520    17676    +15156     
  Branches      636     4188     +3552     
===========================================
+ Hits         2248    11771     +9523     
- Misses        270     5901     +5631     
- Partials        2        4        +2     
Flag Coverage Δ
electric-telemetry 71.40% <94.73%> (?)
elixir 71.40% <94.73%> (?)
packages/agents 47.33% <ø> (?)
packages/agents-runtime 81.42% <ø> (?)
packages/agents-server 65.69% <ø> (?)
packages/agents-server-ui 0.00% <ø> (?)
packages/electric-ax 31.19% <ø> (?)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 94.30% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 66.43% <ø> (-22.78%) ⬇️
unit-tests 66.59% <94.73%> (-22.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13b3967b18

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/sync-service/lib/electric/stack_supervisor/telemetry.ex
Comment thread packages/sync-service/lib/electric/stack_supervisor/telemetry.ex
@claude
Copy link
Copy Markdown

claude Bot commented Apr 21, 2026

Claude Code Review

Summary

The new commit (e3a0fb8bb) drops the local try/catch from report_retained_wal_size/3 and the defensive with-fallthrough from count_shapes/2, correctly delegating all error handling to safe_invoke. One remaining issue from the previous iterations — the :builtin measurements path — is still not wrapped.

What's Working Well

  • Removing the catch-all try/catch from report_retained_wal_size/3 and delegating to safe_invoke correctly restores warning-level logging for unexpected errors. This is exactly the right fix for the observability regression raised in the last review.
  • Removing the "do not raise if stack down" test is intentional and correct — the functions are now expected to raise when called without safe_invoke, and the poller wrapper is the intended contract.
  • The ordering of count_shapes/2 (emit active_shapes before the potentially-failing shape_counts/1 call) is clean and well-commented.
  • safe_invoke/3 and its test suite remain solid.

Issues Found

Important (Should Fix)

:builtin measurements still not wrapped in safe_invoke

File: packages/electric-telemetry/lib/electric/telemetry/poller.ex:30-31

:builtin ->
  module.builtin_periodic_measurements(telemetry_opts)

The 2-arity periodic_measurements/2 fallback (line 49-50) was correctly fixed to pipe through &wrap_mfa/1, but the :builtin branch inside flat_map still returns MFAs from builtin_periodic_measurements/1 without going through wrap_mfa/1. Any crash in a builtin measurement will still be surfaced directly to :telemetry_poller, which will permanently remove it from the polling list — the exact problem this PR was designed to fix.

The fix is a one-liner (same suggestion as previous two iterations):

:builtin ->
  module.builtin_periodic_measurements(telemetry_opts)
  |> Enum.map(&wrap_mfa/1)

Suggestions (Nice to Have)

Silent ArgumentError swallowing is unexplainedpackages/electric-telemetry/lib/electric/telemetry/poller.ex:73-74. The rescue ArgumentError -> :ok clause silently absorbs the error with no log output. A one-line comment (# ETS tables not yet created during startup/restart) would make this intent clear to the next reader. Carried over from previous two iterations.


Issue Conformance

No linked public issue (refs alco-agent-tasks#32). Remains a non-blocking concern.


Previous Review Status

Issue Status
Critical: Failing test for cold-stack measurements Fixed in iteration 2
Important: :builtin measurements not wrapped Still not addressed
Important: Observability regression in report_retained_wal_size/3 Fixed — try/catch removed, safe_invoke handles at warning level
Suggestion: ArgumentError comment Not addressed

Review iteration: 3 | 2026-04-27

…tting

- count_shapes/2: restore `with` fallthrough on `:error` from shape_counts
  and emit `active_shapes` independently so a shape-cache outage doesn't
  drop both metrics for the tick (Codex feedback).
- report_retained_wal_size/3: restore try/catch around the Postgrex call
  so direct callers (including the stack-down regression test) don't
  crash on transient DB/pool failures (Codex feedback).
- Reformat poller.ex and poller_test.exs per `mix format`.

safe_invoke/3 stays as the backstop for unexpected errors through the
poller wrapper; these local handlers ensure graceful partial emission
for known stack-startup states.
@alco alco marked this pull request as draft April 22, 2026 00:06
alco and others added 2 commits April 28, 2026 01:01
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reverts the local error handlers added in 275e0d9 — safe_invoke is
the wrapper for transient errors, so the collectors should raise and
let it log/swallow.

- count_shapes/2: emit active_shapes first so a shape_counts/1 failure
  (e.g. shape cache not yet started) only drops total_shapes for the
  tick.
- report_retained_wal_size/3: let Postgrex errors propagate; safe_invoke
  contains and logs them.
- Drop the regression test that asserted direct calls don't crash on
  stack-down — that contract no longer holds and safe_invoke has its
  own tests in electric-telemetry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 27, 2026

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit e3a0fb8
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/69efebf8efa27f0008975899
😎 Deploy Preview https://deploy-preview-4148--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@alco alco marked this pull request as ready for review April 27, 2026 23:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3a0fb8bb9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/electric-telemetry/lib/electric/telemetry/poller.ex
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants