Skip to content

feat(sync-service): self-heal stuck replication slot creation#4515

Merged
alco merged 10 commits into
mainfrom
feat/slot-creation-self-heal
Jun 9, 2026
Merged

feat(sync-service): self-heal stuck replication slot creation#4515
alco merged 10 commits into
mainfrom
feat/slot-creation-self-heal

Conversation

@erik-the-implementer

@erik-the-implementer erik-the-implementer commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Summary

When a source's logical replication slot creation is blocked waiting on pending transactions, Electric now self-heals instead of requiring a manual source restart.

  • Connection.Manager periodically runs SELECT pg_log_standby_snapshot() on the admin pool while slot creation is blocked, so Postgres can emit an XLOG_RUNNING_XACTS record and the logical snapshot builder reaches CONSISTENT as soon as the blocking transaction ends.
  • Piggybacks on the existing :replication_configuration status-check timer (it already detects the blocked state and dispatches :replication_slot_creation_blocked_by_pending_transactions); mirrors the existing :check_lock_not_abandoned admin-pool query pattern.
  • Degrades gracefully when the function is unavailable (PostgreSQL < 14 or missing EXECUTE privilege): falls back to the previous behavior and emits a one-time :replication_slot_unblock_unavailable stack event + warning with remediation.
  • Makes the connection-status-check interval configurable so the behavior can be tested deterministically and fast.

Background

From an SRE investigation of a customer ("Ajax") with repeated source-inactivity incidents: a long-running transaction pins the slot's restart_lsn, retained WAL grows past max_slot_wal_keep_size (4 GB) and Postgres invalidates the slot. Recreating the slot then blocks on the same transaction — and on an otherwise-idle database, Postgres does not emit a fresh XLOG_RUNNING_XACTS record for a long time after the transaction commits, so the source stays stuck until someone restarts it. pg_log_standby_snapshot() forces that record on demand, making recovery automatic.

🤖 Generated with Claude Code

alco and others added 5 commits June 5, 2026 15:25
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pg_log_standby_snapshot

When CREATE_REPLICATION_SLOT is blocked waiting on pending transactions,
Connection.Manager now periodically runs pg_log_standby_snapshot() on the
admin pool so Postgres can reach a consistent snapshot and the source
recovers without a manual restart. Degrades gracefully (one-time notice)
when the function is unavailable (PG < 14 or missing EXECUTE privilege).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reation

Holds an xid-bearing transaction open to block CREATE_REPLICATION_SLOT, then
commits it and keeps the database idle, asserting that Electric forces a
standby snapshot and resumes replication on its own (no manual restart).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 56.49%. Comparing base (b0030a1) to head (ada95b8).
⚠️ Report is 16 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4515       +/-   ##
===========================================
- Coverage   69.32%   56.49%   -12.83%     
===========================================
  Files          77      358      +281     
  Lines        9277    39081    +29804     
  Branches     2896    10974     +8078     
===========================================
+ Hits         6431    22078    +15647     
- Misses       2828    16931    +14103     
- Partials       18       72       +54     
Flag Coverage Δ
packages/agents 70.75% <ø> (-1.03%) ⬇️
packages/agents-mcp 77.54% <ø> (?)
packages/agents-mobile 66.92% <ø> (?)
packages/agents-runtime 79.98% <ø> (?)
packages/agents-server 74.19% <ø> (+1.38%) ⬆️
packages/agents-server-ui 6.21% <ø> (?)
packages/electric-ax 46.42% <ø> (ø)
packages/experimental 87.73% <ø> (?)
packages/react-hooks 86.48% <ø> (?)
packages/start 82.83% <ø> (?)
packages/typescript-client 91.71% <ø> (?)
packages/y-electric 56.05% <ø> (?)
typescript 56.49% <ø> (-12.83%) ⬇️
unit-tests 56.49% <ø> (-12.83%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@claude

claude Bot commented Jun 5, 2026

Copy link
Copy Markdown

Claude Code Review

Summary

Iteration 4 covers the two commits added since iteration 3 — 68edf0ca5 (restore the defensive handle_continue catch-all) and ada95b8d0 (tighten the assert_receive/refute_receive timeouts). Both are small, in-scope cleanups. No critical or important issues; still ready to merge.

Previous Review Status — fully addressed ✅

  • Suggestion Postgres replication source TCP server #3 (defensive handle_continue/2 catch-all) — now done. 68edf0ca5 re-adds the no-op handle_continue(:unblock_slot_creation, state), do: {:noreply, state} (manager.ex:590) after the step-gated clause, with a comment explaining why it exists. This restores crash-safety if a future caller ever dispatches the continue from a step other than :configuring_connection. The @impl true on the first clause of the group (above manager.ex:553) correctly covers the new clause — consistent with the sibling clauses at 553 and 577, no annotation needed per-clause.

What Is Working Well

  • The no-op fallback is placed and ordered correctly — it sits after the specific %State{current_step: {:start_replication_client, :configuring_connection}} clause, so the real logic still wins and the catch-all only fires for unexpected steps. Exactly the let-it-not-crash hygiene this codebase favors for exhaustive handle_* matching.
  • Test timeout tuning is principled, not blind inflation. ada95b8d0 keeps the wait_until_conn_starting/2 poll as the absorber for the slow connection bring-up, then lets the downstream assertions use short timeouts: the blocked-event assert_receive drops to the ExUnit default, the unblock-unavailable notice asserts within 200 ms, and the repeat-refute_receive widens to 400 ms. The assert-first-occurrence / refute-any-repeat structure is the right way to prove the slot_unblock_notice_sent once-only gating, and 400 ms spans ~8 ticks at the 50 ms test interval — enough to catch a regression where the notice re-fires.

Issues Found

Critical (Must Fix): None.

Important (Should Fix): None.

Suggestions (Nice to Have): None outstanding — every item from prior iterations is resolved.

Issue Conformance

No linked issue (unchanged across iterations); the SRE write-up in the PR description remains a thorough substitute. Both new commits are pure review-feedback cleanup with no scope creep.


Review iteration: 4 | 2026-06-08

@alco alco self-assigned this Jun 8, 2026
@alco alco marked this pull request as ready for review June 8, 2026 13:26
@netlify

netlify Bot commented Jun 8, 2026

Copy link
Copy Markdown

Deploy Preview for electric-next ready!

Name Link
🔨 Latest commit 59aaa26
🔍 Latest deploy log https://app.netlify.com/projects/electric-next/deploys/6a26c2d48ef9ec000838d49b
😎 Deploy Preview https://deploy-preview-4515--electric-next.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

alco and others added 3 commits June 8, 2026 17:15
- Log the degraded-path notice with Logger.warning alongside the stack event
  so self-hosted operators (the case where auto-heal is impossible) get a
  log signal, not just external stack-event subscribers.
- Fix typo "transction" -> "transaction" in the user-facing message.
- Carry the notice as a structured map payload
  {:replication_slot_unblock_unavailable, %{message: ...}} for consistency
  with other map-shaped stack events (the cloud dashboard renders the
  `message` key as text).
- Make the degrade-path test robust: gate on StatusMonitor (conn: :starting)
  before asserting the stack events, instead of relying on the default 100ms
  assert_receive timeout for the whole connection bring-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-add the no-op `handle_continue(:unblock_slot_creation, state)` fallback that
was removed during cleanup, preserving crash-safety if a future caller ever
dispatches this continue from a step other than :configuring_connection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alco alco merged commit d3e1a64 into main Jun 9, 2026
76 of 77 checks passed
@alco alco deleted the feat/slot-creation-self-heal branch June 9, 2026 09:47
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

This PR has been released! 🚀

The following packages include changes from this PR:

  • @core/sync-service@1.6.10

Thanks for contributing to Electric!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants