feat(sync-service): self-heal stuck replication slot creation#4515
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pg_log_standby_snapshot When CREATE_REPLICATION_SLOT is blocked waiting on pending transactions, Connection.Manager now periodically runs pg_log_standby_snapshot() on the admin pool so Postgres can reach a consistent snapshot and the source recovers without a manual restart. Degrades gracefully (one-time notice) when the function is unavailable (PG < 14 or missing EXECUTE privilege). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…reation Holds an xid-bearing transaction open to block CREATE_REPLICATION_SLOT, then commits it and keeps the database idle, asserting that Electric forces a standby snapshot and resumes replication on its own (no manual restart). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4515 +/- ##
===========================================
- Coverage 69.32% 56.49% -12.83%
===========================================
Files 77 358 +281
Lines 9277 39081 +29804
Branches 2896 10974 +8078
===========================================
+ Hits 6431 22078 +15647
- Misses 2828 16931 +14103
- Partials 18 72 +54
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Claude Code ReviewSummary Iteration 4 covers the two commits added since iteration 3 — Previous Review Status — fully addressed ✅
What Is Working Well
Issues Found Critical (Must Fix): None. Important (Should Fix): None. Suggestions (Nice to Have): None outstanding — every item from prior iterations is resolved. Issue Conformance No linked issue (unchanged across iterations); the SRE write-up in the PR description remains a thorough substitute. Both new commits are pure review-feedback cleanup with no scope creep. Review iteration: 4 | 2026-06-08 |
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
- Log the degraded-path notice with Logger.warning alongside the stack event
so self-hosted operators (the case where auto-heal is impossible) get a
log signal, not just external stack-event subscribers.
- Fix typo "transction" -> "transaction" in the user-facing message.
- Carry the notice as a structured map payload
{:replication_slot_unblock_unavailable, %{message: ...}} for consistency
with other map-shaped stack events (the cloud dashboard renders the
`message` key as text).
- Make the degrade-path test robust: gate on StatusMonitor (conn: :starting)
before asserting the stack events, instead of relying on the default 100ms
assert_receive timeout for the whole connection bring-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-add the no-op `handle_continue(:unblock_slot_creation, state)` fallback that was removed during cleanup, preserving crash-safety if a future caller ever dispatches this continue from a step other than :configuring_connection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This PR has been released! 🚀 The following packages include changes from this PR:
Thanks for contributing to Electric! |
Summary
When a source's logical replication slot creation is blocked waiting on pending transactions, Electric now self-heals instead of requiring a manual source restart.
Connection.Managerperiodically runsSELECT pg_log_standby_snapshot()on the admin pool while slot creation is blocked, so Postgres can emit anXLOG_RUNNING_XACTSrecord and the logical snapshot builder reachesCONSISTENTas soon as the blocking transaction ends.:replication_configurationstatus-check timer (it already detects the blocked state and dispatches:replication_slot_creation_blocked_by_pending_transactions); mirrors the existing:check_lock_not_abandonedadmin-pool query pattern.EXECUTEprivilege): falls back to the previous behavior and emits a one-time:replication_slot_unblock_unavailablestack event + warning with remediation.Background
From an SRE investigation of a customer ("Ajax") with repeated source-inactivity incidents: a long-running transaction pins the slot's
restart_lsn, retained WAL grows pastmax_slot_wal_keep_size(4 GB) and Postgres invalidates the slot. Recreating the slot then blocks on the same transaction — and on an otherwise-idle database, Postgres does not emit a freshXLOG_RUNNING_XACTSrecord for a long time after the transaction commits, so the source stays stuck until someone restarts it.pg_log_standby_snapshot()forces that record on demand, making recovery automatic.🤖 Generated with Claude Code