feat(sync-service): record per-transaction fragment wall-time#4504
Conversation
Claude Code ReviewSummaryIteration 9. One new commit since iteration 8 — What's Working Well
Issues FoundCritical (Must Fix)None. Important (Should Fix)None. Suggestions (Nice to Have)Carry-overs from prior iterations, all non-blocking and untouched here:
Issue ConformanceStill no linked issue ( Previous Review Status
Monorepo / Cross-Package Notessync-service-internal only; no HTTP-contract or TypeScript-client impact. Review iteration: 9 | 2026-06-08 |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4504 +/- ##
===========================================
+ Coverage 32.48% 56.46% +23.97%
===========================================
Files 216 358 +142
Lines 18368 39081 +20713
Branches 6478 10978 +4500
===========================================
+ Hits 5967 22066 +16099
- Misses 12369 16944 +4575
- Partials 32 71 +39
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
✅ Deploy Preview for electric-next ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Add a pg_txn.fragments_wall_duration_µs attribute to the pg_txn.replication_client.transaction_received span, set on the commit fragment. It measures the wall-clock time from a transaction's begin to its commit as received from Postgres. Because the replication stream is consumed on demand (e.g. paused while database connections are scaled down), this includes idle gaps between fragments and can be far larger than the per-fragment processing time — it's the signal for transactions whose fragments span a shape consumer's suspend threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ll-time
tx_started_at_mono is set together with txn_fragment on Begin, so a
commit always has it; a begin-less commit would raise on the
`%{fragment | commit}` map-update regardless. Compute the duration
inline at commit and read it directly in the ShapeLogCollector,
removing the defensive nil branches and the misleading "nil after
reconnect" comment.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gment processing Carry the begin monotonic time on the commit and compute the wall-clock duration in the ShapeLogCollector after the commit fragment is processed, so processing time is included. Mirrors the existing receive_lag pattern (stored mono time + delta computed later, reported in ms). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…xtures Changes.Commit is only built by MessageConverter, which always sets tx_started_at on Begin (a begin-less commit raises on the fragment map-update before exiting), so in regular execution the field is always present. Drop the is_integer guard and stamp tx_started_at on synthetic test commits instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…lti-shape txn Drives a single complete transaction that the EventRouter reslices to two shapes, and asserts the total_processing_time span attribute lands exactly once (on the original incoming commit), guarding against future EventRouter changes that might set it per-shape. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The local transaction/3 helper built a bare %Changes.Commit{}, leaving
tx_started_at nil. ShapeLogCollector computes total_processing_time as
now - tx_started_at on every commit fragment, which raised ArithmeticError
on nil and crashed the collector, failing the restore-latest-offset test.
In production a commit always carries tx_started_at (MessageConverter stamps
it on Begin and copies it onto Commit), so the fix is to honor that invariant
in the test helper, matching Support.TestUtils.txn_fragment/4.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
67243d1 to
6f54a60
Compare
…ragment/3 Both shape_cache_test and shape_log_collector_test carried their own copy of a local transaction/3 helper that built a complete single-fragment transaction. The two copies had already drifted: one stamped tx_started_at on the commit, the other didn't — which is what produced the ArithmeticError in shape_cache_test once ShapeLogCollector started computing total_processing_time. Replace both with the existing shared Support.TestUtils.complete_txn_fragment/3 (already used throughout consumer_test). To support the callers that build an empty transaction, txn_fragment/4 now derives last_log_offset from the lsn when there are no changes instead of pattern-matching a non-empty list. This removes the duplication that allowed the helpers to fall out of sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Assert that the pg_txn.replication_client.transaction_received span carries the new total_processing_time attribute end-to-end through the OTLP exporter. This span is the only one gated by ELECTRIC_OTEL_SAMPLING_RATIO (1% by default), so the setup env now forces full sampling (1.0) to make the span deterministically exported; otherwise the assertion would almost never match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This PR has been released! 🚀 The following packages include changes from this PR:
Thanks for contributing to Electric! |
Summary
Adds a
total_processing_timeattribute to thepg_txn.replication_client.transaction_receivedspan, set on the commit fragment. It records the wall-clock time taken to process all fragments of a single transaction — from when the begin was received to when the commit fragment finishes processing.Today our spans only measure per-fragment processing time (~ms). They can't tell us how long a transaction's fragments are smeared across in wall-clock terms — which is the quantity that determines whether a shape consumer can idle past its suspend threshold mid-transaction (see #4501 / #4503).
Unlike
receive_lag— which is anchored on the Postgres commit timestamp and measures end-to-end delivery lag, from when Postgres committed the transaction to when Electric finished processing it —total_processing_timeis anchored entirely within Electric: it spans receipt of the begin fragment to completion of the commit fragment.🤖 Generated with Claude Code