-
Notifications
You must be signed in to change notification settings - Fork 307
Fix Materializer startup race condition via offset coordination #3794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The Materializer subscribed to the Consumer BEFORE reading from storage, creating a window where the same record could be delivered twice (via storage read AND via new_changes message). This caused "Key already exists" crashes in production. The fix: Consumer returns its current offset on subscription, and Materializer reads storage only up to that offset. Changes after that offset are delivered via new_changes, ensuring no duplicates. Fixes #3787 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3794 +/- ##
===========================================
+ Coverage 75.75% 87.34% +11.58%
===========================================
Files 11 23 +12
Lines 693 2039 +1346
Branches 172 542 +370
===========================================
+ Hits 525 1781 +1256
- Misses 167 256 +89
- Partials 1 2 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
icehaunter
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it
|
This PR has been released! 🚀 The following packages include changes from this PR:
Thanks for contributing to Electric! |
alco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reviewing this PR after it has already been merged. Wanted to just see the fix for the bug but noticed issues in the test code. I think it's criticil we make changes in our process or rely on human judgement to prevent occasional agentic sloppiness from taking root in our codebase.
| Storage.mark_snapshot_as_started(storage) | ||
|
|
||
| # Write a record to storage at LogOffset.first() | ||
| alias Electric.Replication.LogOffset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alias should never be interspersed with regular code inside execution units such functions/tests. Putting it at the top of the module is best. In the worst case, it should be at the top of the describe block.
| test_pid = self() | ||
|
|
||
| consumer = | ||
| spawn(fn -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such ad-hoc mocking is really difficult to parse for a human. This one-off process encodes a specific sequence of messages, relying on the underlying implementation of GenServer calls. It has no relation to the high-level code inside consumer.ex. There's no way to ensure that it stays in sync with that implementation. I'm not sure why has this approach even been considered for this particular test?
To highlight a more general problem:
I've noticed this module has a few instances of such receive expressions. At the same time, there's the respond_to_call() local function that abstracts over some instances of receive.
We should not allow such slip ups because they will infect the test code and will be difficult to remove after the fact. Good old test-only module or a mocked behaviour is a better way to write custom functionality for a specific component in the test env.
| MapSet.new([10]) | ||
|
|
||
| {:error, reason} -> | ||
| flunk("Materializer failed to start: #{inspect(reason)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noticed Claude's propensity to litter explicit flunk calls in the test code it generates. Kind of odd to have that in this case, a simple {:ok, pid} = Materializer.start_link(...) would have been perfectly fine.
| # We use a mocked Consumer that returns an offset, and the Materializer | ||
| # should only read storage up to that offset, preventing duplicates. | ||
|
|
||
| Process.flag(:trap_exit, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This a smell when there's no good reason for making the test process trap exits or there is a reason but no explanation of it in the code.
|
That's valuable feedback. We want to maintain an high quality bar in our code. You mention specific code quality patterns. The agent must know we want to follow these patterns. My suggestion is that we should turn this into code quality guidelines that the agent/the reviewer agent must check whenever we create a pr. Why not creating that file or editing any coding style guidance we already have with your observations? Everytime we find something we record it |
|
@balegas 100% agree. But I wanted to flag this up as the responsibility of the author of the PR. We're still ways off from the workflow where all code can be written and reviewed by agents alone. Until then, we humans must remain in the loop and gatekeep any nonsense that agents try to add to the codebase. I think it's only fair for the human operator who delegates code generation to agents to make the necessary adjustments to the codified rules that we maintain to ensure code quality. Conversely, it wouldn't have been fair if I opened some PRs and expecting people who will be reviewing them to notice code that doesn't pass our quality bar and expect that they fix our rules for agents to try preventing this from happening again. The code already commited into Git must be fixed too, in addition to updating the rules for agents. |
|
Developer and reviewer and making sure that our tooling does the work for us. Let's start with setting up the tooling |
## Summary Addresses code quality feedback from @alco's [review on #3794](#3794): - **Move `alias` to module top**: `alias Electric.Replication.LogOffset` was interspersed in the test body. Moved to module-level aliases and replaced all full-path references. - **Replace ad-hoc mock consumer**: The spawned process with raw `$gen_call` message handling was difficult to parse and had no relation to the high-level consumer code. Replaced with the existing `respond_to_call()` helper pattern. - **Remove unnecessary `flunk`**: Replaced `case`/`flunk` with a simple pattern match `{:ok, _pid} = Materializer.start_link(...)`. - **Remove unexplained `Process.flag(:trap_exit, true)`**: No longer needed after the refactor since the test expects success. ## Test plan - [x] All 35 materializer tests pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Fixes #3787 - addresses the snapshot/replication race condition in the Materializer that caused "Key already exists" crashes.
The race condition
The Materializer subscribed to the Consumer before reading from storage:
During the window between subscribing and reading, any changes that arrive via
Consumer.new_changes()would be delivered to the Materializer. If those changes included records that were also in the snapshot being read, the Materializer received duplicates and crashed.Production example (maxwell instance, 27 Jan 2026)
The transaction that triggered it:
The record was already in the snapshot (matched via
is_template = trueOR the subquery), AND was delivered via replication withmove_tagsas a move-in event.The fix
Consumer now returns its current offset when a Materializer subscribes. The Materializer reads storage only up to that offset, not beyond. Changes after that offset will be delivered via
new_changesmessages, ensuring each change is delivered exactly once.Test plan
🤖 Generated with Claude Code