-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure all producers from other workers have registered during recovery #3057
Conversation
This adds two new phases to the recovery protocol to guarantee that all Producers on other workers have registered with a recovering worker before we proceed to initiate the rollback barrier. Fixes #3018.
1661ac2
to
741915d
Compare
Run via: `./master-crasher.sh 2 run_custom_tcp_crash0`. Note that `initializer` and `worker` give very different answers to the cluster status query. Output: ``` WARNING: all useful state files are deleted by this script! Worker initializer: port = 7103 Worker worker1: port = 7113 Success RUN: run_custom_tcp_crash0 Done, yay ... waiting ,,,c0s0,,,,, Query initializer Connected... Cluster Status: Cluster not yet initialized Processing messages: false Query worker1 Connected... Cluster Status: Processing messages: true, Worker count: 2, Workers: |initializer,worker1,| Query initializer again Connected... Cluster Status: Cluster not yet initialized Processing messages: false ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added commit 4bc7b87 that shows what happens when the initializer
worker crashes & restarts a single time in a 2-worker cluster. (Full output logs are at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1573679554.tar.gz)
Initializer says:
Cluster Status:
Cluster not yet initialized
Processing messages: false
Worker1 says:
Cluster Status:
Processing messages: true, Worker count: 2, Workers: |initializer,worker1,|
Initializer's restart does not progress beyond the point of Reconnect Phase 2: Wait for Reconnections
.
We shouldn't register BarrierSource as a source with RouterRegistry because it is only a source for the purposes of the barrier protocol. As a result, unlike other sources, it does not use its own OutgoingBoundaries, but instead uses the canonical ones. This is because it only rarely sends messages (those related to barriers and to register/unregister itself as a Producer).
Even if we have not written any bytes to disk for a step, we might have acquired keys after the last complete checkpoint. These need to be cleared during rollback.
We currently support two reasons for rollback: (1) crash recovery and (2) checkpoint abort. When two workers concurrently initiate rollback we were only using the rollback id to settle priority. However, crash recovery should always trump abort checkpoint since the former must guarantee stronger conditions before rollback can commence (e.g. boundary reconnect and producer registration). With this commit, we use the rollback reason to help determine priority.
This adds two new phases to the recovery protocol to guarantee that
all Producers on other workers have registered with a recovering
worker before we proceed to initiate the rollback barrier.
Fixes #3018.