Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure all producers from other workers have registered during recovery #3057

Merged
merged 9 commits into from
Nov 18, 2019

Conversation

jtfmumm
Copy link
Contributor

@jtfmumm jtfmumm commented Nov 13, 2019

This adds two new phases to the recovery protocol to guarantee that
all Producers on other workers have registered with a recovering
worker before we proceed to initiate the rollback barrier.

Fixes #3018.

This adds two new phases to the recovery protocol to guarantee that
all Producers on other workers have registered with a recovering
worker before we proceed to initiate the rollback barrier.

Fixes #3018.
Run via: `./master-crasher.sh 2 run_custom_tcp_crash0`.

Note that `initializer` and `worker` give very different
answers to the cluster status query.  Output:

```
WARNING: all useful state files are deleted by this script!
Worker initializer: port = 7103
Worker worker1: port = 7113
Success
RUN: run_custom_tcp_crash0
Done, yay ... waiting
,,,c0s0,,,,,

Query initializer

Connected...
Cluster Status:
Cluster not yet initialized
Processing messages: false

Query worker1

Connected...
Cluster Status:
Processing messages: true, Worker count: 2, Workers: |initializer,worker1,|

Query initializer again

Connected...
Cluster Status:
Cluster not yet initialized
Processing messages: false
```
Copy link
Contributor

@slfritchie slfritchie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added commit 4bc7b87 that shows what happens when the initializer worker crashes & restarts a single time in a 2-worker cluster. (Full output logs are at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1573679554.tar.gz)

Initializer says:

Cluster Status:
Cluster not yet initialized
Processing messages: false

Worker1 says:

Cluster Status:
Processing messages: true, Worker count: 2, Workers: |initializer,worker1,|

Initializer's restart does not progress beyond the point of Reconnect Phase 2: Wait for Reconnections.

We shouldn't register BarrierSource as a source with RouterRegistry
because it is only a source for the purposes of the barrier
protocol. As a result, unlike other sources, it does not use its
own OutgoingBoundaries, but instead uses the canonical ones. This is
because it only rarely sends messages (those related to barriers
and to register/unregister itself as a Producer).
Even if we have not written any bytes to disk for a step, we might
have acquired keys after the last complete checkpoint. These need
to be cleared during rollback.
We currently support two reasons for rollback: (1) crash recovery and
(2) checkpoint abort. When two workers concurrently initiate rollback
we were only using the rollback id to settle priority. However, crash
recovery should always trump abort checkpoint since the former must
guarantee stronger conditions before rollback can commence (e.g.
boundary reconnect and producer registration). With this commit, we
use the rollback reason to help determine priority.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CheckpointRollbackResumeBarrierToken(x) is greater than current barrier CheckpointRollbackBarrierToken(x)
2 participants