Ensure all producers from other workers have registered during recovery #3057

jtfmumm · 2019-11-13T15:54:03Z

This adds two new phases to the recovery protocol to guarantee that
all Producers on other workers have registered with a recovering
worker before we proceed to initiate the rollback barrier.

Fixes #3018.

This adds two new phases to the recovery protocol to guarantee that all Producers on other workers have registered with a recovering worker before we proceed to initiate the rollback barrier. Fixes #3018.

Run via: `./master-crasher.sh 2 run_custom_tcp_crash0`. Note that `initializer` and `worker` give very different answers to the cluster status query. Output: ``` WARNING: all useful state files are deleted by this script! Worker initializer: port = 7103 Worker worker1: port = 7113 Success RUN: run_custom_tcp_crash0 Done, yay ... waiting ,,,c0s0,,,,, Query initializer Connected... Cluster Status: Cluster not yet initialized Processing messages: false Query worker1 Connected... Cluster Status: Processing messages: true, Worker count: 2, Workers: |initializer,worker1,| Query initializer again Connected... Cluster Status: Cluster not yet initialized Processing messages: false ```

slfritchie

I've added commit 4bc7b87 that shows what happens when the initializer worker crashes & restarts a single time in a 2-worker cluster. (Full output logs are at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1573679554.tar.gz)

Initializer says:

Cluster Status:
Cluster not yet initialized
Processing messages: false

Worker1 says:

Cluster Status:
Processing messages: true, Worker count: 2, Workers: |initializer,worker1,|

Initializer's restart does not progress beyond the point of Reconnect Phase 2: Wait for Reconnections.

We shouldn't register BarrierSource as a source with RouterRegistry because it is only a source for the purposes of the barrier protocol. As a result, unlike other sources, it does not use its own OutgoingBoundaries, but instead uses the canonical ones. This is because it only rarely sends messages (those related to barriers and to register/unregister itself as a Producer).

Even if we have not written any bytes to disk for a step, we might have acquired keys after the last complete checkpoint. These need to be cleared during rollback.

We currently support two reasons for rollback: (1) crash recovery and (2) checkpoint abort. When two workers concurrently initiate rollback we were only using the rollback id to settle priority. However, crash recovery should always trump abort checkpoint since the former must guarantee stronger conditions before rollback can commence (e.g. boundary reconnect and producer registration). With this commit, we use the rollback reason to help determine priority.

jtfmumm requested a review from slfritchie November 13, 2019 15:54

Ensure all producers from other workers have registered during recovery

741915d

This adds two new phases to the recovery protocol to guarantee that all Producers on other workers have registered with a recovering worker before we proceed to initiate the rollback barrier. Fixes #3018.

jtfmumm force-pushed the eo_bug_exploration branch from 1661ac2 to 741915d Compare November 13, 2019 15:57

slfritchie suggested changes Nov 13, 2019

View reviewed changes

jtfmumm added 7 commits November 15, 2019 11:04

Add support for passing Wallaroo dir directly to sample-env-vars scripts

14ac28e

Use an initial resilience test Wait(3) to ensure initial checkpoint

079e34c

Always clear state before rollback at Step

96f8a5d

Even if we have not written any bytes to disk for a step, we might have acquired keys after the last complete checkpoint. These need to be cleared during rollback.

Ignore new phase methods when we've accepted recovery override

af9477f

Remove clone tracking from circleci config

b8f9fb4

slfritchie approved these changes Nov 18, 2019

View reviewed changes

jtfmumm merged commit 729230a into master Nov 18, 2019

jtfmumm deleted the eo_bug_exploration branch November 18, 2019 16:41

This was referenced Nov 18, 2019

This should not happen: ... boundary.pony at line 361 (Boundary.receive_ack()) #3056

Closed

CheckpointBarrierToken(3) is greater than current barrier CheckpointBarrierToken(1) #2936

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure all producers from other workers have registered during recovery #3057

Ensure all producers from other workers have registered during recovery #3057

jtfmumm commented Nov 13, 2019

slfritchie left a comment •

edited

Loading

Ensure all producers from other workers have registered during recovery #3057

Ensure all producers from other workers have registered during recovery #3057

Conversation

jtfmumm commented Nov 13, 2019

slfritchie left a comment • edited Loading

Choose a reason for hiding this comment

slfritchie left a comment •

edited

Loading