Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness: Replicas crash on receiving Command.reply messages #9

Closed
ThreeFx opened this issue Sep 15, 2021 · 3 comments
Closed

Liveness: Replicas crash on receiving Command.reply messages #9

ThreeFx opened this issue Sep 15, 2021 · 3 comments

Comments

@ThreeFx
Copy link
Contributor

ThreeFx commented Sep 15, 2021

Description and Impact

A replica receiving a Command.reply triggers an unreachable statement and crashes the replica. Replicas must be able to deal with this according to the "messages may be misrouted" specification in the network fault description.

// in: src/vsr/replica.zig

/// Called by the MessageBus to deliver a message to the replica.
pub fn on_message(self: *Self, message: *Message) void {
    // [snip]
    assert(message.header.replica < self.replica_count);
    switch (message.header.command) {
        .ping => self.on_ping(message),
        .pong => self.on_pong(message),
        .request => self.on_request(message),
        .prepare => self.on_prepare(message),
        .prepare_ok => self.on_prepare_ok(message),
        .commit => self.on_commit(message),
        .start_view_change => self.on_start_view_change(message),
        .do_view_change => self.on_do_view_change(message),
        .start_view => self.on_start_view(message),
        .recovery => self.on_recovery(message),
        .request_start_view => self.on_request_start_view(message),
        .request_prepare => self.on_request_prepare(message),
        .request_headers => self.on_request_headers(message),
        .headers => self.on_headers(message),
        .nack_prepare => self.on_nack_prepare(message),
        // command == Command.reply lands here
        else => unreachable,
    }
    // [snip]
}

Steps to Reproduce the Bug

  1. Apply patch
  2. Run ./vopr.sh 2669933095950905174
    • crashes after 18th transition, stack trace:
[debug] (packet_simulator): misdelivering packet from=0 orig_to=2 new_to=0
[debug] (network): deliver_message: Process{ .replica = 0 } > Process{ .replica = 0 }: Command.reply
[debug] (replica): 0: on_message: view=0 status=normal Header{ .checksum = 226925266819502557119179983641806389019, .checksum_body = 298413714882050206950453565564009099268, .parent = 162187561933345702967133189502892135902, .client = 205787606504586803876943260068027083215, .context = 0, .request = 11, .cluster = 0, .epoch = 0, .view = 0, .op = 20, .commit = 20, .offset = 0, .size = 144, .replica = 0, .command = Command.reply, .operation = Operation(3), .version = 0 }
thread 9741 panic: reached unreachable code
/home/bfiedler/projects/tigerbeetle-challenge/src/vsr/replica.zig:459:25: 0x26b7b0 in vsr.replica.Replica(test.state_machine.StateMachine,test.message_bus.MessageBus,test.storage.Storage,test.time.Time).on_message (simulator)
                else => unreachable,
                        ^
/home/bfiedler/projects/tigerbeetle-challenge/src/test/message_bus.zig:59:27: 0x26f9ee in test.message_bus.struct:57:35.wrapper (simulator)
                on_message(@intToPtr(Context, @ptrToInt(_context)), message);
                          ^
/home/bfiedler/projects/tigerbeetle-challenge/src/test/network.zig:170:41: 0x2a6a53 in test.network.Network.deliver_message (simulator)
        target_bus.on_message_callback.?(target_bus.on_message_context, message);
                                        ^
/home/bfiedler/projects/tigerbeetle-challenge/src/test/packet_simulator.zig:185:42: 0x259fa3 in test.packet_simulator.PacketSimulator(test.network.Packet).tick (simulator)
                            data.callback(data.packet, new_path);
                                         ^
/home/bfiedler/projects/tigerbeetle-challenge/src/simulator.zig:168:46: 0x24fe18 in main (simulator)
        cluster.network.packet_simulator.tick();
                                             ^
/usr/lib/zig/std/start.zig:458:37: 0x247b8a in std.start.callMain (simulator)
            const result = root.main() catch |err| {
                                    ^
/usr/lib/zig/std/start.zig:400:12: 0x22b65e in std.start.callMainWithArgs (simulator)
    return @call(.{ .modifier = .always_inline }, callMain, .{});
           ^
/usr/lib/zig/std/start.zig:319:17: 0x22a665 in std.start.posixCallMainAndExit (simulator)
    std.os.exit(@call(.{ .modifier = .always_inline }, callMainWithArgs, .{ argc, argv, envp }));
                ^
/usr/lib/zig/std/start.zig:244:5: 0x22a3c2 in std.start._start (simulator)
    @call(.{ .modifier = .never_inline }, posixCallMainAndExit, .{});
    ^
./vopr.sh: line 23:  9741 Aborted                 (core dumped) zig run src/simulator.zig $BUILD_MODE -- $1

diff --git a/src/simulator.zig b/src/simulator.zig
index 89a7dca..8dde542 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -81,6 +81,7 @@ pub fn main() !void {
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
                 .packet_replay_probability = prng.random.uintLessThan(u8, 50),
+                .packet_misdeliver_probability = prng.random.uintLessThan(u8, 10),
             },
         },
         .storage_options = .{
@@ -118,6 +119,7 @@ pub fn main() !void {
         \\          path_clog_duration_mean={} ticks
         \\          path_clog_probability={}%
         \\          packet_replay_probability={}%
+        \\          packet_misdeliver_probability={}%
         \\          read_latency_min={}
         \\          read_latency_mean={}
         \\          write_latency_min={}
@@ -141,6 +143,7 @@ pub fn main() !void {
         cluster.options.network_options.packet_simulator_options.path_clog_duration_mean,
         cluster.options.network_options.packet_simulator_options.path_clog_probability,
         cluster.options.network_options.packet_simulator_options.packet_replay_probability,
+        cluster.options.network_options.packet_simulator_options.packet_misdeliver_probability,
 
         cluster.options.storage_options.read_latency_min,
         cluster.options.storage_options.read_latency_mean,
diff --git a/src/test/packet_simulator.zig b/src/test/packet_simulator.zig
index 0ec2bc2..a7e2f1c 100644
--- a/src/test/packet_simulator.zig
+++ b/src/test/packet_simulator.zig
@@ -11,6 +11,7 @@ pub const PacketSimulatorOptions = struct {
 
     packet_loss_probability: u8,
     packet_replay_probability: u8,
+    packet_misdeliver_probability: u8,
     seed: u64,
     node_count: u8,
 
@@ -134,6 +135,10 @@ pub fn PacketSimulator(comptime Packet: type) type {
             return self.prng.random.uintAtMost(u8, 100) < self.options.packet_replay_probability;
         }
 
+        fn should_misdeliver(self: *Self) bool {
+            return self.prng.random.uintAtMost(u8, 100) < self.options.packet_misdeliver_probability;
+        }
+
         /// Return a value produced using an exponential distribution with
         /// the minimum and mean specified in self.options
         fn one_way_delay(self: *Self) u64 {
@@ -162,15 +167,25 @@ pub fn PacketSimulator(comptime Packet: type) type {
                             log.alert("dropped packet from={} to={}.", .{ from, to });
                             data.packet.deinit(path);
                             continue;
-                        }
-
-                        if (self.should_replay()) {
+                        } else if (self.should_replay()) {
                             self.submit_packet(data.packet, data.callback, path);
 
                             log.debug("replayed packet from={} to={}", .{ from, to });
                             self.stats[@enumToInt(PacketStatistics.replay)] += 1;
 
                             data.callback(data.packet, path);
+                        } else if (self.options.node_count > 1 and self.should_misdeliver()) {
+                            const orig_to = path.target;
+                            var new_to = self.prng.random.uintLessThan(u8, self.options.node_count - 1);
+                            if (new_to >= orig_to) new_to += 1;
+                            assert(new_to < self.options.node_count and new_to != orig_to);
+
+                            const new_path = Path{ .source = path.source, .target = new_to };
+                            log.debug("misdelivering packet from={} orig_to={} new_to={}", .{ from, orig_to, new_to });
+                            data.callback(data.packet, new_path);
+
+                            // only source matters here
+                            data.packet.deinit(path);
                         } else {
                             log.debug("delivering packet from={} to={}", .{ from, to });
                             data.callback(data.packet, path);

Suggested Fix

Either explicitly ignore .reply commands or ignore all unknown messages in on_message.

// in: src/vsr/replica.zig

/// Called by the MessageBus to deliver a message to the replica.
pub fn on_message(self: *Self, message: *Message) void {
    // [snip]
    assert(message.header.replica < self.replica_count);
    switch (message.header.command) {
        .ping => self.on_ping(message),
        .pong => self.on_pong(message),
        .request => self.on_request(message),
        .prepare => self.on_prepare(message),
        .prepare_ok => self.on_prepare_ok(message),
        .commit => self.on_commit(message),
        .start_view_change => self.on_start_view_change(message),
        .do_view_change => self.on_do_view_change(message),
        .start_view => self.on_start_view(message),
        .recovery => self.on_recovery(message),
        .request_start_view => self.on_request_start_view(message),
        .request_prepare => self.on_request_prepare(message),
        .request_headers => self.on_request_headers(message),
        .headers => self.on_headers(message),
        .nack_prepare => self.on_nack_prepare(message),
        // 1. this
        .reply => // log and ignore
        else => unreachable,
        // or, 2. this
        else => // log and ignore,
    }
    // [snip]
}

The Story Behind the Bug

Misdelivering packets was the next thing on my todo-list :) I was surprised that I triggered a bug this quickly.

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Another recommendation: "EYE HAVE YOU" from Maretu. Sounds like it could be from Undertale/an Undertale mashup.

Literature

No response

The Last Word

No response

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 15, 2021

Ignoring .reply messages makes the seed complete successfully:

diff --git a/src/vsr/replica.zig b/src/vsr/replica.zig
index b3d4b42..33fa0c4 100644
--- a/src/vsr/replica.zig
+++ b/src/vsr/replica.zig
@@ -456,6 +456,7 @@ pub fn Replica(
                 .request_headers => self.on_request_headers(message),
                 .headers => self.on_headers(message),
                 .nack_prepare => self.on_nack_prepare(message),
+                .reply => log.debug("got a mis-routed reply message, ignoring", .{}),
                 else => unreachable,
             }

@ThreeFx ThreeFx changed the title Liveness: Replicas crash on reflected Command.reply messages Liveness: Replicas crash on receiving Command.reply messages Sep 15, 2021
jorangreef added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 15, 2021
We now also ignore unexpected commands in the Replica's
on_message() handler as we were doing for the Client.

Reported-by: @ThreeFx
Refs: tigerbeetle/viewstamped-replication-made-famous#9
jorangreef added a commit that referenced this issue Sep 15, 2021
We now also ignore unexpected commands in the Replica's
on_message() handler as we were doing for the Client,
and we use an exhaustive switch, which would also have
been better in the first place.

Reported-by: @ThreeFx
Refs: #9
@jorangreef
Copy link
Member

Congrats @ThreeFx on finding hopefully the last of the misdirected packets!

Loving the quality of your reports, and enjoying the track recommendations. Keep it coming!

Just pushed a fix to use an exhaustive switch which would have caught this at compile time instead of at runtime. Please would you verify that the fix doesn't introduce any further issues?

We have decided to award a $400 liveness bounty.

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 15, 2021

It seems to work fine now, both this seed and a handful of random seeds (using ./vopr.sh) do not crash anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants