Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness: race condition when handling successive requests from two clients leads to deadlock #5

Closed
ThreeFx opened this issue Sep 13, 2021 · 6 comments

Comments

@ThreeFx
Copy link
Contributor

ThreeFx commented Sep 13, 2021

Description and Impact

This bug occurs (only?) with one or two replicas. We assume one replica for all of the following. It seems like this issue is resolved by leader election following a view expiration for n>=3 replicas.


Successive client requests may lead to a deadlock: The write lock acquired by a replica in jounal.write_prepare may be held until after the next op should be written, causing the write to fail and not be retried. In the one replica case it seems that the replica can never recovers this error and deadlocks.

// file: src/vsr/journal.zig

pub fn write_prepare(
    self: *Self,
    callback: fn (self: *Replica, wrote: ?*Message, trigger: Write.Trigger) void,
    message: *Message,
    trigger: Self.Write.Trigger,
) void {
    // [snip]
    // right here there's a race condition between two messages
    const write = self.writes.acquire() orelse {
        self.write_prepare_debug(message.header, "no IOP available");
        callback(replica, null, trigger);
        return;
    };
    // [snip]
}

This leads to a dirty operation clogging the journal. The replica believes that the op is still going to be written, on_prepare_timeout reports 0: on_prepare_timeout: waiting for journal, and backs off. In repair_op(self, op), the replica cannot choose_any_other_replica(), so the uncommited, dirty op stays uncommited and dirty.

At some point the replica gets stuck in a preparing state for each client, and no new requests can be answered, thus the system deadlocks. In this exact example I have 14 clients and op 12 gets stuck, and the last op number assigned is 25, which makes sense.

This only happens sometimes during testing: It must be the case that two requests from clients i and j (we assume i < j wlog) are both en route to the target replica, and i's request is delivered immediately before j's request. In a distributed setting (i.e. with multiple replicas) this does not occur with high probability, because replicas have to exchange prepare messages before writing the log (maybe when setting the message delay to 0, I have not tested this). Still, I have no reason to believe that this is only a testing error though, as this routing might occur naturally in the one replica setting (imagine co-located clients issuing requests at the same time).

Steps to Reproduce the Bug

  1. Apply the patch at the bottom
  2. Run ./vopr.sh 14432206721076428670 -OReleaseSafe
    • The system will deadlock after two transitions
  3. Run ./vopr.sh 14432206721076428670 to get a detailed log.
diff --git a/src/config.zig b/src/config.zig
index ed6ac8f..f212a24 100644
--- a/src/config.zig
+++ b/src/config.zig
@@ -6,14 +6,14 @@ pub const log_level = 7;
 
 /// The maximum number of replicas allowed in a cluster.
 /// This has been limited to 5 just to decrease the amount of memory required by the VOPR simulator.
-pub const replicas_max = 5;
+pub const replicas_max = 11;
 
 /// The maximum number of clients allowed per cluster, where each client has a unique 128-bit ID.
 /// This impacts the amount of memory allocated at initialization by the server.
 /// This determines the size of the VR client table used to cache replies to clients by client ID.
 /// Each client has one entry in the VR client table to store the latest `message_size_max` reply.
 /// This has been limited to 3 just to decrease the amount of memory required by the VOPR simulator.
-pub const clients_max = 3;
+pub const clients_max = 30;
 
 /// The minimum number of nodes required to form a quorum for replication:
 /// Majority quorums are only required across view change and replication phases (not within).
diff --git a/src/simulator.zig b/src/simulator.zig
index aab7380..7631fd1 100644
--- a/src/simulator.zig
+++ b/src/simulator.zig
@@ -75,7 +75,7 @@ pub fn main() !void {
                 .one_way_delay_mean = 3 + prng.random.uintLessThan(u16, 10),
                 .one_way_delay_min = prng.random.uintLessThan(u16, 3),
                 .packet_loss_probability = prng.random.uintLessThan(u8, 30),
-                .path_maximum_capacity = 20 + prng.random.uintLessThan(u8, 20),
+                .path_maximum_capacity = 20, // must deterministically set because it screws up the randomness otherwise
                 .path_clog_duration_mean = prng.random.uintLessThan(u16, 500),
                 .path_clog_probability = prng.random.uintLessThan(u8, 2),
                 .packet_replay_probability = prng.random.uintLessThan(u8, 50),

Suggested Fix

A "obviously correct" fix is to always wait for writes, however this is infeasible.

I believe that the responsible path in repair_prepare could be adjusted to account for the one replica case. I have not tested any modifications, but this seems like a promising start. I am available on the TigerBeetle Discord for further debugging if needed.

// file: src/vsr/replica.zig

fn repair_prepare(self: *Self, op: u64) bool {
    // [snip]

    if (self.status == .view_change and op > self.commit_max) {
        // [snip]
    } else {
        const nature = if (op > self.commit_max) "uncommitted" else "committed";
        const reason = if (self.journal.faulty.bit(op)) "faulty" else "dirty";
        log.debug("{}: repair_prepare: op={} checksum={} ({s}, {s})", .{
            self.replica,
            op,
            checksum,
            nature,
            reason,
        });

        // We expect that `repair_prepare()` is called in reverse chronological order:
        // Any uncommitted ops should have already been dealt with.
        // We never roll back committed ops, and thus never regard `nack_prepare` responses.
        // Alternatively, we may not be the leader, in which case we do distinguish anyway.
        assert(self.nack_prepare_op == null);
        assert(request_prepare.context == checksum);

        // TODO: account for the one replica case here?
        if (self.choose_any_other_replica()) |replica| {
            self.send_header_to_replica(replica, request_prepare);
        }
    }

    return true;
}

The Story Behind the Bug

I had (have?) a feeling that many clients may lead to problems/race conditions, so I increased the client count. Then I ran ./vopr.sh with increased client numbers, stumbled upon this.

I was not specifically targeting the one replica case, and was really lucky that I got such a small trace to analyze (only the first about 1.5k relevant log lines have all neccessary info).

Songs Enjoyed During the Production of This Issue

Liquicity Yearmix 2020

Literature

No response

The Last Word

No response

@ThreeFx ThreeFx changed the title Liveness: race condition when handling successive client requests leads to deadlock Liveness: race condition when handling successive requests from two clients leads to deadlock Sep 13, 2021
jorangreef added a commit to tigerbeetle/tigerbeetle that referenced this issue Sep 14, 2021
Where `config.pipelining_max` exceeds `config.io_depth_write` it's
possible for a client request to be unable to acquire a write IOP if we
have maxed out our IO depth.

This can lead to deadlock for a cluster of one or two, since there is
no other way for the leader to repair the dirty op because no other
replica has it.

The fix is for `on_prepare_timeout()` to retry the prepare.

Reported-by: @ThreeFx

Fixes: tigerbeetle/viewstamped-replication-made-famous#5
@jorangreef
Copy link
Member

Congrats @ThreeFx on being the first person to earn a TigerBeetle liveness bounty (in less than 72 hours)! You've officially opened the leaderboard.

Your report was crystal clear, easy to reproduce, you understood the issue (concurrent client requests exceeds concurrent I/O write depth with no retry in on_prepare_timeout for replica_count<=2 when that happens) and you seem to have already spent a huge amount of time getting familiar with the protocol code.

This is a really nice find, because it would slowly shutdown clients and then finally shutdown the whole prepare pipeline for a cluster of one or two, and this would be triggered only under load.

You had a good instinct to explore the boundaries by setting clients higher, since this surfaced the issue which depends on clients_max (and thus pipelining_max) exceeding io_depth_write.

The suggested fix was not quite there because the repair_prepare() function is for requesting prepares from other replicas whereas on_prepare_timeout() already has access to the deadlocked prepare which is in the prepare pipeline.

Considering the quality of your report overall, we've decided to award you with a $500 liveness bounty.

P.S. Thanks for sharing Liquicity Yearmix 2020, we've got it on rotation at the office now!

@jorangreef
Copy link
Member

Congrats also on the bonus $50 that will now be going to the ZSF!

@jorangreef
Copy link
Member

I'm pushing the fix now. Please let me know if it fixes your seed without introducing anything else.

@jorangreef
Copy link
Member

Please drop us an email at info@tigerbeetle.com so we can arrange payment.

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 14, 2021

Please drop us an email at info@tigerbeetle.com so we can arrange payment.

Done!

I'm pushing the fix now. Please let me know if it fixes your seed without introducing anything else.

I've confirmed that the seed now works correctly for me as well.

@ThreeFx
Copy link
Contributor Author

ThreeFx commented Sep 14, 2021

71a35083fd1955377a699a61a672ce2c7c2366ee

ThreeFx pushed a commit to ThreeFx/viewstamped-replication-made-famous that referenced this issue Sep 14, 2021
Where `config.pipelining_max` exceeds `config.io_depth_write` it's
possible for a client request to be unable to acquire a write IOP if we
have maxed out our IO depth.

This can lead to deadlock for a cluster of one or two, since there is
no other way for the leader to repair the dirty op because no other
replica has it.

The fix is for `on_prepare_timeout()` to retry the prepare.

Reported-by: @ThreeFx

Fixes: tigerbeetle#5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants