Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

debug_assert causes leader to panic #898

Closed
pepinns opened this issue Jul 17, 2023 · 7 comments
Closed

debug_assert causes leader to panic #898

pepinns opened this issue Jul 17, 2023 · 7 comments
Assignees

Comments

@pepinns
Copy link

pepinns commented Jul 17, 2023

Describe the bug
The debug_assert in the progress entry module can cause the leader to panic if a node goes offline, and then comes back online after it's data has been wiped such that it's log is erased.

To Reproduce
Steps to reproduce the behavior:

  1. Setup a cluster with static membership, using the rocks store from this repo
  2. Every node calls init() with same membership on startup if they have no current leader defined.
  3. Take down 1 node
  4. Wipe out it's data directory
  5. Bring node back online, it has no current leader stored so calls init()

Note: that if we remove the init() on startup at step 5 the leader still panics, but the follower never recovers.

Expected behavior
I expect that the node coming back online would be brought up to date by the leader.

Actual behavior
The follower returns a Conflict message, which causes the leader to panic due to a debug_assert call. Once the leader panics it is un-recoverable.

Env (please complete the following information):

  • Openraft version: tested on main branch
  • Does the bug still there in the latest version?(main for the latest release branch, or release-0.7 for 0.7 branch)? yes
  • Rust-toolchain: [e.g. cargo 1.65.0-nightly (ce40690a5 2022-08-09); Find toolchain with cargo version in a crate dir] nightly-2023-05-04
  • OS: [e.g. mac os 12.0.1, centos 7.0] ubuntu 20.04
  • CPU: [e.g. mac m1, x86_64] x86_64

Additional files:
Logs from my service right before the panic

RaftCore{id=201 cluster=dev}:replication{id=201 target=3}:main{session=44-Some(201):committed/Some(0-0) target=3 cluster=dev}:send_log_entries: openraft::replication: openraft/src/replication/mod.rs:340: append_entries resp req=(Some(44-477), Some(44-478)] resp=Conflict
RaftCore{id=201 cluster=dev}:replication{id=201 target=3}:main{session=44-Some(201):committed/Some(0-0) target=3 cluster=dev}:send_log_entries: openraft::replication: openraft/src/replication/mod.rs:420: update_conflicting target=3 request_id=3 conflict=44-477
RaftCore{id=201 cluster=dev}:replication{id=201 target=3}:main{session=44-Some(201):committed/Some(0-0) target=3 cluster=dev}: openraft::replication: openraft/openraft/src/replication/mod.rs:187: replication action done res=Ok(None)
RaftCore{id=201 cluster=dev}:replication{id=201 target=3}:main{session=44-Some(201):committed/Some(0-0) target=3 cluster=dev}: openraft::replication: openraft/openraft/src/replication/mod.rs:527: drain_events
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}: openraft::core::raft_core: openraft/src/core/raft_core.rs:1153: recv from rx_notify: Replication command done: UpdateReplicationProgress: target: 3, id: 3, result: Ok(UTime { data: Conflict(LogId { leader_id: CommittedLeaderId { term: 44, p: PhantomData<u64> }, index: 477 }), utime: Some(Instant { tv_sec: 1453442, tv_nsec: 389792525 }) }), session_id: 44-Some(201):committed/Some(0-0)
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress: openraft::core::raft_core: openraft/src/core/raft_core.rs:1414: handle_replication_progress target=3 result=Ok(UTime { data: Conflict(LogId { leader_id: CommittedLeaderId { term: 44, p: PhantomData<u64> }, index: 477 }), utime: Some(Instant { tv_sec: 1453442, tv_nsec: 389792525 }) })
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress:update_progress: openraft::engine::handler::replication_handler: openraft/src/engine/handler/replication_handler/mod.rs:290: openraft::engine::handler::replication_handler::ReplicationHandler<_>::update_progress target=3 request_id=3 result=Ok(UTime { data: Conflict(LogId { leader_id: CommittedLeaderId { term: 44, p: PhantomData<u64> }, index: 477 }), utime: Some(Instant { tv_sec: 1453442, tv_nsec: 389792525 }) }) progress={201: {[Some(44-478), 479), inflight:None}, 5: {[Some(44-478), 479), inflight:None}, 1: {[Some(44-478), 479), inflight:None}, 3: {[Some(44-477), 478), inflight:Logs(id=3):(Some(44-477), Some(44-478)]}, 101: {[Some(44-478), 479), inflight:None}}
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress:update_progress:update_success_progress:update_leader_vote_clock: openraft::engine::handler::replication_handler: openraft/src/engine/handler/replication_handler/mod.rs:168: openraft::engine::handler::replication_handler::ReplicationHandler<_>::update_leader_vote_clock target=3 t=Instant { tv_sec: 1453442, tv_nsec: 389792525 }
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress:update_progress:update_success_progress:update_leader_vote_clock: openraft::engine::handler::replication_handler: openraft/src/engine/handler/replication_handler/mod.rs:176: granted leader vote clock after updating granted=Some(Instant { tv_sec: 1453441, tv_nsec: 733456621 }) clock_progress=VecProgress { quorum_set: Joint { data: [[1, 3, 5, 101, 201]], _p: PhantomData<(u64, alloc::vec::Vec<u64>)> }, granted: Some(Instant { tv_sec: 1453441, tv_nsec: 733456621 }), voter_count: 5, vector: [(3, Some(Instant { tv_sec: 1453442, tv_nsec: 389792525 })), (101, Some(Instant { tv_sec: 1453441, tv_nsec: 733727536 })), (5, Some(Instant { tv_sec: 1453441, tv_nsec: 733456621 })), (1, Some(Instant { tv_sec: 1453441, tv_nsec: 733357209 })), (201, None)], stat: Stat { update_count: 12, move_count: 12, is_quorum_count: 27 } }
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress:update_progress:update_success_progress:update_conflicting: openraft::progress::entry: openraft/src/progress/entry/mod.rs:115: update_conflict self=ProgressEntry { matching: Some(LogId { leader_id: CommittedLeaderId { term: 44, p: PhantomData<u64> }, index: 477 }), curr_inflight_id: 3, inflight: Logs { id: 3, log_id_range: LogIdRange { prev_log_id: Some(LogId { leader_id: CommittedLeaderId { term: 44, p: PhantomData<u64> }, index: 477 }), last_log_id: Some(LogId { leader_id: CommittedLeaderId { term: 44, p: PhantomData<u64> }, index: 478 }) } }, searching_end: 478 } request_id=3 conflict=477
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress:update_progress:try_purge_log: openraft::engine::handler::replication_handler: openraft/src/engine/handler/replication_handler/mod.rs:419: try_purge_log last_purged_log_id=None purge_upto=None
RaftCore{id=201 cluster=dev}:main:runtime_loop{id=201}:handle_notify{state=Leader id=201}:handle_replication_progress:update_progress:try_purge_log: openraft::engine::handler::replication_handler: openraft/src/engine/handler/replication_handler/mod.rs:426: no need to purge, return
thread 'api' panicked at 'assertion failed: matching_next <= end', openraft/src/progress/entry/mod.rs:199:9

Additional context
The following patch seems to resolve the issue, as it allows the leader to stay online, and the follower eventually gets updated.

index 0dba3c25..29a0c393 100644
--- a/openraft/src/progress/entry/mod.rs
+++ b/openraft/src/progress/entry/mod.rs
@@ -196,6 +196,9 @@ impl<NID: NodeId> ProgressEntry<NID> {
     }

     fn calc_mid(matching_next: u64, end: u64) -> u64 {
+        if matching_next > end {
+            return end;
+        }
         debug_assert!(matching_next <= end);
         let d = end - matching_next;
         let offset = d / 16 * 8;

The actual offending code could be in the append_entries on the follower, but I do think the leader should be resilient to faults and not panic in this case at the very least.
There was no error path in this code as it returns a Result<T, E> but the E is the same as the T in this case. The above was the simplest way I could think of to avoid this panic, but Ideally if the code has an error condition there would be an actual error return path so the Leader could re-act and perhaps fix whatever internal state is causing this situation.

@github-actions
Copy link

👋 Thanks for opening this issue!

Get help or engage by:

  • /help : to print help messages.
  • /assignme : to assign this issue to you.

@drmingdrmer drmingdrmer self-assigned this Jul 18, 2023
@drmingdrmer
Copy link
Member

Thank you for your detailed report. By the current design it should not be
considered as a bug.

Because the removed and then restarted node has unexpectedly changed its data,
the leader can not guarantee handling such condition correctly.

In a distributed system like Openraft, a node must guarantee the persisted
data wont change. Otherwise it is a severe bug and the entire system should be
shutdown at once to prevent further data damage.

However, we understand that in certain scenarios, like testing or
troubleshooting, you may need to wipe out all data and restart the node.

To address your specific use case, I propose adding a feature flag that relaxes
the replication progress checks. This would allow the leader node to continue data
replication to the follower, even in scenarios where all data has been wiped out
and the node restarted.

Please let me know if you have further questions.

@pepinns
Copy link
Author

pepinns commented Jul 18, 2023

I guess I'm thinking of this slightly differently, in my case all the nodes are known in advance and configured as a cluster. These clusters are then deployed as an isolated unit of 5 machines.

I was thinking of production scenarios, where a node has to be replaced. In production the nodes could be VMs which can die at any time. Being able to bring a new one online to replace a failed one would make this easier to operate, and does seem to work if we don't panic in the leader.
With the panic, ( if handled and shutdown is initiated), then the rest of the cluster is fine, but the leader ends up shutting down abruptly and restarting.

Is there something I'm not considering here that makes this operational plan invalid or dangerous?

@pepinns
Copy link
Author

pepinns commented Jul 18, 2023

The feature flag seems like a good plan though. I must admit I didn't find a good place in the codepath to check for this case.
Do you think the solution is on the Follower side codepath? Or in the leader's handling of the Conflict Message?

@drmingdrmer
Copy link
Member

@pepinns
Yes it is dangerous: Replacing a node with another empty one may cause data loss.
Assumes the leader is N1, followers are N2,N3,N4,N5;

  • A log(x) that is replicated by N1 to N2,N3 is considered committed.
  • At this point, if N3 is replaced with an empty node, and at once the leader N1 is crashed. Then N5 may elected as a new leader with granted vote by N3,N4;
  • Then the new leader N5 will not have log x.

The standard way is call change_membership() to remove the crashed node, then start a new empty node, finally call change_membership() again to add the new empty node back to the cluster.

Another non-standard tricky way is to set up a cluster of 6 nodes. The quorum is 4 nodes thus you can replace one node with empty state without losing data.

I'll open a PR to show you where to add the patch to address the panicking issue.

@pepinns
Copy link
Author

pepinns commented Jul 19, 2023

Thanks for clarifying, I was overlooking the fact that a log is committed before it's replicated everywhere.

@pepinns pepinns closed this as completed Jul 19, 2023
drmingdrmer added a commit to drmingdrmer/openraft that referenced this issue Jul 20, 2023
…res loosen-follower-log-revert`

Add a new feature flag `loosen-follower-log-revert`, to permit the
follower's log to roll back to an earlier state without causing the
leader to panic.

Although log state reversion is typically seen as a bug, enabling it can
be useful for testing or in some special scenarios.
For instance, in an even number nodes cluster, erasing a node's data
and then rebooting it(log reverts to empty) will not result in data
loss.

- Related issue: datafuselabs#898
@drmingdrmer
Copy link
Member

@pepinns
The log reversion issue should best be addressed when the conflicting event is reported to the progress:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants