Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: prevent follower panic on commit index regression #25

Closed
wants to merge 1 commit into from

Conversation

shaj13
Copy link

@shaj13 shaj13 commented Feb 11, 2023

panic: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?

  • The follower recovers from commit index regression and sends MsgHeartbeatResp with a rejection

  • The leader receives MsgHeartbeatResp rejection, reconciles and decreases follower progress

Logs:
Leader Node:
2023/02/11 16:45:27 INFO: 1 failed to send message to 2 because it is unreachable [StateProbe match=20 next=21]
2023/02/11 16:45:27 INFO: 1 failed to send message to 2 because it is unreachable [StateProbe match=20 next=21]
2023/02/11 16:45:27 INFO: 1 failed to send message to 2 because it is unreachable [StateProbe match=20 next=21]
2023/02/11 16:45:27 INFO: 1 failed to send message to 2 because it is unreachable [StateProbe match=20 next=21]
2023/02/11 16:45:28 INFO: 1 failed to send message to 2 because it is unreachable [StateProbe match=20 next=21]
2023/02/11 16:45:28 INFO: 1 received MsgHeartbeatResp(rejected, hint: (index 2)) from 2 for index 20
2023/02/11 16:45:28 INFO: 1 decreased progress of 2 to [StateProbe match=2 next=3]

Follower Node:
2023/02/11 16:45:28 INFO: 2 became follower at term 5
2023/02/11 16:45:28 PANIC: tocommit(20) is out of range [lastIndex(2)]. Was the raft log corrupted, truncated, or lost?
2023/02/11 16:45:28 INFO: raft.node: 2 elected leader 1 at term 5

It also sends a snapshot
Leader Node:
2023/02/11 18:41:51 INFO: 1 failed to send message to 2 because it is unreachable [StateProbe match=14 next=15]
2023/02/11 18:41:52 INFO: 1 received MsgHeartbeatResp(rejected, hint: (index 2)) from 2 for index 14
2023/02/11 18:41:52 INFO: 1 decreased progress of 2 to [StateProbe match=2 next=3]
2023/02/11 18:41:52 INFO: 1 [firstindex: 8, commit: 14] sent snapshot[index: 12, term: 6] to 2 [StateProbe match=2 next=3]
2023/02/11 18:41:52 INFO: 1 paused sending replication messages to 2 [StateSnapshot match=2 next=3 paused pendingSnap=12]
2023/02/11 18:41:52 INFO: 1 snapshot succeeded, resumed sending replication messages to 2 [StateProbe match=2 next=13]

Follower Node:
2023/02/11 18:41:52 INFO: log [committed=2, applied=2, unstable.offset=3, len(unstable.Entries)=0] starts to restore snapshot [index: 12, term: 6]
2023/02/11 18:41:52 INFO: 2 switched to configuration voters=(1 2)
2023/02/11 18:41:52 INFO: 2 [commit: 12, lastindex: 12, lastterm: 6] restored snapshot [index: 12, term: 6]
2023/02/11 18:41:52 INFO: 2 [commit: 12] restored snapshot [index: 12, term: 6]

panic: tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?

* The follower recovers from commit index regression and sends MsgHeartbeatResp with a rejection

* The leader receives MsgHeartbeatResp rejection, reconciles and decreases follower progress
@shaj13 shaj13 marked this pull request as draft February 11, 2023 15:10
@shaj13
Copy link
Author

shaj13 commented Feb 21, 2023

@tbg with whom i can discuss the PR

@tbg tbg self-requested a review February 27, 2023 07:55
@tbg
Copy link
Collaborator

tbg commented Feb 28, 2023

Hitting this panic usually means that a follower has not upheld its durability guarantees. Is this the context that motivated this change? If not, it should be possible to recreate the situation using TestInteraction1 and I'd kindly ask you to do so.

If this is about making raft more tolerant of assertion violations - I'm sympathetic to that, but I think we'd need to be a bit more careful to introduce this. The way I would structure this is actually a little different, by introducing event-based logging:

Instead of a line like this

raft/log.go

Line 324 in d9907d6

l.logger.Panicf("tocommit(%d) is out of range [lastIndex(%d)]. Was the raft log corrupted, truncated, or lost?", tocommit, l.lastIndex())

We'd have something like this (total strawman just to get the idea across)

l.logger.Event(&CommitOutOfRangeIndex{Commit: tocommit, LastIndex: l.lastIndex()})
// Code here to actually handle the problem gracefully
...

where the default logger would panic but users could define a logger that would just log the event and keep going. We wouldn't have to make all events that are now panics recoverable at first but could allow this only for certain events like the one discussed here.

I'd like to have this, but only have limited bandwidth for a direct contribution or supervising this work right now. I'll file this as an issue too, and close this PR as it is currently a stub that would need more maintainer support than I can currently provide. We can discuss more on the issue: someone will have to "sponsor" the work, i.e. provide a solution that only requires minor adjustments (i.e. 2-3 review cycles) to get merged.

Footnotes

  1. https://github.com/etcd-io/raft/blob/6bf4f7fe3122b064e0a0d76314298dca6f379fc7/interaction_test.go#L25-L35

@tbg
Copy link
Collaborator

tbg commented Feb 28, 2023

#29

@tbg tbg closed this Feb 28, 2023
@shaj13 shaj13 changed the title fix: prevent follower panic on commit index regression Draft: prevent follower panic on commit index regression (re-opened for discussion) Apr 23, 2024
@shaj13 shaj13 changed the title Draft: prevent follower panic on commit index regression (re-opened for discussion) Draft: prevent follower panic on commit index regression Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants