-
Notifications
You must be signed in to change notification settings - Fork 156
ISE: inconsistent index #29
Comments
Here is what looks like a similar exception. This one is logged on a follower node that was killed and brought back up. This particular node was killed earlier when it was the leader.
|
This is great. I'll try to reproduce them. Based on the stack traces, these are two similar but different exceptions. One is thrown when appending to the log, and one is thrown when reading. The exception thrown when appending to the log indicates that index of the entry being appended does not match the next index in the log. This is an extra consistency check to catch these types of bugs, so I suppose it did its job. Basically, the Segment's nextIndex() is different than the Entry's index(). The two should be resolved during AppendRequest handling (by truncating the log or skipping entries as necessary). The exception thrown when reading the log indicates that the offset stored in the Segment file does not match with the expected offset in the index. The index is built from disk by reading offsets for each entry, so these should match. But there are many potential factors that can mess that up like truncating the log and searching the index. I'll have to go through that again and see if I can break it. I suspect something in OffsetIndex is off. That can actually cause both of these exceptions since the segment's nextIndex() is dependent on the entries in its OffsetIndex.
|
I set up a test to reproduce this but thus far haven't been successful at reproducing either. @madjam do you think you can manage to get a test that reproduces this in some branch? It might be difficult if the bug relies on how the Raft logic is skipping/truncating entries. I'll keep trying to find it. Once I can get it in a debugger it should be an easy fix. |
Interesting... The code path that this follows indicates that the previous index/term for the |
I wonder if |
I pushed a fix for the I'm not sure that is related to this problem, but it certainly could be. Still looking at the other logic. |
Unfortunately I don't have the logs from that run saved. But I don't think it would be very hard to reproduce. Let me try and I'll post the request dumps. |
Ugh I messed that commit up: 530c3fa |
No success so far reproducing this particular exception. |
Excellent. Yeah, I'm trying to reproduce this based on your steps above and having no luck so far myself either. I want to at least try to find out if it's related to 530c3fa which needs to be merged regardless. |
Something to note is that after I saw this exception a restart restored the node to proper health. That lead me to suspect that this has something to do with one of the in-memory indices getting corrupted. Whether that is nextIndex or the segment offset index is not clear. |
Ahh that's interesting and relevant. You may be totally right. The Just have to keep watching out for it I guess. |
Check this out: https://travis-ci.org/atomix/copycat#L4628 This seems to happen right when the leader transitions. The I'm going to merge the |
Another assertion failure potentially pointing to one of the indices getting out of sync. Note that here I am NOT running with your
This time I have the debug logs saved. I will look through them to see what happened before. Again the test involved killing the current leader (in a 3 node cluster), bringing up back up and allowing some time for its log to catch up and then repeating the process. All the while a separate client is appending log entries at the rate of a few hundred per second. |
These must all be related to the same issue. I'll spend the weekend hitting it hard and hopefully get a good reproducer. Hopefully the logs will help. Should be able to deduce the expected state of the log from that.
|
After parsing the logs here's what I gathered: I killed s1 (the current leader)
I'm expecting your change to reset nextIndex should fix this. Also per the Raft paper that is the right thing to do.
|
Oh awesome! I just pushed it. We should watch out for this from here on. |
There's actually another related bug too. I don't think
At this point, the new leader still has This could be resolved by always forcing the I'll push another fix for this. |
That makes sense. That is what the paper specifies on how a matchIndex should be initialized. So that is a good change to make. |
I ran into the inconsistent index assertion failure again. But this time I probably have a good idea what triggered it. Scenario:
|
Ahh this is great. That it was after compaction and after a configuration change is interesting. It seems to me like maybe entry What we know if that This may be a huge help. |
Hmm... No such luck. |
@madjam I was able to reproduce this in a debugger. But the way in which I reproduced it was via a bad configuration. Basically, if I start several servers each with the same log directory, I can get the stack trace in #29 (comment) Basically, I found this by accident. I'm assuming this isn't what happened in your case, but worth sharing. |
@madjam this is still an issue right? I still haven't been able to reproduce it (aside from by human error). I'm gonna see if Jepsen can reproduce it for me tonight. |
I haven't seen it since your change to reset all volatile state at leadership beginning. It is likely that that fixed it or made it much harder to reproduce :) |
Hmm... Hopefully it's the former! We'll leave this open and I'll try to see if Jepsen can break it. |
@kuujo Was there anything interesting in the Jepsen run? |
@madjam nothing yet. But really the most sensitive part of the algorithms (configuration changes and log compaction) are not well tested in Jepsen yet, so we're working on updating tests to randomly add and remove servers and compact the logs much more frequently. If tests continue to go well in that case that will make me feel warm and cozy. |
I got a little sidetracked going through all the code and really thinking about the algorithms. I managed to find several bugs and submitted PRs for all of them. Once they're cleaned up and merged and I review the rest of the code I'll take to Jepsen again. I did find a case where this bug was likely still possible. The original fix to reset |
Ran into this issue running a 3 node set up of AtomixReplicas. Looks like this happened immediately after a Leader switch. I need to look at the code to see what could be going on.
|
A repro this time a debug level logging shows a curious sequence events:
This particular back and forth (right before exception) is very puzzling.
|
Excellent! @madjam sorry I have been seeing your comments/PRs. I've been out of town since Monday but back now and back to work tomorrow. I'd love to figure this one out. Just too tired ATM :-)
|
@kuujo No worries :) I will shortly add some more details on the issue from a debug trace I captured. That should help figuring this out. |
Here is a summary from the debug trace I gathered. I restarted a 3 node cluster of
I see couple of issues to understand
|
Wow interesting. So, S2 and S3 never receive and thus never respond to the Append for the On the The only critical (for safety) part of cleaning entries is that tombstones are I'll open another issue to make no-op entries tombstones. |
[Re-posting comment from the PR thread] The issue I noticed occurs when a node loses leadership with In this state
This is because when this node tries to validate previous entry term, Log#get returns null since |
Oops guess I should have posted this here: Ahh right. No I think you're totally right. This definitely needs to be implemented. I've seen that odd rejection of an AppendRequest before and never realized the cause. Yeah this commit definitely wasn't a fix for that, it was just a fix for another potential problem (inconsistency in session expiration after leadership changes). Actually, a similar restriction is already in place on log compaction. A segment can only be compacted if there's a non-empty segment with a committed entry later in the log. This ensures there aren't gaps in the log after compaction since one committed, uncompacted entry will always remain later in the log. You're right it seems the same needs to be done to ensure logIndex/logTerm can be checked in AppendRequest handling. I don't see any safety issues with this. Just like in log compaction where we slow down compaction by one entry, we're just making the globalIndex increase potentially one entry slower which is fine. Awesome catch! So glad that was figured out. I'll submit a PR for this unless you want to. Right now globalIndex is calculated as the lowest matchIndex (the lowest index stored on all servers). I think what needs to happen is globalIndex should just be limited to max(lowest matchIndex - 1, 0). That will put the logic for handling this on the leader rather than inside the log/log compaction since this is more a detail of AppendRequest/Response.
|
Sounds good. I made this same change you suggested. I can submit a PR shortly. |
Hmm... @madjam actually I think this last bug may have been handled incorrectly. I don't think the fix broke anything but it did overlook something. I'm going to open another issue so we can discuss. |
I just got another instance of this exception during testing:
The scenario here was that I was simply submitting 3k writes/sec to an Atomix I'll see if it continues to reproduce this. |
I think #93 may have been what it took to get rid of these. I haven't been able to reproduce any more. Going to close this. |
I haven't done much debugging on this yet. But thought I will log this first.
Here's what I was doing:
Here's what I did to see this error:
12:21:00.424 [copycat-server-Address[localhost/127.0.0.1:5003]] ERROR i.a.c.u.c.SingleThreadContext - An uncaught exception occurred java.lang.IllegalStateException: inconsistent index: 1605633 at io.atomix.catalyst.util.Assert.state(Assert.java:69) ~[classes/:na] at io.atomix.copycat.server.storage.Segment.get(Segment.java:319) ~[classes/:na] at io.atomix.copycat.server.storage.Log.get(Log.java:319) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.entriesCommit(LeaderState.java:1040) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.commit(LeaderState.java:980) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.lambda$commit$126(LeaderState.java:876) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator$$Lambda$111/982607974.apply(Unknown Source) ~[na:na] at java.util.Map.computeIfAbsent(Map.java:957) ~[na:1.8.0_25] at io.atomix.copycat.server.state.LeaderState$Replicator.commit(LeaderState.java:874) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState$Replicator.access$100(LeaderState.java:818) ~[classes/:na] at io.atomix.copycat.server.state.LeaderState.accept(LeaderState.java:652) ~[classes/:na] at io.atomix.copycat.server.state.ServerState.lambda$registerHandlers$20(ServerState.java:421) ~[classes/:na] at io.atomix.copycat.server.state.ServerState$$Lambda$34/1604755402.handle(Unknown Source) ~[na:na] at io.atomix.catalyst.transport.NettyConnection.handleRequest(NettyConnection.java:102) ~[classes/:na] at io.atomix.catalyst.transport.NettyConnection.lambda$0(NettyConnection.java:90) ~[classes/:na] at io.atomix.catalyst.transport.NettyConnection$$Lambda$50/1312252238.run(Unknown Source) ~[na:na] at io.atomix.catalyst.util.concurrent.Runnables.lambda$logFailure$7(Runnables.java:20) ~[classes/:na] at io.atomix.catalyst.util.concurrent.Runnables$$Lambda$4/1471868639.run(Unknown Source) [classes/:na] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_25] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_25] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_25] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_25] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_25] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_25] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_25]
The text was updated successfully, but these errors were encountered: