-
Notifications
You must be signed in to change notification settings - Fork 136
Conversation
Codecov Report
@@ Coverage Diff @@
## master #446 +/- ##
==========================================
- Coverage 76.73% 76.60% -0.14%
==========================================
Files 51 51
Lines 9597 9616 +19
Branches 2443 2456 +13
==========================================
+ Hits 7364 7366 +2
- Misses 1159 1166 +7
- Partials 1074 1084 +10
|
I actually don't mind it! Given the constraints it seems like a nice solution to me. |
Jepsen doesn't seem extremely happy https://github.com/canonical/jepsen.dqlite/actions/runs/5390663924/jobs/9786357309, taking a look before removing draft. |
Investigating the failures, it looks like raft has trouble decoding a configuration found in the raft log when taking a snapshot. All of the occurrences follow a truncation of the log though. I'm suspecting that removing the barrier before taking the snapshot, together with the truncation - that in turn also requires a barrier - has some unanticipated interaction. |
/* The next entry in the first open segment will come after the snapshot | ||
* entries. */ | ||
if (put->trailing == 0 && uv->prepare_next_counter == 0) { | ||
uv->prepare_next_counter = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be unrelated to the Jenkins failures, but after a very quick look at the PR's diff I noticed this change, and I'm not entirely sure it's enough. Could it be that there is a ready but yet unused open-0
open segment that will be the one used after the snapshot is installed?
Just throwing it there in case it help, didn't really look much.
Perhaps it'd be a good idea to attach to this PR the Jepsen's artifacts containing the logs of one the failing tests, for future reference (since they'll get deleted at some point). |
If it's a real bug in the current implementation, maybe (probably) it's just a bug in this PR, attaching anyway for completeness. What is fishy
So the node decides to truncate to index 4353, but index 4808 is supposed to be committed, that should not be possible. and later (probably) trying to decode a non-configuration entry as a configuration entry.
|
My gut feeling is that due to the issue in #435 (where a second barrier callback doesn't fire when all active segments are involved in a barrier), the log truncation didn't happen when a snapshot barrier was active covering up an issue that is being exposed now. But still investigating. |
Could well be, but if that's the case, I'm wondering why such a bug was not surfaced by Jepsen also before the non-blocking barrier code was put in place. Or maybe our Jepsen suite is doing more nasty tests now that it was not doing before. Or we simply didn't examine some Jepsen failures before? Anyway, logs should indeed shed some light on this. |
yeah, I'm a bit reasoning instinctively here, trying to gather proof now :-) |
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
5db80f7
to
6b9ce19
Compare
What seems to happen is the following: Jepsen has a snapshot trailing of Node takes snapshot at 2454
Data folder contains segment After the snapshot completes, the log is truncated, but only the closed segment is removed (it contains older entries then the trailing cut off 1430), the open segment, that was not closed, because we don't fire a barrier anymore before the snapshot, is left as-is. Next time we restart, raft sees the open segment, no closed segments and the snapshot and erroneously puts all entries in the open segment after the snapshot leading to data corruption. Will try to investigate what we can do, an explicit start index in an open segment sure sounds nice ... it just too tricky without it. Leaving the barrier before taking a snapshot and living with the extra complexity is an easier, and already tested option too... edit: Or we could say, "Always leave at least the last closed segment", but then what's the point of the trailing parameter, it gets a bit fuzzy then. |
I could live with this by the way, I'd rather have this than the blocking / non-blocking barrier stuff. @cole-miller @freeekanayaka What do you think? |
Also, like @cole-miller mentioned, the introduction of |
I'm going to revisit this with the insights from this exercise and live with the blocking / non-blocking distinction for now. |
This is a draft PR that implements the approach to remove the (non)blocking barrier distinction discussed in #444
I believe @cole-miller is not a huge fan of the approach. Alternatives would be: