-
Notifications
You must be signed in to change notification settings - Fork 136
Fix callback for second barrier not being attached #435
Conversation
Would this be a purely in-memory thing with no connection to any disk file? |
The test is using the I'd suggest to use only public APIs in these tests, so it's easier to understand what real-world scenarios they are exercising, and there's no risk of the test running logic that would not be run in real-world. |
I'm mainly trying to first understand what real-world situation this test is simulating. |
I didn't go through all the details, but an idea could be to add a queue for all barrier requests that are been submitted after all open segments are already attached to a barrier requests. The callbacks for those queued barrier requests should be fired once the last attached barrier completes. |
Or to make this a bit less ad-hoc and more general, all barrier requests could be put in a queue, each barrier would hold a pointer to the open segment they wait to be finalized (which is always the "active" open segment at the time the barrier request was placed). Whenever an open segment gets finalized, the queue should be scanned to see if there are barrier requests that can now be removed and the associated callback would be fired. |
It happened in a Jepsen run, but unfortunately, a while ago and the test artifacts are no longer available. link |
That sounds interesting, will try to explore that. |
The comment there seems to provide useful information to possibly to come up with a reproducer (perhaps something along the lines of calling How about trying to change the test in this PR to something along those lines (using public APIs only) and see if it ends up reproducing the broken disk data as in #372? |
Just to be sure we really understand the issue before putting in place a solution. |
54031b7
to
13a23af
Compare
7accfb9
to
de93e52
Compare
Don't continue writing when someone has asked to stop writing new segments. Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Happy jepsen run: https://github.com/canonical/jepsen.dqlite/actions/runs/5455855501 This is almost ready for review, still have some cleanup function I want to take a look at, because it will fail under certain scenarios I think. |
7e8f867
to
2e42a07
Compare
Codecov Report
@@ Coverage Diff @@
## master canonical/raft#435 +/- ##
==========================================
+ Coverage 76.73% 76.80% +0.06%
==========================================
Files 51 51
Lines 9597 9651 +54
Branches 2443 2458 +15
==========================================
+ Hits 7364 7412 +48
- Misses 1160 1169 +9
+ Partials 1073 1070 -3
|
I think this is ready for review, it surely fixes a bug where a barrier fails to attach if all segments are already involved in a barrier, I've used both internal and external API to reproduce, however still need to see if we can get into the state with incompatible segments from #372 and #431 , but I've already spent quite some time on this. The idea is that a |
jepsen run not entirely happy, couple of core dumps, unfortunately canonical/jepsen.dqlite#105 is getting in the way ... edit: looks happier after a small fix, one occurrence of #386 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I have a few questions, mostly just making sure I understand what's going on since this is a complicated subsystem.
please test downstream |
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
This is for consistency with uvSnapshotPutBarrierCb where this is also done. Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
please test downstream |
/* Fire all barrier cb's, this is safe because the barrier cb exits | ||
* early when uv->closing is true. */ | ||
uvBarrierTriggerAll(barrier); | ||
RaftHeapFree(barrier); | ||
} | ||
/* The segment->barrier field is used: | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to interpret this (pre-existing) comment.
I've just started to look at the diff, before going deep here are few questions to better understand the broader picture.
I guess "I've used external API to reproduce" means that the
Is there anything specific that came out from inspecting the Jepsen logs? Also, what about the other idea of eliminating non-blocking barriers? I presume that it was something "unrelated" that you wanted to do regardless of the bug being fixed in this PR, in order to simplify the current logic and then possibly get a more clear picture of the situation?
This would be basically created a "nested" structure? You have a top level barrier object, and then one or more barrier requests can be associated with the top level object. We'd ideally have a flat structure of barrier requests, but the nested approach is a way to minimize changes to existing code. Is that accurate? |
Yeah, it triggers the bug where the second barrier callback never fires.
The logs are lost but in a comment #372 I say that the truncate barrier cb was never fired, don't have the proof anymore, but I trust the analysis I made back then.
Yeah, that was to simplify the logic a bit, but that got me in other problems and decided it was not worth it.
That's accurate. |
Ok, so probably I confused the logs you had been looking at. I assume you were recently inspecting Jepsen logs with failures associated with the change to drop non-blocking barriers? If that's true, a write-up of what issues you found might be helpful in case we want to pick up the idea again later (see below). One thing that puzzles me is that AFAIU the Jepsen failure in #372 seem to have not manifested again since then. Is that accurate or maybe it did occur and we might simply have missed it? In theory such a failure should have some statistical occurrence. Even if it's rare, if you have, say, 100 Jepsen runs, it should manifest at least 1 time (I'm making up numbers, just to illustrate the point). It could also be that the failure is so rare that it did not occur once in the runs we had in months, that sounds a bit weird though, but maybe it's the reality.
That'd be fine, although I'm slightly concerned that we might be adding complexity without actually addressing the issue. Lacking better evidence, it's somehow fine to assume that the comment made in #372 is indeed pointing to precisely the bug being fixed here, but it might also be a red herring instead. If we merge this change, I believe we'd have a couple of aspects that I would consider something we'd want to eventually improve:
So, from a high-level point of view, if you are convinced that this PR solves both a Jepsen failure and something that happened in real-world, I'd be in favor of it, provided that we take note of 1. and 2. (perhaps opening issues) so we might possibly work on fixing them later down the road. Still, the fact that there is no actual Jepsen failure occurring that we can look at is unfortunate. |
|
||
SNAPSHOT_CLEANUP(); | ||
return MUNIT_OK; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still trying to understand exactly the situation: if we managed to drop non-blocking barriers (by having a way to differentiate between open-segment-starting-at-1 and open-segment-past-a-snapshot), then there should be no need to have a barrier when a snapshot is taken right? I'm wondering if that'd mean that this test would pass.
In other words, the "run a non-blocking barrier when a snapshot is taken" change fixed the differentiation problem, but introduced the bug being fixed here. Is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed in the past Barriers were only used for Truncate operations, that are more rare than snapshots. With multiple truncate requests it's probably possible to hit the bug though (that was always there), but it is a lot easier to hit now.
An issue I encountered when removing the barrier when taking a snapshot, was that the truncate operation that runs after the snapshot (to remove a prefix of the log) could remove all closed segments, again leading to confusion if the open segments remaining in the data folder were newer or older than the snapshot. In the case of taking a snapshot and all closed segments being removed, the open segments were older, in the case of installing a snapshot the open segments will be newer, with no meaningful way to discern the situation. I then more or less felt that the barrier before a snapshot made the situation easier to reason about and that it was not that bad in the end (imo :-))
We could have come up with special rules when taking a snapshot like "Don't remove the last closed segment when truncating a prefix of the log otherwise we will be confused", but that feels very ad hoc and the meaning of the Trailing
parameter loses its value somewhat. Or we could start adding special files in the directory to discern the situations but that also quickly becomes a mess IMO. The right solution would be to create a new segment format that encodes the start index, but it's a big(ish) change and we break backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed in the past Barriers were only used for Truncate operations, that are more rare than snapshots. With multiple truncate requests it's probably possible to hit the bug though (that was always there), but it is a lot easier to hit now.
An issue I encountered when removing the barrier when taking a snapshot, was that the truncate operation that runs after the snapshot (to remove a prefix of the log) could remove all closed segments, again leading to confusion if the open segments remaining in the data folder were newer or older than the snapshot. In the case of taking a snapshot and all closed segments being removed, the open segments were older, in the case of installing a snapshot the open segments will be newer, with no meaningful way to discern the situation. I then more or less felt that the barrier before a snapshot made the situation easier to reason about and that it was not that bad in the end (imo :-))
We could have come up with special rules when taking a snapshot like "Don't remove the last closed segment when truncating a prefix of the log otherwise we will be confused", but that feels very ad hoc and the meaning of the
Trailing
parameter loses its value somewhat. Or we could start adding special files in the directory to discern the situations but that also quickly becomes a mess IMO. The right solution would be to create a new segment format that encodes the start index, but it's a big(ish) change and we break backwards compatibility.
Ok, thanks for the write up, I can better see the whole picture now.
If there's something that simplifies the situation, that'd be good, otherwise ad-hoc rules are just going to flip the problem around.
Just for my understanding: a new segment format would break forward compatibility, right? In the sense that it'd be trick to support downgrades from a version that supports the new format to a version that doesn't. However backward compatibility should be fine (upgrading from old version to new version). Is that accurate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's accurate, got confused by the terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, what about merging this change and at the same time opening a (longer term) issue to investigate the format change or other ideas?
If and when we find a solution that avoids the downsides you described, we could then simplify the code by removing non-blocking barriers and assume that barriers are used only for truncation (that assumption should help too for reducing complexity).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#453
Nice write-up, thanks.
Related to #372 and possibly #431
The test in this PR fails because when all active open segments are already involved in a barrier, the next barrier will fail to be attached to an open segment and the barrier callback will never fire. My first - hacky - idea, would be to push a new open segment and attach the barrier to that one. Any other inspiration?