Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

Fix callback for second barrier not being attached #435

Merged
merged 6 commits into from
Jul 7, 2023

Conversation

MathieuBordere
Copy link
Contributor

@MathieuBordere MathieuBordere commented Jun 16, 2023

Related to #372 and possibly #431

The test in this PR fails because when all active open segments are already involved in a barrier, the next barrier will fail to be attached to an open segment and the barrier callback will never fire. My first - hacky - idea, would be to push a new open segment and attach the barrier to that one. Any other inspiration?

@cole-miller
Copy link
Contributor

My first - hacky - idea, would be to push a new open segment and attach the barrier to that one.

Would this be a purely in-memory thing with no connection to any disk file?

@freeekanayaka
Copy link
Contributor

Related to #372 and possibly #431

The test in this PR fails because when all active open segments are already involved in a barrier, the next barrier will fail to be attached to an open segment and the barrier callback will never fire. My first - hacky - idea, would be to push a new open segment and attach the barrier to that one. Any other inspiration?

The test is using the UvBarrier() internal function, in real world I assume there would be a call to raft_io->truncate() or raft_io->snapshot_put() instead?

I'd suggest to use only public APIs in these tests, so it's easier to understand what real-world scenarios they are exercising, and there's no risk of the test running logic that would not be run in real-world.

@freeekanayaka
Copy link
Contributor

I'm mainly trying to first understand what real-world situation this test is simulating.

@freeekanayaka
Copy link
Contributor

I didn't go through all the details, but an idea could be to add a queue for all barrier requests that are been submitted after all open segments are already attached to a barrier requests. The callbacks for those queued barrier requests should be fired once the last attached barrier completes.

@freeekanayaka
Copy link
Contributor

I didn't go through all the details, but an idea could be to add a queue for all barrier requests that are been submitted after all open segments are already attached to a barrier requests. The callbacks for those queued barrier requests should be fired once the last attached barrier completes.

Or to make this a bit less ad-hoc and more general, all barrier requests could be put in a queue, each barrier would hold a pointer to the open segment they wait to be finalized (which is always the "active" open segment at the time the barrier request was placed). Whenever an open segment gets finalized, the queue should be scanned to see if there are barrier requests that can now be removed and the associated callback would be fired.

@MathieuBordere
Copy link
Contributor Author

I'm mainly trying to first understand what real-world situation this test is simulating.

It happened in a Jepsen run, but unfortunately, a while ago and the test artifacts are no longer available. link

@MathieuBordere
Copy link
Contributor Author

I didn't go through all the details, but an idea could be to add a queue for all barrier requests that are been submitted after all open segments are already attached to a barrier requests. The callbacks for those queued barrier requests should be fired once the last attached barrier completes.

Or to make this a bit less ad-hoc and more general, all barrier requests could be put in a queue, each barrier would hold a pointer to the open segment they wait to be finalized (which is always the "active" open segment at the time the barrier request was placed). Whenever an open segment gets finalized, the queue should be scanned to see if there are barrier requests that can now be removed and the associated callback would be fired.

That sounds interesting, will try to explore that.

@freeekanayaka
Copy link
Contributor

I'm mainly trying to first understand what real-world situation this test is simulating.

It happened in a Jepsen run, but unfortunately, a while ago and the test artifacts are no longer available. link

The comment there seems to provide useful information to possibly to come up with a reproducer (perhaps something along the lines of calling raft_io->truncate() while raft_io->put_snapshot() is in progress.

How about trying to change the test in this PR to something along those lines (using public APIs only) and see if it ends up reproducing the broken disk data as in #372?

@freeekanayaka
Copy link
Contributor

Just to be sure we really understand the issue before putting in place a solution.

Don't continue writing when someone has asked to stop writing new
segments.

Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
@MathieuBordere
Copy link
Contributor Author

Happy jepsen run: https://github.com/canonical/jepsen.dqlite/actions/runs/5455855501

This is almost ready for review, still have some cleanup function I want to take a look at, because it will fail under certain scenarios I think.

@MathieuBordere MathieuBordere force-pushed the 2barriers branch 3 times, most recently from 7e8f867 to 2e42a07 Compare July 5, 2023 13:03
@codecov
Copy link

codecov bot commented Jul 5, 2023

Codecov Report

Merging #435 (02fb737) into master (ce3e318) will increase coverage by 0.06%.
The diff coverage is 84.41%.

@@            Coverage Diff             @@
##           master     canonical/raft#435      +/-   ##
==========================================
+ Coverage   76.73%   76.80%   +0.06%     
==========================================
  Files          51       51              
  Lines        9597     9651      +54     
  Branches     2443     2458      +15     
==========================================
+ Hits         7364     7412      +48     
- Misses       1160     1169       +9     
+ Partials     1073     1070       -3     
Impacted Files Coverage Δ
src/uv_finalize.c 67.05% <66.66%> (+0.77%) ⬆️
src/uv_snapshot.c 55.57% <66.66%> (ø)
src/uv_append.c 85.80% <85.07%> (+1.04%) ⬆️
src/uv_truncate.c 57.50% <100.00%> (+1.56%) ⬆️

... and 4 files with indirect coverage changes

@MathieuBordere
Copy link
Contributor Author

I think this is ready for review, it surely fixes a bug where a barrier fails to attach if all segments are already involved in a barrier, I've used both internal and external API to reproduce, however still need to see if we can get into the state with incompatible segments from #372 and #431 , but I've already spent quite some time on this.

The idea is that a UvBarrier now has a queue of UvBarrierReq structs, The UvBarrierReq has a callback, when a segment gets finalized, UvBarrierMaybeTrigger is called and calls the first callback in the queue and then relies on UvUnblock to call UvBarrierMaybeTrigger again. When the queu is empty the uv->barrier can be removed. The advantage is that the barrier logic can be kept as is, without fundamentally reworking it, leading to a potential new can of worms.

@MathieuBordere MathieuBordere marked this pull request as ready for review July 5, 2023 13:20
@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Jul 5, 2023

jepsen run not entirely happy, couple of core dumps, unfortunately canonical/jepsen.dqlite#105 is getting in the way ...

edit: looks happier after a small fix, one occurrence of #386
edit2: clean run https://github.com/canonical/jepsen.dqlite/actions/runs/5455855501

@cole-miller cole-miller changed the title test_uv_append: Failing test 2 barriers. Fix callback for second barrier not being attached Jul 5, 2023
Copy link
Contributor

@cole-miller cole-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I have a few questions, mostly just making sure I understand what's going on since this is a complicated subsystem.

src/uv_append.c Show resolved Hide resolved
src/uv_append.c Show resolved Hide resolved
src/uv_truncate.c Show resolved Hide resolved
src/uv_append.c Outdated Show resolved Hide resolved
@cole-miller
Copy link
Contributor

please test downstream

Mathieu Borderé added 5 commits July 6, 2023 15:14
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
This is for consistency with uvSnapshotPutBarrierCb where this is also
done.

Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
@MathieuBordere
Copy link
Contributor Author

please test downstream

/* Fire all barrier cb's, this is safe because the barrier cb exits
* early when uv->closing is true. */
uvBarrierTriggerAll(barrier);
RaftHeapFree(barrier);
}
/* The segment->barrier field is used:
*
Copy link
Contributor

@cole-miller cole-miller Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to interpret this (pre-existing) comment.

src/uv_append.c Show resolved Hide resolved
src/uv_append.c Show resolved Hide resolved
src/uv_append.c Show resolved Hide resolved
@freeekanayaka
Copy link
Contributor

I've just started to look at the diff, before going deep here are few questions to better understand the broader picture.

I think this is ready for review, it surely fixes a bug where a barrier fails to attach if all segments are already involved in a barrier, I've used both internal and external API to reproduce

I guess "I've used external API to reproduce" means that the snapshot_truncate/snapshotThenTruncate test introduced in this PR is triggering the bug that you are mentioning?

, however still need to see if we can get into the state with incompatible segments from #372 and #431 , but I've already spent quite some time on this.

Is there anything specific that came out from inspecting the Jepsen logs?

Also, what about the other idea of eliminating non-blocking barriers? I presume that it was something "unrelated" that you wanted to do regardless of the bug being fixed in this PR, in order to simplify the current logic and then possibly get a more clear picture of the situation?

The idea is that a UvBarrier now has a queue of UvBarrierReq structs, The UvBarrierReq has a callback, when a segment gets finalized, UvBarrierMaybeTrigger is called and calls the first callback in the queue and then relies on UvUnblock to call UvBarrierMaybeTrigger again. When the queu is empty the uv->barrier can be removed. The advantage is that the barrier logic can be kept as is, without fundamentally reworking it, leading to a potential new can of worms.

This would be basically created a "nested" structure? You have a top level barrier object, and then one or more barrier requests can be associated with the top level object. We'd ideally have a flat structure of barrier requests, but the nested approach is a way to minimize changes to existing code. Is that accurate?

@MathieuBordere
Copy link
Contributor Author

I guess "I've used external API to reproduce" means that the snapshot_truncate/snapshotThenTruncate test introduced in this PR is triggering the bug that you are mentioning?

Yeah, it triggers the bug where the second barrier callback never fires.

Is there anything specific that came out from inspecting the Jepsen logs?

The logs are lost but in a comment #372 I say that the truncate barrier cb was never fired, don't have the proof anymore, but I trust the analysis I made back then.

Also, what about the other idea of eliminating non-blocking barriers? I presume that it was something "unrelated" that you wanted to do regardless of the bug being fixed in this PR, in order to simplify the current logic and then possibly get a more clear picture of the situation?

Yeah, that was to simplify the logic a bit, but that got me in other problems and decided it was not worth it.

This would be basically created a "nested" structure? You have a top level barrier object, and then one or more barrier requests can be associated with the top level object. We'd ideally have a flat structure of barrier requests, but the nested approach is a way to minimize changes to existing code. Is that accurate?

That's accurate.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Jul 6, 2023

Is there anything specific that came out from inspecting the Jepsen logs?

The logs are lost but in a comment #372 I say that the truncate barrier cb was never fired, don't have the proof anymore, but I trust the analysis I made back then.

Ok, so probably I confused the logs you had been looking at. I assume you were recently inspecting Jepsen logs with failures associated with the change to drop non-blocking barriers? If that's true, a write-up of what issues you found might be helpful in case we want to pick up the idea again later (see below).

One thing that puzzles me is that AFAIU the Jepsen failure in #372 seem to have not manifested again since then. Is that accurate or maybe it did occur and we might simply have missed it?

In theory such a failure should have some statistical occurrence. Even if it's rare, if you have, say, 100 Jepsen runs, it should manifest at least 1 time (I'm making up numbers, just to illustrate the point). It could also be that the failure is so rare that it did not occur once in the runs we had in months, that sounds a bit weird though, but maybe it's the reality.

Also, what about the other idea of eliminating non-blocking barriers? I presume that it was something "unrelated" that you wanted to do regardless of the bug being fixed in this PR, in order to simplify the current logic and then possibly get a more clear picture of the situation?

Yeah, that was to simplify the logic a bit, but that got me in other problems and decided it was not worth it.

This would be basically created a "nested" structure? You have a top level barrier object, and then one or more barrier requests can be associated with the top level object. We'd ideally have a flat structure of barrier requests, but the nested approach is a way to minimize changes to existing code. Is that accurate?

That's accurate.

That'd be fine, although I'm slightly concerned that we might be adding complexity without actually addressing the issue. Lacking better evidence, it's somehow fine to assume that the comment made in #372 is indeed pointing to precisely the bug being fixed here, but it might also be a red herring instead.

If we merge this change, I believe we'd have a couple of aspects that I would consider something we'd want to eventually improve:

  1. non-blocking barriers: as we said, those are needed because we can't differentiate between an open segment containing entry 1, and an open segment containing entries past an installed snapshot, and ideally we'd instead want to introduce some way to differentiate between those two.

  2. nested barrier requests: as we said, we'd ideally want to have a flat structure, which although might require a bit broader change, should result in a bit less complex code.

So, from a high-level point of view, if you are convinced that this PR solves both a Jepsen failure and something that happened in real-world, I'd be in favor of it, provided that we take note of 1. and 2. (perhaps opening issues) so we might possibly work on fixing them later down the road.

Still, the fact that there is no actual Jepsen failure occurring that we can look at is unfortunate.


SNAPSHOT_CLEANUP();
return MUNIT_OK;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still trying to understand exactly the situation: if we managed to drop non-blocking barriers (by having a way to differentiate between open-segment-starting-at-1 and open-segment-past-a-snapshot), then there should be no need to have a barrier when a snapshot is taken right? I'm wondering if that'd mean that this test would pass.

In other words, the "run a non-blocking barrier when a snapshot is taken" change fixed the differentiation problem, but introduced the bug being fixed here. Is that accurate?

Copy link
Contributor Author

@MathieuBordere MathieuBordere Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed in the past Barriers were only used for Truncate operations, that are more rare than snapshots. With multiple truncate requests it's probably possible to hit the bug though (that was always there), but it is a lot easier to hit now.

An issue I encountered when removing the barrier when taking a snapshot, was that the truncate operation that runs after the snapshot (to remove a prefix of the log) could remove all closed segments, again leading to confusion if the open segments remaining in the data folder were newer or older than the snapshot. In the case of taking a snapshot and all closed segments being removed, the open segments were older, in the case of installing a snapshot the open segments will be newer, with no meaningful way to discern the situation. I then more or less felt that the barrier before a snapshot made the situation easier to reason about and that it was not that bad in the end (imo :-))

We could have come up with special rules when taking a snapshot like "Don't remove the last closed segment when truncating a prefix of the log otherwise we will be confused", but that feels very ad hoc and the meaning of the Trailing parameter loses its value somewhat. Or we could start adding special files in the directory to discern the situations but that also quickly becomes a mess IMO. The right solution would be to create a new segment format that encodes the start index, but it's a big(ish) change and we break backwards compatibility.

Copy link
Contributor

@freeekanayaka freeekanayaka Jul 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed in the past Barriers were only used for Truncate operations, that are more rare than snapshots. With multiple truncate requests it's probably possible to hit the bug though (that was always there), but it is a lot easier to hit now.

An issue I encountered when removing the barrier when taking a snapshot, was that the truncate operation that runs after the snapshot (to remove a prefix of the log) could remove all closed segments, again leading to confusion if the open segments remaining in the data folder were newer or older than the snapshot. In the case of taking a snapshot and all closed segments being removed, the open segments were older, in the case of installing a snapshot the open segments will be newer, with no meaningful way to discern the situation. I then more or less felt that the barrier before a snapshot made the situation easier to reason about and that it was not that bad in the end (imo :-))

We could have come up with special rules when taking a snapshot like "Don't remove the last closed segment when truncating a prefix of the log otherwise we will be confused", but that feels very ad hoc and the meaning of the Trailing parameter loses its value somewhat. Or we could start adding special files in the directory to discern the situations but that also quickly becomes a mess IMO. The right solution would be to create a new segment format that encodes the start index, but it's a big(ish) change and we break backwards compatibility.

Ok, thanks for the write up, I can better see the whole picture now.

If there's something that simplifies the situation, that'd be good, otherwise ad-hoc rules are just going to flip the problem around.

Just for my understanding: a new segment format would break forward compatibility, right? In the sense that it'd be trick to support downgrades from a version that supports the new format to a version that doesn't. However backward compatibility should be fine (upgrading from old version to new version). Is that accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's accurate, got confused by the terminology.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, what about merging this change and at the same time opening a (longer term) issue to investigate the format change or other ideas?

If and when we find a solution that avoids the downsides you described, we could then simplify the code by removing non-blocking barriers and assume that barriers are used only for truncation (that assumption should help too for reducing complexity).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#453

Nice write-up, thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants