New segment format #444

MathieuBordere · 2023-06-26T13:18:12Z

This is work in progress I started a while ago to encode the start index in the segment header, leading to a new on-disk segment format.

The 24 byte header would look like: the crc is over the format and the first index header fields.

  _________________________________________
 |[0 - 7] |      [8-15] | [16-19] | [20-23]|
 |FORMAT  | FIRST_INDEX |     CRC |  UNUSED|
 |________|_____________|_________|________|

old 8 byte header

  ________
 |[0 - 7] | 
 |FORMAT  |
 |________|

It's quite a big change and would lead to the following consequences:

Breaking change, newer raft versions would have a hard time rolling back to previous versions with the old disk format -> the idea was to include a recovery binary to convert modern segments to legacy segments.
We can do direct calculations with the start index of open segments for example instead of trying to deduce the start index, this leads to simplified logic in a couple of places.
We can remove the UvBarrier when taking a snapshot, that was introduced to close all open-segments when taking a snapshot, because in some cases the start index of the open segments was unclear, when they were all closed following the barrier, this problem didn't pose itself see uv: barrier before every snapshot #264
We can remove the concept of a non-blocking barrier, a barrier that still allows writes to new segments and simplify the code.
get rid of TODO

The goal is not to review this PR, but start discussion if this is a worthwhile change to do. The barrier logic is already hard enough as it is, and the blocking / non-blocking stuff makes it harder, especially when multiple barriers can be interacting with each other. I'm thinking a "blocking" truncate barrier following an already active non-blocking barrier, then I think we should upgrade the non-blocking barrier to blocking because we don't want to append any new writes etc.... but this just becomes a huge mess that could be avoided if we got rid of that distinction altogether.

edit: I could also just remove the non-blocking barrier and only fire a barrier for the very first snapshot like free suggests here #264 (comment) and then live with the fact that writes will be blocked when that snapshot is being taken, but that doesn't sound like a huge problem. I mainly opened this PR and discussion to simplify the barrier logic because I need to touch it to solve #372 and the blocking / non-blocking distinction is complicating matters.

The (open) segments now contain the first index of the encoded log entries and a checksum over the format and first index fields. The 24 byte header looks like: _________________________________________ |[0 - 7] | [8-15] | [16-19] | [20-23]| |FORMAT | FIRST_INDEX | CRC | UNUSED| |________|_____________|_________|________| This allows us to no longer guess the start index and do more direct checks. This also allows removing the barrier when taking a snapshot.

Downgrades all segments in a directory from version 2 to version 1. Use in case one has to downgrade dqlite versions.

freeekanayaka · 2023-06-26T14:30:45Z

I would be in favor of removing the non-blocking variant of the barrier. I didn't think about it through, so not sure if there are other alternatives. If the only options are 1) the new format 2) fire the barrier for the very first sector, then 1) sounds a bit better, although I'm a tad surprised that it requires a large diff (I didn't go through it yet). Option 2) might be more palatable if we want to avoid the need of an additional recovery program, or if it boils down to a simpler change.

I'll try to think of alternatives, but I don't have much time these days. I'm not totally sure a recovery program is needed, but I can see why you'd want it. If we could go without a recovery program and if the change can be simplified, perhaps option 1) is better.

Just first impression / thinking loud :)

MathieuBordere · 2023-06-26T14:45:53Z

I would be in favor of removing the non-blocking variant of the barrier. I didn't think about it through, so not sure if there are other alternatives. If the only options are 1) the new format 2) fire the barrier for the very first sector, then 1) sounds a bit better, although I'm a tad surprised that it requires a large diff (I didn't go through it yet). Option 2) might be more palatable if we want to avoid the need of an additional recovery program, or if it boils down to a simpler change.

I'll try to think of alternatives, but I don't have much time these days. I'm not totally sure a recovery program is needed, but I can see why you'd want it. If we could go without a recovery program and if the change can be simplified, perhaps option 1) is better.

Just first impression / thinking loud :)

yeah, the diff is large because I tried to simplify a bunch of logic and drop some branches, but you don't really have to review this, it's more for the discussion, if I would open a final PR I would split this in smaller commits.

freeekanayaka · 2023-06-26T22:35:49Z

Perhaps there is a third option. Unless I'm missing something, the issue we have is that we can't disambiguate between these two situations:

There is an open segment along with a snapshot (no closed segments). The open segment is the very first open segment ever written (i.e. its first entry is entry 1) and the snapshot was taken before that very first open segment was finalized.
There is an open segment along with a snapshot (no closed segments). The snapshot was received via InstallSnapshot, triggering the truncation of all entries, the open segment contains entries appended after the snapshot installation and log truncation completed (i.e. the first entry in the open segment has index last_snapshot_index + 1).

Situation 1. was the only one the code was handling up to this commit, which made the code correctly handle situation 2. but regressed the handling of situation 1., which was then fixed again by PR #264, where the ambiguity was identified (but not the regression that was introduced by the older commit). Indeed when merging #264 we realized that strictly speaking that fix was only needed for the very first snapshot, which is basically situation 1. (which as said we didn't realize that we had regressed in the first place).

What about if we simply name the very first open segment (the one starting at entry 1) with a special filename (e.g. open-0, since 0 is not currently used as a counter) ? That should allow to disambiguate between situation 1 and 2.

After loading entries in uvLoad() we'd need to set uv->prepare_next_counter to 0 if there are no entries and no snapshots, and to 1 otherwise. In uvLoad() itself, we can assume that the first entry of an open segment named open-0 is entry 1 (regardless of whether a snapshot is present), and we can assume that the first entry of an open-N segment with N > 0 and no closed segments is last_snapshot_index + 1, if there is a snapshot.

MathieuBordere · 2023-06-27T08:00:06Z

Perhaps there is a third option. Unless I'm missing something, the issue we have is that we can't disambiguate between these two situations:
1. There is an open segment along with a snapshot (no closed segments). The open segment is the very first open segment ever written (i.e. its first entry is entry 1) and the snapshot was taken before that very first open segment was finalized.

2. There is an open segment along with a snapshot (no closed segments). The snapshot was received via InstallSnapshot, triggering the truncation of all entries, the open segment contains entries appended after the snapshot installation and log truncation completed (i.e. the first entry in the open segment has index last_snapshot_index + 1).
Situation 1. was the only one the code was handling up to this commit, which made the code correctly handle situation 2. but regressed the handling of situation 1., which was then fixed again by PR #264, where the ambiguity was identified (but not the regression that was introduced by the older commit). Indeed when merging #264 we realized that strictly speaking that fix was only needed for the very first snapshot, which is basically situation 1. (which as said we didn't realize that we had regressed in the first place).

What about if we simply name the very first open segment (the one starting at entry 1) with a special filename (e.g. open-0, since 0 is not currently used as a counter) ? That should allow to disambiguate between situation 1 and 2.

After loading entries in uvLoad() we'd need to set uv->prepare_next_counter to 0 if there are no entries and no snapshots, and to 1 otherwise. In uvLoad() itself, we can assume that the first entry of an open segment named open-0 is entry 1 (regardless of whether a snapshot is present), and we can assume that the first entry of an open-N segment with N > 0 and no closed segments is last_snapshot_index + 1, if there is a snapshot.

I think that could work.

freeekanayaka · 2023-06-27T08:22:06Z

I think that could work.

Yeah, it should indeed, that's logically equivalent to adding the entry index to the header of the segment, but we'd use the filename instead of the header, and we'd do that only for the very first open segment, which is the one we need to disambiguate.

Note also that in case an open segment is found with a name different than open-0, but no closed segments and no snapshots are found, we should return RAFT_CORRUPT, because we don't have any legal situation that could lead to that scenario (this is a case where nothing can be done even if we encode the first index in the header). I'm not totally sure if we currently handle this case, and if we have a test for it.

MathieuBordere · 2023-06-27T08:42:18Z

I think that could work.

Yeah, it should indeed, that's logically equivalent to adding the entry index to the header of the segment, but we'd use the filename instead of the header, and we'd do that only for the very first open segment, which is the one we need to disambiguate.

Note also that in case an open segment is found with a name different than open-0, but no closed segments and no snapshots are found, we should return RAFT_CORRUPT, because we don't have any legal situation that could lead to that scenario (this is a case where nothing can be done even if we encode the first index in the header). I'm not totally sure if we currently handle this case, and if we have a test for it.

That will break existing servers that upgrade I think. If they get powered down before open-1 is closed and restart with upgraded version, it will break.

MathieuBordere · 2023-06-27T09:13:47Z

Another problem is with new joining servers I think, they have pristine state so they will start with uv->prepare_next_counter set to 0 , they receive a snapshot install because the cluster they're joining has been running for a while, and now we mustn't forget to increase prepare_next_counter. I don't know, it feels a bit brittle and I'm not too comfortable with it, I think I'd prefer to do a barrier on the first snapshot.

freeekanayaka · 2023-06-27T09:52:48Z

I think that could work.

Yeah, it should indeed, that's logically equivalent to adding the entry index to the header of the segment, but we'd use the filename instead of the header, and we'd do that only for the very first open segment, which is the one we need to disambiguate.
Note also that in case an open segment is found with a name different than open-0, but no closed segments and no snapshots are found, we should return RAFT_CORRUPT, because we don't have any legal situation that could lead to that scenario (this is a case where nothing can be done even if we encode the first index in the header). I'm not totally sure if we currently handle this case, and if we have a test for it.

That will break existing servers that upgrade I think. If they get powered down before open-1 is closed and restart with upgraded version, it will break.

Good point.

One way to handle that could be to bump the version number contained in the segment header. That would allow to disambiguate between the two situations.

However, in practice I think right now we don't need to put in place neither the check nor the version bump. Introducing the check could be simply deferred further in the future when we'll be reasonably confident that older versions have been upgraded.

freeekanayaka · 2023-06-27T09:56:25Z

We could also introduce a new standalone file (say format) that contains a format version for the whole raft data directory. A missing file would signal format version 0 (which is the current one), and we' write version 1 for the version that supports the open-0 segment.

But all things considered, I believe it's not practically needed right now, and we can introduce it at any time.

freeekanayaka · 2023-06-27T10:30:45Z

Another problem is with new joining servers I think, they have pristine state so they will start with uv->prepare_next_counter set to 0 , they receive a snapshot install because the cluster they're joining has been running for a while, and now we mustn't forget to increase prepare_next_counter.

Good point too.

If this case exhausts all possible cases, it doesn't sound too bad. But perhaps there are better alternatives.

I don't know, it feels a bit brittle and I'm not too comfortable with it, I think I'd prefer to do a barrier on the first snapshot.

Doing a barrier on the first snapshot is fine, we can refine an alternate solution later on.

MathieuBordere · 2023-06-27T11:13:15Z

If this case exhausts all possible cases, it doesn't sound too bad. But perhaps there are better alternatives.

Yeah, it's probably not too bad, I'm going to give it a go ... :-)

Mathieu Borderé added 3 commits June 26, 2023 15:03

segment: Add recovery binary for downgrading segments.

2cf2dcc

Downgrades all segments in a directory from version 2 to version 1. Use in case one has to downgrade dqlite versions.

wiprecoverytests

02ac13a

MathieuBordere requested review from cole-miller and freeekanayaka June 26, 2023 13:18

This was referenced Jun 27, 2023

Remove blocking barrier #445

Closed

Remove blocking barrier distinctions #446

Closed

MathieuBordere closed this Jun 30, 2023

MathieuBordere mentioned this pull request Jul 7, 2023

tackle open segment start index ambiguity canonical/dqlite#578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New segment format #444

New segment format #444

MathieuBordere commented Jun 26, 2023 •

edited

Loading

freeekanayaka commented Jun 26, 2023

MathieuBordere commented Jun 26, 2023

freeekanayaka commented Jun 26, 2023

MathieuBordere commented Jun 27, 2023

freeekanayaka commented Jun 27, 2023

MathieuBordere commented Jun 27, 2023

MathieuBordere commented Jun 27, 2023

freeekanayaka commented Jun 27, 2023

freeekanayaka commented Jun 27, 2023 •

edited

Loading

freeekanayaka commented Jun 27, 2023

MathieuBordere commented Jun 27, 2023

New segment format #444

New segment format #444

Conversation

MathieuBordere commented Jun 26, 2023 • edited Loading

freeekanayaka commented Jun 26, 2023

MathieuBordere commented Jun 26, 2023

freeekanayaka commented Jun 26, 2023

MathieuBordere commented Jun 27, 2023

freeekanayaka commented Jun 27, 2023

MathieuBordere commented Jun 27, 2023

MathieuBordere commented Jun 27, 2023

freeekanayaka commented Jun 27, 2023

freeekanayaka commented Jun 27, 2023 • edited Loading

freeekanayaka commented Jun 27, 2023

MathieuBordere commented Jun 27, 2023

MathieuBordere commented Jun 26, 2023 •

edited

Loading

freeekanayaka commented Jun 27, 2023 •

edited

Loading