uv: barrier before every snapshot #264

MathieuBordere · 2022-02-11T15:06:58Z

This solves a bug that came up during jepsen tests, it occurs when:

a server is killed ungracefully (leading to a written open segment in
the directory when the system restarts)
a snapshot file exists
no closed segments exist

Because open segments are not closed before writing a snapshot, raft,
when starting up and reading the files in the data directory,
erroneously assumes that all the entries in the open segment are newer
than the entries in the snapshot, while in reality the entries are
already contained in the snapshot, leading to a wrong state.

Closing the current open segments before writing the snapshot ensures
that no old entries can mistakenly be identified as new entries, all
entries in the open segments will be newer than the snapshot.

To close the open segments before making a snapshot, we perform a barrier
request, but we need to make a distinction between a 'blocking' and a 'non-blocking'
barrier in order to save performance.
Both barriers close all current open segments before firing the barrier
callback, but a 'non-blocking' barrier allows writes to go through to newly
created open-segments. This non-blocking barrier is used when raft is creating
a snapshot during regular operation, the blocking barrier is used when installing
snapshots and truncating the log.

MathieuBordere · 2022-02-11T15:10:22Z

edit: there's a flaky test, I'll convert to draft for now and reopen.

src/uv_snapshot.c

freeekanayaka · 2022-02-11T15:27:56Z

Nice catch, the approach looks good to me.

Strictly speaking perhaps requiring to close the open segment is probably necessary on for the very first snapshot that is taken (where typically there are no closed segments yet, leading to the bug you discovered), while in case there are already closed segments the bug shouldn't occur. However closing the open segment doesn't seem to cost much (compared to the overall cost of taking a snapshot), so I guess it's fine.

MathieuBordere · 2022-02-11T15:35:45Z

Nice catch, the approach looks good to me.

Strictly speaking perhaps requiring to close the open segment is probably necessary on for the very first snapshot that is taken (where typically there are no closed segments yet, leading to the bug you discovered), while in case there are already closed segments the bug shouldn't occur. However closing the open segment doesn't seem to cost much (compared to the overall cost of taking a snapshot), so I guess it's fine.

Thanks, an alternative was to add the start index to the open segments, but that would have been trickier I think (dealing with multiple formats, and the segment logic is already quite complex imo).

freeekanayaka · 2022-02-11T15:53:11Z

Nice catch, the approach looks good to me.
Strictly speaking perhaps requiring to close the open segment is probably necessary on for the very first snapshot that is taken (where typically there are no closed segments yet, leading to the bug you discovered), while in case there are already closed segments the bug shouldn't occur. However closing the open segment doesn't seem to cost much (compared to the overall cost of taking a snapshot), so I guess it's fine.

Thanks, an alternative was to add the start index to the open segments, but that would have been trickier I think (dealing with multiple formats, and the segment logic is already quite complex imo).

The reason open segments don't have the start index is that you don't know that upfront (when creating the open segment file on disk) and renaming the open segment file each time you start using it slows down the throughput.

It probably goes without saying, but FWIW I do agree that the segment logic is complex, and that seems the price for its efficiency: not sure you know it, but just for reference that logic was taken virtually unchanged from LogCabin, which is the very first Raft implementation, coded by the Raft author himself. IIRC etcd has a similar approach to a certain extent.

MathieuBordere · 2022-02-11T16:32:08Z

~~I have a bit of a lack of inspiration to show in a test that appends are not blocked by a non-blocking barrier, will return to this tonight or early next week.~~
good to go

This solves a bug that came up during jepsen tests, it occurs when: - a server is killed ungracefully (leading to a written open segment in the directory when the system restarts) - a snapshot file exists - no closed segments exist Because open segments are not closed before writing a snapshot, raft, when starting up and reading the files in the data directory, erroneously assumes that all the entries in the open segment are newer than the entries in the snapshot, while in reality the entries are already contained in the snapshot, leading to a wrong state. Closing the current open segments before writing the snapshot ensures that no old entries can mistakenly be identified as new entries, all entries in the open segments will be newer than the snapshot. To close the open segments before making a snapshot, we perform a `barrier` request, but we need to make a distinction between a 'blocking' and a 'non-blocking' barrier in order to save performance. Both barriers close all current open segments before firing the barrier callback, but a 'non-blocking' barrier allows writes to go through to newly created open-segments. This non-blocking barrier is used when raft is creating a snapshot during regular operation, the blocking barrier is used when installing snapshots and truncating the log.

MathieuBordere requested review from freeekanayaka and stgraber February 11, 2022 15:07

MathieuBordere marked this pull request as draft February 11, 2022 15:10

freeekanayaka reviewed Feb 11, 2022

View reviewed changes

src/uv_snapshot.c Outdated Show resolved Hide resolved

MathieuBordere force-pushed the barrier-before-snapshot branch from 59d6cb0 to 424de42 Compare February 11, 2022 16:25

MathieuBordere force-pushed the barrier-before-snapshot branch from 424de42 to a3de078 Compare February 11, 2022 18:33

MathieuBordere marked this pull request as ready for review February 11, 2022 18:38

stgraber merged commit ca8b466 into canonical:master Feb 11, 2022

MathieuBordere deleted the barrier-before-snapshot branch December 9, 2022 10:15

MathieuBordere mentioned this pull request Jun 26, 2023

New segment format #444

Closed

MathieuBordere mentioned this pull request Jul 7, 2023

tackle open segment start index ambiguity canonical/dqlite#578

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uv: barrier before every snapshot #264

uv: barrier before every snapshot #264

MathieuBordere commented Feb 11, 2022

MathieuBordere commented Feb 11, 2022 •

edited

Loading

freeekanayaka commented Feb 11, 2022

MathieuBordere commented Feb 11, 2022

freeekanayaka commented Feb 11, 2022

MathieuBordere commented Feb 11, 2022 •

edited

Loading

uv: barrier before every snapshot #264

uv: barrier before every snapshot #264

Conversation

MathieuBordere commented Feb 11, 2022

MathieuBordere commented Feb 11, 2022 • edited Loading

freeekanayaka commented Feb 11, 2022

MathieuBordere commented Feb 11, 2022

freeekanayaka commented Feb 11, 2022

MathieuBordere commented Feb 11, 2022 • edited Loading

MathieuBordere commented Feb 11, 2022 •

edited

Loading

MathieuBordere commented Feb 11, 2022 •

edited

Loading