Flaky io log fix #255

MathieuBordere · 2021-11-25T18:30:16Z

Bug observed in LXD cluster:

Follower's appendFollowerCb is called with non-0 status after some I/O failure.
Follower doesn't truncate in-memory log.
Follower answers to leader with a rejection.
Leader sends 3 entries, the one from the failed write and 2 new ones.
Follower only writes the 2 new entries to disk after comparison with in-memory log, the old entry is still in there.
Follower's appendFollowerCb is called with 0 status, I/O completed for the 2 new entries.
Follower increases r->last_stored by 2, while it has a gap in it's on disk log entries.
Follower doesn't reply with rejections on Leader's HeartBeats because, based on in-memory log, it's not missing entries.
Follower never catches up.

Now, when disk I/O fails, we truncate the in-memory log and subsequent writes that see a truncated in-memory log, will try to truncate the on-disk log.

Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>

codecov-commenter · 2021-11-25T18:32:42Z

Codecov Report

Merging #255 (1108413) into master (1d60c27) will increase coverage by 0.16%.
The diff coverage is 97.87%.

@@            Coverage Diff             @@
##           master     #255      +/-   ##
==========================================
+ Coverage   87.71%   87.88%   +0.16%     
==========================================
  Files         107      107              
  Lines       15325    15373      +48     
  Branches     2372     2381       +9     
==========================================
+ Hits        13443    13511      +68     
+ Misses       1701     1681      -20     
  Partials      181      181

Impacted Files	Coverage Δ
src/replication.c	`81.57% <91.66%> (+0.52%)`	⬆️
src/fixture.c	`95.92% <100.00%> (+1.67%)`	⬆️
test/integration/test_replication.c	`100.00% <100.00%> (ø)`
test/fuzzy/test_liveness.c	`98.41% <0.00%> (-1.59%)`	⬇️
src/uv_send.c	`94.88% <0.00%> (+1.57%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d60c27...1108413. Read the comment docs.

MathieuBordere · 2021-11-26T15:54:26Z

@freeekanayaka I'm not entirely sure how to proceed, there's this user with the problem as described in the PR description. I think the general approach to truncating the in-memory and on-disk logs is OK for the test raft_io implementation.
However, in our raft_io uv implementation, io->truncate will only be performed on closed segments, but in this case the closed segment will most likely be corrupt because it's created from an open segment that wasn't written to properly (indicated by the non-0 status in AppendFollowerCb), and the truncate operation will not be executed.

Should I adapt the implementation to try and truncate the corrupt segment anyway until raft->last_stored, assuming the write until raft->last_stored occurred without errors?
Should we just shut down upon seeing disk I/O errors and let uvLoad take care of removing corrupted open segments upon startup?
Other ideas?

freeekanayaka · 2021-11-26T16:06:27Z

I didn't look too close at the situation, but yeah I recall that generic handling of I/O errors that are not "disk full" might be tricky. Since we expect non-"disk full" errors to be so rare (and probably severe) I think that the approach of "stopping the line" by shutting everything down immediately, and possibly let uvLoad see if there's something that can be done (e.g. the error was transient) is a good first approach.

Later down the road more sophisticated strategies might be applied, but I'd wait to see the details of the I/O errors happening in real world.

PS: I'm not sure how well the "disk full" errors are handled these days, but that's probably still an area of improvement?

MathieuBordere · 2021-11-26T16:17:07Z

I didn't look too close at the situation, but yeah I recall that generic handling of I/O errors that are not "disk full" might be tricky. Since we expect non-"disk full" errors to be so rare (and probably severe) I think that the approach of "stopping the line" by shutting everything down immediately, and possibly let uvLoad see if there's something that can be done (e.g. the error was transient) is a good first approach.

Later down the road more sophisticated strategies might be applied, but I'd wait to see the details of the I/O errors happening in real world.

PS: I'm not sure how well the "disk full" errors are handled these days, but that's probably still an area of improvement?

Yeah, still need to take a closer look at handling disk-full situations.

MathieuBordere · 2022-06-16T12:19:51Z

To revisit.

replication: Truncate Follower log when disk I/O fails

4e7d0c6

Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>

MathieuBordere marked this pull request as draft November 25, 2021 18:30

MathieuBordere force-pushed the flaky-io-log-fix branch 3 times, most recently from 1108413 to 105983b Compare November 26, 2021 15:36

MathieuBordere force-pushed the flaky-io-log-fix branch from 105983b to f05ec00 Compare November 26, 2021 15:58

MathieuBordere force-pushed the flaky-io-log-fix branch from f05ec00 to 0a671e5 Compare November 29, 2021 15:16

replication: Truncate on disk log when in-memory log is truncated.

2e91c3e

MathieuBordere force-pushed the flaky-io-log-fix branch from 0a671e5 to 2e91c3e Compare November 29, 2021 15:34

MathieuBordere closed this Jun 16, 2022

MathieuBordere mentioned this pull request Sep 24, 2022

Assertion failure in progressShouldReplicate #314

Closed

MathieuBordere deleted the flaky-io-log-fix branch December 9, 2022 10:12

MathieuBordere mentioned this pull request Jul 10, 2023

Assertion: src/uv_truncate.c:168: UvTruncate: Assertion `index < uv->append_next_index' failed. #450

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky io log fix #255

Flaky io log fix #255

MathieuBordere commented Nov 25, 2021 •

edited

Loading

codecov-commenter commented Nov 25, 2021 •

edited

Loading

MathieuBordere commented Nov 26, 2021

freeekanayaka commented Nov 26, 2021

MathieuBordere commented Nov 26, 2021

MathieuBordere commented Jun 16, 2022

Flaky io log fix #255

Flaky io log fix #255

Conversation

MathieuBordere commented Nov 25, 2021 • edited Loading

codecov-commenter commented Nov 25, 2021 • edited Loading

Codecov Report

MathieuBordere commented Nov 26, 2021

freeekanayaka commented Nov 26, 2021

MathieuBordere commented Nov 26, 2021

MathieuBordere commented Jun 16, 2022

MathieuBordere commented Nov 25, 2021 •

edited

Loading

codecov-commenter commented Nov 25, 2021 •

edited

Loading