Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

Flaky io log fix #255

Closed

Conversation

MathieuBordere
Copy link
Contributor

@MathieuBordere MathieuBordere commented Nov 25, 2021

Bug observed in LXD cluster:

  1. Follower's appendFollowerCb is called with non-0 status after some I/O failure.
  2. Follower doesn't truncate in-memory log.
  3. Follower answers to leader with a rejection.
  4. Leader sends 3 entries, the one from the failed write and 2 new ones.
  5. Follower only writes the 2 new entries to disk after comparison with in-memory log, the old entry is still in there.
  6. Follower's appendFollowerCb is called with 0 status, I/O completed for the 2 new entries.
  7. Follower increases r->last_stored by 2, while it has a gap in it's on disk log entries.
  8. Follower doesn't reply with rejections on Leader's HeartBeats because, based on in-memory log, it's not missing entries.
  9. Follower never catches up.

Now, when disk I/O fails, we truncate the in-memory log and subsequent writes that see a truncated in-memory log, will try to truncate the on-disk log.

Signed-off-by: Mathieu Borderé <mathieu.bordere@canonical.com>
@MathieuBordere MathieuBordere marked this pull request as draft November 25, 2021 18:30
@codecov-commenter
Copy link

codecov-commenter commented Nov 25, 2021

Codecov Report

Merging #255 (1108413) into master (1d60c27) will increase coverage by 0.16%.
The diff coverage is 97.87%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #255      +/-   ##
==========================================
+ Coverage   87.71%   87.88%   +0.16%     
==========================================
  Files         107      107              
  Lines       15325    15373      +48     
  Branches     2372     2381       +9     
==========================================
+ Hits        13443    13511      +68     
+ Misses       1701     1681      -20     
  Partials      181      181              
Impacted Files Coverage Δ
src/replication.c 81.57% <91.66%> (+0.52%) ⬆️
src/fixture.c 95.92% <100.00%> (+1.67%) ⬆️
test/integration/test_replication.c 100.00% <100.00%> (ø)
test/fuzzy/test_liveness.c 98.41% <0.00%> (-1.59%) ⬇️
src/uv_send.c 94.88% <0.00%> (+1.57%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d60c27...1108413. Read the comment docs.

@MathieuBordere MathieuBordere force-pushed the flaky-io-log-fix branch 3 times, most recently from 1108413 to 105983b Compare November 26, 2021 15:36
@MathieuBordere
Copy link
Contributor Author

@freeekanayaka I'm not entirely sure how to proceed, there's this user with the problem as described in the PR description. I think the general approach to truncating the in-memory and on-disk logs is OK for the test raft_io implementation.
However, in our raft_io uv implementation, io->truncate will only be performed on closed segments, but in this case the closed segment will most likely be corrupt because it's created from an open segment that wasn't written to properly (indicated by the non-0 status in AppendFollowerCb), and the truncate operation will not be executed.

Should I adapt the implementation to try and truncate the corrupt segment anyway until raft->last_stored, assuming the write until raft->last_stored occurred without errors?
Should we just shut down upon seeing disk I/O errors and let uvLoad take care of removing corrupted open segments upon startup?
Other ideas?

@freeekanayaka
Copy link
Contributor

I didn't look too close at the situation, but yeah I recall that generic handling of I/O errors that are not "disk full" might be tricky. Since we expect non-"disk full" errors to be so rare (and probably severe) I think that the approach of "stopping the line" by shutting everything down immediately, and possibly let uvLoad see if there's something that can be done (e.g. the error was transient) is a good first approach.

Later down the road more sophisticated strategies might be applied, but I'd wait to see the details of the I/O errors happening in real world.

PS: I'm not sure how well the "disk full" errors are handled these days, but that's probably still an area of improvement?

@MathieuBordere
Copy link
Contributor Author

I didn't look too close at the situation, but yeah I recall that generic handling of I/O errors that are not "disk full" might be tricky. Since we expect non-"disk full" errors to be so rare (and probably severe) I think that the approach of "stopping the line" by shutting everything down immediately, and possibly let uvLoad see if there's something that can be done (e.g. the error was transient) is a good first approach.

Later down the road more sophisticated strategies might be applied, but I'd wait to see the details of the I/O errors happening in real world.

PS: I'm not sure how well the "disk full" errors are handled these days, but that's probably still an area of improvement?

Yeah, still need to take a closer look at handling disk-full situations.

@MathieuBordere
Copy link
Contributor Author

To revisit.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants