forked from vectordotdev/vector
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix(buffers): deadlock when seeking after entire write fails to be fl…
…ushed (vectordotdev#17657) ## Context In vectordotdev#17644, a user reported disk buffers getting stuck in an infinite loop, and thus deadlocking, when restarting after a crash. They provided some very useful debug information, going as far to evaluate, add some logging, and get some values of internal state for the reader. When a disk buffer is initialized -- either for the first time or after Vector is restarted and the buffer must resume where it left off -- both the reader and writer perform a catch-up phase. For the writer, it checks the current data file and tries to figure out if the last record written matches where it believes it left off. For the reader, it actually has to dynamically seek to where it left off within the the given data file, since we can't just open the file and start from the beginning: data files are append-only. As part of the seek logic, there's a loop where we just call `Reader::next` until we read the record we supposedly left off on, and then we know we're caught up. This loop only breaks on two conditions: - `self.last_reader_record_id < ledger_last`, which implies we haven't yet read the last record we left off on (otherwise it would be equal to `ledger_last`) - `maybe_record.is_none() && self.last_reader_record_id == 0`, which would tell us that we reached EOF on the data file (no more records) but nothing was in the file (`last_reader_record_id` still being 0) While the first conditional is correct, the second one is not. The user that originally reported the issue [said as much](vectordotdev#17644 (comment)), but dropping the `&& self.last_reader_record_id == 0` fixes the issue. In this case, there can exist a scenario where Vector crashes and writes that the reader had read and acknowledged never actually make it to disk. Both the reader/writer are able to outpace the data on disk because the reader can read yet-to-be-flushed records since they exist as dirty pages in the page cache. When this happens, the reader may have indicated to the ledger that it, for example, has read up to record ID 10 while the last record _on disk_ when Vector starts up is record ID 5. When the seek logic runs, it knows the last read record ID was 10. It will do some number of reads while seeking, eventually reading record ID 5, and updating `self.last_reader_record_id` accordingly. On the next iteration of the loop, it tries to read but hits EOF: the data file indeed has nothing left. However, `self.last_reader_record_id < ledger_last` is still true while `maybe_record.is_none() && self.last_reader_record_id == 0` is not, as `self.last_reader_record_id` is set to `5`. Alas, deadlock. ## Solution The solution is painfully simple, and the user that originally reported the issue [said as much](vectordotdev#17644 (comment)): drop `&& self.last_reader_record_id == 0`. Given the loop's own condition, the inner check for `self.last_reader_record_id == 0` was redundant... but obviously also logically incorrect, too, in the case where we had missing writes. I'm still not entirely sure how existing tests didn't already catch this, but it was easy enough to spot the error once I knew where to look, and the resulting unit test I added convincingly showed that it was broken, and after making the change, indeed fixed. ## Reviewer Note(s) I added two unit tests: one for the fix as shown and one for what I thought was another bug. Turns out that the "other bug" wasn't a bug, and this unit test isn't _explicitly_ required, but it's a simple variation of other tests with a more straightforward invariant that it tries to demonstrate, so I just left it in. Fixes vectordotdev#17644.
- Loading branch information
Showing
4 changed files
with
224 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters