Introduce FaultInjectionTestFS to test fault File system instead of Env #6414

zhichao-cao · 2020-02-13T22:32:35Z

In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR #5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.

A set of ErrorHandlerFSTests are introduced for testing

Test plan: pass make asan_check, pass error_handler_fs_test.

anand1976

This is great! Do we still need error_handler_test.cc? It seems redundant to me.

ajkr

Is this a migration of FaultInjectionTestEnv to the new FileSystem interface? If so can we delete the old Env-based implementation?

ajkr · 2020-02-21T22:19:52Z

test_util/fault_injection_test_fs.cc

+}
+
+// A basic file truncation function suitable for this test.
+IOStatus TestFSTruncate(FileSystem* fs, const std::string& filename,


I know this is following the existing pattern for dropping unsynced data. But I never understood what's the reason to write unsynced data to a file rather than buffer it in process memory. The latter avoids confusion of persisting data and un-persisting it later.

It would also make the fault injection FileSystem useful for simulating power loss crash-recovery. db_crashtest.py works by repeatedly running and killing db_stress processes and verifying correctness. If the db_stress buffers its unsynced data in process memory, killing the process yields the same result (lost unsynced data) as if we had crashed the whole machine.

@ajkr Thanks for the comments. I'm a little confused. So here, each file has a state_ and we use pos_at_last_sync_ to remember the position of last sync. Append can happen after the previous sync (post_). If DropUnsyncedData is called, we need to preserve the synced data. So it read out the data from 0 to pos_at_last_sync_ and write to a new file, rename it to the original one. So the file only contains the synced data.

Yeah, I agree it works for preserving only the synced data after a call to DropUnsyncedData(). I just think buffering Append()ed data in process memory and waiting until Sync() to write it to a file would be more straightforward, and would make this FileSystem useful for simulating power loss failure in our existing process crash-recovery tests.

@ajkr Oh I see. Correct me if my understanding is wrong. You mean, when people call TestFSWritableFile->Append, instead of directly calling target_->append, we can create a buffer (e.g., attached in file state) to temporally hold the data. Only when TestFSWritableFile->Sync is called, we really call target_->append to really write it to a file (similar as what WritableFileWrite did).

I think the using a write buffer might make the process a little bit complex. The buffer size is preallocated. If the data being appended is over the buffer size, we need to do real target_->append, or create a new buffer. It increase the complexity. If it is the former case, data in the file is more than we expected (append is called earlier than a real sync). If we continues increase the buffer size, managing the memory is not so easy.

I tried this once before btw: https://github.com/ajkr/cockroach/blob/1c540424696a73adcafd5a773d220fbafb68bcbf/c-deps/libroach/rocksdbutils/env_sync_fault_injection.cc#L60-L104. IIRC the locking was needed when DBOptions::manual_wal_flush = true as FlushWAL(true /*sync*/) calls may be concurrent with Append(). That has extra logic for simulating failed fsyncs (calling exit() and corrupting the buffered data) that can be ignored for now.

I think the using a write buffer might make the process a little bit complex. The buffer size is preallocated. If the data being appended is over the buffer size, we need to do real target_->append, or create a new buffer. It increase the complexity. If it is the former case, data in the file is more than we expected (append is called earlier than a real sync). If we continues increase the buffer size, managing the memory is not so easy.

I'd just use std::string and buffer everything until Sync() is called; I don't think this will be used in cases where one file's unsynced data will cause OOM. Maybe in the future if it makes its way into db_stress and we run with weird parameters like huge files. But IMO that'd be a sign we're getting good use out of this feature.

@ajkr Thanks for the reference! I think I can follow your logic to make the change. In this way, DropUnsyncedData function can be removed.

I think the using a write buffer might make the process a little bit complex. The buffer size is preallocated. If the data being appended is over the buffer size, we need to do real target_->append, or create a new buffer. It increase the complexity. If it is the former case, data in the file is more than we expected (append is called earlier than a real sync). If we continues increase the buffer size, managing the memory is not so easy.

I'd just use std::string and buffer everything until Sync() is called; I don't think this will be used in cases where one file's unsynced data will cause OOM. Maybe in the future if it makes its way into db_stress and we run with weird parameters like huge files. But IMO that'd be a sign we're getting good use out of this feature.

Yeah. I forget this is just a testing, using std::string is enough. It will handle memory allocation and copy during append.

Yeah. I forget this is just a testing, using std::string is enough. It will handle memory allocation and copy during append.

You have a point though that the existing solution is more scalable than my suggestion. I didn't think of that before.

@ajkr Thanks for the reference! I think I can follow your logic to make the change. In this way, DropUnsyncedData function can be removed.

Oh interesting, I was initially thinking it'd be implemented by clearing the string. But I haven't looked at the test case so am not sure; maybe deleting it entirely is fine.

Oh interesting, I was initially thinking it'd be implemented by clearing the string. But I haven't looked at the test case so am not sure; maybe deleting it entirely is fine.

I mean, remove the current read->rewrite->rename logic. Just clean the write buffer. Also, remove the pos_at_last_sync_.

zhichao-cao · 2020-02-21T23:43:02Z

This is great! Do we still need error_handler_test.cc? It seems redundant to me.

Since we want to replace storage related from Env to FileSystem, so I think we can remove error_handler_test.cc.

…t based on fs

anand1976

LGTM

zhichao-cao · 2020-03-04T18:56:50Z

LGTM

Thanks for the review!

facebook-github-bot

@zhichao-cao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2020-03-04T22:00:16Z

@zhichao-cao merged this pull request in e62fe50.

…nv (#6414) Summary: In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR facebook/rocksdb#5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller. A set of ErrorHandlerFSTests are introduced for testing Pull Request resolved: facebook/rocksdb#6414 Test Plan: pass make asan_check, pass error_handler_fs_test. Differential Revision: D20252421 Pulled By: zhichao-cao fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21 Signed-off-by: Changlong Chen <levisonchen@live.cn>

zhichao-cao requested review from siying and anand1976 February 13, 2020 22:32

facebook-github-bot added the CLA Signed label Feb 13, 2020

anand1976 reviewed Feb 19, 2020

View reviewed changes

ajkr reviewed Feb 21, 2020

View reviewed changes

zhichao-cao force-pushed the fault_fs_test branch from 08831ed to 51ff83c Compare February 22, 2020 07:44

zhichao-cao added 8 commits March 3, 2020 15:09

Add the file fault_injection_test_fs for simulate the IO error of FS

74afcb6

Add the class of fault_injection_test_fs and the error_handler_fs_tes…

26ccc12

…t based on fs

Replace the name

112a984

Fix the copy operator bug of IOStatus

a8a8331

Change the append to real buffer state before sync

f125d69

Remove error_handler_test.cc, correct name space, make format

6e2d639

Correct the comments

2ce13ae

Corrected the namespace changes

ddb9e78

zhichao-cao force-pushed the fault_fs_test branch from a77f657 to ddb9e78 Compare March 3, 2020 23:41

Added the code for compaction

8e3ca1f

anand1976 approved these changes Mar 4, 2020

View reviewed changes

facebook-github-bot reviewed Mar 4, 2020

View reviewed changes

facebook-github-bot closed this in e62fe50 Mar 4, 2020

facebook-github-bot added the Merged label Mar 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce FaultInjectionTestFS to test fault File system instead of Env #6414

Introduce FaultInjectionTestFS to test fault File system instead of Env #6414

zhichao-cao commented Feb 13, 2020

anand1976 left a comment

ajkr left a comment

ajkr Feb 21, 2020

zhichao-cao Feb 22, 2020

ajkr Feb 22, 2020 •

edited

Loading

zhichao-cao Feb 22, 2020

ajkr Feb 22, 2020 •

edited

Loading

ajkr Feb 22, 2020 •

edited

Loading

zhichao-cao Feb 22, 2020

zhichao-cao Feb 22, 2020

ajkr Feb 22, 2020

zhichao-cao Feb 22, 2020

zhichao-cao commented Feb 21, 2020

anand1976 left a comment

zhichao-cao commented Mar 4, 2020

facebook-github-bot left a comment

facebook-github-bot commented Mar 4, 2020

Introduce FaultInjectionTestFS to test fault File system instead of Env #6414

Introduce FaultInjectionTestFS to test fault File system instead of Env #6414

Conversation

zhichao-cao commented Feb 13, 2020

anand1976 left a comment

Choose a reason for hiding this comment

ajkr left a comment

Choose a reason for hiding this comment

ajkr Feb 21, 2020

Choose a reason for hiding this comment

zhichao-cao Feb 22, 2020

Choose a reason for hiding this comment

ajkr Feb 22, 2020 • edited Loading

Choose a reason for hiding this comment

zhichao-cao Feb 22, 2020

Choose a reason for hiding this comment

ajkr Feb 22, 2020 • edited Loading

Choose a reason for hiding this comment

ajkr Feb 22, 2020 • edited Loading

Choose a reason for hiding this comment

zhichao-cao Feb 22, 2020

Choose a reason for hiding this comment

zhichao-cao Feb 22, 2020

Choose a reason for hiding this comment

ajkr Feb 22, 2020

Choose a reason for hiding this comment

zhichao-cao Feb 22, 2020

Choose a reason for hiding this comment

zhichao-cao commented Feb 21, 2020

anand1976 left a comment

Choose a reason for hiding this comment

zhichao-cao commented Mar 4, 2020

facebook-github-bot left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Mar 4, 2020

ajkr Feb 22, 2020 •

edited

Loading

ajkr Feb 22, 2020 •

edited

Loading

ajkr Feb 22, 2020 •

edited

Loading