-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
db_stress verify with lost unsynced operations #8966
Conversation
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
5ed2c58
to
7e9f573
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
7e9f573
to
c8095f5
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
c8095f5
to
aee8f70
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
aee8f70
to
e7a1aac
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding how it's able to sync up the replayed trace and the DB state. Can you explain? I don't see for example a seqno filter or threshold on what we apply to expected state.
db_stress_tool/db_stress_common.cc
Outdated
@@ -233,6 +233,11 @@ size_t GenerateValue(uint32_t rand, char* v, size_t max_sz) { | |||
return value_sz; // the size of the value set. | |||
} | |||
|
|||
uint32_t GetValueBase(Slice s) { | |||
assert(s.size() >= sizeof(uint32_t)); | |||
return *((uint32_t*)s.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could take this opportunity to change to something like Get/PutUnaligned instead of C-style cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db_stress_tool/expected_state.cc
Outdated
Status PutCF(uint32_t column_family_id, const Slice& key, | ||
const Slice& value) override { | ||
uint64_t key_id; | ||
if (!GetIntVal(key.ToString(), &key_id)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to try to ignore the inherited craziness of GetIntVal
(extern inline taking a string by value, apparently decoding to unsigned where the encode encodes from signed, and builds a temporary vector rather than using values as they are decoded)
db_stress_tool/expected_state.cc
Outdated
Status FileExpectedStateManager::Restore(DB* db) { | ||
// An `ExpectedStateTraceRecordHandler` applies a configurable number of | ||
// write operation trace records to the configured expected state. | ||
class ExpectedStateTraceRecordHandler : public TraceRecord::Handler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little big for a local class IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, made it non-local.
db_stress_tool/expected_state.cc
Outdated
if (s.ok()) { | ||
s = replayer->Prepare(); | ||
} | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer for (;;) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
}; | ||
|
||
SequenceNumber seqno = db->GetLatestSequenceNumber(); | ||
if (seqno < saved_seqno_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, I would assert(HasHistory())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
s = Env::Default()->DeleteFile(temp_path); | ||
} else if (s.IsNotFound()) { | ||
s = Status::OK(); | ||
saved_seqno_ = kMaxSequenceNumber; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say something like SaveAtAndAfter(...) will be called after Restore(...) to initialize tracing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might or might not be called after reaching here, depending on whether the current run uses -sync_fault_injection=1
. Reaching this code path only means the previous db_stress
run used -sync_fault_injection=1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for the review! Will get to your other comments soon - just wanted to answer the fundamental question first.
db_stress_tool/expected_state.cc
Outdated
if (num_write_ops_ == max_write_ops_) { | ||
return Status::OK(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not understanding how it's able to sync up the replayed trace and the DB state. Can you explain? I don't see for example a seqno filter or threshold on what we apply to expected state.
This returns without applying the write op once we've reached max_write_ops_
. The client configures max_write_ops_
by subtracting the base sequence number of the trace determined by the filename ("X.trace" has base seqno X) from the post-recovery GetLatestSequenceNumber()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks. Of course it looks less hidden now that you've showed it to me, but I looked for it for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design/implementation doc explains this in a "Verification algorithm" subsection. It was written after you reviewed. Copy/pasted the subsection here since the doc is not shared publicly.
Verification algorithm
FileExpectedStateManager::Open()
discovers from the filesystem (“*.{trace,state}” files) the sequence number synced by the previousdb_stress
run. It is stored insaved_seqno_
.FileExpectedStateManager::Restore()
reconstructs a “LATEST.state” tailored to the DB that just recovered with possibly lost unsynced operations:- Copies “
saved_seqno_
.state” to “.LATEST.state.tmp”. - Creates an
ExpectedStateTraceRecordHandler
configured to applyDB::GetLatestSequenceNumber() - saved_seqno_
updates to “.LATEST.state.tmp”. - “
saved_seqno_
.trace” is replayed using the above custom record handler. - “.LATEST.state.tmp” is renamed to “LATEST.state”.
- Copies “
NonBatchedOpsStressTest::VerifyDb()
compares the recovered DB against the values now in “LATEST.state”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the cosmetic feedback
e7a1aac
to
4f7aca7
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@ajkr has updated the pull request. You must reimport the pull request before landing. |
4f7aca7
to
11c4941
Compare
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
1 similar comment
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
This reverts commit 57625597e6794ff503b8f2b998edd99061f6198a.
11c4941
to
07f7e9b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
db_stress_tool/expected_state.cc
Outdated
if (num_write_ops_ == max_write_ops_) { | ||
return Status::OK(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The design/implementation doc explains this in a "Verification algorithm" subsection. It was written after you reviewed. Copy/pasted the subsection here since the doc is not shared publicly.
Verification algorithm
FileExpectedStateManager::Open()
discovers from the filesystem (“*.{trace,state}” files) the sequence number synced by the previousdb_stress
run. It is stored insaved_seqno_
.FileExpectedStateManager::Restore()
reconstructs a “LATEST.state” tailored to the DB that just recovered with possibly lost unsynced operations:- Copies “
saved_seqno_
.state” to “.LATEST.state.tmp”. - Creates an
ExpectedStateTraceRecordHandler
configured to applyDB::GetLatestSequenceNumber() - saved_seqno_
updates to “.LATEST.state.tmp”. - “
saved_seqno_
.trace” is replayed using the above custom record handler. - “.LATEST.state.tmp” is renamed to “LATEST.state”.
- Copies “
NonBatchedOpsStressTest::VerifyDb()
compares the recovered DB against the values now in “LATEST.state”.
db_stress_tool/expected_state.cc
Outdated
Status FileExpectedStateManager::Restore(DB* db) { | ||
// An `ExpectedStateTraceRecordHandler` applies a configurable number of | ||
// write operation trace records to the configured expected state. | ||
class ExpectedStateTraceRecordHandler : public TraceRecord::Handler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, made it non-local.
}; | ||
|
||
SequenceNumber seqno = db->GetLatestSequenceNumber(); | ||
if (seqno < saved_seqno_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
db_stress_tool/expected_state.cc
Outdated
if (s.ok()) { | ||
s = replayer->Prepare(); | ||
} | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
s = Env::Default()->DeleteFile(temp_path); | ||
} else if (s.IsNotFound()) { | ||
s = Status::OK(); | ||
saved_seqno_ = kMaxSequenceNumber; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might or might not be called after reaching here, depending on whether the current run uses -sync_fault_injection=1
. Reaching this code path only means the previous db_stress
run used -sync_fault_injection=1
.
db_stress_tool/db_stress_common.cc
Outdated
@@ -233,6 +233,11 @@ size_t GenerateValue(uint32_t rand, char* v, size_t max_sz) { | |||
return value_sz; // the size of the value set. | |||
} | |||
|
|||
uint32_t GetValueBase(Slice s) { | |||
assert(s.size() >= sizeof(uint32_t)); | |||
return *((uint32_t*)s.data()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@ajkr has updated the pull request. You must reimport the pull request before landing. |
@@ -307,7 +307,20 @@ void StressTest::FinishInitDb(SharedState* shared) { | |||
fprintf(stdout, "Compaction filter factory: %s\n", | |||
compaction_filter_factory->Name()); | |||
} | |||
// TODO(ajkr): First restore if there's already a trace. | |||
|
|||
if (shared->HasHistory() && IsStateTracked()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This IsStateTracked()
causes a problem in the following sequence:
-sync_fault_injection=1 -test_batches_snapshot=0
-test_batches_snapshot=1
-test_batches_snapshot=0
The second command can cause DB seqno to advance beyond what is recoverable in expected values, which causes the third command to fail.
@ajkr has updated the pull request. You must reimport the pull request before landing. |
This fixes two bugs in the recently committed DB verification following crash-recovery with unsynced data loss (facebook#8966): First, we cannot trust `GetLatestSequenceNumber()` post-recovery to return the sequence number of the latest recovered record. Oddly, it can return a larger seqno than that in case WAL data older than the latest `LogAndApply()` is lost. This behavior is inherited from LevelDB. They do not have `FileMetaData::largest_seqno` so could not get the exact latest record seqno from WAL+MANIFEST, whereas we could if we wanted to. But for now I added a workaround. Second, there was a bug in crash test runs involving mixed values for `-test_batches_snapshots`. The problem was we were neither restoring expected values nor enabling tracing when `-test_batches_snapshots=1`. This caused a future `-test_batches_snapshots=0` run to not find enough trace data to restore expected values. The fix is to restore expected values at the start of `-test_batches_snapshots=1` runs, but still leave tracing disabled as we do not need to track those KVs. Test Plan: The below command runs for several minutes (still fails eventually) whereas it used to fail every time at the start of the second `db_stress` run. ``` $ TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 ```-
…pshots` (#9302) Summary: This fixes two bugs in the recently committed DB verification following crash-recovery with unsynced data loss (#8966): The first bug was in crash test runs involving mixed values for `-test_batches_snapshots`. The problem was we were neither restoring expected values nor enabling tracing when `-test_batches_snapshots=1`. This caused a future `-test_batches_snapshots=0` run to not find enough trace data to restore expected values. The fix is to restore expected values at the start of `-test_batches_snapshots=1` runs, but still leave tracing disabled as we do not need to track those KVs. The second bug was in `db_stress` runs that restore the expected values file and use compaction filter. The compaction filter was initialized to use the pre-restore expected values, which would be `munmap()`'d during `FileExpectedStateManager::Restore()`. Then compaction filter would run into a segfault. The fix is just to reorder compaction filter init after expected values restore. Pull Request resolved: #9302 Test Plan: - To verify the first problem, the below sequence used to fail; now it passes. ``` $ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0 $ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1 $ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0 ``` - The second problem occurred rarely in the form of a SIGSEGV on a file that was `munmap()`d. I have not seen it after this PR though this doesn't prove much. Reviewed By: jay-zhuang Differential Revision: D33155283 Pulled By: ajkr fbshipit-source-id: 66fd0f0edf34015a010c30015f14f104734e964e
…9338) Summary: Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (#8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled. To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py. Depends on #9305. Pull Request resolved: #9338 Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan. Reviewed By: riversand963 Differential Revision: D33345333 Pulled By: ajkr fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
When a previous run left behind historical state/trace files (implying it was run with --sync_fault_injection set), this PR uses them to restore the expected state according to the DB's recovered sequence number. That way, a tail of latest unsynced operations are permitted to be dropped, as is the case when data in page cache or certain
Env
s is lost. The point of the verification in this scenario is just to ensure there is no hole in the recovered data.Test Plan: