Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db_stress verify with lost unsynced operations #8966

Closed
wants to merge 13 commits into from

Conversation

ajkr
Copy link
Contributor

@ajkr ajkr commented Sep 27, 2021

When a previous run left behind historical state/trace files (implying it was run with --sync_fault_injection set), this PR uses them to restore the expected state according to the DB's recovered sequence number. That way, a tail of latest unsynced operations are permitted to be dropped, as is the case when data in page cache or certain Envs is lost. The point of the verification in this scenario is just to ensure there is no hole in the recovered data.

Test Plan:

  • ran it a while, made sure it is restoring expected values using the historical state/trace files:
$ rm -rf ./tmp-db/ ./exp/ && mkdir -p ./tmp-db/ ./exp/ && while ./db_stress -compression_type=none -clear_column_family_one_in=0 -expected_values_dir=./exp -sync_fault_injection=1 -destroy_db_initially=0 -db=./tmp-db -max_key=1000000 -ops_per_thread=10000 -reopen=0 -threads=32 ; do : ; done

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from 5ed2c58 to 7e9f573 Compare September 28, 2021 22:00
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from 7e9f573 to c8095f5 Compare December 3, 2021 05:43
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from c8095f5 to aee8f70 Compare December 3, 2021 23:08
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from aee8f70 to e7a1aac Compare December 4, 2021 06:42
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding how it's able to sync up the replayed trace and the DB state. Can you explain? I don't see for example a seqno filter or threshold on what we apply to expected state.

@@ -233,6 +233,11 @@ size_t GenerateValue(uint32_t rand, char* v, size_t max_sz) {
return value_sz; // the size of the value set.
}

uint32_t GetValueBase(Slice s) {
assert(s.size() >= sizeof(uint32_t));
return *((uint32_t*)s.data());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could take this opportunity to change to something like Get/PutUnaligned instead of C-style cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Status PutCF(uint32_t column_family_id, const Slice& key,
const Slice& value) override {
uint64_t key_id;
if (!GetIntVal(key.ToString(), &key_id)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to try to ignore the inherited craziness of GetIntVal (extern inline taking a string by value, apparently decoding to unsigned where the encode encodes from signed, and builds a temporary vector rather than using values as they are decoded)

Status FileExpectedStateManager::Restore(DB* db) {
// An `ExpectedStateTraceRecordHandler` applies a configurable number of
// write operation trace records to the configured expected state.
class ExpectedStateTraceRecordHandler : public TraceRecord::Handler,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A little big for a local class IMHO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, made it non-local.

if (s.ok()) {
s = replayer->Prepare();
}
while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer for (;;) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

};

SequenceNumber seqno = db->GetLatestSequenceNumber();
if (seqno < saved_seqno_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, I would assert(HasHistory())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

s = Env::Default()->DeleteFile(temp_path);
} else if (s.IsNotFound()) {
s = Status::OK();
saved_seqno_ = kMaxSequenceNumber;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say something like SaveAtAndAfter(...) will be called after Restore(...) to initialize tracing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might or might not be called after reaching here, depending on whether the current run uses -sync_fault_injection=1. Reaching this code path only means the previous db_stress run used -sync_fault_injection=1.

Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for the review! Will get to your other comments soon - just wanted to answer the fundamental question first.

Comment on lines 348 to 346
if (num_write_ops_ == max_write_ops_) {
return Status::OK();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not understanding how it's able to sync up the replayed trace and the DB state. Can you explain? I don't see for example a seqno filter or threshold on what we apply to expected state.

This returns without applying the write op once we've reached max_write_ops_. The client configures max_write_ops_ by subtracting the base sequence number of the trace determined by the filename ("X.trace" has base seqno X) from the post-recovery GetLatestSequenceNumber().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks. Of course it looks less hidden now that you've showed it to me, but I looked for it for a while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design/implementation doc explains this in a "Verification algorithm" subsection. It was written after you reviewed. Copy/pasted the subsection here since the doc is not shared publicly.

Verification algorithm

  • FileExpectedStateManager::Open() discovers from the filesystem (“*.{trace,state}” files) the sequence number synced by the previous db_stress run. It is stored in saved_seqno_.
  • FileExpectedStateManager::Restore() reconstructs a “LATEST.state” tailored to the DB that just recovered with possibly lost unsynced operations:
    • Copies “saved_seqno_.state” to “.LATEST.state.tmp”.
    • Creates an ExpectedStateTraceRecordHandler configured to apply DB::GetLatestSequenceNumber() - saved_seqno_ updates to “.LATEST.state.tmp”.
    • saved_seqno_.trace” is replayed using the above custom record handler.
    • “.LATEST.state.tmp” is renamed to “LATEST.state”.
  • NonBatchedOpsStressTest::VerifyDb() compares the recovered DB against the values now in “LATEST.state”.

Copy link
Contributor

@pdillinger pdillinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo the cosmetic feedback

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from e7a1aac to 4f7aca7 Compare December 7, 2021 19:35
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from 4f7aca7 to 11c4941 Compare December 7, 2021 22:55
@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@ajkr ajkr force-pushed the dbstress-restore-expected-values branch from 11c4941 to 07f7e9b Compare December 15, 2021 18:15
Copy link
Contributor Author

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

Comment on lines 348 to 346
if (num_write_ops_ == max_write_ops_) {
return Status::OK();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design/implementation doc explains this in a "Verification algorithm" subsection. It was written after you reviewed. Copy/pasted the subsection here since the doc is not shared publicly.

Verification algorithm

  • FileExpectedStateManager::Open() discovers from the filesystem (“*.{trace,state}” files) the sequence number synced by the previous db_stress run. It is stored in saved_seqno_.
  • FileExpectedStateManager::Restore() reconstructs a “LATEST.state” tailored to the DB that just recovered with possibly lost unsynced operations:
    • Copies “saved_seqno_.state” to “.LATEST.state.tmp”.
    • Creates an ExpectedStateTraceRecordHandler configured to apply DB::GetLatestSequenceNumber() - saved_seqno_ updates to “.LATEST.state.tmp”.
    • saved_seqno_.trace” is replayed using the above custom record handler.
    • “.LATEST.state.tmp” is renamed to “LATEST.state”.
  • NonBatchedOpsStressTest::VerifyDb() compares the recovered DB against the values now in “LATEST.state”.

Status FileExpectedStateManager::Restore(DB* db) {
// An `ExpectedStateTraceRecordHandler` applies a configurable number of
// write operation trace records to the configured expected state.
class ExpectedStateTraceRecordHandler : public TraceRecord::Handler,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, made it non-local.

};

SequenceNumber seqno = db->GetLatestSequenceNumber();
if (seqno < saved_seqno_) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (s.ok()) {
s = replayer->Prepare();
}
while (true) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

s = Env::Default()->DeleteFile(temp_path);
} else if (s.IsNotFound()) {
s = Status::OK();
saved_seqno_ = kMaxSequenceNumber;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might or might not be called after reaching here, depending on whether the current run uses -sync_fault_injection=1. Reaching this code path only means the previous db_stress run used -sync_fault_injection=1.

@@ -233,6 +233,11 @@ size_t GenerateValue(uint32_t rand, char* v, size_t max_sz) {
return value_sz; // the size of the value set.
}

uint32_t GetValueBase(Slice s) {
assert(s.size() >= sizeof(uint32_t));
return *((uint32_t*)s.data());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@ajkr has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

@@ -307,7 +307,20 @@ void StressTest::FinishInitDb(SharedState* shared) {
fprintf(stdout, "Compaction filter factory: %s\n",
compaction_filter_factory->Name());
}
// TODO(ajkr): First restore if there's already a trace.

if (shared->HasHistory() && IsStateTracked()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This IsStateTracked() causes a problem in the following sequence:

  • -sync_fault_injection=1 -test_batches_snapshot=0
  • -test_batches_snapshot=1
  • -test_batches_snapshot=0

The second command can cause DB seqno to advance beyond what is recoverable in expected values, which causes the third command to fail.

@facebook-github-bot
Copy link
Contributor

@ajkr has updated the pull request. You must reimport the pull request before landing.

ajkr added a commit to ajkr/rocksdb that referenced this pull request Dec 16, 2021
This fixes two bugs in the recently committed DB verification following
crash-recovery with unsynced data loss (facebook#8966):

First, we cannot trust `GetLatestSequenceNumber()` post-recovery to return
the sequence number of the latest recovered record. Oddly, it can return
a larger seqno than that in case WAL data older than the latest
`LogAndApply()` is lost.  This behavior is inherited from
LevelDB. They do not have `FileMetaData::largest_seqno` so could not
get the exact latest record seqno from WAL+MANIFEST, whereas we could
if we wanted to. But for now I added a workaround.

Second, there was a bug in crash test runs involving mixed values for
`-test_batches_snapshots`. The problem was we were neither restoring
expected values nor enabling tracing when `-test_batches_snapshots=1`.
This caused a future `-test_batches_snapshots=0` run to not find enough
trace data to restore expected values. The fix is to restore expected
values at the start of `-test_batches_snapshots=1` runs, but still leave
tracing disabled as we do not need to track those KVs.

Test Plan:

The below command runs for several minutes (still fails eventually) whereas it used to fail every time at the start of the second `db_stress` run.

```
$ TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1
```-
facebook-github-bot pushed a commit that referenced this pull request Dec 18, 2021
…pshots` (#9302)

Summary:
This fixes two bugs in the recently committed DB verification following
crash-recovery with unsynced data loss (#8966):

The first bug was in crash test runs involving mixed values for
`-test_batches_snapshots`. The problem was we were neither restoring
expected values nor enabling tracing when `-test_batches_snapshots=1`.
This caused a future `-test_batches_snapshots=0` run to not find enough
trace data to restore expected values. The fix is to restore expected
values at the start of `-test_batches_snapshots=1` runs, but still leave
tracing disabled as we do not need to track those KVs.

The second bug was in `db_stress` runs that restore the expected values
file and use compaction filter. The compaction filter was initialized to use
the pre-restore expected values, which would be `munmap()`'d during
`FileExpectedStateManager::Restore()`. Then compaction filter would run
into a segfault. The fix is just to reorder compaction filter init after expected
values restore.

Pull Request resolved: #9302

Test Plan:
- To verify the first problem, the below sequence used to fail; now it passes.

```
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0
```

- The second problem occurred rarely in the form of a SIGSEGV on a file that was `munmap()`d. I have not seen it after this PR though this doesn't prove much.

Reviewed By: jay-zhuang

Differential Revision: D33155283

Pulled By: ajkr

fbshipit-source-id: 66fd0f0edf34015a010c30015f14f104734e964e
facebook-github-bot pushed a commit that referenced this pull request Jan 6, 2022
…9338)

Summary:
Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (#8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled.

To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py.

Depends on #9305.

Pull Request resolved: #9338

Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan.

Reviewed By: riversand963

Differential Revision: D33345333

Pulled By: ajkr

fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants