Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add manual_wal_flush, FlushWAL() to stress/crash test #10698

Closed
wants to merge 3 commits into from

Conversation

hx235
Copy link
Contributor

@hx235 hx235 commented Sep 16, 2022

Context/Summary:
Introduce manual_wal_flush_one_in as titled.

  • When manual_wal_flush_one_in > 0, we also need tracing to correctly verify recovery because WAL data can be lost in this case when FlushWAL() is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with manual_wal_flush_one_in

Incompatibilities fixed along the way:

db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()` failed.
  • It turns out that impl->CreateWAL->Writer::AddCompressionTypeRecord() is called before this assertion. The function EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size()); but do not trigger flush if manual_wal_flush is set . This leads to the later assertion on `impl->TEST_WALBufferIsEmpty()' fails.
    • As suggested, assertion is removed and violation case is handled by FlushWAL(sync=true) along with refactoring TEST_WALBufferIsEmpty() to be WALBufferIsEmpty() since it is used in prod code now.

Test plan:

  • Locally running python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3
  • Joined Support WriteCommit policy with sync_fault_injection=1 #10624 in auto CI testings with all RocksDB stress/crash test jobs

@hx235 hx235 added WIP Work in progress and removed CLA Signed labels Sep 16, 2022
@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, few more things.

One other thing discussed offline is enabling prefix verification when manual_wal_flush is set.

db/log_writer.cc Outdated Show resolved Hide resolved
tools/db_crashtest.py Outdated Show resolved Hide resolved
tools/db_crashtest.py Outdated Show resolved Hide resolved
db_stress_tool/db_stress_test_base.cc Outdated Show resolved Hide resolved
@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 603b9ac to bb89382 Compare September 23, 2022 03:07
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from bb89382 to 5a45987 Compare September 23, 2022 03:17
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 5a45987 to 817988e Compare September 23, 2022 04:26
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 817988e to 2cab8ad Compare September 23, 2022 17:52
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 2cab8ad to 2f3fafc Compare September 23, 2022 18:02
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 changed the title [Draft] Add manual_wal_flush, FlushWAL() to stress/crash test Add manual_wal_flush, FlushWAL() to stress/crash test Sep 23, 2022
@hx235 hx235 removed the WIP Work in progress label Sep 23, 2022
dest_params["manual_wal_flush_one_in"] = 0
if dest_params.get("manual_wal_flush_one_in", 0) > 0:
dest_params["sync_fault_injection"] = 1
if (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this existing processing down to avoid splitting

if dest_params.get("sync_fault_injection", 0) == 0:
        dest_params["manual_wal_flush_one_in"] = 0
    if dest_params.get("manual_wal_flush_one_in", 0) > 0:
        dest_params["sync_fault_injection"] = 1

Both of these two are needed for normal stress run where sync_fault_injection and manual_wal_flush_one_in can each be 0/1.

Moving this existing processing down is also proved to be safe as (1) all writes of disable_wal and sync_fault_injection is now above this (2) there are only writes dest_params["ingest_external_file_one_in"] = 0 (3) there are only writes dest_params["enable_compaction_filter"] = 0 and one read if dest_params.get("enable_compaction_filter", 0) == 1:

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

db_stress_tool/db_stress_tool.cc Outdated Show resolved Hide resolved
tools/db_crashtest.py Outdated Show resolved Hide resolved
@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 2f3fafc to a3a32cc Compare September 24, 2022 02:25
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from a3a32cc to d021068 Compare September 26, 2022 18:46
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@@ -2771,7 +2789,9 @@ void StressTest::Reopen(ThreadState* thread) {
clock_->TimeToString(now / 1000000).c_str(), num_times_reopened_);
Open(thread->shared);

if ((FLAGS_sync_fault_injection || FLAGS_disable_wal) && IsStateTracked()) {
if ((FLAGS_sync_fault_injection || FLAGS_disable_wal ||
FLAGS_manual_wal_flush_one_in > 0) &&
Copy link
Contributor Author

@hx235 hx235 Sep 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we also need to do this in reopen. I did this and turned up CI jobs which founds quite a bit verification errors. So will investigate those before landing. Internal link shared in internal 1:1 doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugged and fixed by disabling features can't run under unsync data loss like file ingestion.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 6c21b8f to 1e0c8a9 Compare September 29, 2022 00:46
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 1e0c8a9 to c5aed9d Compare September 29, 2022 01:45
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes since last accept LGTM

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from c5aed9d to 34b9d0d Compare September 30, 2022 17:41
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@hx235 hx235 force-pushed the manual_wal_flush_stress_test branch from 34b9d0d to 37cf1a7 Compare September 30, 2022 20:09
@facebook-github-bot
Copy link
Contributor

@hx235 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Oct 10, 2022
)

Summary:
**Context:**
#10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see #10698 summary for more ) as well as db_stress to pass.

However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849

Pull Request resolved: #10784

Test Plan: make check

Reviewed By: anand1976

Differential Revision: D40193354

Pulled By: anand1976

fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants