-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add manual_wal_flush, FlushWAL() to stress/crash test #10698
Conversation
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, few more things.
One other thing discussed offline is enabling prefix verification when manual_wal_flush
is set.
603b9ac
to
bb89382
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
bb89382
to
5a45987
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
5a45987
to
817988e
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
817988e
to
2cab8ad
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
2cab8ad
to
2f3fafc
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
tools/db_crashtest.py
Outdated
dest_params["manual_wal_flush_one_in"] = 0 | ||
if dest_params.get("manual_wal_flush_one_in", 0) > 0: | ||
dest_params["sync_fault_injection"] = 1 | ||
if ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this existing processing down to avoid splitting
if dest_params.get("sync_fault_injection", 0) == 0:
dest_params["manual_wal_flush_one_in"] = 0
if dest_params.get("manual_wal_flush_one_in", 0) > 0:
dest_params["sync_fault_injection"] = 1
Both of these two are needed for normal stress run where sync_fault_injection
and manual_wal_flush_one_in
can each be 0/1.
Moving this existing processing down is also proved to be safe as (1) all writes of disable_wal
and sync_fault_injection
is now above this (2) there are only writes dest_params["ingest_external_file_one_in"] = 0
(3) there are only writes dest_params["enable_compaction_filter"] = 0
and one read if dest_params.get("enable_compaction_filter", 0) == 1:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
2f3fafc
to
a3a32cc
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
a3a32cc
to
d021068
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@@ -2771,7 +2789,9 @@ void StressTest::Reopen(ThreadState* thread) { | |||
clock_->TimeToString(now / 1000000).c_str(), num_times_reopened_); | |||
Open(thread->shared); | |||
|
|||
if ((FLAGS_sync_fault_injection || FLAGS_disable_wal) && IsStateTracked()) { | |||
if ((FLAGS_sync_fault_injection || FLAGS_disable_wal || | |||
FLAGS_manual_wal_flush_one_in > 0) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we also need to do this in reopen. I did this and turned up CI jobs which founds quite a bit verification errors. So will investigate those before landing. Internal link shared in internal 1:1 doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugged and fixed by disabling features can't run under unsync data loss like file ingestion.
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
6c21b8f
to
1e0c8a9
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1e0c8a9
to
c5aed9d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes since last accept LGTM
c5aed9d
to
34b9d0d
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
34b9d0d
to
37cf1a7
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
) Summary: **Context:** #10698 made `Flush(sync=true)` required for` DB::Open()` (to pass the original but now deleted assertion `impl->TEST_WALBufferIsEmpty()` under `manual_wal_flush=true`, see #10698 summary for more ) as well as db_stress to pass. However RocksDB users may not implement SyncWAL() (used inFlush(sync=true)). Therefore we replace such in DB::Open and db_stress in this PR and align with https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1883-L1887 and https://github.com/facebook/rocksdb/blob/main/db_stress_tool/db_stress_test_base.cc#L847-L849 Pull Request resolved: #10784 Test Plan: make check Reviewed By: anand1976 Differential Revision: D40193354 Pulled By: anand1976 fbshipit-source-id: e80d53880799ae01bdd717641d07997d3bfe2b54
Context/Summary:
Introduce
manual_wal_flush_one_in
as titled.manual_wal_flush_one_in > 0
, we also need tracing to correctly verify recovery because WAL data can be lost in this case whenFlushWAL()
is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss withmanual_wal_flush_one_in
Incompatibilities fixed along the way:
impl->CreateWAL->Writer::AddCompressionTypeRecord()
is called before this assertion. The functionEmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());
but do not trigger flush ifmanual_wal_flush
is set . This leads to the later assertion on `impl->TEST_WALBufferIsEmpty()' fails.FlushWAL(sync=true)
along with refactoringTEST_WALBufferIsEmpty()
to beWALBufferIsEmpty()
since it is used in prod code now.Test plan:
python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3