Fix a silent data loss for write-committed txn #9571

riversand963 · 2022-02-15T05:14:33Z

The following sequence of events can cause silent data loss for write-committed
transactions.

Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in PrecomputeMinLogNumberToKeep2PC() which calls FindMinPrepLogReferencedByMemTable().
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by PrecomputeMinLogNumberToKeep2PC(). The return value of PrecomputeMinLogNumberToKeep2PC()
depends on three components

PrecomputeMinLogNumberToKeepNon2PC(). This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
prep_tracker->FindMinLogContainingOutstandingPrep(). This represents the WAL with a prepare section but the txn hasn't committed.
FindMinPrepLogReferencedByMemTable(). This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in FindMinPrepLogReferencedByMemTable(). Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via FindMinPrepLogReferencedByMemTable().

Test plan:

./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check

facebook-github-bot · 2022-02-15T05:23:45Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-02-15T05:27:07Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-15T05:28:25Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-02-15T06:12:15Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-15T06:19:22Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-15T06:25:50Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-02-15T16:12:53Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-15T17:04:19Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

siying

Thanks for the fix! I don't remember why the CF to flush was skipped in the first place. It feels that memtables_to_flush is a better condition and filtering by CF doesn't feel necessary. I hope we don't miss anything this time.

If there is any risk to this fix, it is that for some cases log files that should have been deleted don't get deleted. It doesn't feel like everything is covered by memtables_to_flush so it is correct. Still if there is a way to validate it with unit test, or even manually, it will be good.

riversand963 · 2022-02-16T17:12:05Z

Thanks @siying a lot for the review!

Still if there is a way to validate it with unit test

Could you elaborate on what we should validate? WALs will be deleted eventually?

To please internal infra

facebook-github-bot · 2022-02-17T05:02:20Z

@riversand963 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2022-02-17T05:04:12Z

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

riversand963 · 2022-02-17T05:21:53Z

@siying Wrote a unit test which showed that 5.log (the WAL containing the prepare section) will not be deleted until next flush which persists the memtable with the data backed by 5.log. I think this behavior is OK.

One issue: even though recovery may flush to L0 files and obsoletes 5.log, it's not deleted by RocksDB in https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl_open.cc#L1777 because current recovery logic does not update VersionSet::min_log_number_to_keep_ after WAL recovery. This can be fixed, but does not seem too bad. I plan to address this in a separate PR.

Summary: The following sequence of events can cause silent data loss for write-committed transactions. ``` Time thread 1 bg flush | db->Put("a") | txn = NewTxn() | txn->Put("b", "v") | txn->Prepare() // writes only to 5.log | db->SwitchMemtable() // memtable 1 has "a" | // close 5.log, | // creates 8.log | trigger flush | pick memtable 1 | unlock db mutex | write new sst | txn->ctwb->Put("gtid", "1") // writes 8.log | txn->Commit() // writes to 8.log | // writes to memtable 2 | compute min_log_number_to_keep_2pc, this | will be 8 (incorrect). | | Purge obsolete wals, including 5.log | V ``` At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"}, only the commit marker of txn is in 8.log The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`. In the above example, when bg flush thread tries to find obsolete wals, it uses the information computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()` depends on three components - `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit. - `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed. - `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data. The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log. We should return this information via `FindMinPrepLogReferencedByMemTable()`. Pull Request resolved: #9571 Test Plan: ``` ./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/* make check ``` Reviewed By: siying Differential Revision: D34235236 Pulled By: riversand963 fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1

Summary: Test only, no change to functionality. Extremely low risk of library regression. Update test key generation by maintaining existing and non-existing keys. Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared. Add a make target 'blackbox_crash_test_with_multiops_txn'. Running the following commands caught the bug exposed in #9571. ``` $rm -rf /tmp/rocksdbtest/* $./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \ -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \ -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \ -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0 $./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \ -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \ -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0 ``` Running the following command caught a bug which will be fixed in #9648 . ``` $TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn ``` Pull Request resolved: #9568 Reviewed By: jay-zhuang Differential Revision: D34308154 Pulled By: riversand963 fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7

riversand963 added the WIP Work in progress label Feb 15, 2022

facebook-github-bot added the CLA Signed label Feb 15, 2022

riversand963 changed the title ~~[WIP] Fix a silent data loss for write-committed txn~~ Fix a silent data loss for write-committed txn Feb 15, 2022

riversand963 removed the WIP Work in progress label Feb 15, 2022

riversand963 marked this pull request as ready for review February 15, 2022 05:24

riversand963 requested a review from siying February 15, 2022 18:15

siying approved these changes Feb 16, 2022

View reviewed changes

riversand963 mentioned this pull request Feb 16, 2022

Improve stress test for transactions #9568

Closed

riversand963 added 8 commits February 16, 2022 21:01

Reproduce a data loss bug in write-committed txn

053907e

Fix silent data loss for write-committed txn

eea79a3

Remove extra logging

6f5b620

Update HISTORY

39986d8

Clang analyze

b98023e

Minor

d6f4a97

Use BYPASS instead of SKIP

f2758b1

To please internal infra

Add some assertion for clarification

30df1d5

riversand963 force-pushed the data-loss-wc branch from 52b8d48 to 30df1d5 Compare February 17, 2022 05:02

facebook-github-bot closed this in 1cda273 Feb 17, 2022

riversand963 deleted the data-loss-wc branch February 17, 2022 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a silent data loss for write-committed txn #9571

Fix a silent data loss for write-committed txn #9571

riversand963 commented Feb 15, 2022 •

edited

Loading

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

siying left a comment

riversand963 commented Feb 16, 2022

facebook-github-bot commented Feb 17, 2022

facebook-github-bot commented Feb 17, 2022

riversand963 commented Feb 17, 2022 •

edited

Loading

Fix a silent data loss for write-committed txn #9571

Fix a silent data loss for write-committed txn #9571

Conversation

riversand963 commented Feb 15, 2022 • edited Loading

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

facebook-github-bot commented Feb 15, 2022

siying left a comment

Choose a reason for hiding this comment

riversand963 commented Feb 16, 2022

facebook-github-bot commented Feb 17, 2022

facebook-github-bot commented Feb 17, 2022

riversand963 commented Feb 17, 2022 • edited Loading

riversand963 commented Feb 15, 2022 •

edited

Loading

riversand963 commented Feb 17, 2022 •

edited

Loading