Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a silent data loss for write-committed txn #9571

Closed
wants to merge 8 commits into from

Conversation

riversand963
Copy link
Contributor

@riversand963 riversand963 commented Feb 15, 2022

The following sequence of events can cause silent data loss for write-committed
transactions.

Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in PrecomputeMinLogNumberToKeep2PC() which calls FindMinPrepLogReferencedByMemTable().
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by PrecomputeMinLogNumberToKeep2PC(). The return value of PrecomputeMinLogNumberToKeep2PC()
depends on three components

  • PrecomputeMinLogNumberToKeepNon2PC(). This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
  • prep_tracker->FindMinLogContainingOutstandingPrep(). This represents the WAL with a prepare section but the txn hasn't committed.
  • FindMinPrepLogReferencedByMemTable(). This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in FindMinPrepLogReferencedByMemTable(). Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via FindMinPrepLogReferencedByMemTable().

Test plan:

./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check

@riversand963 riversand963 added the WIP Work in progress label Feb 15, 2022
@riversand963 riversand963 changed the title [WIP] Fix a silent data loss for write-committed txn Fix a silent data loss for write-committed txn Feb 15, 2022
@riversand963 riversand963 removed the WIP Work in progress label Feb 15, 2022
@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963 riversand963 marked this pull request as ready for review February 15, 2022 05:24
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@siying siying left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! I don't remember why the CF to flush was skipped in the first place. It feels that memtables_to_flush is a better condition and filtering by CF doesn't feel necessary. I hope we don't miss anything this time.

If there is any risk to this fix, it is that for some cases log files that should have been deleted don't get deleted. It doesn't feel like everything is covered by memtables_to_flush so it is correct. Still if there is a way to validate it with unit test, or even manually, it will be good.

@riversand963
Copy link
Contributor Author

Thanks @siying a lot for the review!

Still if there is a way to validate it with unit test

Could you elaborate on what we should validate? WALs will be deleted eventually?

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor Author

riversand963 commented Feb 17, 2022

@siying Wrote a unit test which showed that 5.log (the WAL containing the prepare section) will not be deleted until next flush which persists the memtable with the data backed by 5.log. I think this behavior is OK.

One issue: even though recovery may flush to L0 files and obsoletes 5.log, it's not deleted by RocksDB in https://github.com/facebook/rocksdb/blob/6.29.fb/db/db_impl/db_impl_open.cc#L1777 because current recovery logic does not update VersionSet::min_log_number_to_keep_ after WAL recovery. This can be fixed, but does not seem too bad. I plan to address this in a separate PR.

@riversand963 riversand963 deleted the data-loss-wc branch February 17, 2022 15:04
riversand963 added a commit that referenced this pull request Feb 17, 2022
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V
```

At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log

The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.

The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.

Pull Request resolved: #9571

Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```

Reviewed By: siying

Differential Revision: D34235236

Pulled By: riversand963

fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
facebook-github-bot pushed a commit that referenced this pull request Mar 17, 2022
Summary:
Test only, no change to functionality.
Extremely low risk of library regression.

Update test key generation by maintaining existing and non-existing keys.
Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared.
Add a make target 'blackbox_crash_test_with_multiops_txn'.

Running the following commands caught the bug exposed in #9571.
```
$rm -rf /tmp/rocksdbtest/*
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \
    -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \
   -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \
   -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0
   -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \
   -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \
   -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
```

Running the following command caught a bug which will be fixed in #9648 .
```
$TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn
```

Pull Request resolved: #9568

Reviewed By: jay-zhuang

Differential Revision: D34308154

Pulled By: riversand963

fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants