Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug causing incorrect data returned by snapshot read #9648

Closed

Conversation

riversand963
Copy link
Contributor

This bug affects use cases that meet the following conditions

  • (has only the default column family or disables WAL) and
  • has at least one event listener
  • atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Test plan
make check

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice test case

HISTORY.md Outdated
@@ -5,7 +5,8 @@
* Added BlobDB options to `ldb`

### Bug Fixes
* * Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and threads waiting for recovery to complete (#9496)
* Fixed a data race on `versions_` between `DBImpl::ResumeImpl()` and threads waiting for recovery to complete (#9496)
* Fixed a bug due to db mutex release causing incorrect result returned for snapshot read.
Copy link
Contributor

@ajkr ajkr Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"db mutex release" - this isn't user API level or very concrete. Should we say an unlikely race condition and/or list the conditions? Maybe we say the symptom too: queries to snapshots created with this race condition may return old values, including resurfacing deleted data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ajkr for the review. Updated HISTORY

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@riversand963 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963 riversand963 deleted the fix-tombstone-snapshot branch March 3, 2022 05:27
ajkr pushed a commit that referenced this pull request Mar 3, 2022
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Pull Request resolved: #9648

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D34555456

Pulled By: riversand963

fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
ajkr pushed a commit that referenced this pull request Mar 7, 2022
Summary:
This bug affects use cases that meet the following conditions
- (has only the default column family or disables WAL) and
- has at least one event listener
- atomic flush is NOT affected.

If the above conditions meet, then RocksDB can release the db mutex before picking all the
existing memtables to flush. In the meantime, a snapshot can be created and db's sequence
number can still be incremented. The upcoming flush will ignore this snapshot.
A later read using this snapshot can return incorrect result.

To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid
creating snapshots during this interval.

Pull Request resolved: #9648

Test Plan: make check

Reviewed By: ajkr

Differential Revision: D34555456

Pulled By: riversand963

fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee
facebook-github-bot pushed a commit that referenced this pull request Mar 17, 2022
Summary:
Test only, no change to functionality.
Extremely low risk of library regression.

Update test key generation by maintaining existing and non-existing keys.
Update db_crashtest.py to drive multiops_txn stress test for both write-committed and write-prepared.
Add a make target 'blackbox_crash_test_with_multiops_txn'.

Running the following commands caught the bug exposed in #9571.
```
$rm -rf /tmp/rocksdbtest/*
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 \
    -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 \
   -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 \
   -ub_c=100 -destroy_db_initially=0 -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
$./db_stress -progress_reports=0 -test_multi_ops_txns -use_txn -clear_column_family_one_in=0
   -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 \
   -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=1000 -ub_a=10000 -ub_c=100 -destroy_db_initially=0 \
   -key_spaces_path=/dev/shm/key_spaces_desc -threads=32 -read_fault_one_in=0
```

Running the following command caught a bug which will be fixed in #9648 .
```
$TEST_TMPDIR=/dev/shm make blackbox_crash_test_with_multiops_wc_txn
```

Pull Request resolved: #9568

Reviewed By: jay-zhuang

Differential Revision: D34308154

Pulled By: riversand963

fbshipit-source-id: 99ff1b65c19b46c471d2f2d3b47adcd342a1b9e7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants