-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exposing corrupted key ranges encountered during compaction #3537
Conversation
Summary: Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence: 1. User call Flush() with flush_options.wait = true 2. The manual flush started in the background 3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached. 4. The manual flush finish. Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed. Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush. Closes facebook#3378 Differential Revision: D6746789 Pulled By: yiwu-arbug fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
Summary: Most popular versions of GCC can't identify platform on ARM if "-march=native" is specified. Remove it to unblock most people. Closes facebook#3346 Differential Revision: D6690544 Pulled By: siying fbshipit-source-id: bbaba9fe2645b6b37144b36ea75beeff88992b49
Summary: Tested on a clean FreeBSD 11.01 x64. Closes facebook#1423 Closes facebook#3357 Differential Revision: D6705868 Pulled By: sagar0 fbshipit-source-id: cbccbbdafd4f42922512ca03619a5d5583a425fd
Summary: Java build on PPC64le has been broken since a few months, due to facebook#2716. Fixing it with the least amount of changes. (We should cleanup a little around this code when time permits). This should fix the build failures seen in http://140.211.168.68:8080/job/Rocksdb/ . Closes facebook#3359 Differential Revision: D6712938 Pulled By: sagar0 fbshipit-source-id: 3046e8f072180693de2af4762934ec1ace309ca4
Summary: Allow StackableDB optionally takes a shared_ptr on construction and thus hold shared ownership of the underlying DB. Closes facebook#3423 Differential Revision: D6824163 Pulled By: yiwu-arbug fbshipit-source-id: dbdc30c42e007533a987ef413785e192340f03eb
Summary: This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Constraints: - The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR). - This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). Thanks to siying for the original idea and implementation. **Benchmarks:** Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` Page cache was cleared before each experiment with the command: ``` sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" ``` ``` Before: seekrandom : 34020.945 micros/op 29 ops/sec; 32.5 MB/s (1636 of 1999 found) With this change: seekrandom : 8726.912 micros/op 114 ops/sec; 126.8 MB/s (5702 of 6999 found) ``` ~3.9X performance improvement. Also verified with strace and gdb that the readahead size is increasing as expected. ``` strace -e readahead -f -T -t -p <db_bench process pid> ``` Closes facebook#3282 Differential Revision: D6586477 Pulled By: sagar0 fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary: In the test, there can be a dead lock between background flush thread and foreground main thread as following: * background flush thread: - holding db mutex, while - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point. * foreground thread: - waiting for db mutex to write "key2" Fixing by let background flush thread wait without holding db mutex. Closes facebook#3436 Differential Revision: D6841334 Pulled By: yiwu-arbug fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d
Summary: Workaround a bunch of "implicit-fallthrough" compiler errors, like: ``` util/crc32c.cc:533:7: error: this statement may fall through [-Werror=implicit-fallthrough=] crc = _mm_crc32_u64(crc, *(uint64_t*)(buf + offset)); ^ util/crc32c.cc:1016:9: note: in expansion of macro ‘CRCsinglet’ CRCsinglet(crc0, next, -2 * 8); ^~~~~~~~~~ util/crc32c.cc:1017:7: note: here case 1: ``` Closes facebook#3339 Reviewed By: sagar0 Differential Revision: D6874736 Pulled By: quark-zju fbshipit-source-id: eec9f3bc135e12fca336928d01711006d5c3cb16
Summary: We don't do fsync() after truncate in direct I/O writeable file (in fact we don't do any fsync ever). This can cause metadata not persistent to disk after the file is generated. We call it instead. Closes facebook#3500 Differential Revision: D6981482 Pulled By: siying fbshipit-source-id: 7e2b591b7e5dd1b96fc0775515b8b9e6092980ef
Summary: Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_. This deadlock is hit all the time on our workload. It blocks our release. In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so. So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them. This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup. I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine. Closes facebook#3510 Reviewed By: sagar0 Differential Revision: D7005346 Pulled By: al13n321 fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
Summary: Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped. Closes facebook#3525 Differential Revision: D7033694 Pulled By: sagar0 fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
Summary: Java static builds are again broken, this time due Snappy version upgrade introduced in 90c1d81 (facebook#3331). This is due to two reasons: 1. The new Snappy packages should now be downloaded from https://github.com/google/snappy/archive/<pkg.tar.gz> instead of https://github.com/google/snappy/releases/download/<pkg.tar.gz> which we are using now. 1. In addition to the the above URL change, Snappy changed its build from using autotools to CMake based : https://github.com/google/snappy/blame/e69d9f880677f2aa3488c80b953ec4309f0dfa2e/README.md#L65-L72 So more changes are needed if we are going to upgrade to 1.1.7. Hence reverting the version upgrade until we figure them out. Closes facebook#3363 Differential Revision: D6716983 Pulled By: sagar0 fbshipit-source-id: f451a1bc5eb0bb090f4da07bc3e5ba72cf89aefa
Summary: Use the rsync tempfile naming convention in our `BackupEngine`. The temp file follows the format, `.<filename>.<suffix>`, which is later renamed to `<filename>`. We fix `tmp` as the `<suffix>` as we don't need to use random bytes for now. The benefit is gluster treats this tempfile naming convention specially and applies hashing only to `<filename>`, so the file won't need to be linked or moved when it's renamed. Our gluster team suggested this will make things operationally easier. Closes facebook#3463 Differential Revision: D6893333 Pulled By: ajkr fbshipit-source-id: fd7622978f4b2487fce33cde40dd3124f16bcaa8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! good job figuring out our codebase and getting something working :)
db/compaction_job.cc
Outdated
} | ||
} else { | ||
//Then we are at the end of the index block. | ||
first_level_iter->SeekToLast(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the only difference between this block and the above one is starting with SeekToLast (corruption was in final block) instead of Prev (corruption was in non-final block). you can probably share the rest of the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, cleaned up the copy-pasta.
Summary: `ReadaheadRandomAccessFile` is used by iterators for file reads in several cases, like in compaction when `compaction_readahead_size > 0` or `use_direct_io_for_flush_and_compaction == true`, or in user iterator when `ReadOptions::readahead_size > 0`. `ReadaheadRandomAccessFile` maintains an internal buffer for readahead data. It assumes that, if the buffer's length is less than `ReadaheadRandomAccessFile::readahead_size_`, which is fixed in the constructor, then EOF has been reached so it doesn't try reading further. Recently, d938226 started calling `RandomAccessFile::Prefetch` with various lengths: 8KB, 16KB, etc. When the `RandomAccessFile` is a `ReadaheadRandomAccessFile`, it triggers the above condition and incorrectly determines EOF. If a block is partially in the readahead buffer and EOF is incorrectly decided, the result is a truncated data block. The problem is reproducible: ``` TEST_TMPDIR=/data/compaction_bench ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -block_size=18384 -use_direct_io_for_flush_and_compaction=true ... put error: Corruption: truncated block read from /data/compaction_bench/dbbench/000014.sst offset 20245, expected 10143 bytes, got 8427 ``` Closes facebook#3454 Differential Revision: D6869405 Pulled By: ajkr fbshipit-source-id: 87001c299e7600a37c0dcccbd0368e0954c929cf
Summary: - Change directory name from "db_test" to "checkpoint_test". Previously it used the same directory as `db_test` - Systematically cleanup snapshot and snapshot staging directories before each test. Previously a failed test run caused subsequent runs to fail, particularly when the first failure caused "snapshot.tmp" to not be cleaned up. Closes facebook#3351 Differential Revision: D6691015 Pulled By: ajkr fbshipit-source-id: 4fc2ac2e21ff2617ea0e96297c5132b5f2eefd79
Summary: Add a new bool option index_uncompressed in BlockBasedTableOptions. Closes facebook#3303 Differential Revision: D6686161 Pulled By: anand1976 fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054
Summary: It was using the same directory as `db_options_test` so transiently failed when unit tests were run in parallel. Closes facebook#3352 Differential Revision: D6691649 Pulled By: ajkr fbshipit-source-id: bee433484fec4faedd5cadf2db3c92fdcc99a170
Summary: Re-use metadata for reading Compression Dictionary on BlockBased table open, this saves two reads from disk. This helps to our 999 percentile in 5.6.1 where prefetch buffer is not present. Closes facebook#3354 Differential Revision: D6695753 Pulled By: ajkr fbshipit-source-id: bb8acd9e9e66e65b89c548ab8940570ae360333c
Summary: Closes facebook#3356 Differential Revision: D6706909 Pulled By: sagar0 fbshipit-source-id: 6e4757d9eceab3e8a6c1b83c1be4108e86576cb2
Summary: This replaces a vague warning about the mostly-obsolete ext3 filesystem with a more detailed note about a historical bug in the still-relevant ext4. Fixes facebook#3410 Closes facebook#3421 Differential Revision: D6834881 Pulled By: siying fbshipit-source-id: 7771ef5c89a54c0ac17821680779c48178d0b400
Summary: Closes facebook#3428 Differential Revision: D6834061 Pulled By: maysamyabandeh fbshipit-source-id: edca5b5b8330e0fee646c7434b9631da76670240
Summary: It failed every time. I guess people usually ran with assertions disabled. Closes facebook#3422 Differential Revision: D6822984 Pulled By: ajkr fbshipit-source-id: 2e90db75618b26ac1c46ddfa9e03c095c7bf16e3
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary: This is to avoid run time error. Fail the db_bench immediately if cuckoo table is used but mmap_read is not specified. Closes facebook#3420 Differential Revision: D6838284 Pulled By: siying fbshipit-source-id: 20893fa28d40fadc31e4ff154bed02f5a1bad341
Summary: ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache. The memory used by the data block is, however, not trackable by users. To make the system more manageable, we can cost the block to block cache while using it, and then release it after using. Closes facebook#3333 Differential Revision: D6670230 Pulled By: miasantreble fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
Summary: Added a test for three dynamic universal compaction options, in the realm of read amplification: - size_ratio - min_merge_width - max_merge_width Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason. Found a bug in compaction reason setting while working on this PR, and fixed in facebook#3412 . TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move. Closes facebook#3419 Differential Revision: D6822217 Pulled By: sagar0 fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4
Summary: SnapshotConcurrentAccessTest sometimes times out when running on the test infra. This patch splits the test into smaller sub-tests to avoid the timeout. It also benefits from lower run-time of each sub-test and increases the coverage of the test. The overall run-time of each final sub-test is at most half of the original test so we should no longer see a timeout. Closes facebook#3435 Differential Revision: D6839427 Pulled By: maysamyabandeh fbshipit-source-id: d53fdb157109e2438ca7fe447d0cf4b71f304bd8
Summary: previously if `checkpoint_dir` contained a trailing slash, we'd attempt to create the `.tmp` directory under `checkpoint_dir` due to simply concatenating `checkpoint_dir + ".tmp"`. This failed because `checkpoint_dir` hadn't been created yet and our directory creation is non-recursive. This PR fixes the issue by always creating the `.tmp` directory in the same parent as `checkpoint_dir` by stripping trailing slashes before concatenating. Closes facebook#3275 Differential Revision: D6574952 Pulled By: ajkr fbshipit-source-id: a6daa6777a901eac2460cd0140c9515f7241aefc
Summary: Fixes this build error on master (macOS): ``` table/table_test.cc:972:27: error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to 'unsigned int' [-Werror,-Wshorten-64-to-32] ``` Closes facebook#3434 Reviewed By: maysamyabandeh Differential Revision: D6840354 Pulled By: gfosco fbshipit-source-id: fffac6aefbbdd134ce1299453c5590aa855a5fc8
Summary: Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because `VersionSet::LogAndApply` is expensive. This PR adds a new `DeleteFilesInRange` function to delete files in multiple ranges at once. Close facebook#2951 Closes facebook#3431 Differential Revision: D6849228 Pulled By: ajkr fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
… injected into the metadata files (change this in future
…uption_injection....h
… when encountered during compaction. To do this, also exposed a new EventListener::OnSpecialBackgroundError function, which will expose the key-range as a parameter. *WILL NOT BE PUSHED to main branchstatus
…ion status. So I changed TwoLevelIterator to store corrupted ranges, because otherwise doing it in the overarching compaction will only get the last index block (which may or may not be corrupted).
@amytai has updated the pull request. |
Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired. Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status. |
not intended for landing |
Note: this WILL NOT be pushed into the main rocksdb branch!! Only creating PR for commenting and code reviews.