Exposing corrupted key ranges encountered during compaction #3537

amytai · 2018-02-24T06:47:59Z

Note: this WILL NOT be pushed into the main rocksdb branch!! Only creating PR for commenting and code reviews.

Summary: Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence: 1. User call Flush() with flush_options.wait = true 2. The manual flush started in the background 3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached. 4. The manual flush finish. Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed. Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush. Closes facebook#3378 Differential Revision: D6746789 Pulled By: yiwu-arbug fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa

Summary: Most popular versions of GCC can't identify platform on ARM if "-march=native" is specified. Remove it to unblock most people. Closes facebook#3346 Differential Revision: D6690544 Pulled By: siying fbshipit-source-id: bbaba9fe2645b6b37144b36ea75beeff88992b49

Summary: Tested on a clean FreeBSD 11.01 x64. Closes facebook#1423 Closes facebook#3357 Differential Revision: D6705868 Pulled By: sagar0 fbshipit-source-id: cbccbbdafd4f42922512ca03619a5d5583a425fd

Summary: Java build on PPC64le has been broken since a few months, due to facebook#2716. Fixing it with the least amount of changes. (We should cleanup a little around this code when time permits). This should fix the build failures seen in http://140.211.168.68:8080/job/Rocksdb/ . Closes facebook#3359 Differential Revision: D6712938 Pulled By: sagar0 fbshipit-source-id: 3046e8f072180693de2af4762934ec1ace309ca4

Summary: Allow StackableDB optionally takes a shared_ptr on construction and thus hold shared ownership of the underlying DB. Closes facebook#3423 Differential Revision: D6824163 Pulled By: yiwu-arbug fbshipit-source-id: dbdc30c42e007533a987ef413785e192340f03eb

Summary: This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Constraints: - The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR). - This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). Thanks to siying for the original idea and implementation. **Benchmarks:** Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` Page cache was cleared before each experiment with the command: ``` sudo sh -c "echo 3 > /proc/sys/vm/drop_caches" ``` ``` Before: seekrandom : 34020.945 micros/op 29 ops/sec; 32.5 MB/s (1636 of 1999 found) With this change: seekrandom : 8726.912 micros/op 114 ops/sec; 126.8 MB/s (5702 of 6999 found) ``` ~3.9X performance improvement. Also verified with strace and gdb that the readahead size is increasing as expected. ``` strace -e readahead -f -T -t -p <db_bench process pid> ``` Closes facebook#3282 Differential Revision: D6586477 Pulled By: sagar0 fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c

Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues. Reviewed By: yiwu-arbug Differential Revision: D6821806 fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c

Summary: In the test, there can be a dead lock between background flush thread and foreground main thread as following: * background flush thread: - holding db mutex, while - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point. * foreground thread: - waiting for db mutex to write "key2" Fixing by let background flush thread wait without holding db mutex. Closes facebook#3436 Differential Revision: D6841334 Pulled By: yiwu-arbug fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d

Summary: Workaround a bunch of "implicit-fallthrough" compiler errors, like: ``` util/crc32c.cc:533:7: error: this statement may fall through [-Werror=implicit-fallthrough=] crc = _mm_crc32_u64(crc, *(uint64_t*)(buf + offset)); ^ util/crc32c.cc:1016:9: note: in expansion of macro ‘CRCsinglet’ CRCsinglet(crc0, next, -2 * 8); ^~~~~~~~~~ util/crc32c.cc:1017:7: note: here case 1: ``` Closes facebook#3339 Reviewed By: sagar0 Differential Revision: D6874736 Pulled By: quark-zju fbshipit-source-id: eec9f3bc135e12fca336928d01711006d5c3cb16

Summary: We don't do fsync() after truncate in direct I/O writeable file (in fact we don't do any fsync ever). This can cause metadata not persistent to disk after the file is generated. We call it instead. Closes facebook#3500 Differential Revision: D6981482 Pulled By: siying fbshipit-source-id: 7e2b591b7e5dd1b96fc0775515b8b9e6092980ef

Summary: Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_. This deadlock is hit all the time on our workload. It blocks our release. In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so. So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them. This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup. I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine. Closes facebook#3510 Reviewed By: sagar0 Differential Revision: D7005346 Pulled By: al13n321 fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f

Summary: Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped. Closes facebook#3525 Differential Revision: D7033694 Pulled By: sagar0 fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8

Summary: Java static builds are again broken, this time due Snappy version upgrade introduced in 90c1d81 (facebook#3331). This is due to two reasons: 1. The new Snappy packages should now be downloaded from https://github.com/google/snappy/archive/<pkg.tar.gz> instead of https://github.com/google/snappy/releases/download/<pkg.tar.gz> which we are using now. 1. In addition to the the above URL change, Snappy changed its build from using autotools to CMake based : https://github.com/google/snappy/blame/e69d9f880677f2aa3488c80b953ec4309f0dfa2e/README.md#L65-L72 So more changes are needed if we are going to upgrade to 1.1.7. Hence reverting the version upgrade until we figure them out. Closes facebook#3363 Differential Revision: D6716983 Pulled By: sagar0 fbshipit-source-id: f451a1bc5eb0bb090f4da07bc3e5ba72cf89aefa

Summary: Use the rsync tempfile naming convention in our `BackupEngine`. The temp file follows the format, `.<filename>.<suffix>`, which is later renamed to `<filename>`. We fix `tmp` as the `<suffix>` as we don't need to use random bytes for now. The benefit is gluster treats this tempfile naming convention specially and applies hashing only to `<filename>`, so the file won't need to be linked or moved when it's renamed. Our gluster team suggested this will make things operationally easier. Closes facebook#3463 Differential Revision: D6893333 Pulled By: ajkr fbshipit-source-id: fd7622978f4b2487fce33cde40dd3124f16bcaa8

ajkr

nice! good job figuring out our codebase and getting something working :)

ajkr · 2018-02-24T19:39:34Z

db/compaction_job.cc

+        }
+      } else {
+        //Then we are at the end of the index block.
+        first_level_iter->SeekToLast();


i think the only difference between this block and the above one is starting with SeekToLast (corruption was in final block) instead of Prev (corruption was in non-final block). you can probably share the rest of the logic.

Good point, cleaned up the copy-pasta.

facebook-github-bot · 2018-02-24T23:56:18Z