Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exposing corrupted key ranges encountered during compaction #3537

Closed
wants to merge 84 commits into from

Conversation

amytai
Copy link
Contributor

@amytai amytai commented Feb 24, 2018

Note: this WILL NOT be pushed into the main rocksdb branch!! Only creating PR for commenting and code reviews.

miasantreble and others added 16 commits January 9, 2018 16:48
Summary:
Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
1. User call Flush() with flush_options.wait = true
2. The manual flush started in the background
3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
4. The manual flush finish.

Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.

Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
Closes facebook#3378

Differential Revision: D6746789

Pulled By: yiwu-arbug

fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
Summary:
Most popular versions of GCC can't identify platform on ARM if "-march=native" is specified. Remove it to unblock most people.
Closes facebook#3346

Differential Revision: D6690544

Pulled By: siying

fbshipit-source-id: bbaba9fe2645b6b37144b36ea75beeff88992b49
Summary:
Tested on a clean FreeBSD 11.01 x64.

Closes facebook#1423
Closes facebook#3357

Differential Revision: D6705868

Pulled By: sagar0

fbshipit-source-id: cbccbbdafd4f42922512ca03619a5d5583a425fd
Summary:
Java build on PPC64le has been broken since a few months, due to facebook#2716. Fixing it with the least amount of changes.
(We should cleanup a little around this code when time permits).

This should fix the build failures seen in http://140.211.168.68:8080/job/Rocksdb/ .
Closes facebook#3359

Differential Revision: D6712938

Pulled By: sagar0

fbshipit-source-id: 3046e8f072180693de2af4762934ec1ace309ca4
Summary:
Allow StackableDB optionally takes a shared_ptr on construction and thus hold shared ownership of the underlying DB.
Closes facebook#3423

Differential Revision: D6824163

Pulled By: yiwu-arbug

fbshipit-source-id: dbdc30c42e007533a987ef413785e192340f03eb
Summary:
This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

Constraints:
- The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR).
- This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).

Thanks to siying for the original idea and implementation.

**Benchmarks:**
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

Page cache was cleared before each experiment with the command:
```
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
```
```
Before:
seekrandom   :   34020.945 micros/op 29 ops/sec;   32.5 MB/s (1636 of 1999 found)
With this change:
seekrandom   :    8726.912 micros/op 114 ops/sec;  126.8 MB/s (5702 of 6999 found)
```
~3.9X performance improvement.

Also verified with strace and gdb that the readahead size is increasing as expected.
```
strace -e readahead -f -T -t -p <db_bench process pid>
```
Closes facebook#3282

Differential Revision: D6586477

Pulled By: sagar0

fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.

Reviewed By: yiwu-arbug

Differential Revision: D6821806

fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary:
In the test, there can be a dead lock between background flush thread and foreground main thread as following:
* background flush thread:
  - holding db mutex, while
  - waiting on "DBImpl::FlushMemTableToOutputFile:BeforeInstallSV" sync point.
* foreground thread:
  - waiting for db mutex to write "key2"

Fixing by let background flush thread wait without holding db mutex.
Closes facebook#3436

Differential Revision: D6841334

Pulled By: yiwu-arbug

fbshipit-source-id: b020768ac94e166e40953c5d09e505515a5f244d
Summary:
Workaround a bunch of "implicit-fallthrough" compiler errors, like:

```
util/crc32c.cc:533:7: error: this statement may fall through [-Werror=implicit-fallthrough=]
   crc = _mm_crc32_u64(crc, *(uint64_t*)(buf + offset));
       ^
util/crc32c.cc:1016:9: note: in expansion of macro ‘CRCsinglet’
         CRCsinglet(crc0, next, -2 * 8);
         ^~~~~~~~~~
util/crc32c.cc:1017:7: note: here
       case 1:
```
Closes facebook#3339

Reviewed By: sagar0

Differential Revision: D6874736

Pulled By: quark-zju

fbshipit-source-id: eec9f3bc135e12fca336928d01711006d5c3cb16
Summary:
We don't do fsync() after truncate in direct I/O writeable file (in fact we don't do any fsync ever). This can cause metadata not persistent to disk after the file is generated. We call it instead.
Closes facebook#3500

Differential Revision: D6981482

Pulled By: siying

fbshipit-source-id: 7e2b591b7e5dd1b96fc0775515b8b9e6092980ef
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.

This deadlock is hit all the time on our workload. It blocks our release.

In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.

So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.

This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.

I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes facebook#3510

Reviewed By: sagar0

Differential Revision: D7005346

Pulled By: al13n321

fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
Summary:
Added a new iterator property: `rocksdb.iterator.internal-key` to get the internal-key (converted to user key) at which the iterator stopped.
Closes facebook#3525

Differential Revision: D7033694

Pulled By: sagar0

fbshipit-source-id: d51e6c00f5e9d766c6276ef79774b81c6c5216f8
Summary:
Java static builds are again broken, this time due Snappy version upgrade introduced in 90c1d81 (facebook#3331).

This is due to two reasons:
1. The new Snappy packages should now be downloaded from https://github.com/google/snappy/archive/<pkg.tar.gz> instead of https://github.com/google/snappy/releases/download/<pkg.tar.gz> which we are using now.
1. In addition to the the above URL change, Snappy changed its build from using autotools to CMake based : https://github.com/google/snappy/blame/e69d9f880677f2aa3488c80b953ec4309f0dfa2e/README.md#L65-L72

So more changes are needed if we are going to upgrade to 1.1.7. Hence reverting the version upgrade until we figure them out.
Closes facebook#3363

Differential Revision: D6716983

Pulled By: sagar0

fbshipit-source-id: f451a1bc5eb0bb090f4da07bc3e5ba72cf89aefa
Summary:
Use the rsync tempfile naming convention in our `BackupEngine`. The temp file follows the format, `.<filename>.<suffix>`, which is later renamed to `<filename>`. We fix `tmp` as the `<suffix>` as we don't need to use random bytes for now. The benefit is gluster treats this tempfile naming convention specially and applies hashing only to `<filename>`, so the file won't need to be linked or moved when it's renamed. Our gluster team suggested this will make things operationally easier.
Closes facebook#3463

Differential Revision: D6893333

Pulled By: ajkr

fbshipit-source-id: fd7622978f4b2487fce33cde40dd3124f16bcaa8
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! good job figuring out our codebase and getting something working :)

}
} else {
//Then we are at the end of the index block.
first_level_iter->SeekToLast();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the only difference between this block and the above one is starting with SeekToLast (corruption was in final block) instead of Prev (corruption was in non-final block). you can probably share the rest of the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, cleaned up the copy-pasta.

@facebook-github-bot
Copy link
Contributor

@amytai has updated the pull request. View: changes

@facebook-github-bot
Copy link
Contributor

@amytai has updated the pull request. View: changes

Summary:
`ReadaheadRandomAccessFile` is used by iterators for file reads in several cases, like in compaction when `compaction_readahead_size > 0` or `use_direct_io_for_flush_and_compaction == true`, or in user iterator when `ReadOptions::readahead_size > 0`. `ReadaheadRandomAccessFile` maintains an internal buffer for readahead data. It assumes that, if the buffer's length is less than `ReadaheadRandomAccessFile::readahead_size_`, which is fixed in the constructor, then EOF has been reached so it doesn't try reading further.

Recently, d938226 started calling `RandomAccessFile::Prefetch` with various lengths: 8KB, 16KB, etc. When the `RandomAccessFile` is a `ReadaheadRandomAccessFile`, it triggers the above condition and incorrectly determines EOF. If a block is partially in the readahead buffer and EOF is incorrectly decided, the result is a truncated data block.

The problem is reproducible:

```
TEST_TMPDIR=/data/compaction_bench ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -block_size=18384 -use_direct_io_for_flush_and_compaction=true
...
put error: Corruption: truncated block read from /data/compaction_bench/dbbench/000014.sst offset 20245, expected 10143 bytes, got 8427
```
Closes facebook#3454

Differential Revision: D6869405

Pulled By: ajkr

fbshipit-source-id: 87001c299e7600a37c0dcccbd0368e0954c929cf
@facebook-github-bot
Copy link
Contributor

@amytai has updated the pull request. View: changes

@facebook-github-bot
Copy link
Contributor

@amytai has updated the pull request. View: changes

@facebook-github-bot
Copy link
Contributor

@amytai has updated the pull request. View: changes

ajkr and others added 5 commits March 8, 2018 18:38
Summary:
- Change directory name from "db_test" to "checkpoint_test". Previously it used the same directory as `db_test`
- Systematically cleanup snapshot and snapshot staging directories before each test. Previously a failed test run caused subsequent runs to fail, particularly when the first failure caused "snapshot.tmp" to not be cleaned up.
Closes facebook#3351

Differential Revision: D6691015

Pulled By: ajkr

fbshipit-source-id: 4fc2ac2e21ff2617ea0e96297c5132b5f2eefd79
Summary:
Add a new bool option index_uncompressed in BlockBasedTableOptions.
Closes facebook#3303

Differential Revision: D6686161

Pulled By: anand1976

fbshipit-source-id: 748b46993d48a01e5f89b6bd3e41f06a59ec6054
Summary:
It was using the same directory as `db_options_test` so transiently failed when unit tests were run in parallel.
Closes facebook#3352

Differential Revision: D6691649

Pulled By: ajkr

fbshipit-source-id: bee433484fec4faedd5cadf2db3c92fdcc99a170
Summary:
Re-use metadata for reading Compression Dictionary on BlockBased
  table open, this saves two reads from disk.
  This helps to our 999 percentile in 5.6.1 where prefetch buffer is  not present.
Closes facebook#3354

Differential Revision: D6695753

Pulled By: ajkr

fbshipit-source-id: bb8acd9e9e66e65b89c548ab8940570ae360333c
Summary: Closes facebook#3356

Differential Revision: D6706909

Pulled By: sagar0

fbshipit-source-id: 6e4757d9eceab3e8a6c1b83c1be4108e86576cb2
bdarnell and others added 25 commits March 8, 2018 18:44
Summary:
This replaces a vague warning about the mostly-obsolete ext3 filesystem with
a more detailed note about a historical bug in the still-relevant ext4.

Fixes facebook#3410
Closes facebook#3421

Differential Revision: D6834881

Pulled By: siying

fbshipit-source-id: 7771ef5c89a54c0ac17821680779c48178d0b400
Summary: Closes facebook#3428

Differential Revision: D6834061

Pulled By: maysamyabandeh

fbshipit-source-id: edca5b5b8330e0fee646c7434b9631da76670240
Summary:
It failed every time. I guess people usually ran with assertions disabled.
Closes facebook#3422

Differential Revision: D6822984

Pulled By: ajkr

fbshipit-source-id: 2e90db75618b26ac1c46ddfa9e03c095c7bf16e3
Summary: Grandfather in super old lint issues to make a clean slate for moving forward that allows us to have stronger enforcement on new issues.

Reviewed By: yiwu-arbug

Differential Revision: D6821806

fbshipit-source-id: 22797d31ec58e9eb0255d3b66fedfcfcb0dc127c
Summary:
This is to avoid run time error. Fail the db_bench immediately if cuckoo table is used but mmap_read is not specified.
Closes facebook#3420

Differential Revision: D6838284

Pulled By: siying

fbshipit-source-id: 20893fa28d40fadc31e4ff154bed02f5a1bad341
Summary:
ReadOptions.fill_cache is set in compaction inputs and can be set by users in their queries too. It tells RocksDB not to put a data block used to block cache.

The memory used by the data block is, however, not trackable by users.

To make the system more manageable, we can cost the block to block cache while using it, and then release it after using.
Closes facebook#3333

Differential Revision: D6670230

Pulled By: miasantreble

fbshipit-source-id: ab848d3ed286bd081a13ee1903de357b56cbc308
Summary:
Added a test for three dynamic universal compaction options, in the realm of read amplification:
- size_ratio
- min_merge_width
- max_merge_width

Also updated DynamicUniversalCompactionSizeAmplification by adding a check on compaction reason.
Found a bug in compaction reason setting while working on this PR, and fixed in facebook#3412 .

TODO for later: Still to add tests for these options: compression_size_percent, stop_style and trivial_move.
Closes facebook#3419

Differential Revision: D6822217

Pulled By: sagar0

fbshipit-source-id: 074573fca6389053cbac229891a0163f38bb56c4
Summary:
SnapshotConcurrentAccessTest sometimes times out when running on the test infra. This patch splits the test into smaller sub-tests to avoid the timeout. It also benefits from lower run-time of each sub-test and increases the coverage of the test. The overall run-time of each final sub-test is at most half of the original test so we should no longer see a timeout.
Closes facebook#3435

Differential Revision: D6839427

Pulled By: maysamyabandeh

fbshipit-source-id: d53fdb157109e2438ca7fe447d0cf4b71f304bd8
Summary:
previously if `checkpoint_dir` contained a trailing slash, we'd attempt to create the `.tmp` directory under `checkpoint_dir` due to simply concatenating `checkpoint_dir + ".tmp"`. This failed because `checkpoint_dir` hadn't been created yet and our directory creation is non-recursive. This PR fixes the issue by always creating the `.tmp` directory in the same parent as `checkpoint_dir` by stripping trailing slashes before concatenating.
Closes facebook#3275

Differential Revision: D6574952

Pulled By: ajkr

fbshipit-source-id: a6daa6777a901eac2460cd0140c9515f7241aefc
Summary:
Fixes this build error on master (macOS):

```
table/table_test.cc:972:27: error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to
      'unsigned int' [-Werror,-Wshorten-64-to-32]
```
Closes facebook#3434

Reviewed By: maysamyabandeh

Differential Revision: D6840354

Pulled By: gfosco

fbshipit-source-id: fffac6aefbbdd134ce1299453c5590aa855a5fc8
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.

This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.

Close facebook#2951
Closes facebook#3431

Differential Revision: D6849228

Pulled By: ajkr

fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
… injected into the metadata files (change this in future
… when encountered during compaction. To do this, also exposed a new EventListener::OnSpecialBackgroundError function, which will expose the key-range as a parameter. *WILL NOT BE PUSHED to main branchstatus
…ion status. So I changed TwoLevelIterator to store corrupted ranges, because otherwise doing it in the overarching compaction will only get the last index block (which may or may not be corrupted).
@facebook-github-bot
Copy link
Contributor

@amytai has updated the pull request.

@facebook-github-bot
Copy link
Contributor

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status.

@ajkr
Copy link
Contributor

ajkr commented Nov 14, 2018

not intended for landing

@ajkr ajkr closed this Nov 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet