storage: excessive RocksDB compaction from reasonable suggestion #26693

petermattis · 2018-06-13T16:40:37Z

This is forked from #24029. One of the recurring symptoms we've seen in that issue is that RocksDB compactions go crazy, consuming excessive CPU and compacting way more data than we expect. We've debugged a few problems in how RocksDB handles range tombstones. This issue occurred with mitigations in place for those earlier issues.

I180613 14:43:08.584878 478 storage/compactor/compactor.go:367 [n8,s8,compactor] processing compaction #1-10/38 (/Table/53/1/56300577-/Table/53/1/56738067) for 288 MiB (reasons: size=true used=false avail=false)
...
I180613 14:43:08.586579 478 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_impl_compaction_flush.cc:971] [default] Manual compaction starting L5 -> L6
I180613 14:43:08.586651 478 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:432] [default] L5 -> L6: compaction 970 overlapping inputs: 3215B + 21GB: 'BD89F9035B2DB188001536E587A4E0547209' seq:4235335, type:0 - 'BD89F903D8A50700' seq:72057594037927935, type:15

Here we can see that the compaction queue executed a compaction for what it believes is 288 MiB of data. RocksDB translated this into a compaction for 21 GiB of data!

Those hex decoded keys indicate that RocksDB expanded our suggested compaction range to: /Table/53/1/56307121/0/1528661494.288372850,0-/Table/53/1/64529671. The start key seems pretty close to the the suggested compaction, but that end key is much larger. The RocksDB log line was added to CompactionPicker::SetupOtherInputs and the range of keys was determined by CompactionPicker::GetRange. I'm going to add some additional debugging info to try and track down what is going on.

Cc @benesch, @tschottdorf

The text was updated successfully, but these errors were encountered:

petermattis · 2018-06-13T18:36:32Z

Caught another instance of this with additional debug logs:

I180613 17:17:57.561411 348 storage/compactor/compactor.go:382  [n8,s8,compactor] processing compaction #244-252/349 (/Table/53/1/54740752-/Table/53/1/54767543) for 262 MiB (reasons: size=true used=false avail=false)
    5: /Table/53/1/54750568/0-/Table/53/1/54764270/0/NULL
    5: /Table/53/1/54764271/0-/Table/53/1/64529672
    6: /Table/53/1/54750568/0-/Table/53/1/54753839/0
    6: /Table/53/1/54753840/0-/Table/53/1/54757111/0
    6: /Table/53/1/54757112/0-/Table/53/1/54760383/0
    6: /Table/53/1/54760384/0-/Table/53/1/54763655/0
    6: /Table/53/1/54763656/0-/Table/53/1/54764270/0
    6: /Table/53/1/54764271/0-/Table/53/1/54767542/0

This shows that the compaction queue is processing a compaction which it thinks covers 262 MiB of data. The subsequent lines show the sstables which overlap this range. RocksDB in turn thinks the compaction encompasses 24 GiB of data:

I180613 17:17:57.563423 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:432] [default] L5 -> L6: compaction 1094 overlapping inputs: 4801B + 24GB: 'BD89F903436D6888001536E587A4E0547209' seq:4206222, type:0 - 'BD89F903D8D86D00' seq:72057594037927935, type:15

The problem is with 5: /Table/53/1/54764271/0-/Table/53/1/64529672. Since that table is involved in the compaction we have to compact every L6 table that it covers. The "clear range hack" is supposed to avoid sstables which cover an exceptionally large number of lower-level sstables, but perhaps it is still being foiled by something in RocksDB.

@benesch This lends additional weight to not going with the hack and actually fixing the selection of sstable boundaries in CompactionJob.

petermattis · 2018-06-13T20:01:26Z

Some other logging from the above event revealed the inputs RocksDB was using for the compaction:

I180613 17:17:57.563434 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:440] [default] input 0: 'BD89F903436D6888001536E587A4E0547209' seq:4206222, type:0 - 'BD89F90343A2EE8800001536E587A4E0547209' seq:4205227, type:15
I180613 17:17:57.563443 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:440] [default] input 1: 'BD89F90343A2EF88001536E587A4E0547209' seq:4205271, type:0 - 'BD89F903D8A50800' seq:3453979, type:15
I180613 17:17:57.563452 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:440] [default] input 2: 'BD89F903D8A50700' seq:3453981, type:0 - 'BD89F903D8D86D00' seq:72057594037927935, type:15

If we decode the start and end keys for these sstables we see:

/Table/53/1/54750568/0 - /Table/53/1/54764270/0/NULL
/Table/53/1/54764271/0 - /Table/53/1/64529672
/Table/53/1/64529671 - /Table/53/1/64542829

Notice that the 2nd and 3rd sstable overlap in their end/start key. That isn't supposed to happen. RocksDB has consistency checks to make sure sstables do not overlap, but they are disabled by default in release mode. We need to be setting AdvancedColumnFamilyOptions::force_consistency_checks = true if we want to enable them. I'm going to try and reproduce and see where the overlap is coming from.

petermattis · 2018-06-13T20:05:38Z

Heh, running with a RocksDB debug build, as soon as the clearrange roachtest starts up one of the nodes died with the following assertion:

void VersionStorageInfo::AddFile(int level, FileMetaData* f, Logger* info_log) {
  auto* level_files = &files_[level];
  // Must not overlap
#ifndef NDEBUG
  if (level > 0 && !level_files->empty() &&
      internal_comparator_->Compare(
          (*level_files)[level_files->size() - 1]->largest, f->smallest) >= 0) {
    auto* f2 = (*level_files)[level_files->size() - 1];
    if (info_log != nullptr) {
      Error(info_log, "Adding new file %" PRIu64
                      " range (%s, %s) to level %d but overlapping "
                      "with existing file %" PRIu64 " %s %s",
            f->fd.GetNumber(), f->smallest.DebugString(true).c_str(),
            f->largest.DebugString(true).c_str(), level, f2->fd.GetNumber(),
            f2->smallest.DebugString(true).c_str(),
            f2->largest.DebugString(true).c_str());
      LogFlush(info_log);
    }
    assert(false);
  }
#endif
  f->refs++;
  level_files->push_back(f);
}

tbg · 2018-06-14T09:36:58Z

Wait, what? That's a whole new world of badness. Is this caused by anything we've done or is this just a way of saying "RocksDB is completely broken"?

petermattis · 2018-06-14T09:51:56Z

I have a diagram in my head I should add to a comment explaining why this is better. However, I just had another thought: we’re effectively forcing a compaction of the bounds of every DeleteRange. I’d like to try resurrecting Nikhil’s patch to mark sstables containing range tombstones as needing compaction. His patch only did that if the sstable contained 5 or more range tombstones. Seems like marking every sstable containing at least 1 range tombstone would completely remove the need for the compaction queue. Internally RocksDB can be smarter, only compacting sstables that contain range tombstones. The compaction queue is only making a guess.

…

On Thu, Jun 14, 2018 at 5:37 AM Tobias Schottdorf ***@***.***> wrote: Wait, what? That's a whole new world of badness. Is this caused by anything we've done or is this just a way of saying "RocksDB is completely broken"? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#26693 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6f95v4jqzWJfVsB4_c0yC23Z_wxhdFks5t8i7IgaJpZM4UmjGL> .

tbg · 2018-06-14T10:46:47Z

Seems like marking every sstable containing at least 1 range tombstone would completely remove the need for the compaction queue.

Yes and no, though I'd be tempted to try it. Yes because for the range tombstones, it mostly does. There is an argument that the "tombstone = compaction" heuristic is too aggressive for tombstones that come from replica GC (though maybe it's ok).

Then there's the bigger problem of running compactions for non-range tombstones. The other day a user managed to get to 60gb of on-disk usage when their dataset was really more like 3gb -- a queue-type workload. I think the original idea of the compaction queue also included alleviating that kind of thing. However I'm not too hopeful that it will, and RocksDB's built-in tombstone-sensitive filters might address this problem better.

petermattis · 2018-06-14T13:50:09Z

@tschottdorf Oh, and to answer your earlier question: yes, this assertion firing is a whole new world of brokenness. I'm trying to make sure it isn't due to one of the RocksDB patches that Nikhil and I have been using. I don't think it is, but need to make sure.

benesch · 2018-06-14T17:11:50Z

Possibly relevant: facebook/rocksdb#3926

petermattis · 2018-06-14T17:58:23Z

That does look relevant.

petermattis · 2018-06-14T19:13:39Z

I've reproduced the assertion mentioned above using a cockroach built from master but with rocksdb assertions enabled. roachtest store-gen -d -c peter-bank-gen --stores=10 bank --payload-bytes=10240 --ranges=0 --rows=65104166 --seed=3

I'm going to try the patch @benesch pointed to above to see if that fixes the issue.

petermattis · 2018-06-15T01:02:58Z

facebook/rocksdb#3926 appears to fix the problem. This seems somewhat serious as the bug causes RocksDB to violate a basic invariant that 2 sstables in the same level do not overlap. I think the result could be missing keys during iteration.

I need to verify that whatever bug that RocksDB PR is fixing is also present in 2.0. Hopefully I'll get to that tomorrow. If I don't, @benesch will likely need to pick this up when I'm on vacation.

Cc @bdarnell in case you haven't been following along.

bdarnell · 2018-06-15T15:22:47Z

Heh, running with a RocksDB debug build, as soon as the clearrange roachtest starts up one of the nodes died with the following assertion:

We enable rocksdb assertions in race-enabled builds (grep ENABLE_ROCKSDB_ASSERTIONS in the Makefile). Sounds like we should figure out how to use them more. This could mean adding more compaction tests in the main test suite so they get run by make testrace or (sometimes) running roachtest with assertion-enabled binaries (maybe even race-enabled, but I doubt that's feasible due to the performance overhead). The rocksdb assertions have a 25-40% performance overhead according to #15604 so we still want them disabled for release builds.

petermattis · 2018-06-15T15:29:01Z

Well, the good news is that so far I'm unable to reproduce the assertion failure on 2.0. It was happening right away when doing a roachtest store-gen on master, but has been running for 30m on 2.0 (with assertions enabled) without problem. We upgraded from RocksDB 5.9.0+patches to RocksDB 5.12.2 in #25235. Ton of changes there, but it is possible that one of them introduced the cause of the assertion failure. I'm looking.

petermattis · 2018-06-15T15:54:07Z

Well, I looked through the changes in our RocksDB upgrade and nothing jumped out at me, but there were a lot of changes so perhaps I missed something.

benesch · 2018-06-15T16:03:25Z

One other thing we could try: backporting the RocksDB unit test that exposes the bug to 5.9.0+patches and seeing whether it fails.

…

On Fri, Jun 15, 2018 at 11:54 AM, Peter Mattis ***@***.***> wrote: Well, I looked through the changes in our RocksDB upgrade and nothing jumped out at me, but there were a lot of changes so perhaps I missed something. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26693 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA15IMjIAR26h_C0-iSAXy3BFTQ7X8cXks5t89i0gaJpZM4UmjGL> .

petermattis · 2018-06-15T16:04:42Z

Good point. I’ll do that this afternoon. On Fri, Jun 15, 2018 at 12:03 PM Nikhil Benesch <notifications@github.com> wrote:

…

One other thing we could try: backporting the RocksDB unit test that exposes the bug to 5.9.0+patches and seeing whether it fails. On Fri, Jun 15, 2018 at 11:54 AM, Peter Mattis ***@***.***> wrote: > Well, I looked through the changes in our RocksDB upgrade and nothing > jumped out at me, but there were a lot of changes so perhaps I missed > something. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > < #26693 (comment) >, > or mute the thread > < https://github.com/notifications/unsubscribe-auth/AA15IMjIAR26h_C0-iSAXy3BFTQ7X8cXks5t89i0gaJpZM4UmjGL > > . > — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#26693 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF6f97T4rdkxYjFp_DuyWa1rMNZDoNvLks5t89rbgaJpZM4UmjGL> .

petermattis · 2018-06-15T20:24:55Z

~/Development/go/src/github.com/cockroachdb/rocksdb (0edac964...) make -j8 db_compaction_test && ./db_compaction_test --gtest_filter=DBCompactionTest.CompactFilesOutputRangeConflict
Makefile:127: Warning: Compiling in debug mode. Don't use the resulting binary in production
  GEN      util/build_version.cc
make: `db_compaction_test' is up to date.
Note: Google Test filter = DBCompactionTest.CompactFilesOutputRangeConflict
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.CompactFilesOutputRangeConflict
Assertion failed: (!FilesRangeOverlapWithCompaction(input_files, output_level)), function CompactFiles, file db/compaction_picker.cc, line 298.

https://github.com/cockroachdb/rocksdb/tree/0edac964ca804c97c69ef18123b5e82d1b57e19e is the SHA used by the 2.0 branch. I'll work on patching master and backporting to 2.0.

RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes cockroachdb#26693 Release note: None

26755: release-2.0: Bump RocksDB pointer to grab facebook/rocksdb#3926 r=benesch,a-robinson a=petermattis RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes #26693 Release note: None Co-authored-by: Peter Mattis <petermattis@gmail.com>

26753: storage: Add extra event to allocator rebalancing r=a-robinson a=a-robinson Helps make the output of simulated allocator runs less confusing, since otherwise it's not clear why we're considering removal from the range and why the replicas being considered for removal includes one that isn't even a real member of the range. Release note: None Would have made looking at the simulated allocator output from https://forum.cockroachlabs.com/t/how-to-enable-leaseholder-load-balancing/1732/3 a little more pleasant. 26754: Bump RocksDB pointer to grab facebook/rocksdb#3926 r=benesch,a-robinson a=petermattis RocksDB was violating an invariant that no 2 sstables in a level overlap. It isn't quite clear what the upshot of this violation is. At the very least it would cause the overlapping tables to be compacted together. It seems possible that it could lead to missing writes, but I haven't been able to verify that. Fixes #26693 Release note: None Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com> Co-authored-by: Peter Mattis <petermattis@gmail.com>

petermattis · 2018-06-16T00:00:18Z

I'm going to leave this open as a reminder that I should try and reproduce the original badness in this issue (excessive compactions) now that the RocksDB bug with overlapping sstables has been fixed.

The current implementation of range deletion tombstones in RocksDB suffers from a performance bug that causes excessive CPU usage on every read operation in a database with many range tombstones. Dropping a large table can easily result in several thousand range deletion tombstones in one store, resulting in an unusable cluster as documented in cockroachdb#24029. Backport a refactoring of range deletion tombstone that fixes the performance problem. This refactoring has also been proposed upstream as facebook/rocksdb#4014. A more minimal change was also proposed in facebook/rocksdb#3992--and that patch better highlights the exact nature of the bug than the patch backported here, for those looking to understand the problem. But this refactoring, though more invasive, gets us one step closer to solving a related problem where range deletions can cause excessively large compactions (cockroachdb#26693). These large compactions do not appear to brick the cluster but undoubtedly have some impact on performance. Fix cockroachdb#24029. Release note: None

petermattis · 2018-08-21T14:11:09Z

I believe the remaining work here is upstream in RocksDB. See facebook/rocksdb#3977. Moving to Later as it isn't clear if that work is necessary.

petermattis · 2020-02-25T17:45:12Z

We have no plans to fix this issue in RocksDB. Pebble is somewhat better in this area as it can cut sstables at grandparent boundaries even if that means cutting a range tombstone. There are a number of open Pebble issues for further improvements for compactions, but this issue has done its service and can be closed.

petermattis added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jun 13, 2018

petermattis added this to the 2.1 milestone Jun 13, 2018

petermattis self-assigned this Jun 13, 2018

tbg added this to On the horizon in KV Jun 14, 2018

petermattis mentioned this issue Jun 15, 2018

Bump RocksDB pointer to grab facebook/rocksdb#3926 #26754

Merged

petermattis mentioned this issue Jun 15, 2018

release-2.0: Bump RocksDB pointer to grab facebook/rocksdb#3926 #26755

Merged

craig bot closed this as completed in #26754 Jun 15, 2018

KV automation moved this from On the horizon to Finished (milestone 2, ends 6/25) Jun 15, 2018

petermattis reopened this Jun 16, 2018

KV automation moved this from Finished (milestone 2, ends 6/25) to Milestone 2 Jun 16, 2018

benesch mentioned this issue Jun 21, 2018

c-deps: backport RocksDB range deletion performance fix #26877

Closed

tbg moved this from Milestone 3 to On the horizon in KV Jun 25, 2018

petermattis mentioned this issue Jul 19, 2018

storage: dropping a large table will brick a cluster due to compactions #24029

Closed

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2018

petermattis modified the milestones: 2.1, Later Aug 21, 2018

tbg moved this from On the horizon to Unit Test Flakes in KV Oct 11, 2018

tbg moved this from Unit Test Flakes to Cold storage in KV Oct 11, 2018

petermattis mentioned this issue Mar 4, 2019

roachtest: clearrange/checks=true failed #34860

Closed

petermattis removed this from Cold storage in KV Sep 25, 2019

petermattis added this to Incoming in Storage via automation Oct 1, 2019

petermattis closed this as completed Feb 25, 2020

Storage automation moved this from Incoming to Done (milestone E) Feb 25, 2020

ghost mentioned this issue Mar 7, 2020

latency spike and wave pattern in disk usage - v19.2.4 #45557

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: excessive RocksDB compaction from reasonable suggestion #26693

storage: excessive RocksDB compaction from reasonable suggestion #26693

petermattis commented Jun 13, 2018

petermattis commented Jun 13, 2018

petermattis commented Jun 13, 2018

petermattis commented Jun 13, 2018

tbg commented Jun 14, 2018

petermattis commented Jun 14, 2018 via email

tbg commented Jun 14, 2018

petermattis commented Jun 14, 2018

benesch commented Jun 14, 2018

petermattis commented Jun 14, 2018

petermattis commented Jun 14, 2018

petermattis commented Jun 15, 2018

bdarnell commented Jun 15, 2018

petermattis commented Jun 15, 2018

petermattis commented Jun 15, 2018

benesch commented Jun 15, 2018 via email

petermattis commented Jun 15, 2018 via email

petermattis commented Jun 15, 2018

petermattis commented Jun 16, 2018

petermattis commented Aug 21, 2018

petermattis commented Feb 25, 2020

storage: excessive RocksDB compaction from reasonable suggestion #26693

storage: excessive RocksDB compaction from reasonable suggestion #26693

Comments

petermattis commented Jun 13, 2018

petermattis commented Jun 13, 2018

petermattis commented Jun 13, 2018

petermattis commented Jun 13, 2018

tbg commented Jun 14, 2018

petermattis commented Jun 14, 2018 via email

tbg commented Jun 14, 2018

petermattis commented Jun 14, 2018

benesch commented Jun 14, 2018

petermattis commented Jun 14, 2018

petermattis commented Jun 14, 2018

petermattis commented Jun 15, 2018

bdarnell commented Jun 15, 2018

petermattis commented Jun 15, 2018

petermattis commented Jun 15, 2018

benesch commented Jun 15, 2018 via email

petermattis commented Jun 15, 2018 via email

petermattis commented Jun 15, 2018

petermattis commented Jun 16, 2018

petermattis commented Aug 21, 2018

petermattis commented Feb 25, 2020