Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: excessive RocksDB compaction from reasonable suggestion #26693

Closed
petermattis opened this issue Jun 13, 2018 · 20 comments
Closed

storage: excessive RocksDB compaction from reasonable suggestion #26693

petermattis opened this issue Jun 13, 2018 · 20 comments
Assignees
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

This is forked from #24029. One of the recurring symptoms we've seen in that issue is that RocksDB compactions go crazy, consuming excessive CPU and compacting way more data than we expect. We've debugged a few problems in how RocksDB handles range tombstones. This issue occurred with mitigations in place for those earlier issues.

I180613 14:43:08.584878 478 storage/compactor/compactor.go:367 [n8,s8,compactor] processing compaction #1-10/38 (/Table/53/1/56300577-/Table/53/1/56738067) for 288 MiB (reasons: size=true used=false avail=false)
...
I180613 14:43:08.586579 478 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/db_impl_compaction_flush.cc:971] [default] Manual compaction starting L5 -> L6
I180613 14:43:08.586651 478 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:432] [default] L5 -> L6: compaction 970 overlapping inputs: 3215B + 21GB: 'BD89F9035B2DB188001536E587A4E0547209' seq:4235335, type:0 - 'BD89F903D8A50700' seq:72057594037927935, type:15

Here we can see that the compaction queue executed a compaction for what it believes is 288 MiB of data. RocksDB translated this into a compaction for 21 GiB of data!

Those hex decoded keys indicate that RocksDB expanded our suggested compaction range to: /Table/53/1/56307121/0/1528661494.288372850,0-/Table/53/1/64529671. The start key seems pretty close to the the suggested compaction, but that end key is much larger. The RocksDB log line was added to CompactionPicker::SetupOtherInputs and the range of keys was determined by CompactionPicker::GetRange. I'm going to add some additional debugging info to try and track down what is going on.

Cc @benesch, @tschottdorf

@petermattis petermattis added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jun 13, 2018
@petermattis petermattis added this to the 2.1 milestone Jun 13, 2018
@petermattis petermattis self-assigned this Jun 13, 2018
@petermattis
Copy link
Collaborator Author

Caught another instance of this with additional debug logs:

I180613 17:17:57.561411 348 storage/compactor/compactor.go:382  [n8,s8,compactor] processing compaction #244-252/349 (/Table/53/1/54740752-/Table/53/1/54767543) for 262 MiB (reasons: size=true used=false avail=false)
    5: /Table/53/1/54750568/0-/Table/53/1/54764270/0/NULL
    5: /Table/53/1/54764271/0-/Table/53/1/64529672
    6: /Table/53/1/54750568/0-/Table/53/1/54753839/0
    6: /Table/53/1/54753840/0-/Table/53/1/54757111/0
    6: /Table/53/1/54757112/0-/Table/53/1/54760383/0
    6: /Table/53/1/54760384/0-/Table/53/1/54763655/0
    6: /Table/53/1/54763656/0-/Table/53/1/54764270/0
    6: /Table/53/1/54764271/0-/Table/53/1/54767542/0

This shows that the compaction queue is processing a compaction which it thinks covers 262 MiB of data. The subsequent lines show the sstables which overlap this range. RocksDB in turn thinks the compaction encompasses 24 GiB of data:

I180613 17:17:57.563423 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:432] [default] L5 -> L6: compaction 1094 overlapping inputs: 4801B + 24GB: 'BD89F903436D6888001536E587A4E0547209' seq:4206222, type:0 - 'BD89F903D8D86D00' seq:72057594037927935, type:15

The problem is with 5: /Table/53/1/54764271/0-/Table/53/1/64529672. Since that table is involved in the compaction we have to compact every L6 table that it covers. The "clear range hack" is supposed to avoid sstables which cover an exceptionally large number of lower-level sstables, but perhaps it is still being foiled by something in RocksDB.

@benesch This lends additional weight to not going with the hack and actually fixing the selection of sstable boundaries in CompactionJob.

@petermattis
Copy link
Collaborator Author

Some other logging from the above event revealed the inputs RocksDB was using for the compaction:

I180613 17:17:57.563434 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:440] [default] input 0: 'BD89F903436D6888001536E587A4E0547209' seq:4206222, type:0 - 'BD89F90343A2EE8800001536E587A4E0547209' seq:4205227, type:15
I180613 17:17:57.563443 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:440] [default] input 1: 'BD89F90343A2EF88001536E587A4E0547209' seq:4205271, type:0 - 'BD89F903D8A50800' seq:3453979, type:15
I180613 17:17:57.563452 348 storage/engine/rocksdb.go:93 [rocksdb] [/go/src/github.com/cockroachdb/cockroach/c-deps/rocksdb/db/compaction_picker.cc:440] [default] input 2: 'BD89F903D8A50700' seq:3453981, type:0 - 'BD89F903D8D86D00' seq:72057594037927935, type:15

If we decode the start and end keys for these sstables we see:

/Table/53/1/54750568/0 - /Table/53/1/54764270/0/NULL
/Table/53/1/54764271/0 - /Table/53/1/64529672
/Table/53/1/64529671 - /Table/53/1/64542829

Notice that the 2nd and 3rd sstable overlap in their end/start key. That isn't supposed to happen. RocksDB has consistency checks to make sure sstables do not overlap, but they are disabled by default in release mode. We need to be setting AdvancedColumnFamilyOptions::force_consistency_checks = true if we want to enable them. I'm going to try and reproduce and see where the overlap is coming from.

@petermattis
Copy link
Collaborator Author

Heh, running with a RocksDB debug build, as soon as the clearrange roachtest starts up one of the nodes died with the following assertion:

void VersionStorageInfo::AddFile(int level, FileMetaData* f, Logger* info_log) {
  auto* level_files = &files_[level];
  // Must not overlap
#ifndef NDEBUG
  if (level > 0 && !level_files->empty() &&
      internal_comparator_->Compare(
          (*level_files)[level_files->size() - 1]->largest, f->smallest) >= 0) {
    auto* f2 = (*level_files)[level_files->size() - 1];
    if (info_log != nullptr) {
      Error(info_log, "Adding new file %" PRIu64
                      " range (%s, %s) to level %d but overlapping "
                      "with existing file %" PRIu64 " %s %s",
            f->fd.GetNumber(), f->smallest.DebugString(true).c_str(),
            f->largest.DebugString(true).c_str(), level, f2->fd.GetNumber(),
            f2->smallest.DebugString(true).c_str(),
            f2->largest.DebugString(true).c_str());
      LogFlush(info_log);
    }
    assert(false);
  }
#endif
  f->refs++;
  level_files->push_back(f);
}

@tbg tbg added this to On the horizon in KV Jun 14, 2018
@tbg
Copy link
Member

tbg commented Jun 14, 2018

Wait, what? That's a whole new world of badness. Is this caused by anything we've done or is this just a way of saying "RocksDB is completely broken"?

@petermattis
Copy link
Collaborator Author

petermattis commented Jun 14, 2018 via email

@tbg
Copy link
Member

tbg commented Jun 14, 2018

Seems like marking every sstable containing at least 1 range tombstone would completely remove the need for the compaction queue.

Yes and no, though I'd be tempted to try it. Yes because for the range tombstones, it mostly does. There is an argument that the "tombstone = compaction" heuristic is too aggressive for tombstones that come from replica GC (though maybe it's ok).

Then there's the bigger problem of running compactions for non-range tombstones. The other day a user managed to get to 60gb of on-disk usage when their dataset was really more like 3gb -- a queue-type workload. I think the original idea of the compaction queue also included alleviating that kind of thing. However I'm not too hopeful that it will, and RocksDB's built-in tombstone-sensitive filters might address this problem better.

@petermattis
Copy link
Collaborator Author

@tschottdorf Oh, and to answer your earlier question: yes, this assertion firing is a whole new world of brokenness. I'm trying to make sure it isn't due to one of the RocksDB patches that Nikhil and I have been using. I don't think it is, but need to make sure.

@benesch
Copy link
Contributor

benesch commented Jun 14, 2018

Possibly relevant: facebook/rocksdb#3926

@petermattis
Copy link
Collaborator Author

That does look relevant.

@petermattis
Copy link
Collaborator Author

I've reproduced the assertion mentioned above using a cockroach built from master but with rocksdb assertions enabled. roachtest store-gen -d -c peter-bank-gen --stores=10 bank --payload-bytes=10240 --ranges=0 --rows=65104166 --seed=3

I'm going to try the patch @benesch pointed to above to see if that fixes the issue.

@petermattis
Copy link
Collaborator Author

facebook/rocksdb#3926 appears to fix the problem. This seems somewhat serious as the bug causes RocksDB to violate a basic invariant that 2 sstables in the same level do not overlap. I think the result could be missing keys during iteration.

I need to verify that whatever bug that RocksDB PR is fixing is also present in 2.0. Hopefully I'll get to that tomorrow. If I don't, @benesch will likely need to pick this up when I'm on vacation.

Cc @bdarnell in case you haven't been following along.

@bdarnell
Copy link
Contributor

Heh, running with a RocksDB debug build, as soon as the clearrange roachtest starts up one of the nodes died with the following assertion:

We enable rocksdb assertions in race-enabled builds (grep ENABLE_ROCKSDB_ASSERTIONS in the Makefile). Sounds like we should figure out how to use them more. This could mean adding more compaction tests in the main test suite so they get run by make testrace or (sometimes) running roachtest with assertion-enabled binaries (maybe even race-enabled, but I doubt that's feasible due to the performance overhead). The rocksdb assertions have a 25-40% performance overhead according to #15604 so we still want them disabled for release builds.

@petermattis
Copy link
Collaborator Author

Well, the good news is that so far I'm unable to reproduce the assertion failure on 2.0. It was happening right away when doing a roachtest store-gen on master, but has been running for 30m on 2.0 (with assertions enabled) without problem. We upgraded from RocksDB 5.9.0+patches to RocksDB 5.12.2 in #25235. Ton of changes there, but it is possible that one of them introduced the cause of the assertion failure. I'm looking.

@petermattis
Copy link
Collaborator Author

Well, I looked through the changes in our RocksDB upgrade and nothing jumped out at me, but there were a lot of changes so perhaps I missed something.

@benesch
Copy link
Contributor

benesch commented Jun 15, 2018 via email

@petermattis
Copy link
Collaborator Author

petermattis commented Jun 15, 2018 via email

@petermattis
Copy link
Collaborator Author

~/Development/go/src/github.com/cockroachdb/rocksdb (0edac964...) make -j8 db_compaction_test && ./db_compaction_test --gtest_filter=DBCompactionTest.CompactFilesOutputRangeConflict
Makefile:127: Warning: Compiling in debug mode. Don't use the resulting binary in production
  GEN      util/build_version.cc
make: `db_compaction_test' is up to date.
Note: Google Test filter = DBCompactionTest.CompactFilesOutputRangeConflict
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBCompactionTest
[ RUN      ] DBCompactionTest.CompactFilesOutputRangeConflict
Assertion failed: (!FilesRangeOverlapWithCompaction(input_files, output_level)), function CompactFiles, file db/compaction_picker.cc, line 298.

https://github.com/cockroachdb/rocksdb/tree/0edac964ca804c97c69ef18123b5e82d1b57e19e is the SHA used by the 2.0 branch. I'll work on patching master and backporting to 2.0.

petermattis added a commit to petermattis/cockroach that referenced this issue Jun 15, 2018
RocksDB was violating an invariant that no 2 sstables in a level
overlap. It isn't quite clear what the upshot of this violation is. At
the very least it would cause the overlapping tables to be compacted
together. It seems possible that it could lead to missing writes, but I
haven't been able to verify that.

Fixes cockroachdb#26693

Release note: None
petermattis added a commit to petermattis/cockroach that referenced this issue Jun 15, 2018
RocksDB was violating an invariant that no 2 sstables in a level
overlap. It isn't quite clear what the upshot of this violation is. At
the very least it would cause the overlapping tables to be compacted
together. It seems possible that it could lead to missing writes, but I
haven't been able to verify that.

Fixes cockroachdb#26693

Release note: None
craig bot pushed a commit that referenced this issue Jun 15, 2018
26755: release-2.0: Bump RocksDB pointer to grab facebook/rocksdb#3926 r=benesch,a-robinson a=petermattis

RocksDB was violating an invariant that no 2 sstables in a level
overlap. It isn't quite clear what the upshot of this violation is. At
the very least it would cause the overlapping tables to be compacted
together. It seems possible that it could lead to missing writes, but I
haven't been able to verify that.

Fixes #26693

Release note: None

Co-authored-by: Peter Mattis <petermattis@gmail.com>
craig bot pushed a commit that referenced this issue Jun 15, 2018
26753: storage: Add extra event to allocator rebalancing r=a-robinson a=a-robinson

Helps make the output of simulated allocator runs less confusing, since
otherwise it's not clear why we're considering removal from the range
and why the replicas being considered for removal includes one that
isn't even a real member of the range.

Release note: None

Would have made looking at the simulated allocator output from https://forum.cockroachlabs.com/t/how-to-enable-leaseholder-load-balancing/1732/3 a little more pleasant.

26754: Bump RocksDB pointer to grab facebook/rocksdb#3926 r=benesch,a-robinson a=petermattis

RocksDB was violating an invariant that no 2 sstables in a level
overlap. It isn't quite clear what the upshot of this violation is. At
the very least it would cause the overlapping tables to be compacted
together. It seems possible that it could lead to missing writes, but I
haven't been able to verify that.

Fixes #26693

Release note: None

Co-authored-by: Alex Robinson <alexdwanerobinson@gmail.com>
Co-authored-by: Peter Mattis <petermattis@gmail.com>
@craig craig bot closed this as completed in #26754 Jun 15, 2018
KV automation moved this from On the horizon to Finished (milestone 2, ends 6/25) Jun 15, 2018
@petermattis
Copy link
Collaborator Author

I'm going to leave this open as a reminder that I should try and reproduce the original badness in this issue (excessive compactions) now that the RocksDB bug with overlapping sstables has been fixed.

@petermattis petermattis reopened this Jun 16, 2018
KV automation moved this from Finished (milestone 2, ends 6/25) to Milestone 2 Jun 16, 2018
benesch added a commit to benesch/cockroach that referenced this issue Jun 21, 2018
The current implementation of range deletion tombstones in RocksDB
suffers from a performance bug that causes excessive CPU usage on every
read operation in a database with many range tombstones. Dropping a
large table can easily result in several thousand range deletion
tombstones in one store, resulting in an unusable cluster as documented
in cockroachdb#24029.

Backport a refactoring of range deletion tombstone that fixes the
performance problem. This refactoring has also been proposed upstream as
facebook/rocksdb#4014.

A more minimal change was also proposed in facebook/rocksdb#3992--and
that patch better highlights the exact nature of the bug than the patch
backported here, for those looking to understand the problem. But this
refactoring, though more invasive, gets us one step closer to solving a
related problem where range deletions can cause excessively large
compactions (cockroachdb#26693). These large compactions do not appear to brick the
cluster but undoubtedly have some impact on performance.

Fix cockroachdb#24029.

Release note: None
@tbg tbg moved this from Milestone 3 to On the horizon in KV Jun 25, 2018
@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jul 22, 2018
@petermattis
Copy link
Collaborator Author

I believe the remaining work here is upstream in RocksDB. See facebook/rocksdb#3977. Moving to Later as it isn't clear if that work is necessary.

@petermattis petermattis modified the milestones: 2.1, Later Aug 21, 2018
@tbg tbg moved this from On the horizon to Unit Test Flakes in KV Oct 11, 2018
@tbg tbg moved this from Unit Test Flakes to Cold storage in KV Oct 11, 2018
@petermattis petermattis removed this from Cold storage in KV Sep 25, 2019
@petermattis petermattis added this to Incoming in Storage via automation Oct 1, 2019
@petermattis
Copy link
Collaborator Author

We have no plans to fix this issue in RocksDB. Pebble is somewhat better in this area as it can cut sstables at grandparent boundaries even if that means cutting a range tombstone. There are a number of open Pebble issues for further improvements for compactions, but this issue has done its service and can be closed.

Storage automation moved this from Incoming to Done (milestone E) Feb 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Relating to our storage engine (Pebble) on-disk storage. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

No branches or pull requests

4 participants