Make mempurge a background process (equivalent to in-memory compaction). #8505

bjlemaire · 2021-07-08T18:17:09Z

In #8454, I introduced a new process baptized MemPurge (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the mempurge process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the imm() memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the experimental_allow_mempurge flag true. When activated, the MemPurge process will always happen when the flush reason is kWriteBufferFull.
The 3 unit tests confirm that this process supports Put, Get, Delete, DeleteRange operators and is compatible with Iterators and CompactionFilters.

…w memtable as the new_mem becomes full.

…ble.

…h. Make the mempurge happen like an in memory compaction. If potentially interesting add half-filled mempurged memtable back to imm memtable lsit.

…le, but add to imm without triggering flush.

… needs to be done for test 2 and range filters and iterators.

…remodeling test 3.

…gedelete iterotrs and compaction filters.

facebook-github-bot · 2021-07-08T18:49:21Z

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-08T20:07:56Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

…ushed to storage. Add manual flush to DB close() function for extra safety.

facebook-github-bot · 2021-07-09T01:46:55Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-07-09T01:47:59Z

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-09T01:53:35Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-07-09T01:54:02Z

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-09T12:22:57Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-07-09T12:32:34Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

… when mempurge is on.

…ndefined behavior.

facebook-github-bot · 2021-07-09T14:42:52Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-07-09T14:43:27Z

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

db/column_family.cc

db/column_family.h

db/db_flush_test.cc

bjlemaire · 2021-07-09T15:02:05Z

db/db_flush_test.cc

-  const size_t RAND_VALUES_LENGTH = 512;
-  bool atLeastOneFlush = false;
+  const size_t NUM_REPEAT = 1000;
+  const size_t RAND_VALUES_LENGTH = 20480;


Another option to lower these values (and speed up the test) would be to change the memtable size (right now: uses the default 64MB size)

These tests should be doable in well under 1s each. Not required to fix now, but definitely put it on the list if the tests are slow (I haven't checked).

The tests are definitely on the slower end of the spectrum (all 3 tests take about 3 seconds each with a regular "make db_flush_test"). I can look at bringing them under 1sec each.

bjlemaire · 2021-07-09T15:04:01Z

db/db_impl/db_impl.cc

@@ -548,12 +548,39 @@ Status DBImpl::CloseHelper() {
  flush_scheduler_.Clear();
  trim_history_scheduler_.Clear();

+  // For now, simply trigger a manual flush at close time
+  // on all the column families.
+  if (immutable_db_options_.experimental_allow_mempurge) {


At the moment, automagically flush all column families. In the future I can do a more fine-grained flushing by first checking if there is a need for flushing (but need to implement something else than imm()->IsFlushPending() because the output memtables added to imm() dont trigger flushes).

Yes, after offline discussion this is a tricky case that will need further testing and consideration. Although all the if (allow_mempurge) code is known to be in active revision, it might be good to put an explicit TODO(bjlemaire) here.

db/db_impl/db_impl_write.cc

bjlemaire · 2021-07-09T15:06:24Z

db/flush_job.cc

+      // only if it filled at less than 10% capacity (arbitrary heuristic).
+      if (new_mem->ApproximateMemoryUsage() <
+          static_cast<size_t>(
+              ceil(0.1 * mutable_cf_options_.write_buffer_size))) {


Arbitrary metric: add to imm() only if memtable at less then 10% capacity. Reason for this: minimize Get/Read overheads that come from storing an extra Imm memtable, while giving us a chance to perform in memory compaction.

Btw, could avoid floating point with

(mutable_cf_options_.write_buffer_size + 9) / 10

Nice trick, I'll go ahead and edit that (even though long term there is a chance we need the flexibility a double would bring).

db/flush_job.cc

bjlemaire · 2021-07-09T15:07:38Z

db/flush_job.cc

-          m->NewRangeTombstoneIterator(ro, kMaxSequenceNumber);
-      if (range_del_iter != nullptr) {
-        range_del_iters.emplace_back(range_del_iter);
+      if (!(m->GetMempurged())) {


See https://github.com/facebook/rocksdb/pull/8505/files#r667019976.

bjlemaire · 2021-07-09T15:09:38Z

db/flush_job.cc

      ScopedArenaIterator iter(
-          NewMergingIterator(&cfd_->internal_comparator(), &memtables[0],
+          NewMergingIterator(&cfd_->internal_comparator(), memtables.data(),


Use of vector.data() is preferable (allowed in C++11), because &memtables[0] leads to an undefined behavior if memtables is empty (I also added an if-statment, pure paranoia).

bjlemaire · 2021-07-09T15:11:45Z

db/memtable_list.cc

@@ -527,7 +528,8 @@ void MemTableList::Add(MemTable* m, autovector<MemTable*>* to_delete) {
  current_->Add(m, to_delete);
  m->MarkImmutable();
  num_flush_not_started_++;
-  if (num_flush_not_started_ == 1) {
+
+  if (num_flush_not_started_ > 0 && trigger_flush) {


This is one thing that could potentially lead to issues, maybe? Before, they stored "flush_needed" as soon as num_flush_not_started reached exactly 1. Why didnt they use num_flush_not_started>0, which should have been strictly equivalent? Either way, here we still need to increment num_flush_not_started but we dont want to trigger flush, otherwise the flush will keep spinning.

bjlemaire · 2021-07-09T16:50:31Z

db/flush_job.cc

+        // number) needs to be present in the new memtable.
+        new_mem->SetFirstSequenceNumber(new_first_seqno);
+        purged_mems.push_back(new_mem);
+        new_mem =


This strategy is also probably something we want to discuss: should we create new memtables, or simply abort if we end up in this situation?
Basically: pros of creating new mems: we dont waste the time (possibly?) spent in mempurge.
Cons: spikes in memory usage by memtables.

pdillinger · 2021-07-09T16:39:18Z

db/db_impl/db_impl.cc

@@ -548,12 +548,39 @@ Status DBImpl::CloseHelper() {
  flush_scheduler_.Clear();
  trim_history_scheduler_.Clear();

+  // For now, simply trigger a manual flush at close time
+  // on all the column families.
+  if (immutable_db_options_.experimental_allow_mempurge) {


Yes, after offline discussion this is a tricky case that will need further testing and consideration. Although all the if (allow_mempurge) code is known to be in active revision, it might be good to put an explicit TODO(bjlemaire) here.

pdillinger · 2021-07-09T16:43:44Z

db/db_flush_test.cc

+
+  const uint32_t mempurge_count_record = mempurge_count;
+
+  // Insertion of of K-V pairs, multiple times.


Perhaps

// Insertion of K-V pairs, no overwrites

pdillinger · 2021-07-09T16:46:26Z

db/db_flush_test.cc

-  const size_t RAND_VALUES_LENGTH = 512;
-  bool atLeastOneFlush = false;
+  const size_t NUM_REPEAT = 1000;
+  const size_t RAND_VALUES_LENGTH = 20480;


These tests should be doable in well under 1s each. Not required to fix now, but definitely put it on the list if the tests are slow (I haven't checked).

db/db_flush_test.cc

pdillinger · 2021-07-09T16:50:37Z

db/flush_job.cc

@@ -306,6 +317,272 @@ void FlushJob::Cancel() {
  base_->Unref();
 }

+Status FlushJob::MemPurge(autovector<MemTable*>& purged_mems) {


Google code style does not allow non-const reference parameters (so that you can tell at the call site without checking parameters types what might be modified).

Updated - thanks for the note!

pdillinger · 2021-07-09T17:10:21Z

db/flush_job.cc

+  // Store the full output memtables in
+  // autovector "purged_mems".
+  // autovector<MemTable*> purged_mems = {};
+  purged_mems = {};


Nit: for an accumulator output parameter, you have essentially three options:

assert it's already empty

just add to what's already there

blindly replace anything that might be there

The last of these is my least favorite because it seems inherently unsafe from a maintenance perspective.

Very true - I just updated the code with with the "assert it's already empty" option.

pdillinger · 2021-07-09T17:14:04Z

db/flush_job.cc

@@ -297,6 +303,11 @@ Status FlushJob::Run(LogsWithPrepTracker* prep_tracker,
           << (IOSTATS(cpu_read_nanos) - prev_cpu_read_nanos);
  }

+  // Clean up mempurge output memtables flushed to SST.


flushed to SST? I am confused about what purged_mems holds

My bad, the name purged_mems is visibly confusing. I've replaced it with "full_output_mems". It contains the memtables resulting from the mempurge that are filled up to capacity and need to be flushed to storage.

pdillinger · 2021-07-09T17:17:44Z

db/flush_job.cc

+    for (auto newm : purged_mems) {
+      // Paranoia
+      if (newm != nullptr) {
+        delete newm;


Double delete between here and caller?

Ouch good catch.

pdillinger · 2021-07-09T17:23:14Z

db/flush_job.cc

+  // but write any full output table to level0.
+  if (s.ok()) {
+    TEST_SYNC_POINT("DBImpl::FlushJob:MemPurgeSuccessful");
+    for (MemTable* m : mems_) {


How are we sure that mems_ here is the same as above? Does the flush thread effectively take ownership of mems_? Can we make that an assertion, or simply not rely on that?

Short answer: we are guaranteed that mems_ here is the same as above - no other flush thread can edit these memtables at the same time, and even additional operations like the DB::Close wait for all the flushes to happen before interfering with any of this.
Long answer: mems_ is a variable that belongs to the FlushJob object, and the FlushJob object is created inside each flush thread (details at db_impl_compaction_flush.cc:154). mems_ simply contains the pointer addresses of the imm() memtables.
mems_ is populated by the column family data object when the db_mutex is held (FlushJob::PickMemTable). Upon populating mems_, the cfd object mark that these memtables are being flushed by a flush thread somewhere ("flush_in_progress_ = true", memtable_list.cc:357), which guarantees that nothing happens to the memtables contained in the mems_ object while the flush job is handling them.
Therefore we can be sure that there is no race condition thanks to the db_mutex, and since the memtables are marked with "flush_in_progress=true", we know that these tables wont be edited while we are working on them in the flush process.

pdillinger · 2021-07-09T17:33:39Z

db/flush_job.cc

+  // If mempurge successful, don't write input tables to level0,
+  // but write any full output table to level0.


I guess the track you're taking here is keeping some data in memory only to maximize the chances that it becomes garbage and never has to be flushed. My inclination is (per column family) to flush all or nothing, for the benefit of write amplification in compaction. When we create an L0 file, it should be as big as possible. This should also make it easier to purge obsolete WAL files in due course.

Doesn't necessarily have to be fixed now, but something to be noted for reconsideration.

Agreed, provided we're allowed to go beyond the size of a regular SST file (so that we dont accidentally create 2 sst files, one of them being full, the other one filled at xx% capacity). According to the builder.cc:BuildTable code, it looks like we would indeed create a single large SST file, which would be ideal.

Before this comment, i actually thought that would lead to the creation of more than 1 sst file because i was convinced the SST file size was enforced. But after code inspection it looks like we would just write a single big SST file.

The SST file size limit is for breaking up sorted runs, primarily so that leveled compaction in L1 and above can operate on parts of sorted runs with reasonable granularity. L0 is special because (AFAIK) we assume each SST file is its own sorted run. :)

…apacity. Also, speed up unit test by decreasing memtable size. Add defautl value of mempurged_ to RollbackMemtablepurge.

facebook-github-bot · 2021-07-09T20:49:23Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-07-09T20:52:36Z

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pdillinger

Some minor suggestions / fixes remain. Otherwise looking good :)

pdillinger · 2021-07-09T21:30:55Z

db/flush_job.cc

+  // autovector<MemTable*> purged_mems = {};
+  purged_mems = {};
+
+  MemTable* new_mem = nullptr;


Can use .release()

pdillinger · 2021-07-09T21:34:13Z

db/flush_job.cc

-    Status mempurge_s = MemPurge(purged_mems);
+    Status mempurge_s = MemPurge();
+    if (!mempurge_s.ok()) {
+      ROCKS_LOG_INFO(db_options_.info_log, "Mempurge process unsuccessful.");


Please include status details, e.g. from ToString(). I'm sure you can find an example to copy.

It seems like an Aborted status should just be INFO but any other status is maybe WARN.

db/flush_job.cc

pdillinger · 2021-07-09T21:43:49Z

db/flush_job.cc

+      (cfd_->GetFlushReason() == FlushReason::kWriteBufferFull) &&
+      (!mems_.empty())) {
+    Status mempurge_s = MemPurge(purged_mems);
+  }
  // This will release and re-acquire the mutex.
  Status s = WriteLevel0Table();


A smart person just suggested this "purged" state might be converted to temporary state here in FlushJob::Run ;)

db/flush_job.cc

facebook-github-bot · 2021-07-09T22:13:53Z

@bjlemaire has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2021-07-09T22:14:29Z

@bjlemaire has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-07-10T00:24:12Z

@bjlemaire merged this pull request in 837705a.

bjlemaire added 9 commits July 7, 2021 06:32

Created FlushJob Mempurge funciton

00fa140

Created new mempurge function as a backgroudn task. Need to create ne…

491d6d6

…w memtable as the new_mem becomes full.

Populate MemPurgeV2 with code to handle more than one immutable memta…

346052e

…ble.

Remove first prototype of memprueg that happend during memtable switc…

9d41c33

…h. Make the mempurge happen like an in memory compaction. If potentially interesting add half-filled mempurged memtable back to imm memtable lsit.

Fix deadlock situation. Pass first 2 tests. Still flushes regular tab…

14cc37b

…le, but add to imm without triggering flush.

Add sync points for tests. Successful test 1, 2, and 3, but more work…

46094f1

… needs to be done for test 2 and range filters and iterators.

Remodeled unit test 1 and 2, mempurge passes these 2 tests. Need for …

ae89742

…remodeling test 3.

Update test 3. Now mempurge passes all 3 tests for get put delete ran…

0819b34

…gedelete iterotrs and compaction filters.

Fix typo in mempurge comment.

bd16dea

bjlemaire requested review from pdillinger, anand1976 and akankshamahajan15 July 8, 2021 18:17

facebook-github-bot added the CLA Signed label Jul 8, 2021

bjlemaire added 3 commits July 8, 2021 11:22

cast uint64 to uint32 to avoid implicit conversion loss

becf067

Run make format.

40d5749

Fix column ID of newly created memtables in mempurge.

64f8b01

Fix potential memory leak pointed by clang_analyzer.

cefd967

Fix memory leak by correctly cleaning up mempurge output memtables fl…

cc4a8a0

…ushed to storage. Add manual flush to DB close() function for extra safety.

Remove assert statment used for debugging.

b57b7f1

Fix overshadow and increase number overwrites in mempurge cfilter test.

eb9d9b2

This one is called: always test locally before pushing.

b8f034b

Speed up unit tests. CLean flush of all column families upon DB close…

b2d9334

… when mempurge is on.

Fix ubsan error: replace &memtables[0] by memtables.data() to avoid u…

30acdd8

…ndefined behavior.