feat(tiering): add background offload step #2504

adiholden · 2024-01-29T15:25:57Z

implement tiering background offloading based on https://junchengyang.com/publication/nsdi24-SIEVE.pdf
When fetching an item we set it as touched item (hot)
A background task running periodically traversing several buckets (continuing from the last place stopped) setting them as untouched items (cold)
The background task checks if the item is cold (not touched from the last time the traverse set it as cold). If an item is cold and it can be offloaded schedule it for offload.

romange · 2024-01-30T10:07:30Z

src/core/compact_object.h

@@ -216,6 +217,18 @@ class CompactObj {
    }
  }

+  bool HasTouched() const {


Suggested change

bool HasTouched() const {

bool WasTouched() const {

imho

theyueli

Thanks for the work. It would be helpful to add more descriptions to the PR.

theyueli · 2024-01-30T10:38:40Z

src/core/compact_object.h

@@ -118,6 +118,7 @@ class CompactObj {
    ASCII2_ENC_BIT = 0x10,
    IO_PENDING = 0x20,
    STICKY = 0x40,
+    SIEVE = 0x80,


"SIEVE" is a very confusing name to use.

We just need a simple word to express whether the key has been accessed or not.

theyueli · 2024-01-30T10:57:29Z

src/core/dash.h

@@ -240,6 +240,10 @@ class DashTable : public detail::DashTableBase {
  // calling cb(iterator) for every non-empty slot. The iteration goes over a physical bucket.
  template <typename Cb> void TraverseBucket(const_iterator it, Cb&& cb);

+  // Traverses over a single bucket in table and calls cb(iterator). The traverse order will be
+  // segment by segment over phisical backets.
+  template <typename Cb> Cursor TraverseBySegmentOrder(Cursor curs, Cb&& cb);


or just TraverseSegment?

TraverseSegment sounds like we go over one segment only

theyueli · 2024-01-30T11:04:22Z

src/server/db_slice.cc

+  auto cb = [&](PrimeIterator it) {
+    // TBD check we did not lock it for future transaction
+
+    if (increase_goal_bytes > offloaded_bytes && !(it->first.HasTouched()) &&


would be great to have some comments explaining these conditions.

theyueli · 2024-01-30T11:08:29Z

src/server/db_slice.cc

+        VLOG(2) << "ScheduleOffload bytes:" << offloaded_bytes;
+      }
+    }
+    it->first.SetTouched(false);


you only need to do so if the slot has been touched and will be offloaded.. so it should go under the if-condition.

We offload only items that are marked as not touched, which means that they are cold items.
How do we make them cold (set as not touched), by iterating on the database and setting touched to false. If the item was fetched it is marked touched true. This way if the item was no touched from the last time we iterated on this bucket it means that it is cold and a candidate for offload.
Therefor the SetTouched(false) should be outside the if

theyueli · 2024-01-30T11:10:09Z

src/server/db_slice.cc

+
+  // Traverse a single segment every time this function is called.
+  for (int i = 0; i < 60; ++i) {
+    cursor = pt.TraverseBySegmentOrder(cursor, cb);


could this take too long? making the segment being locked for too long will affect the performance?

can the granularity be smaller? say every eviction we only traverse certain number of buckets?

This loop traverses exactly 60 buckets, I dont think this is taking long time but we can decrease the number if we want or increase it

theyueli · 2024-01-30T11:16:24Z

src/server/engine_shard_set.cc

@@ -602,7 +605,9 @@ void EngineShard::Heartbeat() {
    ttl_delete_target = kTtlDeleteLimit * double(deleted) / (double(traversed) + 10);
  }

-  ssize_t redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();
+  ssize_t eviction_redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();


i think when eviction and tiering are both enabled, it gets tricky:

To me, eviction_redline shall be calculated correspondingly:

if tiering is not enabled, it is computed in the current way.

otherwise, it should be calculated as:

ssize_t eviction_redline = (max_capacity_limit * kRedLimitFactor) / shard_set->size();

where max_capacity_limit should be the sum of DRAM and tiering file capacity (e.g. tiered_max_file_size)

In general, I think eviction shall not happen before offloading, and we need to add code to check this condition.

I did not touch eviction logic for tiering, it should be handled but it is not in this PR.
I am not sure we will have the same eviction logic as we have today for tiering

I have some reservations about Yue's suggestion. I am afraid that this assumes that all pieces of the system work perfectly well and the offloading has enough bandwidth and cpu resources to offload everything fast enough. And that there are no bugs, we are not out of disk space etc.

In contrast, I imagine both the eviction and offloading processes as two independent forces that work towards the common goal. If the offloading does not achieve its goals, eviction will still make sure that our RAM/RSS usage won't cross the RAM limit.

The practical implication is that eviction policy should delete items if available RAM is low like we do today but it should delete both offloaded and hot items.

it depends on how you want to define tiering: if storage is only an extension of memory, then it is considered as memory too, so the whole dram and storage form a single piece of cache, and eviction shall cover data in cache i.e. dram + storage.

But if one day storage becomes really persistent, it will make sense to use what you suggested

romange · 2024-01-30T10:08:32Z

src/core/dash.h

@@ -240,6 +240,10 @@ class DashTable : public detail::DashTableBase {
  // calling cb(iterator) for every non-empty slot. The iteration goes over a physical bucket.
  template <typename Cb> void TraverseBucket(const_iterator it, Cb&& cb);

+  // Traverses over a single bucket in table and calls cb(iterator). The traverse order will be
+  // segment by segment over phisical backets.


Suggested change

// segment by segment over phisical backets.

// segment by segment over physical buckets.

Please explicitly state that there are no coverage guarantees if the table grows/shrinks and that's useful when formal full coverage is not critically important

romange · 2024-01-30T10:09:54Z

src/core/compact_object.h

@@ -118,6 +118,7 @@ class CompactObj {
    ASCII2_ENC_BIT = 0x10,
    IO_PENDING = 0x20,
    STICKY = 0x40,
+    SIEVE = 0x80,


Maybe add comment describing in short what SIEVE does? Also add a link to the paper

romange · 2024-01-30T12:09:19Z

src/server/db_slice.cc

+  };
+
+  // Traverse a single segment every time this function is called.
+  for (int i = 0; i < 60; ++i) {


should be a constant taken from PrimeTable, right?

Well in the current implementation I go throught entire single segment every time this function is called, but this is very random decision. I will try to tune it and see if it makes sense to go over more buckets/ less buckets to have a better hit rate when running benchmarks. The time between different runs of this function and the number of buckets it goes through determins when a item becomes cold.
i,e if we have 1000 segments and this function runs every milisecond than an item will become cold if it was not touched in 1 second

romange · 2024-01-30T12:10:19Z

src/server/engine_shard_set.cc

@@ -38,6 +38,9 @@ ABSL_FLAG(dfly::MemoryBytesFlag, tiered_max_file_size, dfly::MemoryBytesFlag{},
          "0 - means the program will automatically determine its maximum file size. "
          "default: 0");

+ABSL_FLAG(float, tiered_offload_threshold, 0.5,
+          "The ration of used/max memory above which we run offloading values to disk");


Suggested change

"The ration of used/max memory above which we run offloading values to disk");

"The ratio of used/max memory above which we start offloading values to disk");

romange · 2024-01-30T12:15:51Z

src/server/engine_shard_set.cc

@@ -602,7 +605,9 @@ void EngineShard::Heartbeat() {
    ttl_delete_target = kTtlDeleteLimit * double(deleted) / (double(traversed) + 10);
  }

-  ssize_t redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();
+  ssize_t eviction_redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();


I have some reservations about Yue's suggestion. I am afraid that this assumes that all pieces of the system work perfectly well and the offloading has enough bandwidth and cpu resources to offload everything fast enough. And that there are no bugs, we are not out of disk space etc.

In contrast, I imagine both the eviction and offloading processes as two independent forces that work towards the common goal. If the offloading does not achieve its goals, eviction will still make sure that our RAM/RSS usage won't cross the RAM limit.

romange · 2024-01-30T12:16:43Z

Also please resolve tiered_storage conflicts.

romange · 2024-02-12T12:04:12Z

src/server/tiered_storage.cc

-      VLOG(2) << "Skip WriteSingle for: " << key;
-    }
-    return error_code{};
+    return true;


why do you skip writing here?

This function just does some setup and returns if we can schedule offloading. When writing big blobs we always schedule offload, while writing small blobs we aggregate in pending_entries and only if pending_entries is max size we schedule offloading.
So below is the preparation for small blobs where we add to pending_entries and we will return true (will result in scheduleing for offload from the caller) if we aggregated max entries in the bin

romange · 2024-02-12T12:07:02Z

src/server/tiered_storage.h

@@ -51,15 +56,23 @@ class TieredStorage {

  std::error_code Read(size_t offset, size_t len, char* dest);

+  bool AllowWrites() const;


maybe: IoDeviceOverloaded() with the opposite meaning? AllowWrites is a too general name

romange · 2024-02-12T12:07:41Z

src/server/tiered_storage.h

 private:
  class InflightWriteRequest;

  void WriteSingle(DbIndex db_index, PrimeIterator it, size_t blob_len);

  // Returns a pair consisting of an bool denoting whether we can write to disk, and updated
  // iterator as this function can yield. 'it' should not be used after the call to this function.
-  std::pair<bool, PrimeIterator> CanScheduleOffload(DbIndex db_index, PrimeIterator it,
-                                                    std::string_view key);
+  std::pair<bool, PrimeIterator> ThrottleWrites(DbIndex db_index, PrimeIterator it,


worth adding what this function does.

Signed-off-by: adi_holden <adi@dragonflydb.io>

This reverts commit 994afab.

adiholden requested review from romange and theyueli January 29, 2024 15:25

romange reviewed Jan 30, 2024

View reviewed changes

theyueli reviewed Jan 30, 2024

View reviewed changes

romange reviewed Jan 30, 2024

View reviewed changes

adiholden force-pushed the tiering_background_offload branch from 61afdf0 to 0276d51 Compare January 31, 2024 12:13

adiholden requested review from romange and theyueli January 31, 2024 15:30

romange reviewed Feb 12, 2024

View reviewed changes

adiholden added 8 commits February 14, 2024 13:19

feat(tiering): add background offload step

abaec56

Signed-off-by: adi_holden <adi@dragonflydb.io>

small refactor

1b47eca

Signed-off-by: adi_holden <adi@dragonflydb.io>

fix pr comments

d956276

Signed-off-by: adi_holden <adi@dragonflydb.io>

fix - set items as cold not only when reaching tiering redline

2e00b7a

Signed-off-by: adi_holden <adi@dragonflydb.io>

add ram hit/miss stats

0d28ddb

Signed-off-by: adi_holden <adi@dragonflydb.io>

add tiered_bytes_human

aff68ae

Signed-off-by: adi_holden <adi@dragonflydb.io>

fix naming

8e4ab6a

Signed-off-by: adi_holden <adi@dragonflydb.io>

rebase

994afab

Signed-off-by: adi_holden <adi@dragonflydb.io>

adiholden force-pushed the tiering_background_offload branch from 00c731a to 994afab Compare February 14, 2024 11:22

Revert "rebase"

31903f6

This reverts commit 994afab.

adiholden requested a review from romange February 14, 2024 11:29

romange approved these changes Feb 14, 2024

View reviewed changes

adiholden merged commit 32e8d49 into main Feb 14, 2024
10 checks passed

adiholden deleted the tiering_background_offload branch February 14, 2024 12:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tiering): add background offload step #2504

feat(tiering): add background offload step #2504

adiholden commented Jan 29, 2024 •

edited

romange Jan 30, 2024

theyueli left a comment

theyueli Jan 30, 2024

theyueli Jan 30, 2024

adiholden Jan 30, 2024

theyueli Jan 30, 2024

theyueli Jan 30, 2024

adiholden Jan 30, 2024

theyueli Jan 30, 2024

adiholden Jan 30, 2024

theyueli Jan 30, 2024

adiholden Jan 30, 2024

romange Jan 30, 2024 •

edited

romange Jan 30, 2024

theyueli Jan 30, 2024

romange Jan 30, 2024

romange Jan 30, 2024

romange Jan 30, 2024

romange Jan 30, 2024

adiholden Jan 31, 2024

romange Jan 30, 2024

romange Jan 30, 2024 •

edited

romange commented Jan 30, 2024

romange Feb 12, 2024 •

edited

adiholden Feb 14, 2024

romange Feb 12, 2024

romange Feb 12, 2024

	// segment by segment over phisical backets.
	// segment by segment over physical buckets.

	"The ration of used/max memory above which we run offloading values to disk");
	"The ratio of used/max memory above which we start offloading values to disk");

		@@ -51,15 +56,23 @@ class TieredStorage {

		std::error_code Read(size_t offset, size_t len, char* dest);

		bool AllowWrites() const;

feat(tiering): add background offload step #2504

feat(tiering): add background offload step #2504

Conversation

adiholden commented Jan 29, 2024 • edited

Choose a reason for hiding this comment

theyueli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romange Jan 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romange Jan 30, 2024 • edited

Choose a reason for hiding this comment

romange commented Jan 30, 2024

romange Feb 12, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adiholden commented Jan 29, 2024 •

edited

romange Jan 30, 2024 •

edited

romange Jan 30, 2024 •

edited

romange Feb 12, 2024 •

edited