Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(tiering): add background offload step #2504

Merged
merged 9 commits into from
Feb 14, 2024
Merged

Conversation

adiholden
Copy link
Collaborator

@adiholden adiholden commented Jan 29, 2024

implement tiering background offloading based on https://junchengyang.com/publication/nsdi24-SIEVE.pdf
When fetching an item we set it as touched item (hot)
A background task running periodically traversing several buckets (continuing from the last place stopped) setting them as untouched items (cold)
The background task checks if the item is cold (not touched from the last time the traverse set it as cold). If an item is cold and it can be offloaded schedule it for offload.

@@ -216,6 +217,18 @@ class CompactObj {
}
}

bool HasTouched() const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool HasTouched() const {
bool WasTouched() const {

imho

Copy link
Contributor

@theyueli theyueli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work. It would be helpful to add more descriptions to the PR.

@@ -118,6 +118,7 @@ class CompactObj {
ASCII2_ENC_BIT = 0x10,
IO_PENDING = 0x20,
STICKY = 0x40,
SIEVE = 0x80,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"SIEVE" is a very confusing name to use.

We just need a simple word to express whether the key has been accessed or not.

@@ -240,6 +240,10 @@ class DashTable : public detail::DashTableBase {
// calling cb(iterator) for every non-empty slot. The iteration goes over a physical bucket.
template <typename Cb> void TraverseBucket(const_iterator it, Cb&& cb);

// Traverses over a single bucket in table and calls cb(iterator). The traverse order will be
// segment by segment over phisical backets.
template <typename Cb> Cursor TraverseBySegmentOrder(Cursor curs, Cb&& cb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just TraverseSegment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TraverseSegment sounds like we go over one segment only

auto cb = [&](PrimeIterator it) {
// TBD check we did not lock it for future transaction

if (increase_goal_bytes > offloaded_bytes && !(it->first.HasTouched()) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to have some comments explaining these conditions.

VLOG(2) << "ScheduleOffload bytes:" << offloaded_bytes;
}
}
it->first.SetTouched(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only need to do so if the slot has been touched and will be offloaded.. so it should go under the if-condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We offload only items that are marked as not touched, which means that they are cold items.
How do we make them cold (set as not touched), by iterating on the database and setting touched to false. If the item was fetched it is marked touched true. This way if the item was no touched from the last time we iterated on this bucket it means that it is cold and a candidate for offload.
Therefor the SetTouched(false) should be outside the if


// Traverse a single segment every time this function is called.
for (int i = 0; i < 60; ++i) {
cursor = pt.TraverseBySegmentOrder(cursor, cb);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this take too long? making the segment being locked for too long will affect the performance?

can the granularity be smaller? say every eviction we only traverse certain number of buckets?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop traverses exactly 60 buckets, I dont think this is taking long time but we can decrease the number if we want or increase it

@@ -602,7 +605,9 @@ void EngineShard::Heartbeat() {
ttl_delete_target = kTtlDeleteLimit * double(deleted) / (double(traversed) + 10);
}

ssize_t redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();
ssize_t eviction_redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think when eviction and tiering are both enabled, it gets tricky:

To me, eviction_redline shall be calculated correspondingly:

  1. if tiering is not enabled, it is computed in the current way.
  2. otherwise, it should be calculated as:
ssize_t eviction_redline = (max_capacity_limit * kRedLimitFactor) / shard_set->size();

where max_capacity_limit should be the sum of DRAM and tiering file capacity (e.g. tiered_max_file_size)

In general, I think eviction shall not happen before offloading, and we need to add code to check this condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not touch eviction logic for tiering, it should be handled but it is not in this PR.
I am not sure we will have the same eviction logic as we have today for tiering

Copy link
Collaborator

@romange romange Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservations about Yue's suggestion. I am afraid that this assumes that all pieces of the system work perfectly well and the offloading has enough bandwidth and cpu resources to offload everything fast enough. And that there are no bugs, we are not out of disk space etc.

In contrast, I imagine both the eviction and offloading processes as two independent forces that work towards the common goal. If the offloading does not achieve its goals, eviction will still make sure that our RAM/RSS usage won't cross the RAM limit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The practical implication is that eviction policy should delete items if available RAM is low like we do today but it should delete both offloaded and hot items.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it depends on how you want to define tiering: if storage is only an extension of memory, then it is considered as memory too, so the whole dram and storage form a single piece of cache, and eviction shall cover data in cache i.e. dram + storage.

But if one day storage becomes really persistent, it will make sense to use what you suggested

src/core/dash.h Outdated
@@ -240,6 +240,10 @@ class DashTable : public detail::DashTableBase {
// calling cb(iterator) for every non-empty slot. The iteration goes over a physical bucket.
template <typename Cb> void TraverseBucket(const_iterator it, Cb&& cb);

// Traverses over a single bucket in table and calls cb(iterator). The traverse order will be
// segment by segment over phisical backets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// segment by segment over phisical backets.
// segment by segment over physical buckets.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explicitly state that there are no coverage guarantees if the table grows/shrinks and that's useful when formal full coverage is not critically important

@@ -118,6 +118,7 @@ class CompactObj {
ASCII2_ENC_BIT = 0x10,
IO_PENDING = 0x20,
STICKY = 0x40,
SIEVE = 0x80,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add comment describing in short what SIEVE does? Also add a link to the paper

};

// Traverse a single segment every time this function is called.
for (int i = 0; i < 60; ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be a constant taken from PrimeTable, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well in the current implementation I go throught entire single segment every time this function is called, but this is very random decision. I will try to tune it and see if it makes sense to go over more buckets/ less buckets to have a better hit rate when running benchmarks. The time between different runs of this function and the number of buckets it goes through determins when a item becomes cold.
i,e if we have 1000 segments and this function runs every milisecond than an item will become cold if it was not touched in 1 second

@@ -38,6 +38,9 @@ ABSL_FLAG(dfly::MemoryBytesFlag, tiered_max_file_size, dfly::MemoryBytesFlag{},
"0 - means the program will automatically determine its maximum file size. "
"default: 0");

ABSL_FLAG(float, tiered_offload_threshold, 0.5,
"The ration of used/max memory above which we run offloading values to disk");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"The ration of used/max memory above which we run offloading values to disk");
"The ratio of used/max memory above which we start offloading values to disk");

@@ -602,7 +605,9 @@ void EngineShard::Heartbeat() {
ttl_delete_target = kTtlDeleteLimit * double(deleted) / (double(traversed) + 10);
}

ssize_t redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();
ssize_t eviction_redline = (max_memory_limit * kRedLimitFactor) / shard_set->size();
Copy link
Collaborator

@romange romange Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some reservations about Yue's suggestion. I am afraid that this assumes that all pieces of the system work perfectly well and the offloading has enough bandwidth and cpu resources to offload everything fast enough. And that there are no bugs, we are not out of disk space etc.

In contrast, I imagine both the eviction and offloading processes as two independent forces that work towards the common goal. If the offloading does not achieve its goals, eviction will still make sure that our RAM/RSS usage won't cross the RAM limit.

@romange
Copy link
Collaborator

romange commented Jan 30, 2024

Also please resolve tiered_storage conflicts.

VLOG(2) << "Skip WriteSingle for: " << key;
}
return error_code{};
return true;
Copy link
Collaborator

@romange romange Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you skip writing here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function just does some setup and returns if we can schedule offloading. When writing big blobs we always schedule offload, while writing small blobs we aggregate in pending_entries and only if pending_entries is max size we schedule offloading.
So below is the preparation for small blobs where we add to pending_entries and we will return true (will result in scheduleing for offload from the caller) if we aggregated max entries in the bin

@@ -51,15 +56,23 @@ class TieredStorage {

std::error_code Read(size_t offset, size_t len, char* dest);

bool AllowWrites() const;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe: IoDeviceOverloaded() with the opposite meaning? AllowWrites is a too general name

private:
class InflightWriteRequest;

void WriteSingle(DbIndex db_index, PrimeIterator it, size_t blob_len);

// Returns a pair consisting of an bool denoting whether we can write to disk, and updated
// iterator as this function can yield. 'it' should not be used after the call to this function.
std::pair<bool, PrimeIterator> CanScheduleOffload(DbIndex db_index, PrimeIterator it,
std::string_view key);
std::pair<bool, PrimeIterator> ThrottleWrites(DbIndex db_index, PrimeIterator it,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding what this function does.

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
This reverts commit 994afab.
@adiholden adiholden merged commit 32e8d49 into main Feb 14, 2024
10 checks passed
@adiholden adiholden deleted the tiering_background_offload branch February 14, 2024 12:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants