-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tiering): add background offload step #2504
Conversation
src/core/compact_object.h
Outdated
@@ -216,6 +217,18 @@ class CompactObj { | |||
} | |||
} | |||
|
|||
bool HasTouched() const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool HasTouched() const { | |
bool WasTouched() const { |
imho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work. It would be helpful to add more descriptions to the PR.
src/core/compact_object.h
Outdated
@@ -118,6 +118,7 @@ class CompactObj { | |||
ASCII2_ENC_BIT = 0x10, | |||
IO_PENDING = 0x20, | |||
STICKY = 0x40, | |||
SIEVE = 0x80, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"SIEVE" is a very confusing name to use.
We just need a simple word to express whether the key has been accessed or not.
@@ -240,6 +240,10 @@ class DashTable : public detail::DashTableBase { | |||
// calling cb(iterator) for every non-empty slot. The iteration goes over a physical bucket. | |||
template <typename Cb> void TraverseBucket(const_iterator it, Cb&& cb); | |||
|
|||
// Traverses over a single bucket in table and calls cb(iterator). The traverse order will be | |||
// segment by segment over phisical backets. | |||
template <typename Cb> Cursor TraverseBySegmentOrder(Cursor curs, Cb&& cb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or just TraverseSegment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TraverseSegment sounds like we go over one segment only
src/server/db_slice.cc
Outdated
auto cb = [&](PrimeIterator it) { | ||
// TBD check we did not lock it for future transaction | ||
|
||
if (increase_goal_bytes > offloaded_bytes && !(it->first.HasTouched()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to have some comments explaining these conditions.
VLOG(2) << "ScheduleOffload bytes:" << offloaded_bytes; | ||
} | ||
} | ||
it->first.SetTouched(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only need to do so if the slot has been touched and will be offloaded.. so it should go under the if-condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We offload only items that are marked as not touched, which means that they are cold items.
How do we make them cold (set as not touched), by iterating on the database and setting touched to false. If the item was fetched it is marked touched true. This way if the item was no touched from the last time we iterated on this bucket it means that it is cold and a candidate for offload.
Therefor the SetTouched(false) should be outside the if
|
||
// Traverse a single segment every time this function is called. | ||
for (int i = 0; i < 60; ++i) { | ||
cursor = pt.TraverseBySegmentOrder(cursor, cb); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this take too long? making the segment being locked for too long will affect the performance?
can the granularity be smaller? say every eviction we only traverse certain number of buckets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop traverses exactly 60 buckets, I dont think this is taking long time but we can decrease the number if we want or increase it
@@ -602,7 +605,9 @@ void EngineShard::Heartbeat() { | |||
ttl_delete_target = kTtlDeleteLimit * double(deleted) / (double(traversed) + 10); | |||
} | |||
|
|||
ssize_t redline = (max_memory_limit * kRedLimitFactor) / shard_set->size(); | |||
ssize_t eviction_redline = (max_memory_limit * kRedLimitFactor) / shard_set->size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think when eviction and tiering are both enabled, it gets tricky:
To me, eviction_redline shall be calculated correspondingly:
- if tiering is not enabled, it is computed in the current way.
- otherwise, it should be calculated as:
ssize_t eviction_redline = (max_capacity_limit * kRedLimitFactor) / shard_set->size();
where max_capacity_limit
should be the sum of DRAM and tiering file capacity (e.g. tiered_max_file_size
)
In general, I think eviction shall not happen before offloading, and we need to add code to check this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not touch eviction logic for tiering, it should be handled but it is not in this PR.
I am not sure we will have the same eviction logic as we have today for tiering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some reservations about Yue's suggestion. I am afraid that this assumes that all pieces of the system work perfectly well and the offloading has enough bandwidth and cpu resources to offload everything fast enough. And that there are no bugs, we are not out of disk space etc.
In contrast, I imagine both the eviction and offloading processes as two independent forces that work towards the common goal. If the offloading does not achieve its goals, eviction will still make sure that our RAM/RSS usage won't cross the RAM limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The practical implication is that eviction policy should delete items if available RAM is low like we do today but it should delete both offloaded and hot items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends on how you want to define tiering: if storage is only an extension of memory, then it is considered as memory too, so the whole dram and storage form a single piece of cache, and eviction shall cover data in cache i.e. dram + storage.
But if one day storage becomes really persistent, it will make sense to use what you suggested
src/core/dash.h
Outdated
@@ -240,6 +240,10 @@ class DashTable : public detail::DashTableBase { | |||
// calling cb(iterator) for every non-empty slot. The iteration goes over a physical bucket. | |||
template <typename Cb> void TraverseBucket(const_iterator it, Cb&& cb); | |||
|
|||
// Traverses over a single bucket in table and calls cb(iterator). The traverse order will be | |||
// segment by segment over phisical backets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// segment by segment over phisical backets. | |
// segment by segment over physical buckets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please explicitly state that there are no coverage guarantees if the table grows/shrinks and that's useful when formal full coverage is not critically important
src/core/compact_object.h
Outdated
@@ -118,6 +118,7 @@ class CompactObj { | |||
ASCII2_ENC_BIT = 0x10, | |||
IO_PENDING = 0x20, | |||
STICKY = 0x40, | |||
SIEVE = 0x80, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add comment describing in short what SIEVE does? Also add a link to the paper
}; | ||
|
||
// Traverse a single segment every time this function is called. | ||
for (int i = 0; i < 60; ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be a constant taken from PrimeTable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well in the current implementation I go throught entire single segment every time this function is called, but this is very random decision. I will try to tune it and see if it makes sense to go over more buckets/ less buckets to have a better hit rate when running benchmarks. The time between different runs of this function and the number of buckets it goes through determins when a item becomes cold.
i,e if we have 1000 segments and this function runs every milisecond than an item will become cold if it was not touched in 1 second
src/server/engine_shard_set.cc
Outdated
@@ -38,6 +38,9 @@ ABSL_FLAG(dfly::MemoryBytesFlag, tiered_max_file_size, dfly::MemoryBytesFlag{}, | |||
"0 - means the program will automatically determine its maximum file size. " | |||
"default: 0"); | |||
|
|||
ABSL_FLAG(float, tiered_offload_threshold, 0.5, | |||
"The ration of used/max memory above which we run offloading values to disk"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The ration of used/max memory above which we run offloading values to disk"); | |
"The ratio of used/max memory above which we start offloading values to disk"); |
@@ -602,7 +605,9 @@ void EngineShard::Heartbeat() { | |||
ttl_delete_target = kTtlDeleteLimit * double(deleted) / (double(traversed) + 10); | |||
} | |||
|
|||
ssize_t redline = (max_memory_limit * kRedLimitFactor) / shard_set->size(); | |||
ssize_t eviction_redline = (max_memory_limit * kRedLimitFactor) / shard_set->size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some reservations about Yue's suggestion. I am afraid that this assumes that all pieces of the system work perfectly well and the offloading has enough bandwidth and cpu resources to offload everything fast enough. And that there are no bugs, we are not out of disk space etc.
In contrast, I imagine both the eviction and offloading processes as two independent forces that work towards the common goal. If the offloading does not achieve its goals, eviction will still make sure that our RAM/RSS usage won't cross the RAM limit.
Also please resolve tiered_storage conflicts. |
61afdf0
to
0276d51
Compare
VLOG(2) << "Skip WriteSingle for: " << key; | ||
} | ||
return error_code{}; | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you skip writing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function just does some setup and returns if we can schedule offloading. When writing big blobs we always schedule offload, while writing small blobs we aggregate in pending_entries and only if pending_entries is max size we schedule offloading.
So below is the preparation for small blobs where we add to pending_entries and we will return true (will result in scheduleing for offload from the caller) if we aggregated max entries in the bin
src/server/tiered_storage.h
Outdated
@@ -51,15 +56,23 @@ class TieredStorage { | |||
|
|||
std::error_code Read(size_t offset, size_t len, char* dest); | |||
|
|||
bool AllowWrites() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe: IoDeviceOverloaded()
with the opposite meaning? AllowWrites is a too general name
private: | ||
class InflightWriteRequest; | ||
|
||
void WriteSingle(DbIndex db_index, PrimeIterator it, size_t blob_len); | ||
|
||
// Returns a pair consisting of an bool denoting whether we can write to disk, and updated | ||
// iterator as this function can yield. 'it' should not be used after the call to this function. | ||
std::pair<bool, PrimeIterator> CanScheduleOffload(DbIndex db_index, PrimeIterator it, | ||
std::string_view key); | ||
std::pair<bool, PrimeIterator> ThrottleWrites(DbIndex db_index, PrimeIterator it, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
worth adding what this function does.
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
00c731a
to
994afab
Compare
This reverts commit 994afab.
implement tiering background offloading based on https://junchengyang.com/publication/nsdi24-SIEVE.pdf
When fetching an item we set it as touched item (hot)
A background task running periodically traversing several buckets (continuing from the last place stopped) setting them as untouched items (cold)
The background task checks if the item is cold (not touched from the last time the traverse set it as cold). If an item is cold and it can be offloaded schedule it for offload.