Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(server): add oom guard #1650

Merged
merged 5 commits into from
Aug 8, 2023
Merged

feat(server): add oom guard #1650

merged 5 commits into from
Aug 8, 2023

Conversation

adiholden
Copy link
Collaborator

fixes #1634

  1. add flag maxmemory_ratio
  2. When current used memory * maxmemory_ratio > maxmemory_limit denyoom commands will return oom error.

@@ -226,6 +228,8 @@ class ServerState { // public struct - to allow initialization.

absl::flat_hash_map<std::string, base::Histogram> call_latency_histos_;
uint32_t thread_index_ = 0;
uint64_t used_mem_ = 0;
uint64_t last_chached_used_current_ = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo (chached -> cached)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further, I'd rename it to used_mem_last_update_ or something along those lines

@@ -130,6 +130,8 @@ class ServerState { // public struct - to allow initialization.
gstate_ = s;
}

uint64_t GetCachedUsedMemory(uint64_t now_ns);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cache can mean different things. As a start, I'd add a comment describing what this method does. Reading this first, I thought that this might be the memory of the items in cache (as in cache mode).
Perhaps call it GetLastUsedMemory()? I'd even call it GetUsedMemory() and comment that it might be stale for some short period of time..

@@ -71,6 +71,15 @@ void ServerState::Destroy() {
state_ = nullptr;
}

uint64_t ServerState::GetCachedUsedMemory(uint64_t now_ns) {
uint64_t kCacheEveryNs = 1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make is constexpr to justify the k notation? :)

uint64_t kCacheEveryNs = 1000;
if (now_ns > last_chached_used_current_ + kCacheEveryNs) {
last_chached_used_current_ = now_ns;
used_mem_ = used_mem_current;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use used_mem_current.load(...) to be clear that you're caching an atomic var? It was an odd method to read before I realized that :)

@@ -71,6 +71,15 @@ void ServerState::Destroy() {
state_ = nullptr;
}

uint64_t ServerState::GetCachedUsedMemory(uint64_t now_ns) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not calculate now_ns inside the method? Is it due to performance reasons?
It looks like now we call it once per DispatchCommand(), so it shouldn't be noticeable (right?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think it's better to pass time variables for several reasons.

  1. It's usually much harder to test code the internally relies on clock calls.
  2. Once the default is flipped and we declare it's better style calling the clock internally - the code is immediately flooded with clock calls.
  3. Which brings us to the biggest thing - clock calls can be expensive. It depends on the hardware, virtualisation etc but ask @dranikpg - i remember that at some point all the benchmarks he did on his laptop were meaningless because the clock function took like 30% of the total CPU in Dragonfly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, then feel free to ignore

ABSL_FLAG(double, maxmemory_ratio, 1.1,
"commands with flag denyoom will return OOM when the ratio between maxmemory and used "
"memory is above "
"this value");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with the line above?

@@ -917,6 +921,11 @@ void Service::DispatchCommand(CmdArgList args, facade::ConnectionContext* cntx)
}

uint64_t start_ns = ProactorBase::GetMonotonicTimeNs(), end_ns;
double maxmemory_ratio = GetFlag(FLAGS_maxmemory_ratio);
uint64_t used_memory = ServerState::tlocal()->GetCachedUsedMemory(start_ns);
if (used_memory > (max_memory_limit * maxmemory_ratio) && (cid->opt_mask() & CO::DENYOOM)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also please go over all write commands (there are not many of them) and check which one of them should have DENYOOM - I think some have but they should they not and the opposite.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this in another PR

@@ -76,6 +76,10 @@ ABSL_FLAG(MaxMemoryFlag, maxmemory, MaxMemoryFlag{},
"Limit on maximum-memory that is used by the database. "
"0 - means the program will automatically determine its maximum memory usage. "
"default: 0");
ABSL_FLAG(double, maxmemory_ratio, 1.1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call - oom_deny_ratio ?

uint64_t ServerState::GetCachedUsedMemory(uint64_t now_ns) {
uint64_t kCacheEveryNs = 1000;
if (now_ns > last_chached_used_current_ + kCacheEveryNs) {
last_chached_used_current_ = now_ns;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chached -> cached

uint64_t kCacheEveryNs = 1000;
if (now_ns > last_chached_used_current_ + kCacheEveryNs) {
last_chached_used_current_ = now_ns;
used_mem_ = used_mem_current;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used_mem_current.load(memory_order_relaxed)

@@ -226,6 +228,8 @@ class ServerState { // public struct - to allow initialization.

absl::flat_hash_map<std::string, base::Histogram> call_latency_histos_;
uint32_t thread_index_ = 0;
uint64_t used_mem_ = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used_mem_cached_ and add a comment saying it's a thread local cache of used_mem_current

fixes #1634
1. add flag maxmemory_ratio
2. When current used memory * maxmemory_ratio > maxmemory_limit denyoom
   commands will return oom error.

Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
Signed-off-by: adi_holden <adi@dragonflydb.io>
@@ -892,6 +895,11 @@ void Service::DispatchCommand(CmdArgList args, facade::ConnectionContext* cntx)
}

uint64_t start_ns = ProactorBase::GetMonotonicTimeNs(), end_ns;
double oom_deny_ratio = GetFlag(FLAGS_oom_deny_ratio);
uint64_t used_memory = ServerState::tlocal()->GetUsedMemory(start_ns);
Copy link
Collaborator

@romange romange Aug 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. we already have etl for ServerState::tlocal().
  2. Lets check first if the mask has DENYOOM before fetching all the data. We are on the hotpath.
  3. @chakaz you mentioned that GetFlag is heavy. is it indeed the case? (do not think we need to address that now).

Signed-off-by: adi_holden <adi@dragonflydb.io>
@adiholden adiholden requested a review from romange August 8, 2023 19:44
@adiholden adiholden merged commit 116934b into main Aug 8, 2023
10 checks passed
@adiholden adiholden deleted the oom_guard branch August 8, 2023 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants