Expensive range request memory usage optimization to reduce OOM #13458

zyfo2 · 2021-11-04T01:13:53Z

When etcd server receiving highly concurrent range requests, e.g. listing all the pods repetitively, the memory usage increases quickly and has a chance of OOM. It's a similar issue to this and this
Some proposals using range stream/rate limiting to solve it in long term but no progress due to its complexity.

So basically, for each range request, it will need to allocate memory and decode data from bboltdb, then transform the memory data into actual grpc response. So say 100 list pods at the same time, it will need 100 copies of the intermediate data and 100 actual grpc response. 100 response data is necessary while 100 intermediate data in memory can be reduced by reusing only one copy.

Test profiling below also shows two parts are consuming the most memory:
Intermediate data: each response from backend needs to allocate memory for each revision's value in bboltdb
Actual response: each grpc response itself needs to encode the response from backend into grpc stream

The response part can't be optimized on etcd-server side unless we use proxy or api-server side cache to reduce the requests to server. This is possible but needs to support quorum read which is out of the scope of this issue.
The intermediate part, however is redundant for repetitive requests. For example, list all pods at the same time will only need one copy of the revision list's response from bboltdb.

So one naive idea is to "cache" these frequent range response for reuse.
Initial test has verified the idea, about 50% of memory can be saved for these repetitive range requests.
More details like caching eviction policy needs to be explored but this should explain my idea.

This maybe a naive idea and many more details to talk about but really appreciate if I can get some feedback while I'm doing my own tests. Thanks!

Simple cache POC Commit.
LFU cache sample commit

zyfo2 · 2021-11-23T00:57:36Z

@hexfusion @gyuho Really appreciate if you can give some comments.

wilsonwang371 · 2021-11-23T03:44:37Z

I have done something similar before: https://github.com/wilsonwang371/etcd/tree/range_cache

But I remember the performance improvement was not significant.

You can take a look at this.

zyfo2 · 2021-11-23T04:24:20Z

@wilsonwang371 Hi wilson, thanks for the reply but it's not exactly the same thing after taking a look.
Your commit is on the kv index level while I'm trying to cache the bblotdb response which consumes a big part of the memory for range requests.

wilsonwang371 · 2021-11-23T04:30:21Z

From what I understand, caching the result may not decrease the chance of OOM since anyway the cache is also holding the data.

Using cache can only decrease the response time potentially I think.

zyfo2 · 2021-11-23T04:43:16Z

@wilsonwang371 Yeah, for a normal cache it works that way.

But if you look at the code and the above profiling info, you can see that two parts are consuming the majority of the memory: 1. each grpc response will need one copy of response which is encoded from range response from backend and this part is immutable on etcd server side. 2. each backend range response itself needs to be decoded from bboltdb backend so this step also consumes a big part of the memory but can be shared with each other. Simply put this way, if the above line of code can be shared as one slice array, this redundant decoding memory allocation can be saved.

You can check my above code commit for more details. And also verified in profiling below which you can see that part is very minimal now(only one copy of the decoded response):

This is just very ideal POC case so lots of details like caching policy skipped but should explain my idea.

zyfo2 · 2021-12-16T19:13:26Z

@wilsonwang371 Can you please comment on this idea? Am I making it clear? Thanks

wilsonwang371 · 2021-12-16T20:14:38Z

I took a look at your code change. I think there are few things we need to pay attention to:

Etcd is not only used by kubernetes. We cannot assume the usage based on partial data. When running in a write-mostly scenario, this cache miss can be significant.
BoltDB is hash-based and is mapped into memory, it is not slow.
Marshall and Unmarshal take time, but this has something to do with protobuf design. there are alternative frameworks that can reduce the amount of data copy. However, we are not confident enough with a different serialization/deserialization framework.

In all, what I mean is that Marshall/Unmarshall costs are from protobuf. Using a cache might not be the correct way to solve this issue. Without higher and better protobuf alternatives, we shall stick with protobuf since it is very deeply embedded in the implementation.

zyfo2 · 2021-12-16T20:35:08Z

@wilsonwang371 Thanks so much for taking a look.

Etcd is not only used by kubernetes. We cannot assume the usage based on partial data. When running in a write-mostly scenario, this cache miss can be significant.

Cache may not be the accurate wording. In the implementation, there's no extra memory usage. Instead, it just keeps the response data in memory for a bit longer time for potential reuse and can still be released if no hit. There can be policies when to enable it or when to disable it. High cache miss is just going to release the cache eariler.

BoltDB is hash-based and is mapped into memory, it is not slow.
Marshall and Unmarshal take time, but this has something to do with protobuf design. there are alternative frameworks that can reduce the amount of data copy. However, we are not confident enough with a different serialization/deserialization framework.

A bit confused but this proposal is not trying to make it fast but just to reuse some memory allocation in the case of a lot repetitive range requests like list pods concurrently.
Yes, I agree if the Marshall/Unmarshall can compress the data to a much smaller size, it would be ideal.
Just to clarify, the idea is just to reuse some memory, cause if you have 100 list pods calls at the same time, it will get 100 copies of actual data in memory which seems not necessary. 100 copies of response data is necessary but 100 copies of intermediate data is not.

In all, the idea is to only limit the impact on expensive range case scenarios and disable it when not necessary.
The idea is to reuse some memory allocation

From:
bboltdb -> intermediate memory data -> actual response
bboltdb -> intermediate memory data -> actual response
bboltdb -> intermediate memory data -> actual response

To:
bboltdb                             -> actual response
bboltdb -> intermediate memory data -> actual response
bboltdb                             -> actual response

so only one intermediate memory is needed instead of 3 and no extra memory will be allocated, just keep it there for a while and delay the releasing.

wilsonwang371 · 2021-12-16T20:40:05Z

Can you guys give some opinions on this? @gyuho @ptabor

stale · 2022-03-23T15:50:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

zyfo2 changed the title ~~Caching frequent range response to reduce OOM chance?~~ Reuse range response memory in mvcc to reduce OOM? Nov 26, 2021

zyfo2 changed the title ~~Reuse range response memory in mvcc to reduce OOM?~~ Reuse range response for effective memory management in mvcc Nov 26, 2021

zyfo2 changed the title ~~Reuse range response for effective memory management in mvcc~~ Reuse range response for efficient memory management in mvcc Nov 26, 2021

zyfo2 changed the title ~~Reuse range response for efficient memory management in mvcc~~ Expensive range request memory usage optimization Dec 16, 2021

zyfo2 changed the title ~~Expensive range request memory usage optimization~~ Expensive range request memory usage optimization to reduce OOM Dec 21, 2021

stale bot added the stale label Mar 23, 2022

stale bot closed this as completed Apr 16, 2022

ptabor added stage/investigating and removed stale labels Jan 19, 2023

ahrtr mentioned this issue Oct 31, 2023

Resolve CVE-2023-44487 #16740

Closed

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expensive range request memory usage optimization to reduce OOM #13458

Expensive range request memory usage optimization to reduce OOM #13458

zyfo2 commented Nov 4, 2021 •

edited

zyfo2 commented Nov 23, 2021 •

edited

wilsonwang371 commented Nov 23, 2021

zyfo2 commented Nov 23, 2021

wilsonwang371 commented Nov 23, 2021

zyfo2 commented Nov 23, 2021

zyfo2 commented Dec 16, 2021

wilsonwang371 commented Dec 16, 2021

zyfo2 commented Dec 16, 2021 •

edited

wilsonwang371 commented Dec 16, 2021

stale bot commented Mar 23, 2022

Expensive range request memory usage optimization to reduce OOM #13458

Expensive range request memory usage optimization to reduce OOM #13458

Comments

zyfo2 commented Nov 4, 2021 • edited

zyfo2 commented Nov 23, 2021 • edited

wilsonwang371 commented Nov 23, 2021

zyfo2 commented Nov 23, 2021

wilsonwang371 commented Nov 23, 2021

zyfo2 commented Nov 23, 2021

zyfo2 commented Dec 16, 2021

wilsonwang371 commented Dec 16, 2021

zyfo2 commented Dec 16, 2021 • edited

wilsonwang371 commented Dec 16, 2021

stale bot commented Mar 23, 2022

zyfo2 commented Nov 4, 2021 •

edited

zyfo2 commented Nov 23, 2021 •

edited

zyfo2 commented Dec 16, 2021 •

edited