Incremental etcd defrag #694

serathius · 2024-02-13T12:58:50Z

Resurrecting old @ptabor idea that we had discussed some time ago but didn't have time to implement.

Problem

Bbolt page management still works like Windows 95. To be efficient and it requires to be periodically defragmented by calling a bbolt compact function. This is problematic from both reliability and maintenance perspective. Defragmenting etcd blocks writes, which in large clusters can take in tens of seconds. In high churn clusters users need to manage a separate process that periodically defragments etcd

Goal

Reduce the need for heavy defragment operation by doing small incremental work during transactions. This should should allow us to maintain good enough fragmentation level without a need for a periodic "stop the world" operation.

Proposal:

During each transaction we make a decision whether we want do additional work. If storage overhead (file size / active page size) is more then 20% we do additional operations during a transaction.

The naive concept is that we look into the free-pages map and find the last allocated block. I think we currently maintain the lists of free blocks sorted from the beginning to the end for different sizes of the 'continuous' space. We might need to maintain a link to the last used page... Maybe we maintain a few links for not only 'the' last... but also to the last not too big page.

On each transaction we rewrite the last page to the first 'suitable' position from the free list... and move the reference to the last page down… The challenge is when the last page is to huge... and we don't have continuous space in the 'lower' part of the file to accept it... in such case there is heuristic needed to ignore it... and keep going up to the beginning of the file (hoping that it will generate more space for the 'bigger' chunk eventually)... And after some time (number of transactions) start the process from the end again.

The assumption (but it requires studies), is that even the biggest page is still relatively small in relationship to the whole file for k8s use-cases -> so it will work reasonably well.

Additional notes:

This is a note dump of my discussion with @ptabor. High level idea should stand, however I expect that there might be already existing continuous defrag algorithms that are better, then naive one presented here.

Expect that this can be implemented as a configurable optional behavior

Elbehery · 2024-02-13T13:07:09Z

i am interested in implementing this

cc @serathius @ahrtr

Elbehery · 2024-04-17T11:25:37Z

@ahrtr can i take this ?

serathius · 2024-04-17T11:49:34Z

/assign @Elbehery

Elbehery · 2024-04-17T11:51:28Z

Would a draft pr be fine ?

Or need a design beforehand ?

tjungblu · 2024-06-03T12:24:53Z

Last Friday I had a bit of time to visualize the page usage in bbolt during an OpenShift installation, you can very well see compactions and the resulting fragmentation in it.

https://www.youtube.com/watch?v=ZM91_mndjyY

Leaving this here for the lack of a better ticket, maybe we can also put it up on the website to explain etcd's fragmentation.

Elbehery · 2024-06-03T12:52:38Z

neat work @tjungblu 👍🏽

serathius changed the title ~~Lightweight etcd defrag~~ Incremental etcd defrag Feb 13, 2024

k8s-ci-robot assigned Elbehery Apr 17, 2024

elena-kolevska mentioned this issue Apr 22, 2024

WIP: Productionises distributed scheduler cicoyle/dapr#22

Merged

7 tasks

tjungblu mentioned this issue May 30, 2024

feature: add new compactor based revision count etcd-io/etcd#16426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental etcd defrag #694

Incremental etcd defrag #694

serathius commented Feb 13, 2024 •

edited

Elbehery commented Feb 13, 2024

Elbehery commented Apr 17, 2024

serathius commented Apr 17, 2024

Elbehery commented Apr 17, 2024

tjungblu commented Jun 3, 2024

Elbehery commented Jun 3, 2024

Incremental etcd defrag #694

Incremental etcd defrag #694

Comments

serathius commented Feb 13, 2024 • edited

Elbehery commented Feb 13, 2024

Elbehery commented Apr 17, 2024

serathius commented Apr 17, 2024

Elbehery commented Apr 17, 2024

tjungblu commented Jun 3, 2024

Elbehery commented Jun 3, 2024

serathius commented Feb 13, 2024 •

edited