Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental etcd defrag #694

Open
serathius opened this issue Feb 13, 2024 · 6 comments
Open

Incremental etcd defrag #694

serathius opened this issue Feb 13, 2024 · 6 comments
Assignees

Comments

@serathius
Copy link
Member

serathius commented Feb 13, 2024

Resurrecting old @ptabor idea that we had discussed some time ago but didn't have time to implement.

Problem

  • Bbolt page management still works like Windows 95. To be efficient and it requires to be periodically defragmented by calling a bbolt compact function. This is problematic from both reliability and maintenance perspective. Defragmenting etcd blocks writes, which in large clusters can take in tens of seconds. In high churn clusters users need to manage a separate process that periodically defragments etcd

Goal

  • Reduce the need for heavy defragment operation by doing small incremental work during transactions. This should should allow us to maintain good enough fragmentation level without a need for a periodic "stop the world" operation.

Proposal:

During each transaction we make a decision whether we want do additional work. If storage overhead (file size / active page size) is more then 20% we do additional operations during a transaction.

The naive concept is that we look into the free-pages map and find the last allocated block. I think we currently maintain the lists of free blocks sorted from the beginning to the end for different sizes of the 'continuous' space. We might need to maintain a link to the last used page... Maybe we maintain a few links for not only 'the' last... but also to the last not too big page.

On each transaction we rewrite the last page to the first 'suitable' position from the free list... and move the reference to the last page down… The challenge is when the last page is to huge... and we don't have continuous space in the 'lower' part of the file to accept it... in such case there is heuristic needed to ignore it... and keep going up to the beginning of the file (hoping that it will generate more space for the 'bigger' chunk eventually)... And after some time (number of transactions) start the process from the end again.

The assumption (but it requires studies), is that even the biggest page is still relatively small in relationship to the whole file for k8s use-cases -> so it will work reasonably well.

Additional notes:

This is a note dump of my discussion with @ptabor. High level idea should stand, however I expect that there might be already existing continuous defrag algorithms that are better, then naive one presented here.

Expect that this can be implemented as a configurable optional behavior

@Elbehery
Copy link
Contributor

i am interested in implementing this

cc @serathius @ahrtr

@serathius serathius changed the title Lightweight etcd defrag Incremental etcd defrag Feb 13, 2024
@Elbehery
Copy link
Contributor

@ahrtr can i take this ?

@serathius
Copy link
Member Author

/assign @Elbehery

@Elbehery
Copy link
Contributor

Would a draft pr be fine ?

Or need a design beforehand ?

@tjungblu
Copy link
Contributor

tjungblu commented Jun 3, 2024

Last Friday I had a bit of time to visualize the page usage in bbolt during an OpenShift installation, you can very well see compactions and the resulting fragmentation in it.

https://www.youtube.com/watch?v=ZM91_mndjyY

Leaving this here for the lack of a better ticket, maybe we can also put it up on the website to explain etcd's fragmentation.

@Elbehery
Copy link
Contributor

Elbehery commented Jun 3, 2024

neat work @tjungblu 👍🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants