-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting a large bucket cause infinite write #401
Comments
I just discovered that precisely, the write I/O happen when requests are made, not when the DB is idle. So what that means basically is that a request that would take few KB/s of I/O suddenly take ALL of the available I/O bandwidth. Also I replicated the bug on different hardware. I'm using v1.3.7. |
Do you mean you were trying to delete a bucket, and trying to update K/V data in other buckets concurrently? |
Thanks! No it's way after the bucket is deleted. The bucket deleted properly, no issue on that side. But then any request generate a maximum of I/O for no valid reason. |
Is the request successful? How long does the request take? How do you open the db? Please provide the options when opening the db. Please also run the following two commands,
|
The request is successful but takes multiple seconds instead of 10ms. These are the two lines in my code related to DB opening: Database, err = bbolt.Open("database.db", 0600, nil)
if err != nil {
return err
}
Database.MmapFlags = syscall.MAP_POPULATE I reproduced the bug a third time and ran your two commands on the DB before running
|
do you maybe have a reproducer at hand that can replicate this? would love to hook this into a debugger and see what's going on. Or is just simply writing 44M keys and deleting the bucket enough to trigger this? |
I don't have anything to give you publicly to reproduce it sadly, but I guess that yes, do that and you should be seeing the same thing. If not then there is more digging to do.. |
Sorry for the wait, it took some time to add a couple of buckets with 50M 32 byte keys in them. I created a super naive test where you put something first, delete the bucket and then put something into another bucket. The tests indeed show a 10x-20x increase in the time it takes:
The delete takes some time, but I do not really see it having a large chunk of continuous IO on the second put - even when running the fsync. @CorentinB can you maybe share a little how you're using bbolt? So what kind of "requests" you process and how they translate to the respective functions? That the compaction makes the process faster is not super surprising, for the same reason we suggest people to regularly defrag etcd. Fragmentation does take a toll, not sure whether that's the issue here for the latency increase. code for reference here: |
Thanks a lot for your test. I'd say a 10x-20x increase is already quite an issue but that would be ok to live with if I weren't seeing much much more than that. Our software uses bbolt as a back-end DB to handle projects, and buckets are "queues". There is one bucket per "project", and inside that bucket we have 4 buckets, 2 queues, so we add and remove from those queues, a third bucket that is used to just store values long term (it's actually used to store a hash of the values that have been put into the queue, we read and add, we do not delete from it), and a last bucket for statistics related data. I have seen that bug when deleting a single bucket inside a "project" bucket, but also when deleting an entire project (so that means the bucket and the 4 nested buckets inside). Can you try running many PUT operations, continuously after the delete? To see if you see the same thing as me (in my case, 100's of MB/s of I/O, as opposed to few MB/s maximum). |
Thanks for the explanation. I was also curious regarding the many PUTs, the latency does recover somewhat after the first delete, but it's definitely slower than before:
You were definitely right about the bandwidth, on a bbolt with a deleted bucket it is writing at 500mb/s with the above bad latency, on the original db the same put at <10ms is only 100mb/s. |
I'm happy you see more or less the same, although in my case it's much worse than that, do not recover after the first request, and persist over reboot of the software. In our case, the thing is that when running our software in production, the flow of request never stops so we have constant I/O at 100% of what our SSD can do, which slows down everything. See that spike in I/O that's just after deleting a large bucket, and until we stopped the software and did a compact on the DB before restarting it. |
I sadly have to leave soon for the remainder of the week - maybe Ben can pick it up from here if he has the time. I took some pprofs of both the cases in: you can load them in with the usual
From what I see it's actually Do you maybe want to try disabling the freelist commit? There seems to be a flag for that with
|
I'll try |
Ok so I added I then deleted the 50M bucket and started my benchmark on the 20M bucket, requests take around 20ms in average. So it seems like normal speed. Then I removed I re-built the software with and without NoFeelistSync multiple times, and indeed it avoid the issue. (I was gonna write "fix" but it's not fixed :)) I am not very smart about what NoFreelistSync implies so I'm not sure I can use it in production though.. Edit: could this issue ref. bbolt mentioned in this HN thread be related? https://news.ycombinator.com/item?id=30015703 "It only became an issue when someone wrote a ton of data and then deleted it and never used it again. Roblox reported having 4GB of free pages which translated into a giant array of 4-byte page numbers." |
It's a tradeoff. If the freelist isn't synced to disk, then it may take a while when you open the db, see also #392 (comment). But the benefit is that you get better performance when committing each transaction because bbolt doesn't need to sync freelist. I am thinking probably we should set |
The only useful use case for Should we eventually deprecate the option |
That's interesting.
Line 238 in 505fc0f
The list is not differential, so with each transaction it's written entirely.
I think that it's fair in 1.4 to change the defaults to: Lines 1263 to 1266 in 505fc0f
|
Okay thanks for all of your comments. I now use: bboltOptions := bbolt.DefaultOptions
bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = true
bboltOptions.MmapFlags = syscall.MAP_POPULATE But there is an issue. I simply can't start my program without doing a compact first. I think it's because compact generate the freelist, and so yeah my issue seems to be that I can't go past the needed full sync at startup. It does around 200Mbps of read I/O forever and never complete. (in this case it's a 60GB database, not the biggest on earth I guess) Also I have read I/O spikes every ~30 minutes when I get it running after a compact, which hangs my DB because it max out the I/O of the server, it's maxed out for like 5 minutes then goes back to normal. Not sure if it's a related issue though. |
60GB / 200MBit/s = 60GB / 25MB/s = 40min Please consider:
|
In this spreadsheet I tried to model cost of a single key edit with NoFreelistSync=False: https://docs.google.com/spreadsheets/d/1O_wexgv1vM9GKZNCRjfQZ8U_bRmX_CLc-YSsBsVxDLg/edit#gid=0 With assumption of: the optimal page size seems to be 256KB -> with this setting we should be writing "only" 1.8MB per transaction (instead of 60MB per transaction for 4KB pages). Alternative recommendation is to do transaction batching in Your application logic. Merging multiple writes in a single transaction will significantly amortize cost of writing of the free pages. |
I am going to try: bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = false
bboltOptions.PageSize = 4096*1024 Also please know I batch as much as I can in my application, sadly most of the requests can't be batched. Thanks! Edit: should I stop using |
MAP_POPULATE causes best-effort load of the bbolt data to RAM. (I also think that PageSize = 256*1024 might be closer to the optimum. ) |
Ok I'm gonna remove it then because we only have 32GB of RAM and the DB will likely grow 100s of GBs, maybe up to a terabyte. I'm changing to 256*1024 too. Thanks! |
You are leader in terms of bbolt scale (I heard so far). Please keep us posted on the results of your experiments. |
So far so good since I restarted with the configuration you advised me. It's running well. :) Makes me wonder what would happen if I were to delete a big bucket (then we are back to our original issue from this thread..). |
Talking about "scale" issues, we had to implement our own stats system to keep track of bucket's size because from what I understand the Stats() method reads the entire bucket to return the number of keys. So it was way too slow for our usage. We think it could be way faster by keeping track of the number of keys as we add / remove them from the bucket in real time with an atomic int variable. That's potentially something I'll open an issue for and maybe try to make a PR if that's something you all would like to see implemented in bbolt. |
What's the issue? Could you be a little more specific on this? I may do some experiment myself later when I get free cycle.
I am thinking this may not be the correct behavior. The compact should follow the original db. If the original db has synced freelist, it should persist freelist to db as well, otherwise not. Probably we should fix this. See also #290 (comment) |
So if I don't do a compact first (again that's when NoFreelistSync = true), then the startup never complete, it does I/O forever. When NoFreelistSync is true it needs to generate the freelist, so I think it's what it is doing, but it never completes. If I do a compact first, considering that compact generate the freelist, then it starts without "warmup time" of course. |
The main reason for the big-pages is to make the 'freepage-list' smaller. |
Okay! I can't try now but I'll try again deleting a large bucket tomorrow. |
Okay a little update on our usage of bbolt with the following settings: bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = false
bboltOptions.PageSize = 256*1024 We noticed a HUGE increase in requests latency after few days of it working fine, requests were taking up to 1 minute or more. Restarting the program or even compacting didn't fix it. Then I switched to I don't really know what to do now 🤷♂️ PS/edit: DB is getting bigger over time, it was around 50GB when it was performing well at the beginning, now it's around 150GB. |
are you able to gather a pprof from your running system? The 25s/request is definitely strange. |
I can try that indeed, I am not very very smart about pprof sadly, gimme some time.. Thank you |
@tjungblu I assume that's what you want? |
Just to double confirm, the "a minute" is the total time your HTTP server processed the request instead of the time boltDB commit the transaction? Could you log the duration on how long does the bbolt take on processing each call? You can also print tx.Stats. Is it possible to share your boltDB files(e.g. 50G, 60G, 150G etc.) so that I can test it locally? |
Thanks @ahrtr , yes it's the total time that I mentioned, but I just did some other tests right now and I can confirm >95% of that time is spent on bbolt. (I have some special code to return timing stats with my requests) Sure thing I can share with you a file if I figure out a way to:
Right now our response time is betwen 7 to 50 seconds, it fluctuates a lot. Edit: sorry right now our DB file is 100GB not 150. |
Is the EDIT: based on your last comment, 95% of the
It's fine. Thank you.
Thank you. I will keep it private. |
@ahrtr yes, although 95% is a guess, I have a little bit of code that record the time spent on an Update call for example, and it returns 11000ms for a request that take 12s total. It's very scientific, and I can add more code to get stats using tx.Stats like you mentioned earlier but I think 95% is a good guess. So sad the bbolt CLI do not have a functionnality to delete a bucket! I'm writing a bit of code to do it now.. Then I'll upload it to Google Drive. |
@ahrtr where/how can I send you the file once it's uploaded on Google Drive? |
Either email (wachao@vmware.com) or slack channel (in k8s workspace) is OK. |
@CorentinB close, I believe this looks like a memory profiling? can you get one for the CPU usage please? |
I generated some sample files (see What I got from the test:
In summary,
|
Thank you @ahrtr for preparing the statistics. They match the expectations:
@tjungblu
When such slow behavior is happening would be helpful to root cause the problem. |
@CorentinB In comments above I've read that you have 32 GB memory on server but the database size was 100 GB. If your workload pattern is random, your disk speed may be the bottleneck here. I would try splitting the database into smaller databases and serve from separate machines to be able to fit hot databases into RAM. |
I am curious about the following comment you wrote:
I wonder under what circumstances do you recommend we do the above (append to file first, then write to boltDB)? |
There are two prerequisites to follow this async solution:
|
Hi,
I work at the Internet Archive in the Wayback Machine's team. We started using bbolt in a new software that we are developing internally. We think that we discovered a possible bug.
We have a database with many large buckets and at some point, we deleted a bucket that contained few hundreds millions of keys. Following that, bbolt started writing indefinitely for days (it was not just a short I/O spike, we are talking about like a week of writing at 300MB/s). We have no idea what the "write" was, what it was writing and to where, but it was doing write I/O.
Restarting the software that uses bbolt didn't fix it. I had to run a
bbolt compact
on the database (which reduced it from 211GB to 29GB), and after that when we restarted the software it didn't continue to max out write I/O.I then tried to reproduced it by deleting another big bucket (44M keys), it caused the same issue. then I stopped our software, ran
bbolt compact
again and re-started it, the issue disappeared.The text was updated successfully, but these errors were encountered: