Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleting a large bucket cause infinite write #401

Closed
CorentinB opened this issue Feb 8, 2023 · 46 comments
Closed

Deleting a large bucket cause infinite write #401

CorentinB opened this issue Feb 8, 2023 · 46 comments

Comments

@CorentinB
Copy link

Hi,

I work at the Internet Archive in the Wayback Machine's team. We started using bbolt in a new software that we are developing internally. We think that we discovered a possible bug.

We have a database with many large buckets and at some point, we deleted a bucket that contained few hundreds millions of keys. Following that, bbolt started writing indefinitely for days (it was not just a short I/O spike, we are talking about like a week of writing at 300MB/s). We have no idea what the "write" was, what it was writing and to where, but it was doing write I/O.

Restarting the software that uses bbolt didn't fix it. I had to run a bbolt compact on the database (which reduced it from 211GB to 29GB), and after that when we restarted the software it didn't continue to max out write I/O.

I then tried to reproduced it by deleting another big bucket (44M keys), it caused the same issue. then I stopped our software, ran bbolt compact again and re-started it, the issue disappeared.

@CorentinB
Copy link
Author

CorentinB commented Feb 8, 2023

I just discovered that precisely, the write I/O happen when requests are made, not when the DB is idle. So what that means basically is that a request that would take few KB/s of I/O suddenly take ALL of the available I/O bandwidth.

Also I replicated the bug on different hardware.

I'm using v1.3.7.

@ahrtr
Copy link
Member

ahrtr commented Feb 8, 2023

the write I/O happen when requests are made, not when the DB is idle

Do you mean you were trying to delete a bucket, and trying to update K/V data in other buckets concurrently?

@CorentinB
Copy link
Author

the write I/O happen when requests are made, not when the DB is idle

Do you mean you were trying to delete a bucket, and trying to update K/V data in other buckets concurrently?

Thanks! No it's way after the bucket is deleted. The bucket deleted properly, no issue on that side. But then any request generate a maximum of I/O for no valid reason.

@ahrtr
Copy link
Member

ahrtr commented Feb 8, 2023

But then any request generate a maximum of I/O for no valid reason.

Is the request successful? How long does the request take?

How do you open the db? Please provide the options when opening the db. Please also run the following two commands,

$ bbolt page path-to-db 0
$ bbolt page path-to-db 1

@CorentinB
Copy link
Author

The request is successful but takes multiple seconds instead of 10ms.

These are the two lines in my code related to DB opening:

	Database, err = bbolt.Open("database.db", 0600, nil)
	if err != nil {
		return err
	}

	Database.MmapFlags = syscall.MAP_POPULATE

I reproduced the bug a third time and ran your two commands on the DB before running compact:

$ bbolt page path-to-db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=3033>
Freelist:   <pgid=290340>
HWM:        <pgid=812499>
Txn ID:     63010
Checksum:   2a2490d632f21719
$ bbolt page path-to-db 1
Page ID:    1
Page Type:  meta
Total Size: 4096 bytes
Overflow pages: 0
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=4343>
Freelist:   <pgid=289317>
HWM:        <pgid=812499>
Txn ID:     63011
Checksum:   ec36b443ecc856be

@tjungblu
Copy link
Contributor

tjungblu commented Feb 9, 2023

do you maybe have a reproducer at hand that can replicate this? would love to hook this into a debugger and see what's going on.

Or is just simply writing 44M keys and deleting the bucket enough to trigger this?

@CorentinB
Copy link
Author

Or is just simply writing 44M keys and deleting the bucket enough to trigger this?

I don't have anything to give you publicly to reproduce it sadly, but I guess that yes, do that and you should be seeing the same thing. If not then there is more digging to do..

@tjungblu
Copy link
Contributor

tjungblu commented Feb 9, 2023

Sorry for the wait, it took some time to add a couple of buckets with 50M 32 byte keys in them.

I created a super naive test where you put something first, delete the bucket and then put something into another bucket. The tests indeed show a 10x-20x increase in the time it takes:

PUT took 17.335668ms
<delete>
PUT took 166.673496ms

The delete takes some time, but I do not really see it having a large chunk of continuous IO on the second put - even when running the fsync. @CorentinB can you maybe share a little how you're using bbolt? So what kind of "requests" you process and how they translate to the respective functions?

That the compaction makes the process faster is not super surprising, for the same reason we suggest people to regularly defrag etcd. Fragmentation does take a toll, not sure whether that's the issue here for the latency increase.


code for reference here:
tjungblu@2effd9c#diff-2c97e5ffc11514491315dcfdfe66968d3c22aae95764f385d23f717717783131R13-R48

@CorentinB
Copy link
Author

Thanks a lot for your test.

I'd say a 10x-20x increase is already quite an issue but that would be ok to live with if I weren't seeing much much more than that.

Our software uses bbolt as a back-end DB to handle projects, and buckets are "queues". There is one bucket per "project", and inside that bucket we have 4 buckets, 2 queues, so we add and remove from those queues, a third bucket that is used to just store values long term (it's actually used to store a hash of the values that have been put into the queue, we read and add, we do not delete from it), and a last bucket for statistics related data.

I have seen that bug when deleting a single bucket inside a "project" bucket, but also when deleting an entire project (so that means the bucket and the 4 nested buckets inside).

Can you try running many PUT operations, continuously after the delete? To see if you see the same thing as me (in my case, 100's of MB/s of I/O, as opposed to few MB/s maximum).

@tjungblu
Copy link
Contributor

tjungblu commented Feb 9, 2023

Thanks for the explanation. I was also curious regarding the many PUTs, the latency does recover somewhat after the first delete, but it's definitely slower than before:

PUT took 19.819753ms
<delete>
PUT took 165.796265ms
PUT took 29.281574ms
PUT took 33.015651ms

You were definitely right about the bandwidth, on a bbolt with a deleted bucket it is writing at 500mb/s with the above bad latency, on the original db the same put at <10ms is only 100mb/s.

@CorentinB
Copy link
Author

I'm happy you see more or less the same, although in my case it's much worse than that, do not recover after the first request, and persist over reboot of the software.

In our case, the thing is that when running our software in production, the flow of request never stops so we have constant I/O at 100% of what our SSD can do, which slows down everything.

image

See that spike in I/O that's just after deleting a large bucket, and until we stopped the software and did a compact on the DB before restarting it.

@tjungblu
Copy link
Contributor

tjungblu commented Feb 9, 2023

I sadly have to leave soon for the remainder of the week - maybe Ben can pick it up from here if he has the time.

I took some pprofs of both the cases in:
https://github.com/tjungblu/bbolt/tree/issue_401/pprofs

you can load them in with the usual

$ go tool pprof -http localhost:8080 vanilla_pprof

From what I see it's actually runtime.memmove which would speak for the fragmentation case. I'm not sure how this would impact the disk IO - I assume it's through the commitFreelist path as the free list is written along.

Do you maybe want to try disabling the freelist commit? There seems to be a flag for that with NoFreelistSync:

db.NoFreelistSync = true

@CorentinB
Copy link
Author

I'll try db.NoFreelistSync = true and report here :) Thanks a lot @tjungblu !

@CorentinB
Copy link
Author

CorentinB commented Feb 9, 2023

Ok so I added db.NoFreelistSync = true and fed 2 projects, one with 50M seeds and another one with 20M seeds.

I then deleted the 50M bucket and started my benchmark on the 20M bucket, requests take around 20ms in average. So it seems like normal speed.

Then I removed db.NoFreelistSync = true and re-built my software, and I ran the same benchmark. Requests take 100ms in average.

I re-built the software with and without NoFeelistSync multiple times, and indeed it avoid the issue. (I was gonna write "fix" but it's not fixed :))

I am not very smart about what NoFreelistSync implies so I'm not sure I can use it in production though..

Edit: could this issue ref. bbolt mentioned in this HN thread be related? https://news.ycombinator.com/item?id=30015703 "It only became an issue when someone wrote a ton of data and then deleted it and never used it again. Roblox reported having 4GB of free pages which translated into a giant array of 4-byte page numbers."

@ahrtr
Copy link
Member

ahrtr commented Feb 10, 2023

what NoFreelistSync implies so I'm not sure I can use it in production though..

It's a tradeoff. If the freelist isn't synced to disk, then it may take a while when you open the db, see also #392 (comment). But the benefit is that you get better performance when committing each transaction because bbolt doesn't need to sync freelist.

I am thinking probably we should set NoFreelistSync to true by default? Usually users just open the db once, and do as many transactions as they want. So loading the freelist on startup is just one-time operation, so taking a little longer might be accepted. But taking a long time for each committing is obviously more expensive.

any comments? @ptabor @tjungblu

@ahrtr
Copy link
Member

ahrtr commented Feb 10, 2023

The only useful use case for NoFreelistSync being setting as false is for lots of readonly transactions, because it has a better startup performance, and no syncing freelist cost.

Should we eventually deprecate the option NoFreelistSync? cc @benbjohnson for opinion as well

@ptabor
Copy link
Contributor

ptabor commented Feb 11, 2023

That's interesting.

  1. Out of curiosity - you might try to log the size of the free-pages list that gets written with each transaction:

bbolt/tx.go

Line 238 in 505fc0f

p, err := tx.allocate((tx.db.freelist.size() / tx.db.pageSize) + 1)

The list is not differential, so with each transaction it's written entirely.


  1. NoFreeListSync seems like a good fix. Another approach for speedup is to use significantly bigger pages. In such case there would be:
    a) less nodes in general
    b) better prefetching on reads and writes
    c) less pages in the freelist (so not only cost of flushing it to disk (that NoSync turns off), but also cost of scanning inmemory list to find an empty page )
    d) please also you use map-based freelist (instead of the list-based)

I think that it's fair in 1.4 to change the defaults to:
a) no-sync for free pages
b) Map-based freepages implementation:

bbolt/db.go

Lines 1263 to 1266 in 505fc0f

var DefaultOptions = &Options{
Timeout: 0,
NoGrowSync: false,
FreelistType: FreelistArrayType,

@CorentinB
Copy link
Author

CorentinB commented Feb 11, 2023

Okay thanks for all of your comments. I now use:

bboltOptions := bbolt.DefaultOptions
bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = true
bboltOptions.MmapFlags = syscall.MAP_POPULATE

But there is an issue. I simply can't start my program without doing a compact first. I think it's because compact generate the freelist, and so yeah my issue seems to be that I can't go past the needed full sync at startup. It does around 200Mbps of read I/O forever and never complete. (in this case it's a 60GB database, not the biggest on earth I guess)

Also I have read I/O spikes every ~30 minutes when I get it running after a compact, which hangs my DB because it max out the I/O of the server, it's maxed out for like 5 minutes then goes back to normal. Not sure if it's a related issue though.

@ptabor
Copy link
Contributor

ptabor commented Feb 11, 2023

60GB / 200MBit/s = 60GB / 25MB/s = 40min
In etcd we are not going above 8GBs and either-way doing full scan on startup to compute hash.

Please consider:

bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = false
bboltOptions.PageSize = 4096*1024

@ptabor
Copy link
Contributor

ptabor commented Feb 11, 2023

In this spreadsheet I tried to model cost of a single key edit with NoFreelistSync=False:

https://docs.google.com/spreadsheets/d/1O_wexgv1vM9GKZNCRjfQZ8U_bRmX_CLc-YSsBsVxDLg/edit#gid=0

With assumption of:
dbSize=60GB
avgKVsize=1000 bytes
avgKeySize=16 bytes

the optimal page size seems to be 256KB -> with this setting we should be writing "only" 1.8MB per transaction (instead of 60MB per transaction for 4KB pages).


Alternative recommendation is to do transaction batching in Your application logic. Merging multiple writes in a single transaction will significantly amortize cost of writing of the free pages.

@CorentinB
Copy link
Author

CorentinB commented Feb 12, 2023

60GB / 200MBit/s = 60GB / 25MB/s = 40min
Yeah that's what I assumed too, turns out it never finishes, I've let it do during hours and hours, much like if something was looping over and over.

I am going to try:

bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = false
bboltOptions.PageSize = 4096*1024

Also please know I batch as much as I can in my application, sadly most of the requests can't be batched.

Thanks!

Edit: should I stop using bboltOptions.MmapFlags = syscall.MAP_POPULATE ?

@ptabor
Copy link
Contributor

ptabor commented Feb 12, 2023

MAP_POPULATE causes best-effort load of the bbolt data to RAM.
If you have order of 60GB of RAM and want to serve reads quickly - it might be worth it.

(I also think that PageSize = 256*1024 might be closer to the optimum. )

@CorentinB
Copy link
Author

Ok I'm gonna remove it then because we only have 32GB of RAM and the DB will likely grow 100s of GBs, maybe up to a terabyte.

I'm changing to 256*1024 too. Thanks!

@ptabor
Copy link
Contributor

ptabor commented Feb 12, 2023

You are leader in terms of bbolt scale (I heard so far). Please keep us posted on the results of your experiments.

@CorentinB
Copy link
Author

So far so good since I restarted with the configuration you advised me. It's running well. :)

Makes me wonder what would happen if I were to delete a big bucket (then we are back to our original issue from this thread..).

@CorentinB
Copy link
Author

Talking about "scale" issues, we had to implement our own stats system to keep track of bucket's size because from what I understand the Stats() method reads the entire bucket to return the number of keys. So it was way too slow for our usage.

We think it could be way faster by keeping track of the number of keys as we add / remove them from the bucket in real time with an atomic int variable. That's potentially something I'll open an issue for and maybe try to make a PR if that's something you all would like to see implemented in bbolt.

@ahrtr
Copy link
Member

ahrtr commented Feb 13, 2023

But there is an issue. I simply can't start my program without doing a compact first.

What's the issue? Could you be a little more specific on this? I may do some experiment myself later when I get free cycle.

I think it's because compact generate the freelist

I am thinking this may not be the correct behavior. The compact should follow the original db. If the original db has synced freelist, it should persist freelist to db as well, otherwise not. Probably we should fix this. See also #290 (comment)

@CorentinB
Copy link
Author

What's the issue? Could you be a little more specific on this? I may do some experiment myself later when I get free cycle.

So if I don't do a compact first (again that's when NoFreelistSync = true), then the startup never complete, it does I/O forever. When NoFreelistSync is true it needs to generate the freelist, so I think it's what it is doing, but it never completes. If I do a compact first, considering that compact generate the freelist, then it starts without "warmup time" of course.

@ptabor
Copy link
Contributor

ptabor commented Feb 13, 2023

Makes me wonder what would happen if I were to delete a big bucket (then we are back to our original issue from this thread..).

The main reason for the big-pages is to make the 'freepage-list' smaller.
If your page is 256KB pages... there is 64x less pages... so 64x less IO on transaction commit to sync the free list.
Maybe the 64x smaller usage will be good enough for your use-case.

@CorentinB
Copy link
Author

Okay! I can't try now but I'll try again deleting a large bucket tomorrow.

@CorentinB
Copy link
Author

CorentinB commented Feb 22, 2023

Okay a little update on our usage of bbolt with the following settings:

bboltOptions.FreelistType = bbolt.FreelistMapType
bboltOptions.NoFreelistSync = false
bboltOptions.PageSize = 256*1024

We noticed a HUGE increase in requests latency after few days of it working fine, requests were taking up to 1 minute or more. Restarting the program or even compacting didn't fix it.

Then I switched to bboltOptions.NoFreelistSync = true (same DB file) and it worked perfectly for 2 days, and it was performing extremely well. Then requests latency started rising up again, reaching almost a minute. I restarted the program (which took some time due to NoFreelistSync = true of course), and latency reduced, but it's still way too high (25 seconds / req) at the moment I am writing. Very very far from the 20 or 50ms (worse 1500ms on some requests) we saw during the 2 days of it performing extremely well with the same configuration & the same DB.

I don't really know what to do now 🤷‍♂️

PS/edit: DB is getting bigger over time, it was around 50GB when it was performing well at the beginning, now it's around 150GB.

@tjungblu
Copy link
Contributor

are you able to gather a pprof from your running system? The 25s/request is definitely strange.

@CorentinB
Copy link
Author

I can try that indeed, I am not very very smart about pprof sadly, gimme some time.. Thank you

@CorentinB
Copy link
Author

heap

@tjungblu I assume that's what you want?

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2023

Then requests latency started rising up again, reaching almost a minute.

Just to double confirm, the "a minute" is the total time your HTTP server processed the request instead of the time boltDB commit the transaction? Could you log the duration on how long does the bbolt take on processing each call? You can also print tx.Stats.

Is it possible to share your boltDB files(e.g. 50G, 60G, 150G etc.) so that I can test it locally?

@CorentinB
Copy link
Author

CorentinB commented Feb 22, 2023

Thanks @ahrtr , yes it's the total time that I mentioned, but I just did some other tests right now and I can confirm >95% of that time is spent on bbolt. (I have some special code to return timing stats with my requests)

Sure thing I can share with you a file if I figure out a way to:

  1. Delete a bucket (the one with secrets)
  2. Send it to you privately (we don't want this data to be public)

Right now our response time is betwen 7 to 50 seconds, it fluctuates a lot.

Edit: sorry right now our DB file is 100GB not 150.

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2023

Right now our response time is betwen 7 to 50 seconds, it fluctuates a lot.

Is the 7 to 50 seconds dedicated to bbolt?

EDIT: based on your last comment, 95% of the 7 to 50 seconds is spent on bbolt, correct?

Edit: sorry right now our DB file is 100GB not 150.

It's fine. Thank you.

2. Send it to you privately (we don't want this data to be public)

Thank you. I will keep it private.

@CorentinB
Copy link
Author

CorentinB commented Feb 22, 2023

@ahrtr yes, although 95% is a guess, I have a little bit of code that record the time spent on an Update call for example, and it returns 11000ms for a request that take 12s total. It's very scientific, and I can add more code to get stats using tx.Stats like you mentioned earlier but I think 95% is a good guess.

So sad the bbolt CLI do not have a functionnality to delete a bucket! I'm writing a bit of code to do it now.. Then I'll upload it to Google Drive.

@CorentinB
Copy link
Author

@ahrtr where/how can I send you the file once it's uploaded on Google Drive?

@ahrtr
Copy link
Member

ahrtr commented Feb 22, 2023

@ahrtr where/how can I send you the file once it's uploaded on Google Drive?

Either email (wachao@vmware.com) or slack channel (in k8s workspace) is OK.

@tjungblu
Copy link
Contributor

I assume that's what you want?

@CorentinB close, I believe this looks like a memory profiling? can you get one for the CPU usage please?

@ahrtr
Copy link
Member

ahrtr commented Mar 2, 2023

I generated some sample files (see Test_Generate_Large_DB_256K_10M_NoFreelistSync in db_benchmark_test.go ), and did some benchmark test ( see benchmark )on my MacBook. I should do the test on a Linux server, but unfortunately I don't have a proper Linux server for now.

What I got from the test:

  1. Setting a bigger pageSize (e.g. 256KB) can greatly improve the writing performance. See the test result of test case Test_Large_Delete_Bucket_DB as below,
page size DB size (GB) Num of pages Total time Open time Delete time Commit time
4096 83.26 20322580 1475.980s 19m22.053s 5m11.42s 45.056ms
8192 110.11 13439144 983.476s 12m57.147s 3m25.018s 43.244ms
16384 94.01 5736784 1073.067s 15m46.183s 2m5.483s 60.712ms
32768 87.57 2671896 1358.074s 20m41.051s 1m55.699s 40.186ms
65536 84.66 1291588 921.879s 14m25.124s 55.398s 40.6ms
131072 83.28 635236 672.237s 10m43.786s 26.903s 44.72ms
262144 83.26 317544 680.545s 11m5.849s 13.309s 45.546ms
524288 82.59 157504 601.192s 9m52.866s 6.894s 46.158ms
  1. If NoFreelistSync is true, then it may take a long time to open a large db file in the beginning. In the above example, it's about 11 minutes to load a 83GB file (with pageSize 256K). The huge latency seems unacceptable for most applications, so usually it's suggested to set NoFreelistSync to false.

  2. Most of the time was consumed by fastCheck. When I removed it, then most of the time was consumed by isLeaf. If I understand it correctly, they triggers the process of loading data from disk, and it's just one-time operation, the following operation just reads data from memory.

profile002

In summary,

  1. Please set a larger pageSize (e.g 256K). I do not get time to test even bigger pageSize, e.g. up to 4M.
  2. Set NoFreelistSync to false if the long latency on opening the db file isn't accepted.
  3. Periodically compacting the db file should can improve the performance, but I do not test it, so no any data for that. Please share the data if anyone has it.
  4. If eventually consistentency is accepted, I would suggest to follow the similar pattern as what etcd does. Write each request in a append-only file firstly, then read & process (e.g. write boltDB) the data asynchrnously in a separate goroutine or process.

@ptabor
Copy link
Contributor

ptabor commented Mar 2, 2023

Thank you @ahrtr for preparing the statistics. They match the expectations:
I think that the biggest unknown is RCA of this behavior ?

We noticed a HUGE increase in requests latency after few days of it working fine, requests were taking up to 1 minute or more. Restarting the program or even compacting didn't fix it.

@tjungblu
Are you sure it's not caused by environmtantal problems like the OS starting swapping due too getting out of memory ?

  • pprof CPU profile
  • memory usage (% of total available memory), free OS memory
  • major page faults statistics
  • disk IO data

When such slow behavior is happening would be helpful to root cause the problem.

@cenkalti
Copy link
Member

cenkalti commented Mar 3, 2023

@CorentinB In comments above I've read that you have 32 GB memory on server but the database size was 100 GB. If your workload pattern is random, your disk speed may be the bottleneck here. I would try splitting the database into smaller databases and serve from separate machines to be able to fit hot databases into RAM.

@fumin
Copy link

fumin commented Oct 17, 2023

@ahrtr

I am curious about the following comment you wrote:

If eventually consistentency is accepted, I would suggest to follow the similar pattern as what etcd does. Write each request in a append-only file firstly, then read & process (e.g. write boltDB) the data asynchrnously in a separate goroutine or process.

I wonder under what circumstances do you recommend we do the above (append to file first, then write to boltDB)?
Is it whenever we have write heavy loads?
What are the symptoms if we don't follow this recommendation for write heavy loads, will we be seeing the same behaviou as in this issue?

@ahrtr
Copy link
Member

ahrtr commented Oct 18, 2023

I wonder under what circumstances do you recommend we do the above (append to file first, then write to boltDB)?
Is it whenever we have write heavy loads?
What are the symptoms if we don't follow this recommendation for write heavy loads, will we be seeing the same behaviou as in this issue?

There are two prerequisites to follow this async solution:

  1. Eventually consistentency is accepted. Once the request data is persisted in the append-only file (WAL file: Write-Ahead Log), then the data will not be lost. It's possible that other users/clients might get stale data right after the successful write. But eventually they will get the new data once the data in WAL files are applied to bboltDB.
  2. The users experience matter; in other words, you expect clients to get fast responses.

@github-actions github-actions bot added the stale label Apr 17, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants