Running out of disk space #1258

alsakhaev · 2021-02-15T16:06:40Z

Summary

We have a bunch of troubles running our bee node serving the swarm downloader (out hackathon project).

We a running a bee node under https://swarm.dapplets.org
the node takes all available space on HDD and obviously starts rejecting files we are uploading. Waiting for swarm hash either fails immediately or takes too long time. We have set a db-capacity: 2621440 chunks (aprox. 10gb) + 5GB freespace, but goes fully consumed.

Steps to reproduce

Created VPS server in Hetzner with following hardware (CX11, 1 VCPU, 2 GB RAM, 20 GB) with Ubuntu 20.04.2 LTS
Installed Bee via wget https://github.com/ethersphere/bee/releases/download/v0.5.0/bee_0.5.0_amd64.deb sudo dpkg -i bee_0.5.0_amd64.deb
Configured like in the config bellow
Installed nginx web-server and configured reverse proxy from https://swarm.dapplets.org to http://localhost:1633 with SSL of let's encrypt
Upload files to the node via POST https://swarm.dapplets.org/files/
After a while disk space runs out

Expected behavior

I expect to see 5gb freespace :)

Actual behavior

Disk space runs out
in the log a lot of errors about it
cannot upload a file, node responses HTTP 500 internal server error

Config /etc/bee/bee.yaml

Uncommented lines from config file:

api-addr: 127.0.0.1:1633
clef-signer-endpoint: /var/lib/bee-clef/clef.ipc
config: /etc/bee/bee.yaml
data-dir: /var/lib/bee
db-capacity: 2621440
gateway-mode: true
password-file: /var/lib/bee/password
swap-enable: true
swap-endpoint: https://rpc.slock.it/goerli

The text was updated successfully, but these errors were encountered:

Eknir · 2021-02-20T12:39:01Z

Thank you for reporting the bug! We will have a look into it shortly

jpritikin · 2021-02-21T13:01:59Z

Tangential suggestion: It should be pretty easy to calculate an approximate estimate of the maximum possible disk usage from db-capacity and compare that to the actual disk space available. If these are grossly out-of-whack then bee should log a suggestion of a new value for db-capacity that will not exceed the available disk space.

jpritikin · 2021-02-22T11:28:56Z

I can reproduce this problem. It looks like disk space accounting does not include uploaded files. When I restart swarm, immediately a ton of disk space is freed up as db-capacity is re-applied.

jpritikin · 2021-02-22T14:09:54Z

Hm, bee isn't releasing all of the disk space even after a restart,

root@salvia /o/bee# grep db-cap /etc/bee/bee.yaml
db-capacity: 5000000
root@salvia /o/bee# ls
keys/  localstore/  password  statestore/
root@salvia /o/bee# du -h -s .
111G	.

acud · 2021-02-25T16:21:19Z

Can you say if at any point your db capacity was set above 5mil?
Did you play around with the size?
Bee will not garbage collect your uploaded content before it is fully synced. You can track progress of your uploads with the tags API.

Please try to give as much information as to what you have done prior to this problem surfacing.
I'm trying to reproduce this but so far no luck.

jpritikin · 2021-02-25T16:36:30Z

Can you say if at any point your db capacity was set above 5mil?

Yes, I tried 10mil. Once I realized that disk space management wasn't working then I reduced back to 5mil.

Did you play around with the size?

On one node, I probably uploaded faster than sync'ing. For example, maybe I uploaded 30G of data to the node very quickly and then waited for it to sync.

I'm trying to reproduce this but so far no luck.

If you can provide some guidance about how to not trigger the issue then that would also help. I gather that I shouldn't mess with the db-capacity setting. Also, I should not uploaded too fast?

I was trying to find where the limits were, to help with testing, but I am content to play within expected user behavior too.

I'm curious to hear from @alsakhaev too

significance · 2021-03-16T09:47:58Z

@Eknir @acud

message from bee-support

mfw78: I've found on 3x containers that I've run, all of them do not respect the db-capacity limit.

sig: are you uploading any data to them?

mfw78: No

RealEpikur · 2021-03-16T20:54:40Z

+1: started a node on raspi with 32gb sd card, ran out of disk space after 10hrs

ronald72-gh · 2021-03-16T21:01:23Z

+1: have set up docker-based nodes and all of their localstores have easily surpassed the db-capacity limit and use between 30Gb and 40Gb now

mfw78 · 2021-03-16T23:06:33Z

+1: Running multiple bees in Kubernetes containers. Each bee exhausts it's disk space allocation (doubling the db capacity has no effect besides chewing more space, and consequently exceeding).

Eknir · 2021-03-23T08:08:38Z

Thanks all, for the comments and reports. We are releasing soon and included several improvements that aim to address this issue. We would greatly appreciate if you could try it out and report back here.

mfw78 · 2021-03-29T03:29:02Z

I can confirm that running 0.5.3 the db-capacity seems to be more respected, with 6 nodes that I'm running doing the following disk usage: 28G / 21G / 28G / 28G / 29G / 27G

Eknir · 2021-04-06T07:27:46Z

This issue can be reliably reproduced with a rPI

luowenw · 2021-04-08T05:06:27Z

I am running 0.5.3 using the default db-capacity. I can see the bee is doing garbage collecting in the same time of consuming more space. Once the garbage collecting fall off and the disk usage reached 100% then everything not working anymore. The log will keep reporting No Space Left and the garbage collecting will also stop to work.

Eknir · 2021-04-13T10:09:20Z

@zelig @acud you guys are working on this as part of the postage stamps?
Shall I assign this issue to the current sprint?

Eknir · 2021-04-13T10:09:26Z

@zelig @acud you guys are working on this as part of the postage stamps?
Shall I assign this issue to the current sprint?

ethernian · 2021-04-18T18:37:54Z

the bug has a severe impact on the entire network because people are just purging the localstore of their nodes, causing data loss. No way to release the bee without killing the bug.

github-actions · 2021-11-30T01:44:22Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 5 days.

tmm360 · 2022-01-22T20:49:27Z

Any news on this? Issue is still there. I'm using default configuration on disk space (BEE_CACHE_CAPACITY=1000000), it should be ~4GB, but this is my disk space graphic.

I didn't perform any upload on node. It's a VERY important issue to fix.

acud · 2022-02-16T16:35:42Z

It should be resolved with the latest release. However the problem is multi tiered so shipping a database migration that would fix the problem which is already exacerbated on some nodes was not trivial to do. If you db nuke your node and allow it to resync, the problem should be resolved.

ldeffenb · 2022-02-16T17:09:59Z

Any plans to publish guidance on this? In particular, how to detect if the issue exists within a node so that we don't just start nuking everything and dropping retrievability on chunks already stored in the swarm.

ldeffenb · 2022-04-07T21:25:40Z

Ok, I'm out of ideas now. Sorry, but the node is going to store what it thinks it needs to store. Nuking your DB periodically will just chew up your bandwidth and strain your neighborhood to push it all back to you, not to mention risking actually dropping the only copy of some chunks that your node was supposed to store. Unless there's still a lurking issue with that somewhere that hasn't been uncovered and isn't visible with the metrics we currently have available.

ldeffenb · 2022-04-07T21:30:01Z

One more final thought after having gone back through and re-read everything. If you are still uploading through this node, are you asking it to pin the content on the upload? Pinned content is not garbage-collected and is also not counted against the db-capacity configuration. But it is all dropped on a db nuke, as far as I can tell.

jpritikin · 2022-04-07T21:36:20Z

Nuking your DB periodically will just chew up your bandwidth and strain your neighborhood

OK, I'm happy to wait until my disk fills up. Maybe devs will figure out a solution by then.

If you are still uploading through this node

I am not. I haven't tried to upload for months.

acud · 2022-05-02T17:17:21Z

@jpritikin I am adding a new db indices command to the binary so we can have a bit more info about the problem. Could you please build #2924, run the bee db indices --data-dir <path-to-data-dir> and paste the output here?

acud · 2022-05-02T17:28:16Z

also, @jpritikin please don't use that built version to run bee normally. use the current stable version to run bee with bee start (there are still some things that we're ironing out before the next release)

jpritikin · 2022-05-02T21:46:46Z

Here is the output:

root@glow:~# ./bee db indices --data-dir /opt/bee
INFO[2022-05-02T17:19:40-04:00] getting db indices with data-dir at /opt/bee 
INFO[2022-05-02T17:19:40-04:00] database capacity: 1000000 chunks (approximately 20.3GB) 
INFO[2022-05-02T17:20:29-04:00] localstore index: gcSize, value: 950674      
INFO[2022-05-02T17:20:29-04:00] localstore index: retrievalAccessIndex, value: 1896105 
INFO[2022-05-02T17:20:29-04:00] localstore index: postageChunksIndex, value: 1896105 
INFO[2022-05-02T17:20:29-04:00] localstore index: retrievalDataIndex, value: 1896105 
INFO[2022-05-02T17:20:29-04:00] localstore index: gcIndex, value: 948228     
INFO[2022-05-02T17:20:29-04:00] localstore index: postageRadiusIndex, value: 703 
INFO[2022-05-02T17:20:29-04:00] localstore index: reserveSize, value: 957416 
INFO[2022-05-02T17:20:29-04:00] localstore index: pullIndex, value: 1895775  
INFO[2022-05-02T17:20:29-04:00] localstore index: pinIndex, value: 947877    
INFO[2022-05-02T17:20:29-04:00] localstore index: postageIndexIndex, value: 5173925 
INFO[2022-05-02T17:20:29-04:00] localstore index: pushIndex, value: 0        
INFO[2022-05-02T17:20:29-04:00] done. took 48.935545587s

acud · 2022-05-02T23:02:59Z

w00t. and this takes how many gigs? can we have a du -d 1 -h of the localstore directory?

jpritikin · 2022-05-02T23:17:12Z

Here you go,

root@glow:/opt/bee/localstore# du -d 1 -h
42G	./sharky
43G	.

acud · 2022-05-03T01:46:00Z

can you also provide the output of your /topology endpoint on the debug api?

jpritikin · 2022-05-03T16:20:23Z

Here is topology output, topo.txt

acud · 2022-05-04T15:42:01Z

Thanks. Would you be able to post the free_* files from the localstore/sharky directory?

jpritikin · 2022-05-04T16:45:42Z

Like this? https://drive.google.com/file/d/1pSXKBGzcYKtAqRWFUDQb4-Z4uHiJUqSk/view?usp=sharing

acud · 2022-05-04T22:59:42Z

thanks @jpritikin this was very helpful. i have some possible direction on the problem. since you can reproduce the problem, could you try the following please?

have a look at the diff here
git checkout v1.5.1 or the last stable release that you're running (if git complains and you can't see the tag do a git fetch --tags)
apply the same change as in the changeset
build the binary
nuke the db
run the binary as you normally would and see if the storage keeps on leaking

many thanks in advance!

jpritikin · 2022-05-04T23:24:59Z

Okay, I'm running this code.

jpritikin · 2022-05-12T17:45:40Z

Disk usage is up to 21.5GiB.

acud · 2022-05-12T19:53:18Z

@jpritikin so if I understand correctly everything is OK running with this fix?

jpritikin · 2022-05-12T22:19:20Z

so if I understand correctly everything is OK running with this fix?

No, I commented 4 hours ago because that's how long it took for the disk to fill up to that point. The test begins now, not ends.

Case in point, the disk usage is now up to 26GiB. So I would say that the fix has failed to cure the problem. 😿

acud · 2022-05-13T02:06:21Z

The limit on amount of chunks stored is not a hard limit for reasons which would be difficult to explain here. Let's start with evaluating the size of the sharky sub directory over time. Please update us here on your findings. The leveldb part (the localstore root) also does not get GCd so often and when it does, it is done on its own conditions (not a full gc). Let's start from sharky dir stats then we see how to progress.

jpritikin · 2022-05-13T02:37:42Z

Ah, well, maybe there is nothing wrong? Is this the output you need?

root@glow:/opt/bee/localstore# du -d 1 -h
17G	./sharky
18G	.

I using btrfs for /opt. Perhaps the output of df -h /opt is not accurate? I can't see where 26GiB is used given that bee is only using 20GiB.

jpritikin · 2022-05-16T16:21:49Z

Looks like I was confused by df output. df shows 35G used, but disk usage by the sharky directory is unchanged vs my previous comment 4 days ago. Time to close this issue?

acud · 2022-05-17T10:38:36Z

Let's leave it open for now and continue tracking the issue

jpritikin · 2022-05-17T13:56:05Z

Should I upgrade to v1.6 or is there still value in running the custom build?

acud · 2022-05-18T07:10:27Z

yes you can, as we ported the fix into the release 👍

jpritikin · 2022-06-29T23:36:22Z

Hm, something is still broken? I upgraded to 1.6.2 today and noticed that the disk was filling up again,

root@glow:/opt/bee/localstore# du -d 1 -h
36G	./sharky
37G	.

alsakhaev added the bug Something isn't working label Feb 15, 2021

bee-runner bot added the issue label Feb 15, 2021

jpritikin mentioned this issue Feb 22, 2021

Chunk garbage collector issue #1306

Closed

Eknir assigned acud Feb 23, 2021

Eknir mentioned this issue Mar 3, 2021

Change db-capacity flag to cache-capacity #1372

Closed

acud added pinned and removed pinned labels Sep 9, 2021

github-actions bot added the no-issue-activity label Nov 30, 2021

acud removed the no-issue-activity label Nov 30, 2021

acud mentioned this issue May 12, 2022

fix(sharky): remove next slot allocation optimization #2953

Merged

zelig mentioned this issue May 27, 2022

sharky leaks without a shard limit #2992

Closed

This was referenced Aug 1, 2022

Storage rewrite ethersphere/bee-backlog#41

Closed

localstore leak #3126

Closed

agazso mentioned this issue Aug 15, 2022

Research what causes leaking in sharky #3194

Closed

istae closed this as completed Feb 24, 2023

Running out of disk space #1258

Running out of disk space #1258

Comments

alsakhaev commented Feb 15, 2021

Summary

Steps to reproduce

Expected behavior

Actual behavior

Config /etc/bee/bee.yaml

Eknir commented Feb 20, 2021

jpritikin commented Feb 21, 2021 • edited

jpritikin commented Feb 22, 2021

jpritikin commented Feb 22, 2021

acud commented Feb 25, 2021

jpritikin commented Feb 25, 2021 • edited

significance commented Mar 16, 2021

RealEpikur commented Mar 16, 2021

ronald72-gh commented Mar 16, 2021

mfw78 commented Mar 16, 2021

Eknir commented Mar 23, 2021

mfw78 commented Mar 29, 2021

Eknir commented Apr 6, 2021

luowenw commented Apr 8, 2021

Eknir commented Apr 13, 2021

Eknir commented Apr 13, 2021

ethernian commented Apr 18, 2021

github-actions bot commented Nov 30, 2021

tmm360 commented Jan 22, 2022

acud commented Feb 16, 2022 • edited

ldeffenb commented Feb 16, 2022

ldeffenb commented Apr 7, 2022

ldeffenb commented Apr 7, 2022

jpritikin commented Apr 7, 2022

acud commented May 2, 2022

acud commented May 2, 2022

jpritikin commented May 2, 2022

acud commented May 2, 2022

jpritikin commented May 2, 2022

acud commented May 3, 2022

jpritikin commented May 3, 2022

acud commented May 4, 2022

jpritikin commented May 4, 2022

acud commented May 4, 2022

jpritikin commented May 4, 2022

jpritikin commented May 12, 2022

acud commented May 12, 2022

jpritikin commented May 12, 2022

acud commented May 13, 2022

jpritikin commented May 13, 2022 • edited

jpritikin commented May 16, 2022

acud commented May 17, 2022

jpritikin commented May 17, 2022

acud commented May 18, 2022

jpritikin commented Jun 29, 2022

jpritikin commented Feb 21, 2021 •

edited

jpritikin commented Feb 25, 2021 •

edited

acud commented Feb 16, 2022 •

edited

jpritikin commented May 13, 2022 •

edited