Skip to content
This repository

Extensive disk fragmentation makes OS unresponsive #776

Closed
pszturmaj opened this Issue · 32 comments

9 participants

Piotr Szturmaj Gregory Maxwell p2paul William Cryo Coldwell P. Kaufmann Pieter Wuille rkfg Pas Wladimir J. van der Laan
Piotr Szturmaj

Allocation of file blk0001.dat is very inefficient. It produces more than 100 file fragments per minute on Windows XP 32 bit SP3 (NTFS partition). System becomes very slow and sometimes unresponsive. Hard disk activity is very high.

I encountered this problem in early versions of the client and it still remains in 0.5.2.

Here's screenshot of the fragmentation after 5 minutes of block downloading - see file blk0001.dat. Note that I defragmented this file to one fragment before I ran the client.
http://i.imgur.com/fMhWR.png

Gregory Maxwell
Collaborator

I don't see that 500 fragments in a 500 meg file is much of an issue at all.

The excessive IO (mostly caused by sync writes and write amplification to update the transaction/block index) is a known problem, but I don't believe that fragmentation is playing much of a role in that. (Or at least this isn't evidence of it)

Piotr Szturmaj

That was just 5 minutes. Once, this file had more than 6000 fragments. To scan such a file, assuming that avg. access time is 13 ms we need 6000 x 13 = 78000 ms = 78 seconds.

I don't know about IO issues but they may be related to fragmentation IMHO.

Gregory Maxwell
Collaborator

Hm? In your screenshot the file shows 580 mbytes and 551 fragments— 580 mbytes is more than half way synced up in terms of byte size. Also— normal access to that file is already random, the only time it accesses it at all sequentially is when you do a -rescan.

Piotr Szturmaj

Yes, but I defragmented this file before. The screenshot shows the effect of running bitcoin client for 5 minutes after I defragmented this file. So, before I started the client this file had only 1 fragment.

Gregory Maxwell
Collaborator

Ah! Thank you, I hadn't caught that before.

p2paul

Win7x64 bitcoin-qt --- blk0001.dat && blkindex.dat exploded all over my NTFS. When my HD started sounding like a dot-matrix printer and the synchronization effectively froze: I was forced to synchronously defrag.

This issue could be seriously damaging hardware and will get worse faster than the data grows.

I assume Ext_ and SSD users are unaffected.

Piotr Szturmaj

This can be fixed easily, by allocating block file by larger chunks, for example by 16 MB.

EDIT: I think chunk size should be adjustable in the options dialog.

PS. I was trying to fix this by myself, but I couldn't even compile client, both with QtCreator and manually within MinGW. Perhaps an updated and detailed information about build process might help others fix some bugs.

William Cryo Coldwell
cryo commented

Seeing this on OSX 10.7 with Journaled HFS+. It won as the most fragmented file by such a monumental amount. I get a nice machine-gun sound from my poor drive from the seeking.

P. Kaufmann

I had this in my mind for client optimisation, too ... my blk0001.dat consists of 64000 fragments. We all now the current size is ~1.2GB, why not pre-allocate this as a whole chunk and writing to this chunk. I'm not in the code, but sounds interesting ^^.

Piotr Szturmaj

@Diapolo this means that average fragment size is 19.5KB... this is really harming to disk drive

@gmaxwell with 500 fragments I have waited a minute to see received block count increase by one. After defragmenting blk0001.dat rate jumped to few tens of blocks/sec. Tested with latest 0.6 version.

This is serious issue and it seems that client doesn't use buffering at all. If you're about to write block to disk, instead write it to 4MB (adjustable) memory-buffer. Do the same with other blocks until buffer is filled up, then commit whole block to disk. This simple fix should help greatly.

P. Kaufmann

I'm currently working on an "experimental" patch, that addresses this for the blk0001.dat file on Windows. I'm not sure if the devs are fine with such a thing, but I'm sure this can also be done for Mac and Linux. For such optimizations I would tolerate OS-specific patches as long as they don't break the legacy (un-optimized) behaviour of the client, what do other devs think about this?

@pszturmaj Are you able to compile the client on Windows, as I can't compile redistributable files (because of dependecies to non stacially linked libs) and don't want to :). You could check out my work after it's finished as a tester.

Pieter Wuille
Collaborator
sipa commented

@Diapolo how does it work? How do you influence bdb?

Also: I hope it can be integrated in gitian builds.

P. Kaufmann

I can say it only works before the file exists, so for the change to be effective and active the blockchain needs to be re-downloadeded. As I'm currently in a trial and error phase, I will start posting further infos after that :).

P. Kaufmann

@sipa I need your help (badly)! The WriteToDisk() in CBlock uses streams to write index headers and blocks. I would like to know, if I can somehow set the filepointer / position indicator, so that the data is not always appended to the end of blk0001.dat. I need to be independend of the real file length for my idea.

I tried to change the fseek() in AppendBlockFile(), but that didn't make a difference :-(. The very first block is okay, I see this because I queried via ftell(), but after the "fileout <<" calls, the position indicator is not, where it should be :D.

Piotr Szturmaj

@Diapolo I was able to compile libs separately, but then I gave up and didn't touch it again. Please let me know when you'll finish your patches, I'll try to test them on Windows.

P. Kaufmann

... still experimenting. Seems like via fseek I simply can't set the filepointer to write somewhere into a file :-/.

P. Kaufmann

@pszturmaj With the help of BlueMatt I can perhaps compile a version for you to download and test. I hope to have it ready before I'm off for a few days ;).

rkfg
rkfg commented

Did anyone think about initial chain loading/verification/synchronization while holding the database completely in-memory? Memory is cheap today and this feature can be optional. If you have enough free memory (1-2Gb) why not to put the blk0001.dat and blkindex.dat there? And flush it to disk every 30-60 minutes for example. This will greatly help to avoid fragmentation and also disk thrashing. I tried this using tmpfs and symlinks on GNU/Linux and it worked like a charm.

William Cryo Coldwell
cryo commented

0.6.0.6beta still shows it

BitCoin fragmentation

Gregory Maxwell
Collaborator

I've still yet to see a plausible measurement here showing that this makes a material performance difference / and/or what kind of speedup could be expected if the problem were removed. "a minute to complete download a block" is just not plausible, that performance would take hours to sync just a days worth of blocks and there simply aren't users reporting that it takes weeks to sync the chain (at least not anymore)

There are a number of ways of demonstrating this— perhaps the easiest would be sync up to height 140,000 from a local node (important in order to avoid differences due to network luck)... shut down, copy the directory (thus making a defragmented version), start back up (on the fragmented version) and time how long it takes to sync to 160000 (using logtimestamps=1), then shutdown and replace with the backup and time the sync again.

Simply pointing to the fact that there are fragments isn't itself especially concerning— or at least it doesn't demonstrate that the double writes needed to reduce fragmentation on Windows via preallocation would be justified by the resulting performance increase from fragmentation reduction.

Eurekafag, sure tmpfs works great— but putting it in memory, and also coping with it taking a minute to write to disk all at once is not easy to do. Patches accepted.

rkfg
rkfg commented

Gmaxwell, uhh, I'm just suggesting. I don't even know if Berkeley DB can work entirely in memory. Just an obvious idea of speeding things up a bit, especially for those OSes lacking built-in ramdisk support. Flushing could block all database operations for a minute (once in 30-60 minutes — not a thing to worry about) but it saves more time not using the disk and thus avoiding fragmentation while downloading hundreds or thousands of blocks.

Pas

I've still yet to see a plausible measurement here showing ...

I'd propose to use "time to desktop" without bitcoin-qt, with bitcoin-qt and fragmented files, and after defragmentation. Otherwise the occasional reads and writes are not really a performance toll, because the last few blocks are already in the OS file cache.

I'll try to measure it in a few days (because I've just defragmented the bitcoin directory).

Gregory Maxwell
Collaborator

meh. Time to desktop can be made much smaller by simply making checkblocks check the less of the end of the chain at startup or putting it in the background. This needs to distinguish slow checkblocks from slow startup for other reasons. Especially since the checkblocks does cause sequential access to the chain (it's the only thing that does during normal operation) so it would exaggerate the impact.

Piotr Szturmaj

I can confirm that synchronization slows down proportionally to the number of blk0001.dat fragments. Test is simple:

  1. defragment block file
  2. run client
  3. observe that in the beginning it synchronizes well, but later it will keep slowing down continuosly
  4. close client, check that block file has few hundred of fragments
  5. goto 1 and observe the same, this is repeatable

During synchronization it slows down to the point when the client gets almost unresponsive. After I close the client, I must wait half a minute to see that its process is actually killed. This happens only after a major slow down, it doesn't happen when the file is deframented.

Is this "plausible" enough?

P. Kaufmann
Diapolo commented

@pszturmaj I have a branch ready that fixes your reported problem for Windows (consider it still experimental): https://github.com/Diapolo/bitcoin/tree/InitBlockDL-exp The problem is I can't compile an executable you could try, so I rely on BlueMatt to get a working exe via Jenkins, which produced quite some errors the last time we tried. Or are you able to compile for yourself now?

P. Kaufmann
Diapolo commented

@pszturmaj Sipa was so kind to build a working executable from my branch (https://github.com/Diapolo/bitcoin/tree/InitBlockDL-exp), which is based on the current bitcoin master branch. You find the files for Windows (only) here: http://bitcoin.sipa.be/builds/0.6.1-35-g34a0eab/

Would be great if you could check them out and report back. Be aware that you will need to delete your current blk*.dat files for this testing and unofficial build to work properly!

DISCLAIMER:
The linked build is experimental and should only used in testing environments, as it can contain serious bugs! I tested this on my Windows machine and fixed all bugs I found to date!

Piotr Szturmaj

@Diapolo Hey, it's much better! It allocates ~2 GB block file and now it doesn't block the system anymore.

Here's a screenshot after 10 minutes of running the client: http://i.imgur.com/GSms5.png. Please note the network graph. The link speed constantly drops and increases, I think this could be optimized.

P. Kaufmann
Diapolo commented

@pszturmaj Sounds good as a starting-point. Could you benchmark (messure the time it takes to do a full block-chain download) with 0.6.1 official vs. that experimental build?

And perhaps a startup- / shutdown-time comparison :)?

Piotr Szturmaj

@Diapolo I'm still experiencing extensive disk access, shutdown-time was 45 sec after 3 mins of downloading (during shutdown, Hdd LED was ON and system was blocked). Before that, I temporarily closed the client and defragmented blkindex.dat and debug.log. I see that debug.log is written frequently but I think it's too small to cause such slow downs.

I did look at the IO stats and I saw that bitcoin process reads at rate up to 10 MB/s and writes up to 8 MB/s. Speed dropped to 5-10 blocks/sec. Does it scan the whole index for each block?

It seems that downloading gets slower, proportionally to the number of received blocks.

P. Kaufmann
Diapolo commented

@pszturmaj
blkindex.dat is still heavily fragmented over time, but at least the blk000x.dat files should consist of a single fragment on disk. My last work was truncating unused space in the block files after they reach their max size, to save disk space and adding reading from block files via std::fstream, too ... which now also works.

I currently can't do anything in terms of network performance, my focus lies on the filesystem stuff in my experimental build. I'll see if there is a way to do the same pre-alloc for blkindex.dat.

Pieter Wuille
Collaborator
sipa commented

@pszturmaj more recent blocks have more transactions, and to process a transaction, all its inputs have to be looked up in the index. This index is very efficient, but since those transactions are most likely spread out over many megabytes on your disk, it will require more and more seeking as the block database grows.

Wladimir J. van der Laan
Owner

Closing this as a new storage system is in use for the blocks that avoids having one large file with many fragments.

chronokings chronokings referenced this issue in chronokings/huntercoin
Open

Major Fragmentation on windows #64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.