Blockchain sync failure #6606

bol-van · 2015-08-31T12:23:10Z

I'm on Bitcoin core v0.11.0 windows x64.
OS is Windows Server 2012 R2.

I've been using bitcoin core for years without significant problems, but last month something happened. Database got corrupted. I tried to delete all but wallet.dat, resync database. Tried ~5 times, put datadir to different hard drives. At random position sync stops with error. After process relaunch same error is displayed and program crashes with assertion.

bol-van · 2015-08-31T13:57:01Z

Same thing happens to bitcoind.

C:\Program Files\Bitcoin\daemon>bitcoind.exe -datadir=H:\bitcoin
Error: Error reading from database, shutting down.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

debug.log :

2015-08-31 12:39:34 LevelDB read failure: Corruption: block checksum mismatch
2015-08-31 12:39:34 Corruption: block checksum mismatch
2015-08-31 12:39:34 Error: Error reading from database, shutting down.
2015-08-31 12:39:34 Error reading from database: Database corrupted

laanwj · 2015-08-31T14:23:40Z

"Error reading from database: Database corrupted" levelDB corruption is usually caused by disk or memory corruption (while writing to disk).
You could try using -par=1 to restrict syncing to one thread and then -reindex. Sometimes this helps when, for example, the CPU is overheating.

bol-van · 2015-08-31T15:51:01Z

Unlikely this is RAM or DISK problem. OS runs stable for weeks, memtest report nothing.
No bad block events in the event log. One of the disks I tried to put db on is several days old.
Ram problems are mostly random. Here I have 100% failure result each time.
Any ways to further diagnose the source of the problem ?

bol-van · 2015-09-01T08:05:47Z

I reproduced exact same behavior in VM with Windows Server 2003 X64.
Pls someone try to resync the whole db ! Am I alone with this ?

laanwj · 2015-09-01T09:47:17Z

I'd be interested to know if the same happens in that VM with Bitcoin 10.2.

wtogami · 2015-09-01T09:50:17Z

You are able to reproduce the failure on other hardware?
What about bitcoind or bitcoin-qt for Linux in a VM?

bol-van · 2015-09-01T09:54:06Z

Additional notice.

Both 0.10.2 and 0.11.0 cannot start db sync when empty datadir is on "\vmware-host\shared folders" and successfully do when datadir is on windows network drive.

2015-09-01 09:51:55 init message: Loading block index...
2015-09-01 09:51:55 Opening LevelDB in Z:\home-h\Bit2test\blocks\index
2015-09-01 09:51:55 Corruption: no meta-nextfile entry in descriptor
2015-09-01 09:52:23 init message: Loading block index...
2015-09-01 09:52:23 Wiping LevelDB in Z:\home-h\Bit2test\blocks\index
2015-09-01 09:52:23 Opening LevelDB in Z:\home-h\Bit2test\blocks\index
2015-09-01 09:52:23 Corruption: no meta-nextfile entry in descriptor
2015-09-01 09:52:25 Shutdown: In progress...
2015-09-01 09:52:25 StopNode()
2015-09-01 09:52:25 Shutdown: done

bol-van · 2015-09-01T11:01:01Z

I have one guess. Trouble can be in memory mapped files. I know bitcoin core uses them, it can be seen in RamMap utility.
I also run BURST coin pocminer. It extensively uses mapped files. Because of that kernel paged pool grows very large - up to more than half of the physical memory (its gigabytes). Huge pooltag is "MmSt", it contain PTEs. Detailed subject description is here : http://blogs.technet.com/b/askperf/archive/2011/09/23/getting-to-know-the-mmst-pool-tag.aspx
I'm on 24 GB system and set the PoolUsageMaximum to 10 (its 10 percent of RAM, 2.4G in my case). This measure effectively limit MmSt growth and it worked great until... what changed in last weeks ?
I replaced failing hard drive which contain 3 TB of BURST miner plots. This time I formatted NTFS volume with 64K cluster size (was 4K).
And probably from that point bitcoin db corruptions started.
Now i killed pocminer and trying to sync bitcoin both on the host and in VM.
Without pocminer bitcoin could start sync on vmware-host shared folder.
Will report after my guess is confirmed or no.

PS. Bitcoin 0.11.0 linux x86, runs on different hardware node without VM. Already synced till 1 year old, still no problem.

bol-van · 2015-09-02T04:30:43Z

Yes, trouble was triggered by BURST miner. Without it sync was successful.
Running with almost exhausted paged pool cause errors not only in bitcoin core but also have other negative effects and having large cluster volume seem to harden them.

laanwj · 2015-09-02T10:28:45Z

Thanks for looking into this so deeply. This issue could be useful for other people that experience issues on windows.

I still wonder how the combination of hw and sw caused corruption, but it's likely the problem lies outside bitcoin core if it affects other software negatively as well.

bol-van · 2015-09-02T13:12:52Z

One of the negative effects was the following.
Attempts to start db sync from bitcoin core running in vmware guest to vmware host drive were failing just at the start. Then I tried to mount network drive from vm guest to vm host using virtual network (regular 'net use \192.168.1.5') and sync db to that drive.
Start was successful but after some time I saw messages in the tray stating that windows could not flush data to network drive and data could be lost. Obviously, bitcoin core cannot display such messages, explorer.exe displays them. Event source is guest os kernel not being able to get read/write success confirmation from the server side. Thus the lanmanserver (The 'Server' service) component on the host was experiencing problems in exhausting paged pool condition probably IO-related.
Its all very strange because in the task manager on the host I see that pool is being trimmed from 2.8G to 800M after its exhaustion and then again grows to 2.8G.
I can suppose some paged pool allocations or some map-view-of-file operations fail before MmSt trim actually happens. Kernel components are written well-checked to not crush in any possible condition, but still denial-of-service exists.
Windows architecture problem ? I know burst miner is badly designed. It should not map terabytes of data files to memory. But also OS should not behave bad in this condition. If MmSt pool is like cache it must be trimmed transparently without alloc fails.

From bitcoin core perspective may be some checks are missing or db engine lack enough atomicity to rollback failing changes ? At the moment I can state : BURST miner can kill bitcoin db in some conditions, possibly when burst plots are on a large cluster volume.
This is not HW related at all. Its mainly the OS problem not being too resistant to some conditions.

bol-van closed this as completed Aug 31, 2015

bol-van reopened this Aug 31, 2015

laanwj added the Windows label Sep 1, 2015

bol-van closed this as completed Sep 2, 2015

laanwj added the Data corruption label Feb 9, 2016

jonasschnelli mentioned this issue Jul 11, 2017

ioctl error when opening database on external hard drive #10787

Closed

bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blockchain sync failure #6606

Blockchain sync failure #6606

bol-van commented Aug 31, 2015

bol-van commented Aug 31, 2015

laanwj commented Aug 31, 2015

bol-van commented Aug 31, 2015

bol-van commented Sep 1, 2015

laanwj commented Sep 1, 2015

wtogami commented Sep 1, 2015

bol-van commented Sep 1, 2015

bol-van commented Sep 1, 2015

bol-van commented Sep 2, 2015

laanwj commented Sep 2, 2015

bol-van commented Sep 2, 2015

Blockchain sync failure #6606

Blockchain sync failure #6606

Comments

bol-van commented Aug 31, 2015

bol-van commented Aug 31, 2015

laanwj commented Aug 31, 2015

bol-van commented Aug 31, 2015

bol-van commented Sep 1, 2015

laanwj commented Sep 1, 2015

wtogami commented Sep 1, 2015

bol-van commented Sep 1, 2015

bol-van commented Sep 1, 2015

bol-van commented Sep 2, 2015

laanwj commented Sep 2, 2015

bol-van commented Sep 2, 2015