Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blockchain sync failure #6606

Closed
bol-van opened this issue Aug 31, 2015 · 11 comments
Closed

Blockchain sync failure #6606

bol-van opened this issue Aug 31, 2015 · 11 comments

Comments

@bol-van
Copy link

bol-van commented Aug 31, 2015

I'm on Bitcoin core v0.11.0 windows x64.
OS is Windows Server 2012 R2.

I've been using bitcoin core for years without significant problems, but last month something happened. Database got corrupted. I tried to delete all but wallet.dat, resync database. Tried ~5 times, put datadir to different hard drives. At random position sync stops with error. After process relaunch same error is displayed and program crashes with assertion.

bitcoin_read_database
1
2

@bol-van
Copy link
Author

bol-van commented Aug 31, 2015

Same thing happens to bitcoind.

C:\Program Files\Bitcoin\daemon>bitcoind.exe -datadir=H:\bitcoin
Error: Error reading from database, shutting down.

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.

debug.log :

2015-08-31 12:39:34 LevelDB read failure: Corruption: block checksum mismatch
2015-08-31 12:39:34 Corruption: block checksum mismatch
2015-08-31 12:39:34 Error: Error reading from database, shutting down.
2015-08-31 12:39:34 Error reading from database: Database corrupted

@bol-van bol-van closed this as completed Aug 31, 2015
@bol-van bol-van reopened this Aug 31, 2015
@laanwj
Copy link
Member

laanwj commented Aug 31, 2015

"Error reading from database: Database corrupted" levelDB corruption is usually caused by disk or memory corruption (while writing to disk).
You could try using -par=1 to restrict syncing to one thread and then -reindex. Sometimes this helps when, for example, the CPU is overheating.

@bol-van
Copy link
Author

bol-van commented Aug 31, 2015

Unlikely this is RAM or DISK problem. OS runs stable for weeks, memtest report nothing.
No bad block events in the event log. One of the disks I tried to put db on is several days old.
Ram problems are mostly random. Here I have 100% failure result each time.
Any ways to further diagnose the source of the problem ?

@bol-van
Copy link
Author

bol-van commented Sep 1, 2015

I reproduced exact same behavior in VM with Windows Server 2003 X64.
Pls someone try to resync the whole db ! Am I alone with this ?

@laanwj laanwj added the Windows label Sep 1, 2015
@laanwj
Copy link
Member

laanwj commented Sep 1, 2015

I'd be interested to know if the same happens in that VM with Bitcoin 10.2.

@wtogami
Copy link
Contributor

wtogami commented Sep 1, 2015

You are able to reproduce the failure on other hardware?
What about bitcoind or bitcoin-qt for Linux in a VM?

@bol-van
Copy link
Author

bol-van commented Sep 1, 2015

Additional notice.

Both 0.10.2 and 0.11.0 cannot start db sync when empty datadir is on "\vmware-host\shared folders" and successfully do when datadir is on windows network drive.

2015-09-01 09:51:55 init message: Loading block index...
2015-09-01 09:51:55 Opening LevelDB in Z:\home-h\Bit2test\blocks\index
2015-09-01 09:51:55 Corruption: no meta-nextfile entry in descriptor
2015-09-01 09:52:23 init message: Loading block index...
2015-09-01 09:52:23 Wiping LevelDB in Z:\home-h\Bit2test\blocks\index
2015-09-01 09:52:23 Opening LevelDB in Z:\home-h\Bit2test\blocks\index
2015-09-01 09:52:23 Corruption: no meta-nextfile entry in descriptor
2015-09-01 09:52:25 Shutdown: In progress...
2015-09-01 09:52:25 StopNode()
2015-09-01 09:52:25 Shutdown: done

@bol-van
Copy link
Author

bol-van commented Sep 1, 2015

I have one guess. Trouble can be in memory mapped files. I know bitcoin core uses them, it can be seen in RamMap utility.
I also run BURST coin pocminer. It extensively uses mapped files. Because of that kernel paged pool grows very large - up to more than half of the physical memory (its gigabytes). Huge pooltag is "MmSt", it contain PTEs. Detailed subject description is here : http://blogs.technet.com/b/askperf/archive/2011/09/23/getting-to-know-the-mmst-pool-tag.aspx
I'm on 24 GB system and set the PoolUsageMaximum to 10 (its 10 percent of RAM, 2.4G in my case). This measure effectively limit MmSt growth and it worked great until... what changed in last weeks ?
I replaced failing hard drive which contain 3 TB of BURST miner plots. This time I formatted NTFS volume with 64K cluster size (was 4K).
And probably from that point bitcoin db corruptions started.
Now i killed pocminer and trying to sync bitcoin both on the host and in VM.
Without pocminer bitcoin could start sync on vmware-host shared folder.
Will report after my guess is confirmed or no.

PS. Bitcoin 0.11.0 linux x86, runs on different hardware node without VM. Already synced till 1 year old, still no problem.

@bol-van
Copy link
Author

bol-van commented Sep 2, 2015

Yes, trouble was triggered by BURST miner. Without it sync was successful.
Running with almost exhausted paged pool cause errors not only in bitcoin core but also have other negative effects and having large cluster volume seem to harden them.

@bol-van bol-van closed this as completed Sep 2, 2015
@laanwj
Copy link
Member

laanwj commented Sep 2, 2015

Thanks for looking into this so deeply. This issue could be useful for other people that experience issues on windows.

I still wonder how the combination of hw and sw caused corruption, but it's likely the problem lies outside bitcoin core if it affects other software negatively as well.

@bol-van
Copy link
Author

bol-van commented Sep 2, 2015

One of the negative effects was the following.
Attempts to start db sync from bitcoin core running in vmware guest to vmware host drive were failing just at the start. Then I tried to mount network drive from vm guest to vm host using virtual network (regular 'net use \192.168.1.5') and sync db to that drive.
Start was successful but after some time I saw messages in the tray stating that windows could not flush data to network drive and data could be lost. Obviously, bitcoin core cannot display such messages, explorer.exe displays them. Event source is guest os kernel not being able to get read/write success confirmation from the server side. Thus the lanmanserver (The 'Server' service) component on the host was experiencing problems in exhausting paged pool condition probably IO-related.
Its all very strange because in the task manager on the host I see that pool is being trimmed from 2.8G to 800M after its exhaustion and then again grows to 2.8G.
I can suppose some paged pool allocations or some map-view-of-file operations fail before MmSt trim actually happens. Kernel components are written well-checked to not crush in any possible condition, but still denial-of-service exists.
Windows architecture problem ? I know burst miner is badly designed. It should not map terabytes of data files to memory. But also OS should not behave bad in this condition. If MmSt pool is like cache it must be trimmed transparently without alloc fails.

From bitcoin core perspective may be some checks are missing or db engine lack enough atomicity to rollback failing changes ? At the moment I can state : BURST miner can kill bitcoin db in some conditions, possibly when burst plots are on a large cluster volume.
This is not HW related at all. Its mainly the OS problem not being too resistant to some conditions.

@bitcoin bitcoin locked as resolved and limited conversation to collaborators Dec 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants
@wtogami @laanwj @bol-van and others