-
Notifications
You must be signed in to change notification settings - Fork 19.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
State heal phase is very slow (not finished after 2 weeks) #23191
Comments
I've been eyeing through the logs a bit. I wonder if your IO speed is bad? Also, there's a lot of this:
It indicates that requests time out. And there's a lot of it. No idea why, but @karalabe added a mechanism to scale the request size depending on the performance, not sure if that is in |
I don't think the IO speed is bad. Here is an indication (taken while geth is also still running):
|
What is the machine specs? |
8GB RAM, 4x 2.2GHz vCPUs (AMD EPYC 7601) |
I have the same issue, 2 weeks+ and still not done "State heal in progress"
logs snipset:
Any clue how much longer based on logs? total number of nodes/etc? |
The same issue. IO is more than enough.
machine
eth_syncing
The difference between |
@bFanek Your node is downloading 2 trie nodes per second (1 trie node is 500 bytes). On a different log, it says writing 7MB to the database took 14 seconds. Something's very off there. What kind of machine are you running Geth on? |
@kraymond37 You'll need to share some logs. How fast are trie nodes being downloaded? |
@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared. |
@karalabe Gist automatically truncates large files. You need to click "view the full file" at the top of the gist page to see it all. |
Oh, doh. Apologies.
…On Tue, Aug 3, 2021 at 4:36 PM Ramana Kumar ***@***.***> wrote:
@xrchz <https://github.com/xrchz> The logs you shared show the initial
snap sync, even at that, only the beginning. You say you've been running
state heal for weeks, but no such thing appears in the logs you've shared.
@karalbe Gist automatically truncates large files. You need to click "view
the full file" at the top of the gist page to see it all.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23191 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAA7UGOTSIPXIMKKEJJFEYTT27WFLANCNFSM5AEREQXA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
I've got the same issue. Syncing mainnet and rinkeby, both nodes have been running state heal for 2-3 weeks now. |
Is there a resolution to this? |
Similar problem here ... I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing. RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz) Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing: The Geth folder currently utilizes 400 GiB (no archive). |
FYI, I posted more detail in the other ticket: #23228 EDIT:
|
@xrchz Have u solved this problem? Im facing the same problem as yours. Can u tell me whats going on in the following days? thx! (Or can i switch to the fast mode without delet the data i have downloaded? ) |
No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead. |
sad news :( |
try the other sync options maybe one works better |
Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week) |
When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest. |
thx! BTW i noticed resync didnt work for u right? 😂 |
I didn't have time to test a lot of things, so I just moved the downloaded chaindata to an x86-64 machine and did a snap sync there ... which worked. Then I transferred it back to the ARM and it's been running smoothly since. |
How can i increase the throttle? |
Hello everyone, as i faced this issue i started my geth with inaccessible ancient store, cause of faulty permissions. The snap sync was used per default. While running i did another mistake and my system partition runs out of disk space. After this got solved, the 4GB of RAM exceeded and the swap failed, system halts. So we can expect a corrupted database at this point? After this got fixed i observed the behaviour described, a lot of "Unexpected trienode heal packet". My Port Forwarding was also missing, but this seems not to fix it yet. IPv6 was enabled, i disabled it and started with "--syncmode fast". The "Unexpected trienode heal packet" disappeared from logs and its syncing. System: AMD Ryzen in a Virtualbox attached to a USB 3.0 nVME running Ubuntu Server 20 (4vCPU, 6GB RAM, 6GB swap) |
Hi. A 8 GB Raspberry Pi 4 is able to complete a snap sync in less than 2 days. Please see this: https://twitter.com/EthereumOnARM/status/1463796036103770114 And check these points: https://twitter.com/EthereumOnARM/status/1463828024080707585 |
Hi I've also been trying to snap sync a geth node in aws and have also been stuck in state healing for a few weeks. I've synced nodes locally before within a few days so this was surprising to me. I'm using nvme storage on the aws node so I would expect it has the iops to sync. A log snippet:
I have seen others have solved this by switching to fast sync mode but this is no longer an option on the geth version i'm using (1.10.16-stable). |
Geth on my Raspberry Pi 4 8 GB has also continuously been reporting "State heal in progress" and frequent "Unexpected trienode heal packet" log messages for a few weeks after I deleted the Geth database and restarted the sync from the beginning. What do these messages mean? Is my Geth node still syncing? Should I bother to continue running my Geth node until this issue is resolved? |
Had the same and I decided to synch on a different (faster) computer instead ... I then rsync'd the blockchain data to the RPi after the sync was complete. Not an ideal solution, but worked. Might have mentioned this before somewhere in the tickets here, sorry if that's the case. |
I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD. As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it: Not sure what's the cause. Maybe it's an outdated firmware or a bad SATA-to-USB adapter. Anyway, it's not a Geth bug in my case. |
I'm also using a WD Blue 1 TB with my Raspberry Pi 4 8 GB and its IO performance (read: IOPS=2284, BW=9139KiB/s, write: IOPS=763, BW=3054KiB/s) is much lower than that which the Rocketpool guide recommends. However, Geth was running without errors or warnings on this hardware until a few point versions prior to 1.10.16. Why would it encounter these problems now? |
My node has finally finished syncing! It took about 5 days, with the state healing phase taking about half of that. My setup:
I'd say that the first phase (blocks downloading) is limited by network. I didn't have many peers so it took about 2.5 days to finish. Incoming traffic was between 2 and 4 Mb/s. I posted some statistics to my Twitter: These were last "State heal in progress" lines in logs (the node was restarted a couple of times):
Seems that
|
Hi, I'm experiencing the same problem now. I had an unclean shutdown three days ago, node was offline for two hours, and it seems that it has not yet recovered.
I have been running geth on this machine for more than a year. SSD stats:
geth version Is there something I should do to speed it up? Maybe increase bandwidth size? I'm running geth with default values for everything:
|
You need v1.10.19 or later in order to sync.
|
Thanks, I upgraded the geth version but the problem still remains. |
@newearthmartin |
@newearthmartin |
@nuliknol Here are some values from iostat while running geth:
Does this mean r/s is ok but latency is not? Any way that I can improve this? I'm using a 2TB WD Blue on a 2016 8GB Asus Zenbook laptop running Ubuntu 22.04. What I don't understand is that I've been running a full node for a year, why will geth not recover from this? You say weeks... when I downloaded the blockchain, it took less... Also, should I be looking into Erigon for this hardware? :) I really would like to stay with Geth. |
Is there a way to see the progress of the state heal phase and how more it has to go? |
Hello, I'm experiencing the same issue. I ran the Rocketpool test and results are awful `test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 Run status group 0 (all jobs): Disk stats (read/write): This strikes a bit strange to me since I had previously synced geth successfully on this drive. This are the specs for the SSD |
Hello @Jeiwan ! Thanks for your help. I was happy to see that you could sync using a Raspberry Pi. I have a much more powerful equipment and was about to give up.
For me, your message showing the total nodes when you finished to sync was the most useful here. All I wanted to know is if my syncing was progressing or not. Can you update us with the total number of nodes today? I am still syncing. I hope you're not much further than 40,151,717. I am at about 8 million nodes now. Thanks. :) |
Hi @machinasapiens! I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine. As to the number of nodes, I think it gets reset after each restart (but progress is not lost). As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷♂️ If it keeps utilizing CPU and memory, everything's fine. |
Excellent information. Thanks a lot. 🙌 |
If your healing is taking forever, this is how I solved it after weeks of trying:
|
I have same issue.
|
I'm seeing this (while geth is healing) on a VPS SSD.
I guess i'm fucked. Is that i/o perf good enough once the healing is done? Or am I waiting for a week just to discover it won't be able to keep up? |
Try erigon, they don't have these SSD requirements |
I found that other users are having this problem too. They are using local SSD disks or other cloud hosting. |
I am also having this issue.
I ran the iostat command to see what the throughput was:
This is on a beast of a Google cloud server with SSD drives |
We did several fixes to snap-sync on For anyone with "eternal state-heal" problems: try Since this ticket is starting to be come one of those endless tickets where people pile on long after the original issue is resolved, I'll close this now. If
|
System information
Geth version:
geth version
OS & Version: Linux
Commit hash : (if
develop
)Expected behaviour
geth finishes (default=snap) sync in < 1 day
Actual behaviour
geth is still in 'State heal in progress' after more than 2 weeks
Steps to reproduce the behaviour
Run geth
Logs
https://gist.github.com/xrchz/851eca7ecfb0312422875c90b9a86b2b
The text was updated successfully, but these errors were encountered: