Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State heal phase is very slow (not finished after 2 weeks) #23191

Closed
xrchz opened this issue Jul 10, 2021 · 51 comments
Closed

State heal phase is very slow (not finished after 2 weeks) #23191

xrchz opened this issue Jul 10, 2021 · 51 comments
Labels

Comments

@xrchz
Copy link

xrchz commented Jul 10, 2021

System information

Geth version: geth version

Geth
Version: 1.10.4-stable
Git Commit: aa637fd38a379db6da98df0d520fb1c5139a18ce
Git Commit Date: 20210617
Architecture: amd64
Go Version: go1.16.5
Operating System: linux
GOPATH=
GOROOT=go

OS & Version: Linux

Commit hash : (if develop)

Expected behaviour

geth finishes (default=snap) sync in < 1 day

Actual behaviour

geth is still in 'State heal in progress' after more than 2 weeks

Steps to reproduce the behaviour

Run geth

Logs

https://gist.github.com/xrchz/851eca7ecfb0312422875c90b9a86b2b

@xrchz xrchz added the type:bug label Jul 10, 2021
@holiman
Copy link
Contributor

holiman commented Jul 16, 2021

I've been eyeing through the logs a bit. I wonder if your IO speed is bad?

Also, there's a lot of this:

WARN [07-03|18:00:40.413] Unexpected trienode heal packet          peer=9f0e8fbf         reqid=6,915,308,639,612,522,441

It indicates that requests time out. And there's a lot of it. No idea why, but @karalabe added a mechanism to scale the request size depending on the performance, not sure if that is in 1.10.4-stable or not. In a rush, can't check right now

@xrchz
Copy link
Author

xrchz commented Jul 16, 2021

I don't think the IO speed is bad. Here is an indication (taken while geth is also still running):

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite
=randrw --rwmixread=75 --directory=<redacted> --aux-path=<redacted>
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=136MiB/s,w=45.2MiB/s][r=34.9k,w=11.6k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=450154: Fri Jul 16 12:37:52 2021
  read: IOPS=37.9k, BW=148MiB/s (155MB/s)(3070MiB/20719msec)
   bw (  KiB/s): min=90808, max=196112, per=100.00%, avg=151902.44, stdev=24036.61, samples=41
   iops        : min=22702, max=49028, avg=37975.61, stdev=6009.16, samples=41
  write: IOPS=12.7k, BW=49.5MiB/s (51.9MB/s)(1026MiB/20719msec); 0 zone resets
   bw (  KiB/s): min=30840, max=66952, per=100.00%, avg=50762.73, stdev=8052.88, samples=41
   iops        : min= 7710, max=16738, avg=12690.68, stdev=2013.22, samples=41
  cpu          : usr=8.53%, sys=45.38%, ctx=105093, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=148MiB/s (155MB/s), 148MiB/s-148MiB/s (155MB/s-155MB/s), io=3070MiB (3219MB), run=20719-20719msec
  WRITE: bw=49.5MiB/s (51.9MB/s), 49.5MiB/s-49.5MiB/s (51.9MB/s-51.9MB/s), io=1026MiB (1076MB), run=20719-20719msec

Disk stats (read/write):
  sda: ios=783034/261889, merge=32/32, ticks=870354/151608, in_queue=1021965, util=99.63%

@begetan
Copy link

begetan commented Jul 23, 2021

What is the machine specs?

@xrchz
Copy link
Author

xrchz commented Jul 23, 2021

8GB RAM, 4x 2.2GHz vCPUs (AMD EPYC 7601)

@bFanek
Copy link

bFanek commented Jul 28, 2021

I have the same issue, 2 weeks+ and still not done "State heal in progress"
Geth version:

Geth
Version: 1.10.6-stable
Architecture: amd64
Go Version: go1.16.6
Operating System: darwin
GOPATH=
GOROOT=go

logs snipset:

WARN [07-28|10:29:46.515] Pivot became stale, moving               old=12,914,818 new=12,914,882
INFO [07-28|10:29:50.427] Imported new block headers               count=1    elapsed=253.434ms number=12,914,945 hash=ee1a08..9ce38a
INFO [07-28|10:30:00.224] Imported new block receipts              count=64   elapsed=13.703s   number=12,914,881 hash=fef964..d789fc age=18m5s     size=7.69MiB
INFO [07-28|10:30:18.658] Imported new block headers               count=1    elapsed=46.715ms  number=12,914,946 hash=7b24c8..2d8006
INFO [07-28|10:30:21.965] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,196@11.70GiB pending=112,626
INFO [07-28|10:30:21.969] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,196@11.70GiB pending=112,626
WARN [07-28|10:30:22.014] Unexpected trienode heal packet          peer=cef48f42 reqid=6,500,665,912,563,462,364
WARN [07-28|10:30:22.014] Unexpected trienode heal packet          peer=df5f16fe reqid=2,588,354,188,352,030,655
WARN [07-28|10:30:22.081] Unexpected trienode heal packet          peer=07527e24 reqid=2,299,545,781,862,247,020
WARN [07-28|10:30:22.097] Unexpected trienode heal packet          peer=06fbcb3d reqid=4,616,793,971,286,455,283
WARN [07-28|10:30:22.117] Unexpected trienode heal packet          peer=f26bae89 reqid=5,492,659,643,994,326,497
WARN [07-28|10:30:22.150] Unexpected trienode heal packet          peer=8fe0dc37 reqid=4,841,696,059,284,633,551
WARN [07-28|10:30:22.166] Unexpected trienode heal packet          peer=d8c491e0 reqid=5,089,512,569,912,806,296
WARN [07-28|10:30:22.214] Unexpected trienode heal packet          peer=ab94eac9 reqid=289,535,681,122,254,882
INFO [07-28|10:30:29.698] Imported new block headers               count=1    elapsed=50.129ms  number=12,914,947 hash=502883..e4a4cf
INFO [07-28|10:30:40.282] Imported new block headers               count=1    elapsed=68.952ms  number=12,914,948 hash=ca7379..feb0e4
INFO [07-28|10:30:42.053] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=246.41KiB throttle=1064
INFO [07-28|10:30:51.755] Imported new block headers               count=1    elapsed=455.293ms number=12,914,949 hash=69f8b7..c0be19
INFO [07-28|10:31:23.127] Imported new block headers               count=1    elapsed=36.240ms  number=12,914,950 hash=1c41d1..32309f
INFO [07-28|10:31:51.161] Imported new block headers               count=2    elapsed=303.134ms number=12,914,952 hash=05fedd..bd977f age=1m4s
INFO [07-28|10:31:53.863] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=251.06KiB throttle=1045
INFO [07-28|10:31:58.571] Imported new block headers               count=1    elapsed=50.721ms  number=12,914,953 hash=ade2c1..0a5bc0
INFO [07-28|10:32:06.890] Imported new block headers               count=1    elapsed=635.477ms number=12,914,954 hash=804d5c..ab7b25
INFO [07-28|10:32:10.395] Imported new block headers               count=2    elapsed=488.349ms number=12,914,956 hash=1e8a8f..3e8b40
INFO [07-28|10:32:48.525] Imported new block headers               count=1    elapsed=80.333ms  number=12,914,957 hash=e45180..9b6220
INFO [07-28|10:32:59.615] Imported new block headers               count=1    elapsed=47.949ms  number=12,914,958 hash=d709af..d8cf88
INFO [07-28|10:33:00.748] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=236.64KiB throttle=1108
INFO [07-28|10:33:07.879] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,469@11.70GiB pending=4369
INFO [07-28|10:33:14.157] Imported new block headers               count=1    elapsed=66.366ms  number=12,914,959 hash=b48f74..bf44f3 age=1m6s
INFO [07-28|10:33:17.898] Imported new block headers               count=1    elapsed=34.263ms  number=12,914,960 hash=0bd389..15919a
INFO [07-28|10:33:22.825] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,509@11.70GiB pending=5004
INFO [07-28|10:33:28.418] Imported new block headers               count=1    elapsed=36.038ms  number=12,914,961 hash=71baf1..463032
INFO [07-28|10:33:35.198] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,539@11.70GiB pending=5484
INFO [07-28|10:33:36.183] Imported new block headers               count=1    elapsed=53.880ms  number=12,914,962 hash=d3a956..4d2c35
INFO [07-28|10:34:04.801] Imported new block headers               count=1    elapsed=49.827ms  number=12,914,963 hash=3a9efb..280236
INFO [07-28|10:34:05.954] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=249.31KiB throttle=1052
INFO [07-28|10:34:11.899] Imported new block headers               count=1    elapsed=51.650ms  number=12,914,964 hash=b03469..eed621

Any clue how much longer based on logs? total number of nodes/etc?

@kraymond37
Copy link

The same issue. IO is more than enough.
Geth version

Geth
Version: 1.10.6-stable
Git Commit: 576681f29b895dd39e559b7ba17fcd89b42e4833
Git Commit Date: 20210722
Architecture: amd64
Go Version: go1.16.4
Operating System: linux
GOPATH=
GOROOT=go

machine

CPU: AMD EPYC 7R32 @ 8x 3.224GHz
RAM: 3485MiB / 15853MiB

eth_syncing

{
  "jsonrpc": "2.0",
  "id": "curltest",
  "result": {
    "currentBlock": "0xc5a113",
    "highestBlock": "0xc5a15d",
    "knownStates": "0x2b8c155a",
    "pulledStates": "0x2b8c155a",
    "startingBlock": "0xc595dd"
  }
}

The difference between currentBlock and currentBlock is always about 50~150.

@karalabe
Copy link
Member

karalabe commented Aug 3, 2021

@bFanek Your node is downloading 2 trie nodes per second (1 trie node is 500 bytes). On a different log, it says writing 7MB to the database took 14 seconds. Something's very off there. What kind of machine are you running Geth on?

@karalabe
Copy link
Member

karalabe commented Aug 3, 2021

@kraymond37 You'll need to share some logs. How fast are trie nodes being downloaded?

@karalabe
Copy link
Member

karalabe commented Aug 3, 2021

@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

@xrchz
Copy link
Author

xrchz commented Aug 3, 2021

@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

@karalabe Gist automatically truncates large files. You need to click "view the full file" at the top of the gist page to see it all.

@karalabe
Copy link
Member

karalabe commented Aug 3, 2021 via email

@kraymond37
Copy link

@karalabe Finally finished. But I don't know why it is so slow. Here is the log.

@j4ys0n
Copy link

j4ys0n commented Aug 9, 2021

I've got the same issue. Syncing mainnet and rinkeby, both nodes have been running state heal for 2-3 weeks now.

@j4ys0n
Copy link

j4ys0n commented Aug 12, 2021

Is there a resolution to this?

@xanoni
Copy link

xanoni commented Aug 24, 2021

Similar problem here ...

I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing.

RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz)
Disk: SSD (200+ MiB / sec)
Net: 2 Gbit/s

Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing:
WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ

The Geth folder currently utilizes 400 GiB (no archive).

@xanoni
Copy link

xanoni commented Aug 24, 2021

FYI, I posted more detail in the other ticket: #23228

EDIT:

I switched to fast sync (maintaining the already downloaded data), but it was still not done after 2-3 days. Thus, I rsync'd the whole geth folder to an aarch64 machine and completed the snap sync there, which took 1-2 days. Then I rsync'd everything back and it's running smoothly on the Pi now.

Not ideal, but worked for me.

@lrc123456
Copy link

@xrchz Have u solved this problem? Im facing the same problem as yours. Can u tell me whats going on in the following days? thx! (Or can i switch to the fast mode without delet the data i have downloaded? )

@xrchz
Copy link
Author

xrchz commented Sep 14, 2021

No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead.

@lrc123456
Copy link

No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead.

sad news :(
I don't know much about full node service providers, which one did u choose? My purpose is to get full node data of ethereum by syncing with geth, and then use my modified geth client for scientific tests, do u think full node service providers works for me? Can i get the same "datadir" folder using it? Thx again!!

@xanoni
Copy link

xanoni commented Sep 14, 2021

try the other sync options maybe one works better

@lrc123456
Copy link

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

@xanoni
Copy link

xanoni commented Sep 15, 2021

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

@lrc123456
Copy link

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

thx! BTW i noticed resync didnt work for u right? 😂

@xanoni
Copy link

xanoni commented Sep 16, 2021

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

thx! BTW i noticed resync didnt work for u right? 😂

I didn't have time to test a lot of things, so I just moved the downloaded chaindata to an x86-64 machine and did a snap sync there ... which worked. Then I transferred it back to the ARM and it's been running smoothly since.

@Hailiebaby16
Copy link

How can i increase the throttle?

@ChuckNorrison
Copy link

ChuckNorrison commented Nov 21, 2021

Hello everyone, as i faced this issue i started my geth with inaccessible ancient store, cause of faulty permissions.

The snap sync was used per default. While running i did another mistake and my system partition runs out of disk space. After this got solved, the 4GB of RAM exceeded and the swap failed, system halts.

So we can expect a corrupted database at this point?

After this got fixed i observed the behaviour described, a lot of "Unexpected trienode heal packet".

My Port Forwarding was also missing, but this seems not to fix it yet.

IPv6 was enabled, i disabled it and started with "--syncmode fast".

The "Unexpected trienode heal packet" disappeared from logs and its syncing.

System: AMD Ryzen in a Virtualbox attached to a USB 3.0 nVME running Ubuntu Server 20 (4vCPU, 6GB RAM, 6GB swap)

@diglos
Copy link

diglos commented Nov 25, 2021

Hi.

A 8 GB Raspberry Pi 4 is able to complete a snap sync in less than 2 days. Please see this:

https://twitter.com/EthereumOnARM/status/1463796036103770114

And check these points:

https://twitter.com/EthereumOnARM/status/1463828024080707585

@0xadnauseam
Copy link

Hi I've also been trying to snap sync a geth node in aws and have also been stuck in state healing for a few weeks. I've synced nodes locally before within a few days so this was surprising to me. I'm using nvme storage on the aws node so I would expect it has the iops to sync.

A log snippet:

INFO [03-08|21:10:17.620] State heal in progress                   accounts=1,549,981@76.36MiB slots=4,125,659@300.97MiB codes=12627@99.72MiB nodes=163,743,126@49.96GiB pending=85071
WARN [03-08|21:10:17.622] Unexpected trienode heal packet          peer=73755973 reqid=4,134,898,869,596,324,111
WARN [03-08|21:10:17.633] Unexpected trienode heal packet          peer=25f05f19 reqid=7,011,098,481,708,786,349
WARN [03-08|21:10:17.671] Unexpected trienode heal packet          peer=9c8ce4ee reqid=7,218,003,286,183,975,893
WARN [03-08|21:10:17.674] Unexpected trienode heal packet          peer=bcddcc5d reqid=1,911,971,051,676,565,964
INFO [03-08|21:10:22.508] Imported new block headers               count=1    elapsed=8.076ms   number=14,348,482 hash=de4c70..134896
INFO [03-08|21:10:31.658] Imported new block headers               count=1    elapsed=7.743ms   number=14,348,483 hash=809a3e..684cf6
INFO [03-08|21:10:33.379] State heal in progress                   accounts=1,549,993@76.36MiB slots=4,125,659@300.97MiB codes=12627@99.72MiB nodes=163,743,522@49.96GiB pending=85505
INFO [03-08|21:10:47.765] State heal in progress                   accounts=1,550,012@76.36MiB slots=4,125,660@300.97MiB codes=12627@99.72MiB nodes=163,743,921@49.96GiB pending=85906

I have seen others have solved this by switching to fast sync mode but this is no longer an option on the geth version i'm using (1.10.16-stable).

@derekmahar
Copy link

Geth on my Raspberry Pi 4 8 GB has also continuously been reporting "State heal in progress" and frequent "Unexpected trienode heal packet" log messages for a few weeks after I deleted the Geth database and restarted the sync from the beginning. What do these messages mean? Is my Geth node still syncing? Should I bother to continue running my Geth node until this issue is resolved?

@xanoni
Copy link

xanoni commented Mar 16, 2022

Had the same and I decided to synch on a different (faster) computer instead ... I then rsync'd the blockchain data to the RPi after the sync was complete. Not an ideal solution, but worked. Might have mentioned this before somewhere in the tickets here, sorry if that's the case.

@Jeiwan
Copy link
Contributor

Jeiwan commented Mar 25, 2022

I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD.

As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it:
https://docs.rocketpool.net/guides/node/local/prepare-pi.html#testing-the-ssd-s-performance

Not sure what's the cause. Maybe it's an outdated firmware or a bad SATA-to-USB adapter. Anyway, it's not a Geth bug in my case.

@derekmahar
Copy link

I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD.

As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it: https://docs.rocketpool.net/guides/node/local/prepare-pi.html#testing-the-ssd-s-performance

I'm also using a WD Blue 1 TB with my Raspberry Pi 4 8 GB and its IO performance (read: IOPS=2284, BW=9139KiB/s, write: IOPS=763, BW=3054KiB/s) is much lower than that which the Rocketpool guide recommends. However, Geth was running without errors or warnings on this hardware until a few point versions prior to 1.10.16. Why would it encounter these problems now?

@Jeiwan
Copy link
Contributor

Jeiwan commented Mar 26, 2022

My node has finally finished syncing! It took about 5 days, with the state healing phase taking about half of that.

My setup:

  1. Raspberry Pi 4 Model B Rev 1.5, 8 Gb.
  2. WD Blue 1Tb connected via a cheap SATA-to-USB adapter. IOPS and bandwidth seem to be low (geth was not running during the test):
test: (groupid=0, jobs=1): err= 0: pid=1820: Fri Mar 25 13:19:16 2022
  read: IOPS=3043, BW=11.9MiB/s (12.5MB/s)(3070MiB/258259msec)
   bw (  KiB/s): min= 4728, max=14672, per=99.99%, avg=12170.61, stdev=1868.41, samples=516
   iops        : min= 1182, max= 3668, avg=3042.62, stdev=467.11, samples=516
  write: IOPS=1017, BW=4068KiB/s (4166kB/s)(1026MiB/258259msec); 0 zone resets
   bw (  KiB/s): min= 1544, max= 5104, per=99.99%, avg=4067.59, stdev=651.52, samples=516
   iops        : min=  386, max= 1276, avg=1016.85, stdev=162.86, samples=516
  cpu          : usr=3.86%, sys=24.14%, ctx=1048480, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=11.9MiB/s (12.5MB/s), 11.9MiB/s-11.9MiB/s (12.5MB/s-12.5MB/s), io=3070MiB (3219MB), run=258259-258259msec
  WRITE: bw=4068KiB/s (4166kB/s), 4068KiB/s-4068KiB/s (4166kB/s-4166kB/s), io=1026MiB (1076MB), run=258259-258259msec

I'd say that the first phase (blocks downloading) is limited by network. I didn't have many peers so it took about 2.5 days to finish. Incoming traffic was between 2 and 4 Mb/s.
The state healing phase seems to be limited by CPU and IOPS. My RPI is not overclocked (some recommend overclocking it), and average read IOPS had been around 1.3K during the phase (seems underperfomed? maybe because of CPU?). Read bandwidth had been around 11MiB/s, which matches the number from the test.

I posted some statistics to my Twitter:
https://twitter.com/jeiwan7/status/1507565655079686145

These were last "State heal in progress" lines in logs (the node was restarted a couple of times):

         accounts=470,344@24.66MiB slots=883,544@64.47MiB codes=4634@36.46MiB nodes=40,140,227@9.88GiB pending=10152
         accounts=470,396@24.66MiB slots=884,109@64.51MiB codes=4634@36.46MiB nodes=40,141,949@9.88GiB pending=8943
         accounts=470,436@24.66MiB slots=884,702@64.55MiB codes=4634@36.46MiB nodes=40,143,775@9.88GiB pending=7697
         accounts=470,461@24.66MiB slots=885,356@64.60MiB codes=4634@36.46MiB nodes=40,145,830@9.88GiB pending=6533
         accounts=470,471@24.66MiB slots=886,227@64.66MiB codes=4634@36.46MiB nodes=40,148,055@9.88GiB pending=4126
         accounts=470,476@24.66MiB slots=886,728@64.69MiB codes=4634@36.46MiB nodes=40,149,541@9.88GiB pending=2891
         accounts=470,477@24.66MiB slots=887,598@64.76MiB codes=4634@36.46MiB nodes=40,151,685@9.89GiB pending=149
         accounts=470,477@24.66MiB slots=887,623@64.76MiB codes=4634@36.46MiB nodes=40,151,717@9.89GiB pending=0

Seems that pending needs to get down to 0 for the phase to be finished.


Geth
Version: 1.10.16-stable
Git Commit: 20356e57b119b4e70ce47665a71964434e15200d
Git Commit Date: 20220215
Architecture: arm64
Go Version: go1.17.5
Operating System: linux
GOPATH=
GOROOT=go

@newearthmartin
Copy link

Hi, I'm experiencing the same problem now. I had an unclean shutdown three days ago, node was offline for two hours, and it seems that it has not yet recovered.

  • Still State heal in progress
  • Lots of Unexpected trienode heal packet
  • pending seems to go up every time

I have been running geth on this machine for more than a year. SSD stats:

   READ: bw=112MiB/s (118MB/s), 112MiB/s-112MiB/s (118MB/s-118MB/s), io=3070MiB (3219MB), run=27321-27321msec
  WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1026MiB (1076MB), run=27321-27321msec

geth version 1.10.16-stable

Is there something I should do to speed it up? Maybe increase bandwidth size?

I'm running geth with default values for everything:

geth --mainnet --datadir ... --http --http.api eth,net,web3

@holiman
Copy link
Contributor

holiman commented Jul 23, 2022 via email

@newearthmartin
Copy link

Thanks, I upgraded the geth version but the problem still remains.

@nuliknol
Copy link

nuliknol commented Jul 27, 2022

Thanks, I upgraded the geth version but the problem still remains.

@newearthmartin
run iostat -d -x -m 10 command on your Linux box.
The column r/s (read requests per second) must be about 1500-2000 and the latency (column r_await) must be about 80 microseconds , i.e. 0.08)
If it is lower, then you are going to state-heal for weeks

@nuliknol
Copy link

nuliknol commented Jul 27, 2022

Thanks, I upgraded the geth version but the problem still remains.

@newearthmartin
if it happens that you have got a slow SSD disk, you can accelerate it buying a smaller but fast SSD disk and configure it as caching device for the slow SSD disk via bcache kernel module.

@newearthmartin
Copy link

newearthmartin commented Jul 27, 2022

@nuliknol Here are some values from iostat while running geth:

r/s         r_await
1980.40     0.32
1953.30     0.33
1732.00     0.32
1654.20     0.35
2095.00     0.33

Does this mean r/s is ok but latency is not? Any way that I can improve this? I'm using a 2TB WD Blue on a 2016 8GB Asus Zenbook laptop running Ubuntu 22.04.

What I don't understand is that I've been running a full node for a year, why will geth not recover from this? You say weeks... when I downloaded the blockchain, it took less...

Also, should I be looking into Erigon for this hardware? :) I really would like to stay with Geth.

@newearthmartin
Copy link

Is there a way to see the progress of the state heal phase and how more it has to go?
Also, if I restart geth, do I lose the progress?

@6ilcarlos
Copy link

6ilcarlos commented Aug 1, 2022

Hello, I'm experiencing the same issue.

I ran the Rocketpool test and results are awful

`test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][99.9%][r=276KiB/s,w=88KiB/s][r=69,w=22 IOPS][eta 00m:05s]
test: (groupid=0, jobs=1): err= 0: pid=384253: Mon Aug 1 13:14:52 2022
read: IOPS=210, BW=842KiB/s (863kB/s)(3070MiB/3731597msec)
bw ( KiB/s): min= 8, max=81176, per=100.00%, avg=879.52, stdev=3872.75, samples=7137
iops : min= 2, max=20294, avg=219.84, stdev=968.19, samples=7137
write: IOPS=70, BW=282KiB/s (288kB/s)(1026MiB/3731597msec); 0 zone resets
bw ( KiB/s): min= 8, max=27040, per=100.00%, avg=294.19, stdev=1298.58, samples=7130
iops : min= 2, max= 6760, avg=73.50, stdev=324.65, samples=7130
cpu : usr=0.12%, sys=0.43%, ctx=76826, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=842KiB/s (863kB/s), 842KiB/s-842KiB/s (863kB/s-863kB/s), io=3070MiB (3219MB), run=3731597-3731597msec
WRITE: bw=282KiB/s (288kB/s), 282KiB/s-282KiB/s (288kB/s-288kB/s), io=1026MiB (1076MB), run=3731597-3731597msec

Disk stats (read/write):
sda: ios=819211/295271, merge=6718/35856, ticks=151863026/55438750, in_queue=208094988, util=97.34%
`

This strikes a bit strange to me since I had previously synced geth successfully on this drive.

This are the specs for the SSD
Device Model: HP SSD S650 1920GB Serial Number: HBSA51350100117 Firmware Version: T1030A0 User Capacity: 1 920 383 410 176 bytes [1.92 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Aug 1 13:52:08 2022 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Unavailable
I don't know how to proceed. Maybe there's some erroneous configuration on my part. I'm running geth 1.10.20

@ghost
Copy link

ghost commented Aug 13, 2022

Hello @Jeiwan !

Thanks for your help. I was happy to see that you could sync using a Raspberry Pi. I have a much more powerful equipment and was about to give up.

nodes=40,151,717@9.89GiB

For me, your message showing the total nodes when you finished to sync was the most useful here. All I wanted to know is if my syncing was progressing or not.

Can you update us with the total number of nodes today?

I am still syncing. I hope you're not much further than 40,151,717. I am at about 8 million nodes now.

Thanks. :)

@Jeiwan
Copy link
Contributor

Jeiwan commented Aug 13, 2022

Hi @machinasapiens!

I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine.

As to the number of nodes, I think it gets reset after each restart (but progress is not lost).

As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷‍♂️ If it keeps utilizing CPU and memory, everything's fine.

@ghost
Copy link

ghost commented Aug 13, 2022

Hi @machinasapiens!

I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine.

As to the number of nodes, I think it gets reset after each restart (but progress is not lost).

As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷‍♂️ If it keeps utilizing CPU and memory, everything's fine.

Excellent information. Thanks a lot. 🙌

@mutazmq1
Copy link

If your healing is taking forever, this is how I solved it after weeks of trying:

  • geth removedb y n as @MariusVanDerWijden recommended
  • Upgrade your geth to the latest unstable version
  • Get a fast hardware around 11k IOPS or above. The higher the better
  • Wait for 24-48 hours and you should be fine

@YurGa777
Copy link

I have same issue.
My geth output:
сен 16 10:28:50 Future32 geth[9904]: INFO [09-16|10:28:50.614] State heal in progress accounts=48404@2.22MiB slots=37572@2.90MiB codes=267@1.94MiB nodes=1,220,413@482.41MiB pending=53853 сен 16 10:28:59 Future32 geth[9904]: INFO [09-16|10:28:59.640] State heal in progress accounts=48412@2.22MiB slots=37574@2.90MiB codes=267@1.94MiB nodes=1,220,466@482.43MiB pending=53919 сен 16 10:29:02 Future32 geth[9904]: INFO [09-16|10:29:02.391] Forkchoice requested sync to new head number=15,544,704 hash=549400..9527c0 сен 16 10:29:05 Future32 geth[9904]: INFO [09-16|10:29:05.176] Imported new block headers count=1 elapsed=11.062ms number=15,544,704 hash=549400..9527c0 сен 16 10:29:06 Future32 geth[9904]: INFO [09-16|10:29:06.182] Downloader queue stats receiptTasks=0 blockTasks=0 itemSize=262.70KiB throttle=998

> eth.syncing { currentBlock: 15544641, healedBytecodeBytes: 2030904, healedBytecodes: 267, healedTrienodeBytes: 507542641, healedTrienodes: 1225652, healingBytecode: 0, healingTrienodes: 1108, highestBlock: 15544744, startingBlock: 15538964, syncedAccountBytes: 44957199813, syncedAccounts: 219496732, syncedBytecodeBytes: 4321233920, syncedBytecodes: 661982, syncedStorage: 821946751, syncedStorageBytes: 176094687911
I understand that my IOPS is low, but I have SSD M2 256 Gb.
Could I speed up sync process by using cashe on SSD?

@rndstr
Copy link

rndstr commented Sep 17, 2022

@newearthmartin run iostat -d -x -m 10 command on your Linux box. The column r/s (read requests per second) must be about 1500-2000 and the latency (column r_await) must be about 80 microseconds , i.e. 0.08) If it is lower, then you are going to state-heal for weeks

I'm seeing this (while geth is healing) on a VPS SSD.

$ iostat -d -x -m 10
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
sda            524.10      4.44     0.00   0.00    1.00     8.68    0.20      0.01     1.70  89.47    0.50    38.00    0.00      0.00     0.00   0.00    0.00     0.00    0.09  87.88

I guess i'm fucked.

Is that i/o perf good enough once the healing is done? Or am I waiting for a week just to discover it won't be able to keep up?

@newearthmartin
Copy link

Try erigon, they don't have these SSD requirements

@Liamlu28
Copy link

I found that other users are having this problem too. They are using local SSD disks or other cloud hosting.

@admcudo
Copy link

admcudo commented Oct 4, 2022

I am also having this issue.

Oct 04 20:14:36 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:36.484] Forkchoice requested sync to new head    number=15,677,085 hash=6734dc..9730ab
Oct 04 20:14:38 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:38.452] Imported new block headers               count=1    elapsed="676.217µs" number=15,677,085 hash=6734dc..9730ab
Oct 04 20:14:41 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:41.260] State heal in progress                   accounts=9386@633.68KiB slots=3676@296.28KiB codes=14318@111.56MiB nodes=94,577,877@25.16GiB pending=31633
Oct 04 20:14:48 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:48.521] Forkchoice requested sync to new head    number=15,677,086 hash=9c17ac..68214f
Oct 04 20:14:50 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:50.482] Imported new block headers               count=1    elapsed=26.540ms    number=15,677,086 hash=9c17ac..68214f
Oct 04 20:14:51 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:51.498] Downloader queue stats                   receiptTasks=0    blockTasks=0    itemSize=207.11KiB throttle=1266
Oct 04 20:14:51 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:51.782] State heal in progress                   accounts=9386@633.68KiB slots=3676@296.28KiB codes=14318@111.56MiB nodes=94,578,466@25.16GiB pending=32223
Oct 04 20:14:59 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:59.612] Forkchoice requested sync to new head    number=15,677,087 hash=1fcd6b..b1ff8f
Oct 04 20:15:00 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:15:00.831] State heal in progress                   accounts=9483@639.58KiB slots=3676@296.28KiB codes=14318@111.56MiB nodes=94,579,214@25.16GiB pending=32625
Oct 04 20:15:02 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:15:02.467] Imported new block headers               count=1    elapsed="290.785µs" number=15,677,087 hash=1fcd6b..b1ff8f

I ran the iostat command to see what the throughput was:

r/s = 1800.10 but r_await = 0.45

This is on a beast of a Google cloud server with SSD drives
Is this thing ever going to catch up do you think ?

@holiman
Copy link
Contributor

holiman commented Oct 5, 2022

We did several fixes to snap-sync on master, fixes that were never backported to stable release. The original reporter problems came from one of these issues that were since solved.

For anyone with "eternal state-heal" problems: try 1.11.xx. Also know as 'latest, available both via docker ethereum/client-go:latest` and downloads page.

Since this ticket is starting to be come one of those endless tickets where people pile on long after the original issue is resolved, I'll close this now.

If 1.11.xx still doesn't solve things for you, and you want to file a ticket, doing a disk-write speed might be good background info

dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output

@holiman holiman closed this as completed Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests