State heal phase is very slow (not finished after 2 weeks) #23191

xrchz · 2021-07-10T20:05:43Z

System information

Geth version: geth version

Geth
Version: 1.10.4-stable
Git Commit: aa637fd38a379db6da98df0d520fb1c5139a18ce
Git Commit Date: 20210617
Architecture: amd64
Go Version: go1.16.5
Operating System: linux
GOPATH=
GOROOT=go

OS & Version: Linux

Commit hash : (if develop)

Expected behaviour

geth finishes (default=snap) sync in < 1 day

Actual behaviour

geth is still in 'State heal in progress' after more than 2 weeks

Steps to reproduce the behaviour

Run geth

Logs

https://gist.github.com/xrchz/851eca7ecfb0312422875c90b9a86b2b

The text was updated successfully, but these errors were encountered:

holiman · 2021-07-16T11:40:44Z

I've been eyeing through the logs a bit. I wonder if your IO speed is bad?

Also, there's a lot of this:

WARN [07-03|18:00:40.413] Unexpected trienode heal packet          peer=9f0e8fbf         reqid=6,915,308,639,612,522,441

It indicates that requests time out. And there's a lot of it. No idea why, but @karalabe added a mechanism to scale the request size depending on the performance, not sure if that is in 1.10.4-stable or not. In a rush, can't check right now

xrchz · 2021-07-16T12:40:24Z

I don't think the IO speed is bad. Here is an indication (taken while geth is also still running):

fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite
=randrw --rwmixread=75 --directory=<redacted> --aux-path=<redacted>
test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.27
Starting 1 process
Jobs: 1 (f=1): [m(1)][100.0%][r=136MiB/s,w=45.2MiB/s][r=34.9k,w=11.6k IOPS][eta 00m:00s]
test: (groupid=0, jobs=1): err= 0: pid=450154: Fri Jul 16 12:37:52 2021
  read: IOPS=37.9k, BW=148MiB/s (155MB/s)(3070MiB/20719msec)
   bw (  KiB/s): min=90808, max=196112, per=100.00%, avg=151902.44, stdev=24036.61, samples=41
   iops        : min=22702, max=49028, avg=37975.61, stdev=6009.16, samples=41
  write: IOPS=12.7k, BW=49.5MiB/s (51.9MB/s)(1026MiB/20719msec); 0 zone resets
   bw (  KiB/s): min=30840, max=66952, per=100.00%, avg=50762.73, stdev=8052.88, samples=41
   iops        : min= 7710, max=16738, avg=12690.68, stdev=2013.22, samples=41
  cpu          : usr=8.53%, sys=45.38%, ctx=105093, majf=0, minf=8
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=148MiB/s (155MB/s), 148MiB/s-148MiB/s (155MB/s-155MB/s), io=3070MiB (3219MB), run=20719-20719msec
  WRITE: bw=49.5MiB/s (51.9MB/s), 49.5MiB/s-49.5MiB/s (51.9MB/s-51.9MB/s), io=1026MiB (1076MB), run=20719-20719msec

Disk stats (read/write):
  sda: ios=783034/261889, merge=32/32, ticks=870354/151608, in_queue=1021965, util=99.63%

begetan · 2021-07-23T11:10:20Z

What is the machine specs?

xrchz · 2021-07-23T22:38:36Z

8GB RAM, 4x 2.2GHz vCPUs (AMD EPYC 7601)

bFanek · 2021-07-28T14:38:28Z

I have the same issue, 2 weeks+ and still not done "State heal in progress"
Geth version:

Geth
Version: 1.10.6-stable
Architecture: amd64
Go Version: go1.16.6
Operating System: darwin
GOPATH=
GOROOT=go

logs snipset:

WARN [07-28|10:29:46.515] Pivot became stale, moving               old=12,914,818 new=12,914,882
INFO [07-28|10:29:50.427] Imported new block headers               count=1    elapsed=253.434ms number=12,914,945 hash=ee1a08..9ce38a
INFO [07-28|10:30:00.224] Imported new block receipts              count=64   elapsed=13.703s   number=12,914,881 hash=fef964..d789fc age=18m5s     size=7.69MiB
INFO [07-28|10:30:18.658] Imported new block headers               count=1    elapsed=46.715ms  number=12,914,946 hash=7b24c8..2d8006
INFO [07-28|10:30:21.965] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,196@11.70GiB pending=112,626
INFO [07-28|10:30:21.969] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,196@11.70GiB pending=112,626
WARN [07-28|10:30:22.014] Unexpected trienode heal packet          peer=cef48f42 reqid=6,500,665,912,563,462,364
WARN [07-28|10:30:22.014] Unexpected trienode heal packet          peer=df5f16fe reqid=2,588,354,188,352,030,655
WARN [07-28|10:30:22.081] Unexpected trienode heal packet          peer=07527e24 reqid=2,299,545,781,862,247,020
WARN [07-28|10:30:22.097] Unexpected trienode heal packet          peer=06fbcb3d reqid=4,616,793,971,286,455,283
WARN [07-28|10:30:22.117] Unexpected trienode heal packet          peer=f26bae89 reqid=5,492,659,643,994,326,497
WARN [07-28|10:30:22.150] Unexpected trienode heal packet          peer=8fe0dc37 reqid=4,841,696,059,284,633,551
WARN [07-28|10:30:22.166] Unexpected trienode heal packet          peer=d8c491e0 reqid=5,089,512,569,912,806,296
WARN [07-28|10:30:22.214] Unexpected trienode heal packet          peer=ab94eac9 reqid=289,535,681,122,254,882
INFO [07-28|10:30:29.698] Imported new block headers               count=1    elapsed=50.129ms  number=12,914,947 hash=502883..e4a4cf
INFO [07-28|10:30:40.282] Imported new block headers               count=1    elapsed=68.952ms  number=12,914,948 hash=ca7379..feb0e4
INFO [07-28|10:30:42.053] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=246.41KiB throttle=1064
INFO [07-28|10:30:51.755] Imported new block headers               count=1    elapsed=455.293ms number=12,914,949 hash=69f8b7..c0be19
INFO [07-28|10:31:23.127] Imported new block headers               count=1    elapsed=36.240ms  number=12,914,950 hash=1c41d1..32309f
INFO [07-28|10:31:51.161] Imported new block headers               count=2    elapsed=303.134ms number=12,914,952 hash=05fedd..bd977f age=1m4s
INFO [07-28|10:31:53.863] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=251.06KiB throttle=1045
INFO [07-28|10:31:58.571] Imported new block headers               count=1    elapsed=50.721ms  number=12,914,953 hash=ade2c1..0a5bc0
INFO [07-28|10:32:06.890] Imported new block headers               count=1    elapsed=635.477ms number=12,914,954 hash=804d5c..ab7b25
INFO [07-28|10:32:10.395] Imported new block headers               count=2    elapsed=488.349ms number=12,914,956 hash=1e8a8f..3e8b40
INFO [07-28|10:32:48.525] Imported new block headers               count=1    elapsed=80.333ms  number=12,914,957 hash=e45180..9b6220
INFO [07-28|10:32:59.615] Imported new block headers               count=1    elapsed=47.949ms  number=12,914,958 hash=d709af..d8cf88
INFO [07-28|10:33:00.748] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=236.64KiB throttle=1108
INFO [07-28|10:33:07.879] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,469@11.70GiB pending=4369
INFO [07-28|10:33:14.157] Imported new block headers               count=1    elapsed=66.366ms  number=12,914,959 hash=b48f74..bf44f3 age=1m6s
INFO [07-28|10:33:17.898] Imported new block headers               count=1    elapsed=34.263ms  number=12,914,960 hash=0bd389..15919a
INFO [07-28|10:33:22.825] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,509@11.70GiB pending=5004
INFO [07-28|10:33:28.418] Imported new block headers               count=1    elapsed=36.038ms  number=12,914,961 hash=71baf1..463032
INFO [07-28|10:33:35.198] State heal in progress                   accounts=169,633@7.48MiB  slots=57314@4.17MiB    codes=4895@38.14MiB nodes=43,293,539@11.70GiB pending=5484
INFO [07-28|10:33:36.183] Imported new block headers               count=1    elapsed=53.880ms  number=12,914,962 hash=d3a956..4d2c35
INFO [07-28|10:34:04.801] Imported new block headers               count=1    elapsed=49.827ms  number=12,914,963 hash=3a9efb..280236
INFO [07-28|10:34:05.954] Downloader queue stats                   receiptTasks=0     blockTasks=0     itemSize=249.31KiB throttle=1052
INFO [07-28|10:34:11.899] Imported new block headers               count=1    elapsed=51.650ms  number=12,914,964 hash=b03469..eed621

Any clue how much longer based on logs? total number of nodes/etc?

kraymond37 · 2021-08-03T11:00:18Z

The same issue. IO is more than enough.
Geth version

Geth
Version: 1.10.6-stable
Git Commit: 576681f29b895dd39e559b7ba17fcd89b42e4833
Git Commit Date: 20210722
Architecture: amd64
Go Version: go1.16.4
Operating System: linux
GOPATH=
GOROOT=go

machine

CPU: AMD EPYC 7R32 @ 8x 3.224GHz
RAM: 3485MiB / 15853MiB

eth_syncing

{
  "jsonrpc": "2.0",
  "id": "curltest",
  "result": {
    "currentBlock": "0xc5a113",
    "highestBlock": "0xc5a15d",
    "knownStates": "0x2b8c155a",
    "pulledStates": "0x2b8c155a",
    "startingBlock": "0xc595dd"
  }
}

The difference between currentBlock and currentBlock is always about 50~150.

karalabe · 2021-08-03T12:39:45Z

@bFanek Your node is downloading 2 trie nodes per second (1 trie node is 500 bytes). On a different log, it says writing 7MB to the database took 14 seconds. Something's very off there. What kind of machine are you running Geth on?

karalabe · 2021-08-03T12:40:37Z

@kraymond37 You'll need to share some logs. How fast are trie nodes being downloaded?

karalabe · 2021-08-03T12:42:21Z

@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

xrchz · 2021-08-03T13:36:09Z

@xrchz The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared.

@karalabe Gist automatically truncates large files. You need to click "view the full file" at the top of the gist page to see it all.

karalabe · 2021-08-03T13:55:52Z

Oh, doh. Apologies.

…

On Tue, Aug 3, 2021 at 4:36 PM Ramana Kumar ***@***.***> wrote: @xrchz <https://github.com/xrchz> The logs you shared show the initial snap sync, even at that, only the beginning. You say you've been running state heal for weeks, but no such thing appears in the logs you've shared. @karalbe Gist automatically truncates large files. You need to click "view the full file" at the top of the gist page to see it all. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#23191 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7UGOTSIPXIMKKEJJFEYTT27WFLANCNFSM5AEREQXA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

kraymond37 · 2021-08-04T01:07:31Z

@karalabe Finally finished. But I don't know why it is so slow. Here is the log.

j4ys0n · 2021-08-09T16:54:23Z

I've got the same issue. Syncing mainnet and rinkeby, both nodes have been running state heal for 2-3 weeks now.

j4ys0n · 2021-08-12T18:32:57Z

Is there a resolution to this?

xanoni · 2021-08-24T04:08:13Z

Similar problem here ...

I'm ~8 days in now ... the state sync took ~2 days (including some downtimes while I was setting up the rest of the server). The other 6 days it's been state healing.

RaspberryPi ARM-64 (8 GiB RAM, overclocked @ 4 x 2,100 Mhz)
Disk: SSD (200+ MiB / sec)
Net: 2 Gbit/s

Not sure if relevant, but I had >40,000 of the below messages in the 6 days it's been state healing:
WARN [08-22|00:03:07.286] Unexpected trienode heal packet peer=XYZ reqid=XYZ

The Geth folder currently utilizes 400 GiB (no archive).

xanoni · 2021-08-24T05:12:50Z

FYI, I posted more detail in the other ticket: #23228

EDIT:

I switched to fast sync (maintaining the already downloaded data), but it was still not done after 2-3 days. Thus, I rsync'd the whole geth folder to an aarch64 machine and completed the snap sync there, which took 1-2 days. Then I rsync'd everything back and it's running smoothly on the Pi now.

Not ideal, but worked for me.

lrc123456 · 2021-09-14T03:43:44Z

@xrchz Have u solved this problem? Im facing the same problem as yours. Can u tell me whats going on in the following days? thx! (Or can i switch to the fast mode without delet the data i have downloaded? )

xrchz · 2021-09-14T05:56:56Z

No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead.

lrc123456 · 2021-09-14T13:08:11Z

No I have not solved it. I stopped trying to use geth myself and switched to using full node service providers instead.

sad news :(
I don't know much about full node service providers, which one did u choose? My purpose is to get full node data of ethereum by syncing with geth, and then use my modified geth client for scientific tests, do u think full node service providers works for me? Can i get the same "datadir" folder using it? Thx again!!

xanoni · 2021-09-14T18:49:57Z

try the other sync options maybe one works better

lrc123456 · 2021-09-15T01:43:33Z

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

xanoni · 2021-09-15T03:00:00Z

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

lrc123456 · 2021-09-15T03:12:47Z

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

thx! BTW i noticed resync didnt work for u right？ 😂

xanoni · 2021-09-16T02:53:04Z

try the other sync options maybe one works better

Can i just interrupt the snapsync with ctr+c and use fastsync to continue with the same data folder? (Don't want to lose what i have got the last week)

When I did this recently, it re-used the already downloaded chaindata (~470 GiB). Not sure about the rest.

thx! BTW i noticed resync didnt work for u right？ 😂

I didn't have time to test a lot of things, so I just moved the downloaded chaindata to an x86-64 machine and did a snap sync there ... which worked. Then I transferred it back to the ARM and it's been running smoothly since.

Hailiebaby16 · 2021-09-29T09:56:53Z

How can i increase the throttle?

ChuckNorrison · 2021-11-21T08:39:08Z

Hello everyone, as i faced this issue i started my geth with inaccessible ancient store, cause of faulty permissions.

The snap sync was used per default. While running i did another mistake and my system partition runs out of disk space. After this got solved, the 4GB of RAM exceeded and the swap failed, system halts.

So we can expect a corrupted database at this point?

After this got fixed i observed the behaviour described, a lot of "Unexpected trienode heal packet".

My Port Forwarding was also missing, but this seems not to fix it yet.

IPv6 was enabled, i disabled it and started with "--syncmode fast".

The "Unexpected trienode heal packet" disappeared from logs and its syncing.

System: AMD Ryzen in a Virtualbox attached to a USB 3.0 nVME running Ubuntu Server 20 (4vCPU, 6GB RAM, 6GB swap)

diglos · 2021-11-25T11:32:41Z

Hi.

A 8 GB Raspberry Pi 4 is able to complete a snap sync in less than 2 days. Please see this:

https://twitter.com/EthereumOnARM/status/1463796036103770114

And check these points:

https://twitter.com/EthereumOnARM/status/1463828024080707585

0xadnauseam · 2022-03-08T21:17:22Z

Hi I've also been trying to snap sync a geth node in aws and have also been stuck in state healing for a few weeks. I've synced nodes locally before within a few days so this was surprising to me. I'm using nvme storage on the aws node so I would expect it has the iops to sync.

A log snippet:

INFO [03-08|21:10:17.620] State heal in progress                   accounts=1,549,981@76.36MiB slots=4,125,659@300.97MiB codes=12627@99.72MiB nodes=163,743,126@49.96GiB pending=85071
WARN [03-08|21:10:17.622] Unexpected trienode heal packet          peer=73755973 reqid=4,134,898,869,596,324,111
WARN [03-08|21:10:17.633] Unexpected trienode heal packet          peer=25f05f19 reqid=7,011,098,481,708,786,349
WARN [03-08|21:10:17.671] Unexpected trienode heal packet          peer=9c8ce4ee reqid=7,218,003,286,183,975,893
WARN [03-08|21:10:17.674] Unexpected trienode heal packet          peer=bcddcc5d reqid=1,911,971,051,676,565,964
INFO [03-08|21:10:22.508] Imported new block headers               count=1    elapsed=8.076ms   number=14,348,482 hash=de4c70..134896
INFO [03-08|21:10:31.658] Imported new block headers               count=1    elapsed=7.743ms   number=14,348,483 hash=809a3e..684cf6
INFO [03-08|21:10:33.379] State heal in progress                   accounts=1,549,993@76.36MiB slots=4,125,659@300.97MiB codes=12627@99.72MiB nodes=163,743,522@49.96GiB pending=85505
INFO [03-08|21:10:47.765] State heal in progress                   accounts=1,550,012@76.36MiB slots=4,125,660@300.97MiB codes=12627@99.72MiB nodes=163,743,921@49.96GiB pending=85906

I have seen others have solved this by switching to fast sync mode but this is no longer an option on the geth version i'm using (1.10.16-stable).

derekmahar · 2022-03-14T16:03:10Z

Geth on my Raspberry Pi 4 8 GB has also continuously been reporting "State heal in progress" and frequent "Unexpected trienode heal packet" log messages for a few weeks after I deleted the Geth database and restarted the sync from the beginning. What do these messages mean? Is my Geth node still syncing? Should I bother to continue running my Geth node until this issue is resolved?

xanoni · 2022-03-16T02:37:32Z

Had the same and I decided to synch on a different (faster) computer instead ... I then rsync'd the blockchain data to the RPi after the sync was complete. Not an ideal solution, but worked. Might have mentioned this before somewhere in the tickets here, sorry if that's the case.

Jeiwan · 2022-03-25T13:33:44Z

I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD.

As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it:
https://docs.rocketpool.net/guides/node/local/prepare-pi.html#testing-the-ssd-s-performance

Not sure what's the cause. Maybe it's an outdated firmware or a bad SATA-to-USB adapter. Anyway, it's not a Geth bug in my case.

derekmahar · 2022-03-25T19:58:04Z

I'm also struggling with this problem, and I'm also running Geth on a Raspberry PI 4 and an external SSD.

As I just found out, my SSD has quite low IOPS (I'm using WD Blue 1Tb). I'm following these instructions to test it: https://docs.rocketpool.net/guides/node/local/prepare-pi.html#testing-the-ssd-s-performance

I'm also using a WD Blue 1 TB with my Raspberry Pi 4 8 GB and its IO performance (read: IOPS=2284, BW=9139KiB/s, write: IOPS=763, BW=3054KiB/s) is much lower than that which the Rocketpool guide recommends. However, Geth was running without errors or warnings on this hardware until a few point versions prior to 1.10.16. Why would it encounter these problems now?

Jeiwan · 2022-03-26T04:00:56Z

My node has finally finished syncing! It took about 5 days, with the state healing phase taking about half of that.

My setup:

Raspberry Pi 4 Model B Rev 1.5, 8 Gb.
WD Blue 1Tb connected via a cheap SATA-to-USB adapter. IOPS and bandwidth seem to be low (geth was not running during the test):

test: (groupid=0, jobs=1): err= 0: pid=1820: Fri Mar 25 13:19:16 2022
  read: IOPS=3043, BW=11.9MiB/s (12.5MB/s)(3070MiB/258259msec)
   bw (  KiB/s): min= 4728, max=14672, per=99.99%, avg=12170.61, stdev=1868.41, samples=516
   iops        : min= 1182, max= 3668, avg=3042.62, stdev=467.11, samples=516
  write: IOPS=1017, BW=4068KiB/s (4166kB/s)(1026MiB/258259msec); 0 zone resets
   bw (  KiB/s): min= 1544, max= 5104, per=99.99%, avg=4067.59, stdev=651.52, samples=516
   iops        : min=  386, max= 1276, avg=1016.85, stdev=162.86, samples=516
  cpu          : usr=3.86%, sys=24.14%, ctx=1048480, majf=0, minf=19
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=11.9MiB/s (12.5MB/s), 11.9MiB/s-11.9MiB/s (12.5MB/s-12.5MB/s), io=3070MiB (3219MB), run=258259-258259msec
  WRITE: bw=4068KiB/s (4166kB/s), 4068KiB/s-4068KiB/s (4166kB/s-4166kB/s), io=1026MiB (1076MB), run=258259-258259msec

I'd say that the first phase (blocks downloading) is limited by network. I didn't have many peers so it took about 2.5 days to finish. Incoming traffic was between 2 and 4 Mb/s.
The state healing phase seems to be limited by CPU and IOPS. My RPI is not overclocked (some recommend overclocking it), and average read IOPS had been around 1.3K during the phase (seems underperfomed? maybe because of CPU?). Read bandwidth had been around 11MiB/s, which matches the number from the test.

I posted some statistics to my Twitter:
https://twitter.com/jeiwan7/status/1507565655079686145

These were last "State heal in progress" lines in logs (the node was restarted a couple of times):

         accounts=470,344@24.66MiB slots=883,544@64.47MiB codes=4634@36.46MiB nodes=40,140,227@9.88GiB pending=10152
         accounts=470,396@24.66MiB slots=884,109@64.51MiB codes=4634@36.46MiB nodes=40,141,949@9.88GiB pending=8943
         accounts=470,436@24.66MiB slots=884,702@64.55MiB codes=4634@36.46MiB nodes=40,143,775@9.88GiB pending=7697
         accounts=470,461@24.66MiB slots=885,356@64.60MiB codes=4634@36.46MiB nodes=40,145,830@9.88GiB pending=6533
         accounts=470,471@24.66MiB slots=886,227@64.66MiB codes=4634@36.46MiB nodes=40,148,055@9.88GiB pending=4126
         accounts=470,476@24.66MiB slots=886,728@64.69MiB codes=4634@36.46MiB nodes=40,149,541@9.88GiB pending=2891
         accounts=470,477@24.66MiB slots=887,598@64.76MiB codes=4634@36.46MiB nodes=40,151,685@9.89GiB pending=149
         accounts=470,477@24.66MiB slots=887,623@64.76MiB codes=4634@36.46MiB nodes=40,151,717@9.89GiB pending=0

Seems that pending needs to get down to 0 for the phase to be finished.

Geth
Version: 1.10.16-stable
Git Commit: 20356e57b119b4e70ce47665a71964434e15200d
Git Commit Date: 20220215
Architecture: arm64
Go Version: go1.17.5
Operating System: linux
GOPATH=
GOROOT=go

newearthmartin · 2022-07-23T13:54:08Z

Hi, I'm experiencing the same problem now. I had an unclean shutdown three days ago, node was offline for two hours, and it seems that it has not yet recovered.

Still State heal in progress
Lots of Unexpected trienode heal packet
pending seems to go up every time

I have been running geth on this machine for more than a year. SSD stats:

   READ: bw=112MiB/s (118MB/s), 112MiB/s-112MiB/s (118MB/s-118MB/s), io=3070MiB (3219MB), run=27321-27321msec
  WRITE: bw=37.6MiB/s (39.4MB/s), 37.6MiB/s-37.6MiB/s (39.4MB/s-39.4MB/s), io=1026MiB (1076MB), run=27321-27321msec

geth version 1.10.16-stable

Is there something I should do to speed it up? Maybe increase bandwidth size?

I'm running geth with default values for everything:

geth --mainnet --datadir ... --http --http.api eth,net,web3

holiman · 2022-07-23T15:45:30Z

You need v1.10.19 or later in order to sync.

newearthmartin · 2022-07-23T19:05:14Z

Thanks, I upgraded the geth version but the problem still remains.

nuliknol · 2022-07-27T16:27:51Z

Thanks, I upgraded the geth version but the problem still remains.

@newearthmartin
run iostat -d -x -m 10 command on your Linux box.
The column r/s (read requests per second) must be about 1500-2000 and the latency (column r_await) must be about 80 microseconds , i.e. 0.08)
If it is lower, then you are going to state-heal for weeks

nuliknol · 2022-07-27T16:37:57Z

Thanks, I upgraded the geth version but the problem still remains.

@newearthmartin
if it happens that you have got a slow SSD disk, you can accelerate it buying a smaller but fast SSD disk and configure it as caching device for the slow SSD disk via bcache kernel module.

newearthmartin · 2022-07-27T17:00:06Z

@nuliknol Here are some values from iostat while running geth:

r/s         r_await
1980.40     0.32
1953.30     0.33
1732.00     0.32
1654.20     0.35
2095.00     0.33

Does this mean r/s is ok but latency is not? Any way that I can improve this? I'm using a 2TB WD Blue on a 2016 8GB Asus Zenbook laptop running Ubuntu 22.04.

What I don't understand is that I've been running a full node for a year, why will geth not recover from this? You say weeks... when I downloaded the blockchain, it took less...

Also, should I be looking into Erigon for this hardware? :) I really would like to stay with Geth.

newearthmartin · 2022-07-27T21:12:10Z

Is there a way to see the progress of the state heal phase and how more it has to go?
Also, if I restart geth, do I lose the progress?

6ilcarlos · 2022-08-01T18:55:54Z

Hello, I'm experiencing the same issue.

I ran the Rocketpool test and results are awful

`test: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
fio-3.16
Starting 1 process
test: Laying out IO file (1 file / 4096MiB)
Jobs: 1 (f=1): [m(1)][99.9%][r=276KiB/s,w=88KiB/s][r=69,w=22 IOPS][eta 00m:05s]
test: (groupid=0, jobs=1): err= 0: pid=384253: Mon Aug 1 13:14:52 2022
read: IOPS=210, BW=842KiB/s (863kB/s)(3070MiB/3731597msec)
bw ( KiB/s): min= 8, max=81176, per=100.00%, avg=879.52, stdev=3872.75, samples=7137
iops : min= 2, max=20294, avg=219.84, stdev=968.19, samples=7137
write: IOPS=70, BW=282KiB/s (288kB/s)(1026MiB/3731597msec); 0 zone resets
bw ( KiB/s): min= 8, max=27040, per=100.00%, avg=294.19, stdev=1298.58, samples=7130
iops : min= 2, max= 6760, avg=73.50, stdev=324.65, samples=7130
cpu : usr=0.12%, sys=0.43%, ctx=76826, majf=0, minf=8
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=785920,262656,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
READ: bw=842KiB/s (863kB/s), 842KiB/s-842KiB/s (863kB/s-863kB/s), io=3070MiB (3219MB), run=3731597-3731597msec
WRITE: bw=282KiB/s (288kB/s), 282KiB/s-282KiB/s (288kB/s-288kB/s), io=1026MiB (1076MB), run=3731597-3731597msec

Disk stats (read/write):
sda: ios=819211/295271, merge=6718/35856, ticks=151863026/55438750, in_queue=208094988, util=97.34%
`

This strikes a bit strange to me since I had previously synced geth successfully on this drive.

This are the specs for the SSD
Device Model: HP SSD S650 1920GB Serial Number: HBSA51350100117 Firmware Version: T1030A0 User Capacity: 1 920 383 410 176 bytes [1.92 TB] Sector Size: 512 bytes logical/physical Rotation Rate: Solid State Device Form Factor: 2.5 inches Device is: Not in smartctl database [for details use: -P showall] ATA Version is: ACS-3 T13/2161-D revision 4 SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Mon Aug 1 13:52:08 2022 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled AAM feature is: Unavailable APM level is: 254 (maximum performance) Rd look-ahead is: Enabled Write cache is: Enabled DSN feature is: Unavailable ATA Security is: Disabled, frozen [SEC2] Wt Cache Reorder: Unavailable
I don't know how to proceed. Maybe there's some erroneous configuration on my part. I'm running geth 1.10.20

ghost · 2022-08-13T11:14:38Z

Hello @Jeiwan !

Thanks for your help. I was happy to see that you could sync using a Raspberry Pi. I have a much more powerful equipment and was about to give up.

nodes=40,151,717@9.89GiB

For me, your message showing the total nodes when you finished to sync was the most useful here. All I wanted to know is if my syncing was progressing or not.

Can you update us with the total number of nodes today?

I am still syncing. I hope you're not much further than 40,151,717. I am at about 8 million nodes now.

Thanks. :)

Jeiwan · 2022-08-13T18:05:24Z

Hi @machinasapiens!

I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine.

As to the number of nodes, I think it gets reset after each restart (but progress is not lost).

As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷‍♂️ If it keeps utilizing CPU and memory, everything's fine.

ghost · 2022-08-13T18:13:24Z

Hi @machinasapiens!

I don't have access to my RPI at the moment so I cannot check that. I had multiple full re-syncs on that same RPI and SSD and each of them went fine.

As to the number of nodes, I think it gets reset after each restart (but progress is not lost).

As far as I understand the synchronization process, there are two stages: during the first stage, it downloads block headers; during the second stage, it reconstruct the account state. The length of the latter cannot be estimated because the account state is a Patricia Trie (which gets updated while your node is syncing). So you just need to wait 🤷‍♂️ If it keeps utilizing CPU and memory, everything's fine.

Excellent information. Thanks a lot. 🙌

mutazmq1 · 2022-09-14T02:26:45Z

If your healing is taking forever, this is how I solved it after weeks of trying:

geth removedb y n as @MariusVanDerWijden recommended
Upgrade your geth to the latest unstable version
Get a fast hardware around 11k IOPS or above. The higher the better
Wait for 24-48 hours and you should be fine

YurGa777 · 2022-09-16T07:57:14Z

I have same issue.
My geth output:
сен 16 10:28:50 Future32 geth[9904]: INFO [09-16|10:28:50.614] State heal in progress accounts=48404@2.22MiB slots=37572@2.90MiB codes=267@1.94MiB nodes=1,220,413@482.41MiB pending=53853 сен 16 10:28:59 Future32 geth[9904]: INFO [09-16|10:28:59.640] State heal in progress accounts=48412@2.22MiB slots=37574@2.90MiB codes=267@1.94MiB nodes=1,220,466@482.43MiB pending=53919 сен 16 10:29:02 Future32 geth[9904]: INFO [09-16|10:29:02.391] Forkchoice requested sync to new head number=15,544,704 hash=549400..9527c0 сен 16 10:29:05 Future32 geth[9904]: INFO [09-16|10:29:05.176] Imported new block headers count=1 elapsed=11.062ms number=15,544,704 hash=549400..9527c0 сен 16 10:29:06 Future32 geth[9904]: INFO [09-16|10:29:06.182] Downloader queue stats receiptTasks=0 blockTasks=0 itemSize=262.70KiB throttle=998

> eth.syncing { currentBlock: 15544641, healedBytecodeBytes: 2030904, healedBytecodes: 267, healedTrienodeBytes: 507542641, healedTrienodes: 1225652, healingBytecode: 0, healingTrienodes: 1108, highestBlock: 15544744, startingBlock: 15538964, syncedAccountBytes: 44957199813, syncedAccounts: 219496732, syncedBytecodeBytes: 4321233920, syncedBytecodes: 661982, syncedStorage: 821946751, syncedStorageBytes: 176094687911
I understand that my IOPS is low, but I have SSD M2 256 Gb.
Could I speed up sync process by using cashe on SSD?

rndstr · 2022-09-17T17:45:05Z

@newearthmartin run iostat -d -x -m 10 command on your Linux box. The column r/s (read requests per second) must be about 1500-2000 and the latency (column r_await) must be about 80 microseconds , i.e. 0.08) If it is lower, then you are going to state-heal for weeks

I'm seeing this (while geth is healing) on a VPS SSD.

$ iostat -d -x -m 10
Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wMB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dMB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
sda            524.10      4.44     0.00   0.00    1.00     8.68    0.20      0.01     1.70  89.47    0.50    38.00    0.00      0.00     0.00   0.00    0.00     0.00    0.09  87.88

I guess i'm fucked.

Is that i/o perf good enough once the healing is done? Or am I waiting for a week just to discover it won't be able to keep up?

newearthmartin · 2022-09-17T23:15:54Z

Try erigon, they don't have these SSD requirements

Liamlu28 · 2022-09-27T21:05:10Z

I found that other users are having this problem too. They are using local SSD disks or other cloud hosting.

admcudo · 2022-10-04T20:20:14Z

I am also having this issue.

Oct 04 20:14:36 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:36.484] Forkchoice requested sync to new head    number=15,677,085 hash=6734dc..9730ab
Oct 04 20:14:38 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:38.452] Imported new block headers               count=1    elapsed="676.217µs" number=15,677,085 hash=6734dc..9730ab
Oct 04 20:14:41 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:41.260] State heal in progress                   accounts=9386@633.68KiB slots=3676@296.28KiB codes=14318@111.56MiB nodes=94,577,877@25.16GiB pending=31633
Oct 04 20:14:48 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:48.521] Forkchoice requested sync to new head    number=15,677,086 hash=9c17ac..68214f
Oct 04 20:14:50 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:50.482] Imported new block headers               count=1    elapsed=26.540ms    number=15,677,086 hash=9c17ac..68214f
Oct 04 20:14:51 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:51.498] Downloader queue stats                   receiptTasks=0    blockTasks=0    itemSize=207.11KiB throttle=1266
Oct 04 20:14:51 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:51.782] State heal in progress                   accounts=9386@633.68KiB slots=3676@296.28KiB codes=14318@111.56MiB nodes=94,578,466@25.16GiB pending=32223
Oct 04 20:14:59 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:14:59.612] Forkchoice requested sync to new head    number=15,677,087 hash=1fcd6b..b1ff8f
Oct 04 20:15:00 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:15:00.831] State heal in progress                   accounts=9483@639.58KiB slots=3676@296.28KiB codes=14318@111.56MiB nodes=94,579,214@25.16GiB pending=32625
Oct 04 20:15:02 svcteam-mainnet-eth geth[980924]: INFO [10-04|20:15:02.467] Imported new block headers               count=1    elapsed="290.785µs" number=15,677,087 hash=1fcd6b..b1ff8f

I ran the iostat command to see what the throughput was:

r/s = 1800.10 but r_await = 0.45

This is on a beast of a Google cloud server with SSD drives
Is this thing ever going to catch up do you think ?

holiman · 2022-10-05T07:04:39Z

We did several fixes to snap-sync on master, fixes that were never backported to stable release. The original reporter problems came from one of these issues that were since solved.

For anyone with "eternal state-heal" problems: try 1.11.xx. Also know as 'latest, available both via docker ethereum/client-go:latest` and downloads page.

Since this ticket is starting to be come one of those endless tickets where people pile on long after the original issue is resolved, I'll close this now.

If 1.11.xx still doesn't solve things for you, and you want to file a ticket, doing a disk-write speed might be good background info

dd if=/dev/zero of=/tmp/output conv=fdatasync bs=384k count=1k; rm -f /tmp/output

xrchz added the type:bug label Jul 10, 2021

ramaaz31 mentioned this issue Jul 28, 2021

I don't think the IO speed is bad. Here is an indication (taken while geth is also still running): ethereum/ethereumj#1295

Open

xanoni mentioned this issue Aug 24, 2021

FYI - Sync mode SNAP does not work #23228

Closed

Marudny mentioned this issue Jul 22, 2022

Fix Ropsten Geth node and validator layout status-im/infra-nimbus#117

Closed

holiman closed this as completed Oct 5, 2022

State heal phase is very slow (not finished after 2 weeks) #23191

State heal phase is very slow (not finished after 2 weeks) #23191

Comments

xrchz commented Jul 10, 2021

System information

Expected behaviour

Actual behaviour

Steps to reproduce the behaviour

Logs

holiman commented Jul 16, 2021

xrchz commented Jul 16, 2021 • edited Loading

begetan commented Jul 23, 2021

xrchz commented Jul 23, 2021

bFanek commented Jul 28, 2021

kraymond37 commented Aug 3, 2021

karalabe commented Aug 3, 2021

karalabe commented Aug 3, 2021

karalabe commented Aug 3, 2021

xrchz commented Aug 3, 2021 • edited Loading

karalabe commented Aug 3, 2021 via email

kraymond37 commented Aug 4, 2021

j4ys0n commented Aug 9, 2021

j4ys0n commented Aug 12, 2021

xanoni commented Aug 24, 2021 • edited Loading

xanoni commented Aug 24, 2021 • edited Loading

lrc123456 commented Sep 14, 2021

xrchz commented Sep 14, 2021

lrc123456 commented Sep 14, 2021

xanoni commented Sep 14, 2021

lrc123456 commented Sep 15, 2021

xanoni commented Sep 15, 2021

lrc123456 commented Sep 15, 2021

xanoni commented Sep 16, 2021

Hailiebaby16 commented Sep 29, 2021

ChuckNorrison commented Nov 21, 2021 • edited Loading

diglos commented Nov 25, 2021

0xadnauseam commented Mar 8, 2022

derekmahar commented Mar 14, 2022

xanoni commented Mar 16, 2022

Jeiwan commented Mar 25, 2022

derekmahar commented Mar 25, 2022

Jeiwan commented Mar 26, 2022

newearthmartin commented Jul 23, 2022

holiman commented Jul 23, 2022 via email

newearthmartin commented Jul 23, 2022

nuliknol commented Jul 27, 2022 • edited Loading

nuliknol commented Jul 27, 2022 • edited Loading

newearthmartin commented Jul 27, 2022 • edited Loading

newearthmartin commented Jul 27, 2022

6ilcarlos commented Aug 1, 2022 • edited Loading

ghost commented Aug 13, 2022

Jeiwan commented Aug 13, 2022

ghost commented Aug 13, 2022

mutazmq1 commented Sep 14, 2022

YurGa777 commented Sep 16, 2022

rndstr commented Sep 17, 2022

newearthmartin commented Sep 17, 2022

Liamlu28 commented Sep 27, 2022

admcudo commented Oct 4, 2022 • edited by holiman Loading

holiman commented Oct 5, 2022

xrchz commented Jul 16, 2021 •

edited

Loading

xrchz commented Aug 3, 2021 •

edited

Loading

xanoni commented Aug 24, 2021 •

edited

Loading

xanoni commented Aug 24, 2021 •

edited

Loading

ChuckNorrison commented Nov 21, 2021 •

edited

Loading

nuliknol commented Jul 27, 2022 •

edited

Loading

nuliknol commented Jul 27, 2022 •

edited

Loading

newearthmartin commented Jul 27, 2022 •

edited

Loading

6ilcarlos commented Aug 1, 2022 •

edited

Loading

admcudo commented Oct 4, 2022 •

edited by holiman

Loading