Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts #1558

Thireus · 2020-09-26T10:46:33Z

Subject of the issue

Despite several attempts to fast catchup using the last checkpoint from https://algorand-catchpoints.s3.us-east-2.amazonaws.com/channel/mainnet/latest.catchpoint, my non-relay node is unable to fast catchup.

catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts

Your environment

Debian 10
NVME SSD
Dedicated server with 1Gbps connectivity (up/down)
Fresh install of algorand with v2.1.5 (no prior install)

Steps to reproduce

Before:

$ goal node status -d /var/lib/algorand -w 2000
Last committed block: 1983270
Time since last block: 0.0s
Sync Time: 44769.4s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Round for next consensus protocol: 1983271
Next consensus protocol supported: true
Last Catchpoint: 
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Checkpoint set:
goal node catchup -d /var/lib/algorand 9290000#PQRSAA3T7USCIGCVRH7MEXWCEXTASVKPZNCH4UMSLYEXXHHK2D3Q

The node processes the following additional steps:

Catchpoint accounts processed...
Catchpoint total blocks...

During the last few minutes of fast catchup:

$ goal node status -d /var/lib/algorand -w 2000
Last committed block: 1983548
Sync Time: 1351.4s
Catchpoint: 9290000#PQRSAA3T7USCIGCVRH7MEXWCEXTASVKPZNCH4UMSLYEXXHHK2D3Q
Catchpoint total accounts: 5877716
Catchpoint accounts processed: 5877716
Catchpoint total blocks: 1000
Catchpoint downloaded blocks: 907
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Once "Catchpoint downloaded" is "done", fast catchup fails silently and the node returns to syncing from where it stopped before the fast catchup attempt:

$ goal node status -d /var/lib/algorand -w 2000
Last committed block: 1983720
Time since last block: 0.4s
Sync Time: 9.2s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Round for next consensus protocol: 1983721
Next consensus protocol supported: true
Last Catchpoint: 
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Expected behaviour

Fast Catchup should work.

Actual behaviour

Fast Catchup fails silently after the "Catchpoint total blocks" process.

Those are the logs during the exact moment the fast catchup silently failed:

{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-ag.algorand-mainnet.network:4160 for block 9289094","time":"2020-09-26T12:41:23.370661+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-he.algorand-mainnet.network:4160 for block 9289093","time":"2020-09-26T12:41:23.589744+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-co.algorand-mainnet.network:4160 for block 9289092","time":"2020-09-26T12:41:24.530497+02:00"}
{"file":"logger.go","function":"github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares.(*LoggerMiddleware).handler.func1","level":"info","line":56,"msg":"127.0.0.1:60808 - - [2020-09-26 12:41:25.328691883 +0200 CEST m=+47488.546827919] \"GET /v2/status HTTP/1.1\" 200 637 \"Go-http-client/1.1\" 27.616µs","time":"2020-09-26T12:41:25.328746+02:00"}
{"file":"logger.go","function":"github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares.(*LoggerMiddleware).handler.func1","level":"info","line":56,"msg":"127.0.0.1:60808 - - [2020-09-26 12:41:25.32911893 +0200 CEST m=+47488.547254967] \"GET /versions HTTP/1.1\" 200 0 \"Go-http-client/1.1\" 18.427µs","time":"2020-09-26T12:41:25.329158+02:00"}
{"callee":"github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl).ResetStagingBalances.func1","caller":"github.com/algorand/go-algorand/ledger/catchupaccessor.go:180","file":"dbutil.go","function":"github.com/algorand/go-algorand/util/db.(*Accessor).atomic","level":"warning","line":344,"msg":"dbatomic: tx surpassed expected deadline by 382.724586ms","name":"","readonly":false,"time":"2020-09-26T12:41:26.173825+02:00"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).periodicSync","level":"info","line":422,"msg":"network ready","name":"","time":"2020-09-26T12:41:26.174094+02:00"}
{"Context":"sync","details":{"StartRound":1983548},"file":"telemetry.go","function":"github.com/algorand/go-algorand/logging.(*telemetryState).logTelemetry","instanceName":"/nRykeA74vF7XHoV","level":"info","line":213,"msg":"/ApplicationState/CatchupStart","name":"","session":"","time":"2020-09-26T12:41:26.174162+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.NetworkFetcherFactory.NewOverGossip","level":"info","line":127,"msg":"no gossip peers for NewOverGossip","time":"2020-09-26T12:41:26.174198+02:00"}
{"file":"node.go","function":"github.com/algorand/go-algorand/node.(*AlgorandFullNode).SetCatchpointCatchupMode.func1","level":"info","line":935,"msg":"Indexer is not available - indexer is not active","name":"","time":"2020-09-26T12:41:26.174256+02:00"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-09-26T12:41:26.174291+02:00"}
{"file":"persistence.go","function":"github.com/algorand/go-algorand/agreement.restore.func3","level":"info","line":146,"msg":"restore (agreement): crash state not found (n = 0)","time":"2020-09-26T12:41:26.174387+02:00"}
{"file":"persistence.go","function":"github.com/algorand/go-algorand/agreement.restore.func3.1","level":"info","line":127,"msg":"restore (agreement): resetting crash state","time":"2020-09-26T12:41:26.174416+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-rh.algorand-mainnet.network:4160 for block 1983549","time":"2020-09-26T12:41:26.174662+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-ti.algorand-mainnet.network:4160 for block 1983553","time":"2020-09-26T12:41:26.174693+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-db.algorand-mainnet.network:4160 for block 1983550","time":"2020-09-26T12:41:26.174753+02:00"}

This is the failure message exctracted from the logs above that caused the fast catchup to fail:
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-09-26T12:41:26.174291+02:00"}

Related forum post: https://forum.algorand.org/t/fast-catchup-not-working-for-our-node-on-mainnet/1950/46

The text was updated successfully, but these errors were encountered:

Thireus · 2020-09-27T10:01:27Z

The following two nodes appear to return error 400 often, which might be the cause of the fast catchup failure:

r-rh.algorand-mainnet.network:4160 --> failure rate is ~87%
r-ti.algorand-mainnet.network:4160 --> failure rate is ~82%

Bumping the default config.json parameter “CatchupBlockDownloadRetryAttempts” from 1000 to 100000 allows fast catchup to succeed.

More detail: https://forum.algorand.org/t/fast-catchup-not-working-for-our-node-on-mainnet/1950/53?u=thireus

Thireus · 2020-09-28T18:27:05Z

Edit: I'm using 2.1.5.stable, title edited.

Thireus · 2020-09-28T20:46:56Z

Same behaviour with v2.1.6

onetechnical · 2020-10-07T18:11:22Z

@Thireus Can you confirm behavior is still a problem? Some relays with errors were removed from records.

onetechnical · 2020-10-27T16:16:11Z

Haven't heard more on this, so closing for now. Please re-open if you still experience issues.

Thireus added the new-bug label Sep 26, 2020

ian-algorand added external contribution Dev Ops labels Oct 7, 2020

ian-algorand assigned algojohnlee Oct 7, 2020

onetechnical assigned onetechnical and unassigned algojohnlee Oct 7, 2020

onetechnical closed this as completed Oct 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts #1558

Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts #1558

Thireus commented Sep 26, 2020 •

edited

Thireus commented Sep 27, 2020

Thireus commented Sep 28, 2020

Thireus commented Sep 28, 2020

onetechnical commented Oct 7, 2020

onetechnical commented Oct 27, 2020

Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts #1558

Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts #1558

Comments

Thireus commented Sep 26, 2020 • edited

Subject of the issue

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

Thireus commented Sep 27, 2020

Thireus commented Sep 28, 2020

Thireus commented Sep 28, 2020

onetechnical commented Oct 7, 2020

onetechnical commented Oct 27, 2020

Thireus commented Sep 26, 2020 •

edited