Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts #1558

Closed
Thireus opened this issue Sep 26, 2020 · 5 comments
Assignees

Comments

@Thireus
Copy link

Thireus commented Sep 26, 2020

Subject of the issue

Despite several attempts to fast catchup using the last checkpoint from https://algorand-catchpoints.s3.us-east-2.amazonaws.com/channel/mainnet/latest.catchpoint, my non-relay node is unable to fast catchup.

catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts

Your environment

  • Debian 10
  • NVME SSD
  • Dedicated server with 1Gbps connectivity (up/down)
  • Fresh install of algorand with v2.1.5 (no prior install)

Steps to reproduce

Before:

$ goal node status -d /var/lib/algorand -w 2000
Last committed block: 1983270
Time since last block: 0.0s
Sync Time: 44769.4s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Round for next consensus protocol: 1983271
Next consensus protocol supported: true
Last Catchpoint: 
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Checkpoint set:
goal node catchup -d /var/lib/algorand 9290000#PQRSAA3T7USCIGCVRH7MEXWCEXTASVKPZNCH4UMSLYEXXHHK2D3Q

The node processes the following additional steps:

  1. Catchpoint accounts processed...
  2. Catchpoint total blocks...

During the last few minutes of fast catchup:

$ goal node status -d /var/lib/algorand -w 2000
Last committed block: 1983548
Sync Time: 1351.4s
Catchpoint: 9290000#PQRSAA3T7USCIGCVRH7MEXWCEXTASVKPZNCH4UMSLYEXXHHK2D3Q
Catchpoint total accounts: 5877716
Catchpoint accounts processed: 5877716
Catchpoint total blocks: 1000
Catchpoint downloaded blocks: 907
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Once "Catchpoint downloaded" is "done", fast catchup fails silently and the node returns to syncing from where it stopped before the fast catchup attempt:

$ goal node status -d /var/lib/algorand -w 2000
Last committed block: 1983720
Time since last block: 0.4s
Sync Time: 9.2s
Last consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Next consensus protocol: https://github.com/algorandfoundation/specs/tree/5615adc36bad610c7f165fa2967f4ecfa75125f0
Round for next consensus protocol: 1983721
Next consensus protocol supported: true
Last Catchpoint: 
Genesis ID: mainnet-v1.0
Genesis hash: wGHE2Pwdvd7S12BL5FaOP20EGYesN73ktiC1qzkkit8=

Expected behaviour

Fast Catchup should work.

Actual behaviour

Fast Catchup fails silently after the "Catchpoint total blocks" process.

Those are the logs during the exact moment the fast catchup silently failed:

{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-ag.algorand-mainnet.network:4160 for block 9289094","time":"2020-09-26T12:41:23.370661+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-he.algorand-mainnet.network:4160 for block 9289093","time":"2020-09-26T12:41:23.589744+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-co.algorand-mainnet.network:4160 for block 9289092","time":"2020-09-26T12:41:24.530497+02:00"}
{"file":"logger.go","function":"github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares.(*LoggerMiddleware).handler.func1","level":"info","line":56,"msg":"127.0.0.1:60808 - - [2020-09-26 12:41:25.328691883 +0200 CEST m=+47488.546827919] \"GET /v2/status HTTP/1.1\" 200 637 \"Go-http-client/1.1\" 27.616µs","time":"2020-09-26T12:41:25.328746+02:00"}
{"file":"logger.go","function":"github.com/algorand/go-algorand/daemon/algod/api/server/lib/middlewares.(*LoggerMiddleware).handler.func1","level":"info","line":56,"msg":"127.0.0.1:60808 - - [2020-09-26 12:41:25.32911893 +0200 CEST m=+47488.547254967] \"GET /versions HTTP/1.1\" 200 0 \"Go-http-client/1.1\" 18.427µs","time":"2020-09-26T12:41:25.329158+02:00"}
{"callee":"github.com/algorand/go-algorand/ledger.(*CatchpointCatchupAccessorImpl).ResetStagingBalances.func1","caller":"github.com/algorand/go-algorand/ledger/catchupaccessor.go:180","file":"dbutil.go","function":"github.com/algorand/go-algorand/util/db.(*Accessor).atomic","level":"warning","line":344,"msg":"dbatomic: tx surpassed expected deadline by 382.724586ms","name":"","readonly":false,"time":"2020-09-26T12:41:26.173825+02:00"}
{"Context":"sync","file":"service.go","function":"github.com/algorand/go-algorand/catchup.(*Service).periodicSync","level":"info","line":422,"msg":"network ready","name":"","time":"2020-09-26T12:41:26.174094+02:00"}
{"Context":"sync","details":{"StartRound":1983548},"file":"telemetry.go","function":"github.com/algorand/go-algorand/logging.(*telemetryState).logTelemetry","instanceName":"/nRykeA74vF7XHoV","level":"info","line":213,"msg":"/ApplicationState/CatchupStart","name":"","session":"","time":"2020-09-26T12:41:26.174162+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.NetworkFetcherFactory.NewOverGossip","level":"info","line":127,"msg":"no gossip peers for NewOverGossip","time":"2020-09-26T12:41:26.174198+02:00"}
{"file":"node.go","function":"github.com/algorand/go-algorand/node.(*AlgorandFullNode).SetCatchpointCatchupMode.func1","level":"info","line":935,"msg":"Indexer is not available - indexer is not active","name":"","time":"2020-09-26T12:41:26.174256+02:00"}
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-09-26T12:41:26.174291+02:00"}
{"file":"persistence.go","function":"github.com/algorand/go-algorand/agreement.restore.func3","level":"info","line":146,"msg":"restore (agreement): crash state not found (n = 0)","time":"2020-09-26T12:41:26.174387+02:00"}
{"file":"persistence.go","function":"github.com/algorand/go-algorand/agreement.restore.func3.1","level":"info","line":127,"msg":"restore (agreement): resetting crash state","time":"2020-09-26T12:41:26.174416+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-rh.algorand-mainnet.network:4160 for block 1983549","time":"2020-09-26T12:41:26.174662+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-ti.algorand-mainnet.network:4160 for block 1983553","time":"2020-09-26T12:41:26.174693+02:00"}
{"file":"fetcher.go","function":"github.com/algorand/go-algorand/catchup.(*NetworkFetcher).FetchBlock","level":"info","line":219,"msg":"networkFetcher.FetchBlock: asking client r-db.algorand-mainnet.network:4160 for block 1983550","time":"2020-09-26T12:41:26.174753+02:00"}

This is the failure message exctracted from the logs above that caused the fast catchup to fail:
{"file":"catchpointService.go","function":"github.com/algorand/go-algorand/catchup.(*CatchpointCatchupService).run","level":"warning","line":201,"msg":"catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts","name":"","time":"2020-09-26T12:41:26.174291+02:00"}

Related forum post: https://forum.algorand.org/t/fast-catchup-not-working-for-our-node-on-mainnet/1950/46

@Thireus
Copy link
Author

Thireus commented Sep 27, 2020

The following two nodes appear to return error 400 often, which might be the cause of the fast catchup failure:

r-rh.algorand-mainnet.network:4160 --> failure rate is ~87%
r-ti.algorand-mainnet.network:4160 --> failure rate is ~82%

Bumping the default config.json parameter “CatchupBlockDownloadRetryAttempts” from 1000 to 100000 allows fast catchup to succeed.

More detail: https://forum.algorand.org/t/fast-catchup-not-working-for-our-node-on-mainnet/1950/53?u=thireus

@Thireus
Copy link
Author

Thireus commented Sep 28, 2020

Edit: I'm using 2.1.5.stable, title edited.

@Thireus Thireus changed the title Fast Catchup not working on MainNet with v2.1.6 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts Fast Catchup not working on MainNet with v2.1.5 - catchpoint catchup stage error : processStageBlocksDownload failed after multiple blocks download attempts Sep 28, 2020
@Thireus
Copy link
Author

Thireus commented Sep 28, 2020

Same behaviour with v2.1.6

@onetechnical
Copy link
Contributor

@Thireus Can you confirm behavior is still a problem? Some relays with errors were removed from records.

@onetechnical
Copy link
Contributor

Haven't heard more on this, so closing for now. Please re-open if you still experience issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants