Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catchup: fetchAndWrite/fetchRound quit early on errNoBlockForRound #5809

Merged

Conversation

algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Oct 27, 2023

Summary

Checking this test failure (three nodes, one has 99%+ of the stake) I found frequent retries with after errNoBlockForRound could lead to punishing nodes. In particular in this test the relay node punished node1 and failed to catchup.

Two functions are modified in slightly different way:

  1. fetchAndWrite simply checks for errNoBlockForRoundThreshold number of errors before giving up.
  2. fetchRound sleeps and recreate the local peer selector. This flow is for a case when agreement saw the cert but not the block itself, and asking the block immediately could fail since the other side has not added it into the ledger yet. According to the method contract it must try indefinitely. To fix a rare situation with punishing one of two nodes more the peer selector gets recreated if too many errors.

Additionally, modified how peerRankDownloadFailed is handled in rankPeer ->> historicStats.push.

  1. Added peerRankNoBlockForRound to handle "no block error" a bit differently: the penalty function is p*2^(0.1*x) instead of p*2^x for this case to punish no block errors less aggressive.

image

Test Plan

  1. Added a unit test
  2. Removed ReadyCheck from TestBasicCatchup. The test is too fast now, from start to finish catchup on 3-10 blocks takes only 20 ms. ReadyCheck still covered by TestReadyEndpoint.

catchup/service.go Outdated Show resolved Hide resolved
@algorandskiy algorandskiy changed the title catchup: delay fetchAndWrite/fetchRound on errors catchup: fetchAndWrite/fetchRound quit early on errNoBlockForRound Oct 28, 2023
@algorandskiy algorandskiy marked this pull request as ready for review October 28, 2023 02:49
@codecov
Copy link

codecov bot commented Oct 30, 2023

Codecov Report

Merging #5809 (f70f0aa) into master (92af372) will decrease coverage by 0.05%.
The diff coverage is 68.75%.

@@            Coverage Diff             @@
##           master    #5809      +/-   ##
==========================================
- Coverage   55.64%   55.60%   -0.05%     
==========================================
  Files         475      475              
  Lines       66869    66892      +23     
==========================================
- Hits        37209    37193      -16     
- Misses      27146    27178      +32     
- Partials     2514     2521       +7     
Files Coverage Δ
catchup/peerSelector.go 98.71% <100.00%> (-1.29%) ⬇️
catchup/service.go 70.41% <50.00%> (-1.95%) ⬇️

... and 7 files with indirect coverage changes

📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today!

@algorandskiy algorandskiy requested a review from cce October 30, 2023 15:51
catchup/service.go Outdated Show resolved Hide resolved
catchup/service.go Outdated Show resolved Hide resolved
catchup/service.go Outdated Show resolved Hide resolved
catchup/service.go Outdated Show resolved Hide resolved
algorandskiy and others added 2 commits October 30, 2023 15:34
Co-authored-by: Gary <982483+gmalouf@users.noreply.github.com>
catchup/service.go Outdated Show resolved Hide resolved
catchup/service.go Outdated Show resolved Hide resolved
@algorandskiy algorandskiy marked this pull request as draft November 2, 2023 21:44
@algorandskiy algorandskiy marked this pull request as ready for review November 3, 2023 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants