Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catchup: Dynamic parallel catchup #5802

Merged
merged 5 commits into from Nov 9, 2023

Conversation

winder
Copy link
Contributor

@winder winder commented Oct 24, 2023

Summary

This PR adds some rate limiting to the block download service. It will download two blocks and check the download duration. If they were fetched in less than the round time it allows full parallelism, otherwise it keeps the rate limited since we're likely at the latest round.

Test Plan

So far, I've tested manually with the print statement in the code. Not sure how to unit test this.

@codecov
Copy link

codecov bot commented Oct 24, 2023

Codecov Report

Merging #5802 (2e6b628) into master (f5d901a) will decrease coverage by 0.05%.
The diff coverage is 86.66%.

@@            Coverage Diff             @@
##           master    #5802      +/-   ##
==========================================
- Coverage   55.61%   55.57%   -0.05%     
==========================================
  Files         475      475              
  Lines       66869    66880      +11     
==========================================
- Hits        37189    37166      -23     
- Misses      27161    27192      +31     
- Partials     2519     2522       +3     
Files Coverage Δ
catchup/service.go 71.05% <86.66%> (-0.19%) ⬇️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@winder winder changed the title Dynamic parallel catchup catchup: Dynamic parallel catchup Oct 25, 2023
@winder winder marked this pull request as ready for review October 25, 2023 13:15
catchup/service.go Outdated Show resolved Hide resolved
Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify why duration measurement works? I expected fetchers to exit immediately if there is no such block, or return a block it is available in remote ledger.

catchup/service.go Show resolved Hide resolved
catchup/service.go Show resolved Hide resolved
catchup/service.go Show resolved Hide resolved
@algorandskiy
Copy link
Contributor

algorandskiy commented Oct 25, 2023

I still think a new REST or WS endpoint streaming blocks would work better.

  1. Follower node connects to a new /v2/blocks/monitor endpoint
  2. A single goroutine handles it on server side by using block notifier, and streams a block
  3. Follower writes blocks into the ledger if it is sequential. If blocks is too new, AddBlock errors, and eventually regular catchup will fire downloading everything.
  4. Regular catchup will not fire since new blocks are supplied with this new mechanism (similarly to regular operations when agreement service adds blocks).

@winder
Copy link
Contributor Author

winder commented Oct 25, 2023

I still think a new REST or WS endpoint streaming blocks would work better.

1. Follower node connects to a new `/v2/blocks/monitor` endpoint

2. A single goroutine handles it on server side by using block notifier, and streams a block

3. Follower writes blocks into the ledger if it is sequential. If blocks is too new, `AddBlock` errors, and eventually regular catchup will fire downloading everything.

4. Regular catchup will not fire since new blocks are supplied with this new mechanism (similarly to regular operations when agreement service adds blocks).

I'm not sure what that has to do with this PR

@algorandskiy
Copy link
Contributor

I'm not sure what that has to do with this PR

As I understand the purpose of this PR is reduce exceed requests to non-existing yet blocks for follower nodes that already caught up. I propose an alternative approach to the same problem.

@winder
Copy link
Contributor Author

winder commented Oct 25, 2023

I'm not sure what that has to do with this PR

As I understand the purpose of this PR is reduce exceed requests to non-existing yet blocks for follower nodes that already caught up. I propose an alternative approach to the same problem.

Unfortunately it's not a solution because Conduit uses the sync API.

@winder winder requested a review from gmalouf October 26, 2023 14:59
algorandskiy
algorandskiy previously approved these changes Oct 27, 2023
Copy link
Contributor

@algorandskiy algorandskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR actually reduces number of requests made by a follower when it caught up.
Need to remove the debugging log.

Copy link
Contributor

@gmalouf gmalouf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think we should cleanly get to a resolution on the convo around line 47 of service.go.

Copy link
Contributor

@cce cce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using download duration & 1s seems like a weird hack that affects all nodes, not just follower nodes, and is very arbitrary (why 1s?). @algorandskiy shouldn't the right solution to requesting blocks too far into the future involve smarter polling, or introduce additional calls to WaitMem() like how fetchAndWrite waits for lookbackComplete?

I think a better solution would incorporate feedback from peers about what their latest was, like: https://github.com/cce/go-algorand/tree/blockNotAvailableErrMsg-latest

@algorandskiy is working on this approach — smarter polling that introduces pauses in between retries when asking for blocks from the future, in #5809 — I think it would make this PR unnecessary. The problem this PR and #5809 are both trying to solve is aggressive polling once you get close to latest.

@algorandskiy
Copy link
Contributor

These are two different cases: 1) extra requests due to parallel requests - this PR addresses this 2) extra requests due to retries on errors. My PR addresses that.

catchup/service.go Show resolved Hide resolved
Copy link
Contributor

@cce cce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this has been tested and works without causing the catchup pipeline to fail when continuously following the latest, this could be a workaround that prevents cancelling the pipeline unnecessarily. But ideally we replace it later with better polling and backoff, perhaps incorporating information from #5819 which tells you the latest round of the node you're requesting from.

It would be a little nicer to limit the scope of this change only to follower nodes, since regular nodes don't need to worry about following the tip of the chain using catchup.

@winder winder merged commit 22b7096 into algorand:master Nov 9, 2023
18 checks passed
@winder winder deleted the will/dynamic-parallel-catchup branch November 9, 2023 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants