Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

catchup: pause catchup if ledger lagging behind #5794

Merged
merged 13 commits into from Oct 26, 2023

Conversation

algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Oct 20, 2023

Summary

It is possible catchup downloads blocks faster than ledger commits it slow disks. This could cause OOM because of too many pending deltas.

Implementation overview

  1. Defined max deltas number = 256 used to slow down catchup via a new Ledger.Available() method.
  2. If MaxAcctLookback is manually set greater than 256, use MaxAcctLookback+1 as a new max.
  3. Added a new boolean flag for checking if the ledger is committing right now. If both deltas exceeded and committing then ledger is unavailable.
  4. When unavailable, catchup service stops blocks fetching similarly to what happens with catchpoint file writing.

Additionally moved trackerRegistry.dbRound initialization from initialize to loadFromDisk to 1) simplify testing 2) it seems naturally belongs to loadFromDisk.

Considered improvement(s) [not implemented]

  1. It is possible to use max deltas value to force flushing in addition on max accounts and time delay.

Test Plan

Added unit tests to catchup and ledger packages.
Run a manual mainnet catchup test.

Memory usage algod_ledger_round - algod_ledger_dbround node.log .details.NewBase - .details.OldBase
mm heap metrics_viz deltas

catchup/service_test.go Outdated Show resolved Hide resolved
ledger/ledger.go Outdated Show resolved Hide resolved
ledger/tracker.go Outdated Show resolved Hide resolved
ledger/tracker_test.go Show resolved Hide resolved
catchup/service.go Outdated Show resolved Hide resolved
ledger/tracker.go Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 20, 2023

Codecov Report

Merging #5794 (b03bd70) into master (6c482ab) will not change coverage.
Report is 20 commits behind head on master.
The diff coverage is 69.38%.

@@           Coverage Diff           @@
##           master    #5794   +/-   ##
=======================================
  Coverage   55.55%   55.55%           
=======================================
  Files         474      474           
  Lines       66850    66877   +27     
=======================================
+ Hits        37138    37153   +15     
- Misses      27188    27199   +11     
- Partials     2524     2525    +1     
Files Coverage Δ
ledger/ledger.go 68.70% <0.00%> (-0.34%) ⬇️
catchup/service.go 71.01% <66.66%> (+1.28%) ⬆️
ledger/tracker.go 77.84% <75.00%> (+0.47%) ⬆️

... and 11 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@algorandskiy
Copy link
Contributor Author

Thanks for the review, pushed fixes

catchup/service.go Outdated Show resolved Hide resolved
jannotti
jannotti previously approved these changes Oct 23, 2023
catchup/service.go Outdated Show resolved Hide resolved
ledger/tracker.go Outdated Show resolved Hide resolved
catchup/service_test.go Show resolved Hide resolved
ledger/tracker.go Show resolved Hide resolved
ledger/ledger.go Outdated Show resolved Hide resolved
ledger/tracker.go Outdated Show resolved Hide resolved
@algorandskiy algorandskiy requested a review from cce October 24, 2023 14:10
catchup/service.go Outdated Show resolved Hide resolved
ledger/tracker.go Show resolved Hide resolved
ledger/tracker.go Outdated Show resolved Hide resolved
ledger/tracker.go Show resolved Hide resolved
gmalouf
gmalouf previously approved these changes Oct 24, 2023
cce
cce previously approved these changes Oct 24, 2023
Copy link
Contributor

@cce cce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand better now, this doesn't strictly stop catchup from downloading more blocks than the max: it only stops catchup if the max deltas is exceeded AND the commitSyncer is running inside the transaction context used to call trackers' commitRound() functions. So as a side effect of DB flushing taking a while, once you exceed the max, you have a high chance of an IsBehind call noticing and catchup being stopped, and maybe another block or two could sneak in past the max. But the net effect is this PR will force catchup to wait for a slow DB flush operation to finish before continuing.

@algorandskiy algorandskiy dismissed stale reviews from cce, gmalouf, and jasonpaulos via e85936b October 25, 2023 18:42
@algorandskiy
Copy link
Contributor Author

Added pausing pipelinedFetch for s.deadlineTimeout seconds in order to give in-flight goroutines to complete. In my testing it reduced number of fetched blocks by about 10% => less wasted bandwidth.

@algorandskiy
Copy link
Contributor Author

Memory and deltas data after adding an active sleeping in the last commit

Memory from pprof and runtime metrics, 60s samples
image

algod_ledger_round - algod_ledger_dbround 60s samples
image

.details.NewBase - .details.OldBase from flushing log
image

catchup/service.go Outdated Show resolved Hide resolved
catchup/service.go Outdated Show resolved Hide resolved
@algorandskiy algorandskiy merged commit f5d901a into algorand:master Oct 26, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants