Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ledger: fix commit tasks enqueueing #5214

Merged
merged 10 commits into from Apr 3, 2023

Conversation

algorandskiy
Copy link
Contributor

@algorandskiy algorandskiy commented Mar 17, 2023

Summary

Do not enqueue commit tasks if the queue is full. The consumer might be blocked on a long-running operation like catchpoint/snapshot, and the producer (blockqueue) might stuck for some time. In this case new blocks are not persisted and all ledger.Wait() clients either stuck or timeout.

The fix DOES NOT change a frequency of commits: before, long commit blocked the syncer so producer could not queue more tasks except one right after the the long commit. Others were not produced. With the fix, producer is not blocked but continue to produce tasks with growing rounds range (except catchpoint rounds, these would the same tasks again and again), these tasks are not accepted. Eventually the queue gets open and some task gets into it. The round range does not matter much since trackers always attempt to commit some max range that makes sense from their point of view (no consensus switch within a range, range ends with catchpoint [data] file creation, etc).

The fix broke some catchpoint tests and they were adjusted: instead of scheduling commits every round, they now schedule when it is time for data file writing or for catchpoint creation.

Test Plan

Added a unit tests.
Reproduced on a local net and confirmed the fix works.

@algorandskiy algorandskiy self-assigned this Mar 17, 2023
@algorandskiy algorandskiy changed the title WIP: WIP: Fix commit tasks enqueueing Mar 17, 2023
@algorandskiy algorandskiy requested a review from cce March 17, 2023 01:10
ledger/tracker.go Outdated Show resolved Hide resolved
ledger/tracker.go Outdated Show resolved Hide resolved
@algorandskiy algorandskiy changed the title WIP: Fix commit tasks enqueueing ledger: Fix commit tasks enqueueing Mar 17, 2023
@@ -46,7 +47,7 @@ import (
"github.com/algorand/go-algorand/test/partitiontest"
)

func TestIsWritingCatchpointFile(t *testing.T) {
func TestCatchpointIsWritingCatchpointFile(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed catchpointtracker_test.go tests to unify names and make it easier to run per-file

@@ -401,6 +402,7 @@ func TestReproducibleCatchpointLabels(t *testing.T) {
defer ml.Close()

cfg := config.GetDefaultLocal()
cfg.MaxAcctLookback = 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I set it to explicit value to be sure it is less than CatchpointInterval or CatchpointLookback

@@ -1055,6 +1170,15 @@ func TestSecondStagePersistence(t *testing.T) {
t, ml, cfg, filepath.Join(tempDirectory, config.LedgerFilenamePrefix))
defer ct.close()

isCatchpointRound := func(rnd basics.Round) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some code duplication but I prefer to have isCatchpointRound with a single arg and not 3 args

wg.Wait()
ml.trackers.waitAccountsWriting()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prevents calling accountsWriting.Wait before calling any accountsWriting.Add. Note these are called from different goroutines and could lead to a data race. Golang pattern is to have Add/Wait called from the same goroutine.

@codecov
Copy link

codecov bot commented Mar 17, 2023

Codecov Report

Merging #5214 (2e0eea9) into master (36ffb59) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5214   +/-   ##
=======================================
  Coverage   53.73%   53.74%           
=======================================
  Files         444      444           
  Lines       55678    55685    +7     
=======================================
+ Hits        29917    29926    +9     
+ Misses      23430    23428    -2     
  Partials     2331     2331           
Impacted Files Coverage Δ
ledger/tracker.go 73.68% <100.00%> (+0.76%) ⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@algorandskiy algorandskiy changed the title ledger: Fix commit tasks enqueueing ledger: Fix committing tasks enqueueing Mar 17, 2023
@algorandskiy algorandskiy changed the title ledger: Fix committing tasks enqueueing ledger: fix committing tasks enqueueing Mar 17, 2023
@algorandskiy algorandskiy changed the title ledger: fix committing tasks enqueueing ledger: fix commit tasks enqueueing Mar 17, 2023
}

// ensure Ledger.Wait() is non-blocked for all rounds except the last one (due to possible races)
for rnd := startRound; rnd < endRound; rnd++ {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rnd < endRound is because the last round might not be committed yet. Sure, endRound-1 might not be as well but with much smaller probability

@cce cce dismissed stale reviews from Eric-Warehime, icorderi, and bbroder-algo via bd0fdfb March 30, 2023 19:16
cce
cce previously approved these changes Mar 30, 2023
Copy link
Contributor

@cce cce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talked to Pavel and understand how synchronous channel sends to deferredCommits could lock up ledger lookups from agreement, causing agreement message processing (ledger reads for weights and circulation) to be blocked until after a catchpoint file is finished writing to disk. Updated the comments to help me remember the set of interactions leading to slow commitRound() locking reads ..

cce
cce previously approved these changes Mar 30, 2023
@algorandskiy
Copy link
Contributor Author

Looks like there is some flakiness in catchpoint tests, looking into it

@@ -609,9 +608,6 @@ func TestCatchpointReproducibleLabels(t *testing.T) {
delta := roundDeltas[i]

ml2.trackers.newBlock(blk, delta)
err := ml2.addMockBlock(blockEntry{block: blk}, delta)
require.NoError(t, err)
ml2.trackers.committedUpTo(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this, I saw these lines were added in #4803, do they break the test the way it is written now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because committedUpTo can skip there is no guaranteed way to schedule a commit, and not all catchpoint commits can be scheduled b/c of the fixed range of rounds. To make this test work committedUpTo is only called on catchpoint rounds.

Copy link
Contributor

@cce cce Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right, the test in #4803 was assuming committedUpTo would work differently .. I see

@algorandskiy algorandskiy merged commit f8a130e into algorand:master Apr 3, 2023
9 checks passed
bbroder-algo added a commit that referenced this pull request May 23, 2023
algorandskiy added a commit to algorandskiy/go-algorand that referenced this pull request May 23, 2023
The original implementation was setting this flag in task producer,
that created a gap between setting it and actual writing catchpoint
in commitSyncer.

Later, algorand#5214 introduced ability to skip tasks if the queue is full.
This caused an issue with catchup service that was reading the flag
with IsWritingCatchpointDataFile() and stopping the catchup.
Because of that ledger was not receiving new blocks and was unable
to schedule a new commit task but had the catchpoint writing flag set
from the previous discarded task.
bbroder-algo pushed a commit that referenced this pull request May 23, 2023
The original implementation was setting this flag in task producer,
that created a gap between setting it and actual writing catchpoint
in commitSyncer.

Later, #5214 introduced ability to skip tasks if the queue is full.
This caused an issue with catchup service that was reading the flag
with IsWritingCatchpointDataFile() and stopping the catchup.
Because of that ledger was not receiving new blocks and was unable
to schedule a new commit task but had the catchpoint writing flag set
from the previous discarded task.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants