New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ledger: fix commit tasks enqueueing #5214
ledger: fix commit tasks enqueueing #5214
Conversation
2116a3a
to
97d1b13
Compare
97d1b13
to
22e4da8
Compare
@@ -46,7 +47,7 @@ import ( | |||
"github.com/algorand/go-algorand/test/partitiontest" | |||
) | |||
|
|||
func TestIsWritingCatchpointFile(t *testing.T) { | |||
func TestCatchpointIsWritingCatchpointFile(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed catchpointtracker_test.go tests to unify names and make it easier to run per-file
@@ -401,6 +402,7 @@ func TestReproducibleCatchpointLabels(t *testing.T) { | |||
defer ml.Close() | |||
|
|||
cfg := config.GetDefaultLocal() | |||
cfg.MaxAcctLookback = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set it to explicit value to be sure it is less than CatchpointInterval or CatchpointLookback
@@ -1055,6 +1170,15 @@ func TestSecondStagePersistence(t *testing.T) { | |||
t, ml, cfg, filepath.Join(tempDirectory, config.LedgerFilenamePrefix)) | |||
defer ct.close() | |||
|
|||
isCatchpointRound := func(rnd basics.Round) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is some code duplication but I prefer to have isCatchpointRound
with a single arg and not 3 args
wg.Wait() | ||
ml.trackers.waitAccountsWriting() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this prevents calling accountsWriting.Wait
before calling any accountsWriting.Add
. Note these are called from different goroutines and could lead to a data race. Golang pattern is to have Add/Wait called from the same goroutine.
Codecov Report
@@ Coverage Diff @@
## master #5214 +/- ##
=======================================
Coverage 53.73% 53.74%
=======================================
Files 444 444
Lines 55678 55685 +7
=======================================
+ Hits 29917 29926 +9
+ Misses 23430 23428 -2
Partials 2331 2331
... and 6 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
} | ||
|
||
// ensure Ledger.Wait() is non-blocked for all rounds except the last one (due to possible races) | ||
for rnd := startRound; rnd < endRound; rnd++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rnd < endRound
is because the last round might not be committed yet. Sure, endRound-1 might not be as well but with much smaller probability
bd0fdfb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked to Pavel and understand how synchronous channel sends to deferredCommits could lock up ledger lookups from agreement, causing agreement message processing (ledger reads for weights and circulation) to be blocked until after a catchpoint file is finished writing to disk. Updated the comments to help me remember the set of interactions leading to slow commitRound() locking reads ..
Looks like there is some flakiness in catchpoint tests, looking into it |
@@ -609,9 +608,6 @@ func TestCatchpointReproducibleLabels(t *testing.T) { | |||
delta := roundDeltas[i] | |||
|
|||
ml2.trackers.newBlock(blk, delta) | |||
err := ml2.addMockBlock(blockEntry{block: blk}, delta) | |||
require.NoError(t, err) | |||
ml2.trackers.committedUpTo(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this, I saw these lines were added in #4803, do they break the test the way it is written now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because committedUpTo
can skip there is no guaranteed way to schedule a commit, and not all catchpoint commits can be scheduled b/c of the fixed range of rounds. To make this test work committedUpTo
is only called on catchpoint rounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh right, the test in #4803 was assuming committedUpTo would work differently .. I see
This reverts commit f8a130e.
The original implementation was setting this flag in task producer, that created a gap between setting it and actual writing catchpoint in commitSyncer. Later, algorand#5214 introduced ability to skip tasks if the queue is full. This caused an issue with catchup service that was reading the flag with IsWritingCatchpointDataFile() and stopping the catchup. Because of that ledger was not receiving new blocks and was unable to schedule a new commit task but had the catchpoint writing flag set from the previous discarded task.
The original implementation was setting this flag in task producer, that created a gap between setting it and actual writing catchpoint in commitSyncer. Later, #5214 introduced ability to skip tasks if the queue is full. This caused an issue with catchup service that was reading the flag with IsWritingCatchpointDataFile() and stopping the catchup. Because of that ledger was not receiving new blocks and was unable to schedule a new commit task but had the catchpoint writing flag set from the previous discarded task.
Summary
Do not enqueue commit tasks if the queue is full. The consumer might be blocked on a long-running operation like catchpoint/snapshot, and the producer (blockqueue) might stuck for some time. In this case new blocks are not persisted and all
ledger.Wait()
clients either stuck or timeout.The fix DOES NOT change a frequency of commits: before, long commit blocked the syncer so producer could not queue more tasks except one right after the the long commit. Others were not produced. With the fix, producer is not blocked but continue to produce tasks with growing rounds range (except catchpoint rounds, these would the same tasks again and again), these tasks are not accepted. Eventually the queue gets open and some task gets into it. The round range does not matter much since trackers always attempt to commit some max range that makes sense from their point of view (no consensus switch within a range, range ends with catchpoint [data] file creation, etc).
The fix broke some catchpoint tests and they were adjusted: instead of scheduling commits every round, they now schedule when it is time for data file writing or for catchpoint creation.
Test Plan
Added a unit tests.
Reproduced on a local net and confirmed the fix works.