ledger: fix commit tasks enqueueing #5214

algorandskiy · 2023-03-17T01:08:40Z

Summary

Do not enqueue commit tasks if the queue is full. The consumer might be blocked on a long-running operation like catchpoint/snapshot, and the producer (blockqueue) might stuck for some time. In this case new blocks are not persisted and all ledger.Wait() clients either stuck or timeout.

The fix DOES NOT change a frequency of commits: before, long commit blocked the syncer so producer could not queue more tasks except one right after the the long commit. Others were not produced. With the fix, producer is not blocked but continue to produce tasks with growing rounds range (except catchpoint rounds, these would the same tasks again and again), these tasks are not accepted. Eventually the queue gets open and some task gets into it. The round range does not matter much since trackers always attempt to commit some max range that makes sense from their point of view (no consensus switch within a range, range ends with catchpoint [data] file creation, etc).

The fix broke some catchpoint tests and they were adjusted: instead of scheduling commits every round, they now schedule when it is time for data file writing or for catchpoint creation.

Test Plan

Added a unit tests.
Reproduced on a local net and confirmed the fix works.

ledger/tracker.go

algorandskiy · 2023-03-17T19:14:37Z

ledger/catchpointtracker_test.go

@@ -46,7 +47,7 @@ import (
 	"github.com/algorand/go-algorand/test/partitiontest"
 )

-func TestIsWritingCatchpointFile(t *testing.T) {
+func TestCatchpointIsWritingCatchpointFile(t *testing.T) {


I renamed catchpointtracker_test.go tests to unify names and make it easier to run per-file

algorandskiy · 2023-03-17T19:15:30Z

ledger/catchpointtracker_test.go

@@ -401,6 +402,7 @@ func TestReproducibleCatchpointLabels(t *testing.T) {
 	defer ml.Close()

 	cfg := config.GetDefaultLocal()
+	cfg.MaxAcctLookback = 2


I set it to explicit value to be sure it is less than CatchpointInterval or CatchpointLookback

algorandskiy · 2023-03-17T19:17:02Z

ledger/catchpointtracker_test.go

@@ -1055,6 +1170,15 @@ func TestSecondStagePersistence(t *testing.T) {
 		t, ml, cfg, filepath.Join(tempDirectory, config.LedgerFilenamePrefix))
 	defer ct.close()

+	isCatchpointRound := func(rnd basics.Round) bool {


there is some code duplication but I prefer to have isCatchpointRound with a single arg and not 3 args

algorandskiy · 2023-03-17T19:19:26Z

ledger/catchpointtracker_test.go

 	wg.Wait()
+	ml.trackers.waitAccountsWriting()


this prevents calling accountsWriting.Wait before calling any accountsWriting.Add. Note these are called from different goroutines and could lead to a data race. Golang pattern is to have Add/Wait called from the same goroutine.

codecov · 2023-03-17T19:30:44Z

Codecov Report

Merging #5214 (2e0eea9) into master (36ffb59) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #5214   +/-   ##
=======================================
  Coverage   53.73%   53.74%           
=======================================
  Files         444      444           
  Lines       55678    55685    +7     
=======================================
+ Hits        29917    29926    +9     
+ Misses      23430    23428    -2     
  Partials     2331     2331

Impacted Files	Coverage Δ
ledger/tracker.go	`73.68% <100.00%> (+0.76%)`	⬆️

... and 6 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

algorandskiy · 2023-03-17T20:17:10Z

ledger/catchpointtracker_test.go

+	}
+
+	// ensure Ledger.Wait() is non-blocked for all rounds except the last one (due to possible races)
+	for rnd := startRound; rnd < endRound; rnd++ {


rnd < endRound is because the last round might not be committed yet. Sure, endRound-1 might not be as well but with much smaller probability

cce

Talked to Pavel and understand how synchronous channel sends to deferredCommits could lock up ledger lookups from agreement, causing agreement message processing (ledger reads for weights and circulation) to be blocked until after a catchpoint file is finished writing to disk. Updated the comments to help me remember the set of interactions leading to slow commitRound() locking reads ..

…21s-fix

algorandskiy · 2023-03-30T20:21:59Z

Looks like there is some flakiness in catchpoint tests, looking into it

cce · 2023-04-03T13:03:10Z

ledger/catchpointtracker_test.go

@@ -609,9 +608,6 @@ func TestCatchpointReproducibleLabels(t *testing.T) {
 			delta := roundDeltas[i]

 			ml2.trackers.newBlock(blk, delta)
-			err := ml2.addMockBlock(blockEntry{block: blk}, delta)
-			require.NoError(t, err)
-			ml2.trackers.committedUpTo(i)


Sorry about this, I saw these lines were added in #4803, do they break the test the way it is written now?

because committedUpTo can skip there is no guaranteed way to schedule a commit, and not all catchpoint commits can be scheduled b/c of the fixed range of rounds. To make this test work committedUpTo is only called on catchpoint rounds.

oh right, the test in #4803 was assuming committedUpTo would work differently .. I see

This reverts commit f8a130e.

The original implementation was setting this flag in task producer, that created a gap between setting it and actual writing catchpoint in commitSyncer. Later, algorand#5214 introduced ability to skip tasks if the queue is full. This caused an issue with catchup service that was reading the flag with IsWritingCatchpointDataFile() and stopping the catchup. Because of that ledger was not receiving new blocks and was unable to schedule a new commit task but had the catchpoint writing flag set from the previous discarded task.

The original implementation was setting this flag in task producer, that created a gap between setting it and actual writing catchpoint in commitSyncer. Later, #5214 introduced ability to skip tasks if the queue is full. This caused an issue with catchup service that was reading the flag with IsWritingCatchpointDataFile() and stopping the catchup. Because of that ledger was not receiving new blocks and was unable to schedule a new commit task but had the catchpoint writing flag set from the previous discarded task.

algorandskiy added the Team Carbon-11 label Mar 17, 2023

algorandskiy requested review from algonautshant and AlgoAxel March 17, 2023 01:08

algorandskiy self-assigned this Mar 17, 2023

algorandskiy changed the title ~~WIP:~~ WIP: Fix commit tasks enqueueing Mar 17, 2023

algorandskiy added the Bug-Fix label Mar 17, 2023

algorandskiy requested a review from cce March 17, 2023 01:10

algorandskiy force-pushed the pavel/catchpoint-21s-fix branch from 2116a3a to 97d1b13 Compare March 17, 2023 02:56

AlgoAxel reviewed Mar 17, 2023

View reviewed changes

ledger/tracker.go Outdated Show resolved Hide resolved

AlgoAxel reviewed Mar 17, 2023

View reviewed changes

ledger/tracker.go Outdated Show resolved Hide resolved

algorandskiy added 2 commits March 17, 2023 13:33

fix: do not enqueue commits while another is in progress

1651c2a

fix catchpoint tests

22e4da8

algorandskiy force-pushed the pavel/catchpoint-21s-fix branch from 97d1b13 to 22e4da8 Compare March 17, 2023 17:33

algorandskiy changed the title ~~WIP: Fix commit tasks enqueueing~~ ledger: Fix commit tasks enqueueing Mar 17, 2023

CR fixes

4ffdb7f

algorandskiy requested a review from AlgoAxel March 17, 2023 17:48

algorandskiy added 2 commits March 17, 2023 14:08

fix datarace in TestCatchpointFastUpdates

24f4931

Fix a possible race condition on tr.accountsWriting

c5ec049

algorandskiy requested a review from jannotti March 17, 2023 19:14

algorandskiy commented Mar 17, 2023

View reviewed changes

algorandskiy requested review from bbroder-algo and Eric-Warehime March 17, 2023 20:14

algorandskiy changed the title ~~ledger: Fix commit tasks enqueueing~~ ledger: Fix committing tasks enqueueing Mar 17, 2023

algorandskiy changed the title ~~ledger: Fix committing tasks enqueueing~~ ledger: fix committing tasks enqueueing Mar 17, 2023

algorandskiy changed the title ~~ledger: fix committing tasks enqueueing~~ ledger: fix commit tasks enqueueing Mar 17, 2023

algorandskiy commented Mar 17, 2023

View reviewed changes

cce dismissed stale reviews from Eric-Warehime, icorderi, and bbroder-algo via bd0fdfb March 30, 2023 19:16

cce previously approved these changes Mar 30, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/master' into pavel/catchpoint-…

e4100ec

…21s-fix

cce dismissed their stale review via e4100ec March 30, 2023 19:21

cce previously approved these changes Mar 30, 2023

View reviewed changes

Fix invalid merge

2e0eea9

algorandskiy dismissed cce’s stale review via 2e0eea9 March 30, 2023 22:08

algorandskiy requested review from cce, algonautshant and bbroder-algo March 30, 2023 22:08

cce reviewed Apr 3, 2023

View reviewed changes

cce approved these changes Apr 3, 2023

View reviewed changes

bbroder-algo approved these changes Apr 3, 2023

View reviewed changes

algorandskiy merged commit f8a130e into algorand:master Apr 3, 2023

This was referenced May 9, 2023

go-algorand 3.16.0-beta Release PR #5370

Closed

go-algorand 3.16.0-beta Release PR #5371

Closed

go-algorand 3.16.0-beta Release PR #5382

Closed

Algo-devops-service mentioned this pull request May 22, 2023

go-algorand 3.16.0-beta Release PR #5406

Merged

bbroder-algo added a commit that referenced this pull request May 23, 2023

Revert "ledger: fix commit tasks enqueueing (#5214)"

2981c1e

This reverts commit f8a130e.

algorandskiy mentioned this pull request May 23, 2023

ledger: report catchpoint writing only when it actually started #5413

Merged

This was referenced May 24, 2023

go-algorand 3.16.0-beta Release PR #5417

Merged

go-algorand 3.16.0-beta Release PR #5430

Merged

go-algorand 3.16.0-beta Release PR #5434

Merged

Algo-devops-service mentioned this pull request Jun 12, 2023

go-algorand 3.16.1-stable Release PR #5465

Merged

onetechnical mentioned this pull request Jun 14, 2023

go-algorand 3.16.2-stable Release PR #5469

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ledger: fix commit tasks enqueueing #5214

ledger: fix commit tasks enqueueing #5214

algorandskiy commented Mar 17, 2023 •

edited

Loading

algorandskiy Mar 17, 2023

algorandskiy Mar 17, 2023

algorandskiy Mar 17, 2023

algorandskiy Mar 17, 2023

codecov bot commented Mar 17, 2023 •

edited

Loading

algorandskiy Mar 17, 2023

cce left a comment

algorandskiy commented Mar 30, 2023

cce Apr 3, 2023

algorandskiy Apr 3, 2023

cce Apr 3, 2023 •

edited

Loading

ledger: fix commit tasks enqueueing #5214

ledger: fix commit tasks enqueueing #5214

Conversation

algorandskiy commented Mar 17, 2023 • edited Loading

Summary

Test Plan

algorandskiy Mar 17, 2023

Choose a reason for hiding this comment

algorandskiy Mar 17, 2023

Choose a reason for hiding this comment

algorandskiy Mar 17, 2023

Choose a reason for hiding this comment

algorandskiy Mar 17, 2023

Choose a reason for hiding this comment

codecov bot commented Mar 17, 2023 • edited Loading

Codecov Report

algorandskiy Mar 17, 2023

Choose a reason for hiding this comment

cce left a comment

Choose a reason for hiding this comment

algorandskiy commented Mar 30, 2023

cce Apr 3, 2023

Choose a reason for hiding this comment

algorandskiy Apr 3, 2023

Choose a reason for hiding this comment

cce Apr 3, 2023 • edited Loading

Choose a reason for hiding this comment

algorandskiy commented Mar 17, 2023 •

edited

Loading

codecov bot commented Mar 17, 2023 •

edited

Loading

cce Apr 3, 2023 •

edited

Loading