Request ancestor blocks for previous rounds #264

yacovm · 2025-10-02T21:50:23Z

This commit makes the node request past ancestor blocks and notarizations it doesn't have.

It is needed in case the node notarizes one or more empty blocks while the rest of the nodes notarize blocks in these rounds.

This commit makes the node request past ancestor blocks and notarizations it doesn't have. It is needed in case the node notarizes one or more empty blocks while the rest of the nodes notarize blocks in these rounds. Signed-off-by: Yacov Manevich <yacov.manevich@avalabs.org>

samliok

I'm concerned this is a lot of code changes and additional complexity for a corner case. All we are trying to do is resend the notarization if we receive a block with a parent we don't know. Why can't our current logic with timeouts and replication solve this? When our node eventually times out on the round, it will send an empty vote. If that vote is for an old round, nodes will respond with their most recent round/seq as well as the round/seq for the stale empty vote.

Secondly and I think most importantly, if a node empty notarizes a round then it shouldn't be swayed and by notarizations right? Only a finalization should be able to change that nodes mind otherwise we can get a quorum of nodes that sign off on both an empty notarization & finalization for the same round.

samliok · 2025-10-14T17:01:14Z

epoch_failover_test.go

+	recordedMessages := make(chan *Message, 100)
+	comm := &recordingComm{Communication: testutil.NoopComm(nodes), SentMessages: recordedMessages, BroadcastMessages: recordedMessages}
+
+	bb2 := &testutil.TestBlockBuilder{Out: make(chan *testutil.TestBlock, 1), BlockShouldBeBuilt: make(chan struct{}, 1)}


why create a second BB?

samliok · 2025-10-14T17:05:53Z

epoch_failover_test.go

+	t.Log("Last block is", lastBlock.BlockHeader().Seq, "for round", lastBlock.BlockHeader().Round)
+
+	leaderIndexOfLastBlock := (int(lastBlock.BlockHeader().Round)) % len(nodes)
+	voteOnLastBlock, err := testutil.NewTestVote(lastBlock, nodes[leaderIndexOfLastBlock])


this err value never gets checked

samliok · 2025-10-14T17:10:04Z

epoch_failover_test.go

+			return nil, nil
+		}
+		seq := msg.NotarizedBlockRequest.Seq
+		for i, b := range blocks {


ik this is a test so it doesn't really matter as much, but would it be easier to store blocks as a map indexed by their seqs?

samliok · 2025-10-14T17:19:33Z

msg.go

+
+type NotarizedBlockResponse struct {
+	Block         Block
+	VerifiedBlock VerifiedBlock


I think we should separate these into NotarizedBlockResponse and VerifiedNotarizedBlockResponse just like the other messages

samliok · 2025-10-14T17:24:00Z

epoch.go

 		e.increaseRound()
+		increasedRound = true
+	}
+	if increasedRound {


is this important for this pr, or is it a separate bug fix?

samliok · 2025-10-14T17:37:37Z

epoch.go

+func (e *Epoch) maybeCreateFinalizeVoteForAncestor(digest Digest) Digest {
+	for roundNum, round := range e.rounds {
+		if round.block.BlockHeader().Digest != digest {
+			continue


if we are checking for this, should we log a warn? every round in the rounds map should always have a block

samliok · 2025-10-14T17:43:03Z

epoch.go

 			zap.Int("size", len(recordBytes)),
 			zap.Stringer("digest", finalization.Finalization.BlockHeader.Digest))

+		e.finalizeAncestors(finalization.Finalization.Prev)


can we add a short comment to why this is important, i feel like I am going to forget down the line

samliok · 2025-10-14T17:45:18Z

epoch.go

 	vote := message.Vote
 	from = vote.Signature.Signer

-	e.Logger.Debug("Handling block message", zap.Stringer("digest", md.Digest), zap.Uint64("round", md.Round))


why is this log being removed?

samliok · 2025-10-14T18:12:50Z

epoch.go

 	return e.Storage.NumBlocks()
 }

+func (e *Epoch) haveWeTimedOutOnRound(round uint64) bool {


we have a method with a very similar name already. haveWeAlreadyTimedOutOnThisRound checks for timedOut while this checks for an emptyNotarization

samliok · 2025-10-14T18:18:10Z

epoch.go

+		return nil
+	}
+
+	if response.Block != nil {


this will always be true since we check the negation above

yacovm · 2025-10-14T19:25:09Z

Why can't our current logic with timeouts and replication solve this?

Because here we replicate rounds in the past, and the replication logic replicates rounds from the future.

When our node eventually times out on the round, it will send an empty vote. If that vote is for an old round, nodes will respond with their most recent round/seq as well as the round/seq for the stale empty vote.

Here the assumption is that all nodes are in the latest round. This doesn't apply for nodes that are behind.

Secondly and I think most importantly, if a node empty notarizes a round then it shouldn't be swayed and by notarizations right? Only a finalization should be able to change that nodes mind otherwise we can get a quorum of nodes that sign off on both an empty notarization & finalization for the same round.

If a node notarizes an empty round, it will not send a finalize vote for that round. However, it should still be receptive to blocks built on a valid alternative chain. There is no problem replicating notarizations as long as we remember not sending finalize votes for them.

Otherwise, a single node that missed one or more blocks may force the rest of the nodes to notarize empty rounds until it is the leader again.

Consider we have 4 nodes and one (node 1) has built and broadcast a block.
Nodes 2 and 3 received the block and send votes, but only node 2 manages to collect a notarization for the block.
Node 0 hasn't received the block at all, nor the votes or notarization.

The rest of the nodes (0, 1, 3) notarize an empty round for that round.
We are now in the next round where node 2 is the leader.
Now node 2 broadcasts a block built on top of the block that node 1 has built.
At this point, if nodes 0, 1 and 3 don't replicate the notarization and node 0 doesn't replicate the block, then we will notarize an empty block for that round and move to the next node, but this will incur needless wait time.

A bigger problem is if node 3 crashes in this round - then we will notarize an empty round for the block of node 2 and also for the block of node 3.

I'm concerned this is a lot of code changes and additional complexity for a corner case.

I tend to agree. If we're OK with unwanted and sub-optimal latency in case of a network failure then we can just agree to not address this corner case.

yacovm force-pushed the requestPreviousBlocks branch 2 times, most recently from b751a1f to 0571fa8 Compare October 9, 2025 12:58

yacovm changed the title ~~...~~ Request ancestor blocks for previous rounds Oct 9, 2025

yacovm force-pushed the requestPreviousBlocks branch 3 times, most recently from 401e2f3 to 3374993 Compare October 9, 2025 17:42

yacovm force-pushed the requestPreviousBlocks branch from 3374993 to 8cdf1c6 Compare October 10, 2025 20:20

Merge branch 'main' into requestPreviousBlocks

a4527f2

samliok reviewed Oct 14, 2025

View reviewed changes

Merge branch 'main' into requestPreviousBlocks

ee175ba

yacovm closed this Oct 14, 2025

Uh oh!

Request ancestor blocks for previous rounds #264

Request ancestor blocks for previous rounds #264

Uh oh!

Conversation

yacovm commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samliok left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yacovm commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yacovm commented Oct 2, 2025 •

edited

Loading