You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have a failed test case with liveness using the python e2e test engine, this is quite an extreme scenario in terms of network topology and doesn't seem to be caused by consensus logic. If the nodes successfully reestablish connections then liveness is restored.
Scenario
Line1: (Node1 ---- Node2) with View Height = 1802
Line2: (Node3 -----Node4) with View Height = 1830
After that, we connect Node2 and Node3, where the p2p handshake(View exchange and block download) happens between Node2 and Node3.
Now we have topology:
Node1 ------ Node2 ---- Node3 ----- Node4
Height: 1802 1830 1830 1830
Since N1 stay on an old view height = 1802, it buffers all the round state sync msg from N2 as future message in backlog,
And it cannot vote for the new height at 1831 proposed by N2, N3, N4.
Possible cause of the issue
Node1 doesn't have the information that Node2 has synchronized to height 1830 so it remains at block 1802 and cannot participate in block 1831 voting. It appears to me as being a pure eth service protocol issue.
In the sync logic ( eth/syng.go) we are broadcasting NewBlockHash to our connected peers: https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L207-L215
But the handling of NewBlockHash in eth/handler.go doesn't seem to trigger a sync. I think there is an upstream issue here.
Implemented in https://github.com/clearmatics/autonity/tree/sync_announcement
Status: Unsuccessful Triggering sync for every NewBlockHashesMsg message received was stopping the consensus engine too often and conflicting with consensus commit function. Understanding now why we only query the downloader with the parent block when receiving a NewBlockMsg message.
How the proposed solution will solve the issue when nodes are on the the same height but different round/step?
If current AskSync doesn't solve this issue, why do we need it?
For me, we need a solution for all the sync cases, doing many solutions for many cases in different parts of the project will increase complexity and in future tech debt.
There is a distinction between block synchronization and consensus state synchronization.
The proposed solution here is only for triggering blocks sync (validators not at the same height), after that the consensus state should be synchronized via our pull based sync mechanism.
AskSync is here to solve consensus state synchronization and solve important issues when you have for example nodes at the same height but not at the same round. This is purely to solve here consensus issue.
Block synchronization is handled by the eth protocol, this problem can occur as well with an observer node not able to sync from the network
We have a failed test case with liveness using the python e2e test engine, this is quite an extreme scenario in terms of network topology and doesn't seem to be caused by consensus logic. If the nodes successfully reestablish connections then liveness is restored.
Scenario
Line1: (Node1 ---- Node2) with View Height = 1802
Line2: (Node3 -----Node4) with View Height = 1830
After that, we connect Node2 and Node3, where the p2p handshake(View exchange and block download) happens between Node2 and Node3.
Now we have topology:
Node1 ------ Node2 ---- Node3 ----- Node4
Height: 1802 1830 1830 1830
Since N1 stay on an old view height = 1802, it buffers all the round state sync msg from N2 as future message in backlog,
And it cannot vote for the new height at 1831 proposed by N2, N3, N4.
Possible cause of the issue
Node1 doesn't have the information that Node2 has synchronized to height 1830 so it remains at block 1802 and cannot participate in block 1831 voting. It appears to me as being a pure eth service protocol issue.
In the sync logic ( eth/syng.go) we are broadcasting NewBlockHash to our connected peers:
https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L207-L215
But the handling of NewBlockHash in eth/handler.go doesn't seem to trigger a sync. I think there is an upstream issue here.
Possible solutions
Solution 1 (Unsuccessful)
My proposed solution is just to update a peer estimated block head when receiving
NewBlockHashesMsg
in the handler logic : https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/handler.go#L751 withpeer.setHead()
it would trigger sync later via https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L134
Implemented in https://github.com/clearmatics/autonity/tree/sync_announcement
Status: Unsuccessful Triggering sync for every NewBlockHashesMsg message received was stopping the consensus engine too often and conflicting with consensus
commit
function. Understanding now why we only query the downloader with the parent block when receiving aNewBlockMsg
message.Solution 2
Send
NewBlockMsg
after completing sync: this was originally proposed by @Jason-Zhangxin-Chen, the idea is to simply set true the propagate flag in https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L214But this is unfortunately insufficient as
BroadcastBlock()
is only broadcasting to a subset of potentially unsynced peers. The proposal is to create a separate method that performs broadcasting to everyPeersWithoutBlock
connected.Relevant info.
The text was updated successfully, but these errors were encountered: