Failed e2e test with bus topology network #332

yazzaoui · 2019-11-01T13:38:10Z

We have a failed test case with liveness using the python e2e test engine, this is quite an extreme scenario in terms of network topology and doesn't seem to be caused by consensus logic. If the nodes successfully reestablish connections then liveness is restored.

Scenario

Line1: (Node1 ---- Node2) with View Height = 1802
Line2: (Node3 -----Node4) with View Height = 1830
After that, we connect Node2 and Node3, where the p2p handshake(View exchange and block download) happens between Node2 and Node3.
Now we have topology:
Node1 ------ Node2 ---- Node3 ----- Node4
Height: 1802 1830 1830 1830
Since N1 stay on an old view height = 1802, it buffers all the round state sync msg from N2 as future message in backlog,
And it cannot vote for the new height at 1831 proposed by N2, N3, N4.

Possible cause of the issue

Node1 doesn't have the information that Node2 has synchronized to height 1830 so it remains at block 1802 and cannot participate in block 1831 voting. It appears to me as being a pure eth service protocol issue.
In the sync logic ( eth/syng.go) we are broadcasting NewBlockHash to our connected peers:
https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L207-L215
But the handling of NewBlockHash in eth/handler.go doesn't seem to trigger a sync. I think there is an upstream issue here.

Possible solutions

Solution 1 (Unsuccessful)

My proposed solution is just to update a peer estimated block head when receiving NewBlockHashesMsg in the handler logic : https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/handler.go#L751 with peer.setHead()
it would trigger sync later via https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L134

Implemented in https://github.com/clearmatics/autonity/tree/sync_announcement
Status: Unsuccessful Triggering sync for every NewBlockHashesMsg message received was stopping the consensus engine too often and conflicting with consensus commit function. Understanding now why we only query the downloader with the parent block when receiving a NewBlockMsg message.

Solution 2

Send NewBlockMsg after completing sync: this was originally proposed by @Jason-Zhangxin-Chen, the idea is to simply set true the propagate flag in https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L214
But this is unfortunately insufficient as BroadcastBlock() is only broadcasting to a subset of potentially unsynced peers. The proposal is to create a separate method that performs broadcasting to every PeersWithoutBlock connected.

Relevant info.

The text was updated successfully, but these errors were encountered:

JekaMas · 2019-11-01T15:23:24Z

How the proposed solution will solve the issue when nodes are on the the same height but different round/step?
If current AskSync doesn't solve this issue, why do we need it?

For me, we need a solution for all the sync cases, doing many solutions for many cases in different parts of the project will increase complexity and in future tech debt.

yazzaoui · 2019-11-01T17:04:33Z

There is a distinction between block synchronization and consensus state synchronization.

The proposed solution here is only for triggering blocks sync (validators not at the same height), after that the consensus state should be synchronized via our pull based sync mechanism.
AskSync is here to solve consensus state synchronization and solve important issues when you have for example nodes at the same height but not at the same round. This is purely to solve here consensus issue.

Block synchronization is handled by the eth protocol, this problem can occur as well with an observer node not able to sync from the network

yazzaoui assigned IronGauntlets, Jason-Zhangxin-Chen, b00ris and JekaMas Nov 1, 2019

yazzaoui added the Sprint 16 label Nov 1, 2019

yazzaoui added this to Backlog in Autonity via automation Nov 1, 2019

yazzaoui moved this from Backlog to In progress in Autonity Nov 1, 2019

yazzaoui added this to the Autonity 1.0.0 milestone Nov 1, 2019

JekaMas moved this from In progress to Review in Autonity Nov 7, 2019

yazzaoui moved this from Review to In progress in Autonity Nov 18, 2019

yazzaoui added discussion and removed Sprint 16 labels Nov 18, 2019

JekaMas moved this from In progress to Done in Autonity Dec 13, 2019

yazzaoui added BFT type:design Design and planning labels Jan 27, 2020

piersy removed the BFT label Apr 6, 2020

IronGauntlets closed this as completed Jul 24, 2020

IronGauntlets unassigned b00ris Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed e2e test with bus topology network #332

Failed e2e test with bus topology network #332

yazzaoui commented Nov 1, 2019 •

edited

JekaMas commented Nov 1, 2019

yazzaoui commented Nov 1, 2019

Failed e2e test with bus topology network #332

Failed e2e test with bus topology network #332

Comments

yazzaoui commented Nov 1, 2019 • edited

Scenario

Possible cause of the issue

Possible solutions

Solution 1 (Unsuccessful)

Solution 2

Relevant info.

JekaMas commented Nov 1, 2019

yazzaoui commented Nov 1, 2019

yazzaoui commented Nov 1, 2019 •

edited