Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed e2e test with bus topology network #332

Closed
yazzaoui opened this issue Nov 1, 2019 · 2 comments
Closed

Failed e2e test with bus topology network #332

yazzaoui opened this issue Nov 1, 2019 · 2 comments
Assignees
Labels
type:design Design and planning
Projects

Comments

@yazzaoui
Copy link
Collaborator

yazzaoui commented Nov 1, 2019

We have a failed test case with liveness using the python e2e test engine, this is quite an extreme scenario in terms of network topology and doesn't seem to be caused by consensus logic. If the nodes successfully reestablish connections then liveness is restored.

Scenario

Line1: (Node1 ---- Node2) with View Height = 1802
Line2: (Node3 -----Node4) with View Height = 1830
After that, we connect Node2 and Node3, where the p2p handshake(View exchange and block download) happens between Node2 and Node3.
Now we have topology:
Node1 ------ Node2 ---- Node3 ----- Node4
Height: 1802 1830 1830 1830
Since N1 stay on an old view height = 1802, it buffers all the round state sync msg from N2 as future message in backlog,
And it cannot vote for the new height at 1831 proposed by N2, N3, N4.

Possible cause of the issue

Node1 doesn't have the information that Node2 has synchronized to height 1830 so it remains at block 1802 and cannot participate in block 1831 voting. It appears to me as being a pure eth service protocol issue.
In the sync logic ( eth/syng.go) we are broadcasting NewBlockHash to our connected peers:
https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L207-L215
But the handling of NewBlockHash in eth/handler.go doesn't seem to trigger a sync. I think there is an upstream issue here.

Possible solutions

Solution 1 (Unsuccessful)

My proposed solution is just to update a peer estimated block head when receiving NewBlockHashesMsg in the handler logic : https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/handler.go#L751 with peer.setHead()
it would trigger sync later via https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L134

Implemented in https://github.com/clearmatics/autonity/tree/sync_announcement
Status: Unsuccessful Triggering sync for every NewBlockHashesMsg message received was stopping the consensus engine too often and conflicting with consensus commit function. Understanding now why we only query the downloader with the parent block when receiving a NewBlockMsg message.

Solution 2

Send NewBlockMsg after completing sync: this was originally proposed by @Jason-Zhangxin-Chen, the idea is to simply set true the propagate flag in https://github.com/clearmatics/autonity/blob/cfff0d1408cbfa14f00a6275b33a93eff20a4a73/eth/sync.go#L214
But this is unfortunately insufficient as BroadcastBlock() is only broadcasting to a subset of potentially unsynced peers. The proposal is to create a separate method that performs broadcasting to every PeersWithoutBlock connected.

Relevant info.

@yazzaoui yazzaoui added this to Backlog in Autonity via automation Nov 1, 2019
@yazzaoui yazzaoui moved this from Backlog to In progress in Autonity Nov 1, 2019
@yazzaoui yazzaoui added this to the Autonity 1.0.0 milestone Nov 1, 2019
@JekaMas
Copy link
Contributor

JekaMas commented Nov 1, 2019

  1. How the proposed solution will solve the issue when nodes are on the the same height but different round/step?
  2. If current AskSync doesn't solve this issue, why do we need it?

For me, we need a solution for all the sync cases, doing many solutions for many cases in different parts of the project will increase complexity and in future tech debt.

@yazzaoui
Copy link
Collaborator Author

yazzaoui commented Nov 1, 2019

There is a distinction between block synchronization and consensus state synchronization.

  1. The proposed solution here is only for triggering blocks sync (validators not at the same height), after that the consensus state should be synchronized via our pull based sync mechanism.
  2. AskSync is here to solve consensus state synchronization and solve important issues when you have for example nodes at the same height but not at the same round. This is purely to solve here consensus issue.

Block synchronization is handled by the eth protocol, this problem can occur as well with an observer node not able to sync from the network

@JekaMas JekaMas moved this from In progress to Review in Autonity Nov 7, 2019
@yazzaoui yazzaoui moved this from Review to In progress in Autonity Nov 18, 2019
@JekaMas JekaMas moved this from In progress to Done in Autonity Dec 13, 2019
@yazzaoui yazzaoui added BFT type:design Design and planning labels Jan 27, 2020
@piersy piersy removed the BFT label Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:design Design and planning
Projects
Autonity
  
Done
Development

No branches or pull requests

6 participants