-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consensus failure test-case #716
Conversation
Codecov Report
@@ Coverage Diff @@
## v0.x.x #716 +/- ##
===========================================
- Coverage 90.55% 30.76% -59.79%
===========================================
Files 70 81 +11
Lines 5536 7940 +2404
===========================================
- Hits 5013 2443 -2570
- Misses 523 5497 +4974
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
I've tried testing this with the previous two fixes which are not merged yet:
However the issue is still there. It seems like after a node nominates, it does not want to begin the balloting. At least according to the logs there are not BallotProtocol outputs. broken: https://gist.github.com/AndrejMitrovic/fd24357035bd5ee36af808d0614ac482 here broken means: immutable quorums =
[
QuorumConfig(3, [n0, n1, n2]),
QuorumConfig(3, [n0, n1, n2]),
QuorumConfig(2, [n1, n2]),
]; and fixed means: immutable quorums =
[
QuorumConfig(3, [n0, n1, n2]),
QuorumConfig(3, [n0, n1, n2]),
QuorumConfig(3, [n0, n1, n2]),
]; I don't see a reason why the first configuration should fail. |
Well here's a bit of news: it works fine on LDC. Sigh. |
More evidence that dropping DMD support is the future for D developers! |
This one test-case runs ok with LDC, but further tests don't. Also this single test-case seems to sporadically fail, not every time. Here's a new diff where the quorum configs are the same this time: ok: https://gist.github.com/AndrejMitrovic/44580a1c7e31dcdf6d61383e88806d4d |
I noticed one diff which seemed very relevant: < 2020-04-03 10:22:02,408 Info [agora.network.NetworkManager] - Discovery reached. 2 peers connected.
---
> 2020-04-03 10:22:36,948 Info [agora.network.NetworkManager] - Discovery reached. 1 peers connected. If I manually set the This means the core issue could be related to gossiping of envelopes. |
It fixes one bug, but it reveals another: https://github.com/bpfkorea/agora/blob/2b787c8fde8999854347072134e0bc9c34e58399/source/scpp/src/scp/BallotProtocol.cpp#L642 I'll add my intended changes to this PR. |
0750961
to
17aabc8
Compare
So with the first two commits the new test added in the last commit passes, with
I think the only option left is to try using the SCP test-suite with the same quorum config and log the messages. Maybe it will tell us which messages are missing in Agora. That being said, I don't know what is tripping up SCP. There are other projects using SCP, right? Do you know their names? I could take a look at how they use it. |
I don't need this to be a PR anymore. Can test locally. |
I just did the same test but with the system integration test-suite. And the test works. I'll try to retrieve the logs to compare what's different when running with LocalRest vs vibe.d. |
17aabc8
to
b9f7b8c
Compare
I changed the test to use the same quorum config for all nodes. It seems to still fail.. |
With the quorum balancer that is currently a WIP, this test-case passes. |
This test-case works in my wip: ///
unittest
{
// generate 1007 blocks, 1 short of the enrollments expiring.
TestConf conf = { extra_blocks : 1007 };
auto network = makeTestNetwork(conf);
network.start();
scope(exit) network.shutdown();
scope(failure) network.printLogs();
network.waitForDiscovery();
auto nodes = network.clients;
nodes.enumerate.each!((idx, node) =>
retryFor(node.getBlockHeight() == 1007, 2.seconds,
format("Node %s has block height %s. Expected: %s",
idx, node.getBlockHeight(), 1007)));
// create enrollment data
// send a request to enroll as a Validator
Enrollment enroll_0 = nodes[0].createEnrollmentData();
Enrollment enroll_1 = nodes[1].createEnrollmentData();
nodes[2].enrollValidator(enroll_0);
nodes[3].enrollValidator(enroll_1);
// check enrollments
nodes.each!(node =>
retryFor(node.getEnrollment(enroll_0.utxo_key) == enroll_0 &&
node.getEnrollment(enroll_1.utxo_key) == enroll_1,
5.seconds));
auto txs = makeChainedTransactions(getGenesisKeyPair(),
network.blocks[$ - 1].txs, 1);
txs.each!(tx => nodes[0].putTransaction(tx));
// at block height 1008 the validator set changes from 4 => 2
nodes.enumerate.each!((idx, node) =>
retryFor(node.getBlockHeight() == 1008, 2.seconds,
format("Node %s has block height %s. Expected: %s",
idx, node.getBlockHeight(), 1008)));
// these are un-enrolled now
nodes[2 .. $].each!(node => node.sleep(2.seconds, true));
// verify that consensus can still be reached by the leftover validators
txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
txs.each!(tx => nodes[0].putTransaction(tx));
nodes[0 .. 2].enumerate.each!((idx, node) =>
retryFor(node.getBlockHeight() == 1009, 2.seconds,
format("Node %s has block height %s. Expected: %s",
idx, node.getBlockHeight(), 1009)));
// wait for nodes[2 .. 3] to wake up
Thread.sleep(3.seconds);
// now try to re-enroll the rest of the validators
Enrollment[] enrolls;
foreach (node; nodes[2 .. $])
{
enrolls ~= node.createEnrollmentData();
nodes[0].enrollValidator(enrolls[$ - 1]);
}
// check enrollments
nodes.each!(node =>
enrolls.each!(enroll =>
retryFor(node.getEnrollment(enroll.utxo_key) == enroll, 5.seconds)));
// this still uses 2 nodes for reaching consensus
txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
txs.each!(tx => nodes[0].putTransaction(tx));
nodes.enumerate.each!((idx, node) =>
retryFor(node.getBlockHeight() == 1010, 2.seconds,
format("Node %s has block height %s. Expected: %s",
idx, node.getBlockHeight(), 1010)));
// this should use 4 nodes
txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
txs.each!(tx => nodes[0].putTransaction(tx));
nodes.enumerate.each!((idx, node) =>
retryFor(node.getBlockHeight() == 1011, 2.seconds,
format("Node %s has block height %s. Expected: %s",
idx, node.getBlockHeight(), 1011)));
// this should halt progress because threshold is set to max
// commenting this out will make the assert below fire.
nodes[$ - 1].sleep(4.seconds, true);
txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
txs.each!(tx => nodes[0].putTransaction(tx));
try
{
// progress was not made, still stuck at 1011 blocks
nodes.enumerate.each!((idx, node) =>
retryFor!Exception(node.getBlockHeight() == 1012, 2.seconds,
format("Node %s has block height %s. Expected: %s",
idx, node.getBlockHeight(), 1012)));
assert(0); // should not be reached
}
catch (Exception ex)
{
assert(ex.msg.canFind("has block height 1011. Expected: 1012"));
}
Thread.sleep(3.seconds); // wait for thread to wake up before shutdown()
} There's a few issues though:
|
9597567
to
a98895f
Compare
Most CIs seem to be turning green with the new fixes. 🎉 |
The only thing that's left to do here is to rewrite and enable these tests: https://github.com/bpfkorea/agora/blob/c06e8b534119039f0a104899262c15c976bda71c/source/agora/test/Quorum.d#L165 They have to be rewritten because the genesis block will not contain more than N enrollments (because we want it to be hardcoded). The rewrite should spawn the minimum N number of nodes, and then enroll an additional 16-N and 32-N nodes and test if consensus can be reached. |
It was a bit too low for debugging purposes.
There may be more outsider nodes than the original set of nodes, so the index may be out of bounds.
The spending was too low to be able to use it for many freeze transactions. The QuorumPreimage test had to be changed because of the way dice() works - even though the relative amount of stake is the same between nodes, the random number generator will in fact return different values. For example: ``` dice(rnd, [10, 10, 10, 10, 10)); ``` Returns different indices compared to: ``` dice(rnd, [100, 100, 100, 100, 100)); ``` Even though the weight is the same.
The test with the 32 nodes is semantically correct, but it will fail due to timeouts. There seems to be significant overhead trying to simulate 32 nodes on a single machine - it's possible we haven't optimized our communication overhead yet. Additionally made these tests run last since they're taxing on the system.
a98895f
to
b717e3c
Compare
Folded into #1086 |
After finishing writing unittests for the function in #684, I started writing tests in Agora for consensus actually being reached.
But consensus fails even in the most simplistic scenario with this quorum configuration:
According to my understanding of SCP this should not result in consensus failure. So there is a bug in how we use SCP.