Consensus failure test-case #716

AndrejMitrovic · 2020-04-01T07:18:44Z

After finishing writing unittests for the function in #684, I started writing tests in Agora for consensus actually being reached.

But consensus fails even in the most simplistic scenario with this quorum configuration:

immutable quorums =
[
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(2, [n1, n2]),
];

According to my understanding of SCP this should not result in consensus failure. So there is a bug in how we use SCP.

codecov · 2020-04-01T07:26:02Z

Codecov Report

Merging #716 into v0.x.x will decrease coverage by 59.78%.
The diff coverage is 29.22%.

@@             Coverage Diff             @@
##           v0.x.x     #716       +/-   ##
===========================================
- Coverage   90.55%   30.76%   -59.79%     
===========================================
  Files          70       81       +11     
  Lines        5536     7940     +2404     
===========================================
- Hits         5013     2443     -2570     
- Misses        523     5497     +4974

Flag	Coverage Δ
#integration	`?`
#unittests	`30.76% <29.22%> (-58.16%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
source/agora/common/Hash.d	`50.00% <ø> (-36.67%)`	⬇️
source/agora/common/Task.d	`0.00% <0.00%> (-100.00%)`	⬇️
source/agora/consensus/data/ConsensusData.d	`0.00% <0.00%> (-100.00%)`	⬇️
source/agora/consensus/data/Enrollment.d	`25.00% <ø> (-75.00%)`	⬇️
source/agora/consensus/data/PreImageInfo.d	`0.00% <ø> (-100.00%)`	⬇️
source/agora/consensus/data/Transaction.d	`35.13% <ø> (-62.17%)`	⬇️
source/agora/consensus/data/genesis/Coinnet.d	`0.00% <0.00%> (ø)`
source/agora/consensus/data/genesis/Test.d	`0.00% <0.00%> (ø)`
source/agora/consensus/validation/Enrollment.d	`11.26% <0.00%> (-87.49%)`	⬇️
source/agora/consensus/validation/Transaction.d	`11.15% <0.00%> (-88.11%)`	⬇️
... and 118 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d0f3b9...b717e3c. Read the comment docs.

AndrejMitrovic · 2020-04-01T09:11:30Z

I've tried testing this with the previous two fixes which are not merged yet:

However the issue is still there.

It seems like after a node nominates, it does not want to begin the balloting. At least according to the logs there are not BallotProtocol outputs.

broken: https://gist.github.com/AndrejMitrovic/fd24357035bd5ee36af808d0614ac482
fixed: https://gist.github.com/AndrejMitrovic/230db5cf228990cb79e7f153ede272ad

here broken means:

immutable quorums =
[
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(2, [n1, n2]),
];

and fixed means:

immutable quorums =
[
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
];

I don't see a reason why the first configuration should fail.

AndrejMitrovic · 2020-04-02T09:10:16Z

Well here's a bit of news: it works fine on LDC. Sigh.

Geod24 · 2020-04-02T09:11:40Z

More evidence that dropping DMD support is the future for D developers!

AndrejMitrovic · 2020-04-03T01:25:20Z

This one test-case runs ok with LDC, but further tests don't.

Also this single test-case seems to sporadically fail, not every time. Here's a new diff where the quorum configs are the same this time:

ok: https://gist.github.com/AndrejMitrovic/44580a1c7e31dcdf6d61383e88806d4d
fail: https://gist.github.com/AndrejMitrovic/16890ce1d146832227bf87db3e947e0b

AndrejMitrovic · 2020-04-03T01:40:18Z

I noticed one diff which seemed very relevant:

< 2020-04-03 10:22:02,408 Info [agora.network.NetworkManager] - Discovery reached. 2 peers connected.
---
> 2020-04-03 10:22:36,948 Info [agora.network.NetworkManager] - Discovery reached. 1 peers connected.

If I manually set the min_listeners to 2, then the tests always pass.

This means the core issue could be related to gossiping of envelopes.

AndrejMitrovic · 2020-04-03T03:20:48Z

I tested adding gossiping right in receiveEnvelope, and it seems to fix the bug.

I'll submit a fix PR soon.

@Geod24 I will need you to merge #546 once it's green on Travis.

Geod24 · 2020-04-15T17:43:36Z

@Geod24 I will need you to merge #546 once it's green on Travis.

This has been done. Anything else blocking you ?

AndrejMitrovic · 2020-04-16T01:07:34Z

I tested adding gossiping right in receiveEnvelope, and it seems to fix the bug.

It fixes one bug, but it reveals another: https://github.com/bpfkorea/agora/blob/2b787c8fde8999854347072134e0bc9c34e58399/source/scpp/src/scp/BallotProtocol.cpp#L642

I'll add my intended changes to this PR.

AndrejMitrovic · 2020-04-16T01:54:21Z

So with the first two commits the new test added in the last commit passes, with dtest=agora.test.QuorumTest dub test. But all the other tests then fail with:

  what():  moved to a bad state (ballot protocol)

I think the only option left is to try using the SCP test-suite with the same quorum config and log the messages. Maybe it will tell us which messages are missing in Agora.

That being said, I don't know what is tripping up SCP.

There are other projects using SCP, right? Do you know their names? I could take a look at how they use it.

AndrejMitrovic · 2020-04-21T06:04:41Z

I don't need this to be a PR anymore. Can test locally.

AndrejMitrovic · 2020-04-27T01:36:41Z

I just did the same test but with the system integration test-suite. And the test works.

I'll try to retrieve the logs to compare what's different when running with LocalRest vs vibe.d.

AndrejMitrovic · 2020-04-29T07:05:49Z

I changed the test to use the same quorum config for all nodes. It seems to still fail..

AndrejMitrovic · 2020-05-14T09:48:09Z

With the quorum balancer that is currently a WIP, this test-case passes.

AndrejMitrovic · 2020-05-14T10:09:37Z

This test-case works in my wip:

///
unittest
{
    // generate 1007 blocks, 1 short of the enrollments expiring.
    TestConf conf = { extra_blocks : 1007 };
    auto network = makeTestNetwork(conf);
    network.start();
    scope(exit) network.shutdown();
    scope(failure) network.printLogs();
    network.waitForDiscovery();

    auto nodes = network.clients;

    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1007, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1007)));

    // create enrollment data
    // send a request to enroll as a Validator
    Enrollment enroll_0 = nodes[0].createEnrollmentData();
    Enrollment enroll_1 = nodes[1].createEnrollmentData();
    nodes[2].enrollValidator(enroll_0);
    nodes[3].enrollValidator(enroll_1);

    // check enrollments
    nodes.each!(node =>
        retryFor(node.getEnrollment(enroll_0.utxo_key) == enroll_0 &&
                 node.getEnrollment(enroll_1.utxo_key) == enroll_1,
            5.seconds));

    auto txs = makeChainedTransactions(getGenesisKeyPair(),
        network.blocks[$ - 1].txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    // at block height 1008 the validator set changes from 4 => 2
    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1008, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1008)));

    // these are un-enrolled now
    nodes[2 .. $].each!(node => node.sleep(2.seconds, true));

    // verify that consensus can still be reached by the leftover validators
    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    nodes[0 .. 2].enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1009, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1009)));

    // wait for nodes[2 .. 3] to wake up
    Thread.sleep(3.seconds);

    // now try to re-enroll the rest of the validators
    Enrollment[] enrolls;
    foreach (node; nodes[2 .. $])
    {
        enrolls ~= node.createEnrollmentData();
        nodes[0].enrollValidator(enrolls[$ - 1]);
    }

    // check enrollments
    nodes.each!(node =>
        enrolls.each!(enroll =>
            retryFor(node.getEnrollment(enroll.utxo_key) == enroll, 5.seconds)));

    // this still uses 2 nodes for reaching consensus
    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1010, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1010)));

    // this should use 4 nodes
    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1011, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1011)));

    // this should halt progress because threshold is set to max
    // commenting this out will make the assert below fire.
    nodes[$ - 1].sleep(4.seconds, true);

    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    try
    {
        // progress was not made, still stuck at 1011 blocks
        nodes.enumerate.each!((idx, node) =>
            retryFor!Exception(node.getBlockHeight() == 1012, 2.seconds,
                format("Node %s has block height %s. Expected: %s",
                    idx, node.getBlockHeight(), 1012)));
        assert(0);  // should not be reached
    }
    catch (Exception ex)
    {
        assert(ex.msg.canFind("has block height 1011. Expected: 1012"));
    }

    Thread.sleep(3.seconds);  // wait for thread to wake up before shutdown()
}

There's a few issues though:

The excessive Thread.sleep calls
There should be a way to force a node to stop sleeping, and not be based on time as it currently is. So something like node.ctrl.sleep and node.ctrl.wakeUp would be good to have.
I wish we had some kind of event mechanism so we can just "wait" until a node informs us about its ledger state. This means a push mechanism to the unitttest thread, rather than the poll mechanism from the unittest thread as we have now. I think this should be possible to implement in TestValidatorNode, we would just have to add some overrides.

AndrejMitrovic · 2020-07-09T09:20:59Z

Most CIs seem to be turning green with the new fixes. 🎉

AndrejMitrovic · 2020-07-24T06:57:27Z

The only thing that's left to do here is to rewrite and enable these tests:

https://github.com/bpfkorea/agora/blob/c06e8b534119039f0a104899262c15c976bda71c/source/agora/test/Quorum.d#L165
https://github.com/bpfkorea/agora/blob/c06e8b534119039f0a104899262c15c976bda71c/source/agora/test/Quorum.d#L186

They have to be rewritten because the genesis block will not contain more than N enrollments (because we want it to be hardcoded). The rewrite should spawn the minimum N number of nodes, and then enroll an additional 16-N and 32-N nodes and test if consensus can be reached.

It was a bit too low for debugging purposes.

There may be more outsider nodes than the original set of nodes, so the index may be out of bounds.

The spending was too low to be able to use it for many freeze transactions. The QuorumPreimage test had to be changed because of the way dice() works - even though the relative amount of stake is the same between nodes, the random number generator will in fact return different values. For example: ``` dice(rnd, [10, 10, 10, 10, 10)); ``` Returns different indices compared to: ``` dice(rnd, [100, 100, 100, 100, 100)); ``` Even though the weight is the same.

The test with the 32 nodes is semantically correct, but it will fail due to timeouts. There seems to be significant overhead trying to simulate 32 nodes on a single machine - it's possible we haven't optimized our communication overhead yet. Additionally made these tests run last since they're taxing on the system.

AndrejMitrovic · 2020-07-31T04:05:58Z

Folded into #1086

AndrejMitrovic added type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense type-testing For issues which relate to test scenarios labels Apr 1, 2020

This was referenced Apr 3, 2020

Add support for hashing SCPStatement & SCPEnvelope #546

Merged

Design and implement quorum balancing #240

Closed

AndrejMitrovic force-pushed the quorum-tests branch from 0750961 to 17aabc8 Compare April 16, 2020 01:44

AndrejMitrovic closed this Apr 21, 2020

Geod24 reopened this Apr 29, 2020

AndrejMitrovic force-pushed the quorum-tests branch from 17aabc8 to b9f7b8c Compare April 29, 2020 06:59

AndrejMitrovic force-pushed the quorum-tests branch 4 times, most recently from 9597567 to a98895f Compare July 9, 2020 09:09

AndrejMitrovic mentioned this pull request Jul 28, 2020

Remove and replace most network topologies #1070

Merged

AndrejMitrovic added 7 commits July 31, 2020 09:58

Increase cyclic log to 1MB

0a39b99

It was a bit too low for debugging purposes.

Uncomment accidental comment

cdfef74

Fix comment and localize prev_node

1d6ed31

Fix indexing bug for cyclic configs

d4040f9

There may be more outsider nodes than the original set of nodes, so the index may be out of bounds.

tests: Add ability to get the validator count

47e701d

AndrejMitrovic force-pushed the quorum-tests branch from a98895f to b717e3c Compare July 31, 2020 02:05

AndrejMitrovic closed this Jul 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consensus failure test-case #716

Consensus failure test-case #716

AndrejMitrovic commented Apr 1, 2020

codecov bot commented Apr 1, 2020 •

edited

Loading

AndrejMitrovic commented Apr 1, 2020

AndrejMitrovic commented Apr 2, 2020

Geod24 commented Apr 2, 2020

AndrejMitrovic commented Apr 3, 2020

AndrejMitrovic commented Apr 3, 2020

AndrejMitrovic commented Apr 3, 2020

Geod24 commented Apr 15, 2020

AndrejMitrovic commented Apr 16, 2020

AndrejMitrovic commented Apr 16, 2020

AndrejMitrovic commented Apr 21, 2020

AndrejMitrovic commented Apr 27, 2020

AndrejMitrovic commented Apr 29, 2020

AndrejMitrovic commented May 14, 2020

AndrejMitrovic commented May 14, 2020

AndrejMitrovic commented Jul 9, 2020

AndrejMitrovic commented Jul 24, 2020

AndrejMitrovic commented Jul 31, 2020

Consensus failure test-case #716

Consensus failure test-case #716

Conversation

AndrejMitrovic commented Apr 1, 2020

codecov bot commented Apr 1, 2020 • edited Loading

Codecov Report

AndrejMitrovic commented Apr 1, 2020

AndrejMitrovic commented Apr 2, 2020

Geod24 commented Apr 2, 2020

AndrejMitrovic commented Apr 3, 2020

AndrejMitrovic commented Apr 3, 2020

AndrejMitrovic commented Apr 3, 2020

Geod24 commented Apr 15, 2020

AndrejMitrovic commented Apr 16, 2020

AndrejMitrovic commented Apr 16, 2020

AndrejMitrovic commented Apr 21, 2020

AndrejMitrovic commented Apr 27, 2020

AndrejMitrovic commented Apr 29, 2020

AndrejMitrovic commented May 14, 2020

AndrejMitrovic commented May 14, 2020

AndrejMitrovic commented Jul 9, 2020

AndrejMitrovic commented Jul 24, 2020

AndrejMitrovic commented Jul 31, 2020

codecov bot commented Apr 1, 2020 •

edited

Loading