Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus failure test-case #716

Closed
wants to merge 7 commits into from

Conversation

AndrejMitrovic
Copy link
Contributor

After finishing writing unittests for the function in #684, I started writing tests in Agora for consensus actually being reached.

But consensus fails even in the most simplistic scenario with this quorum configuration:

immutable quorums =
[
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(2, [n1, n2]),
];

According to my understanding of SCP this should not result in consensus failure. So there is a bug in how we use SCP.

@AndrejMitrovic AndrejMitrovic added type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense type-testing For issues which relate to test scenarios labels Apr 1, 2020
@codecov
Copy link

codecov bot commented Apr 1, 2020

Codecov Report

Merging #716 into v0.x.x will decrease coverage by 59.78%.
The diff coverage is 29.22%.

Impacted file tree graph

@@             Coverage Diff             @@
##           v0.x.x     #716       +/-   ##
===========================================
- Coverage   90.55%   30.76%   -59.79%     
===========================================
  Files          70       81       +11     
  Lines        5536     7940     +2404     
===========================================
- Hits         5013     2443     -2570     
- Misses        523     5497     +4974     
Flag Coverage Δ
#integration ?
#unittests 30.76% <29.22%> (-58.16%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
source/agora/common/Hash.d 50.00% <ø> (-36.67%) ⬇️
source/agora/common/Task.d 0.00% <0.00%> (-100.00%) ⬇️
source/agora/consensus/data/ConsensusData.d 0.00% <0.00%> (-100.00%) ⬇️
source/agora/consensus/data/Enrollment.d 25.00% <ø> (-75.00%) ⬇️
source/agora/consensus/data/PreImageInfo.d 0.00% <ø> (-100.00%) ⬇️
source/agora/consensus/data/Transaction.d 35.13% <ø> (-62.17%) ⬇️
source/agora/consensus/data/genesis/Coinnet.d 0.00% <0.00%> (ø)
source/agora/consensus/data/genesis/Test.d 0.00% <0.00%> (ø)
source/agora/consensus/validation/Enrollment.d 11.26% <0.00%> (-87.49%) ⬇️
source/agora/consensus/validation/Transaction.d 11.15% <0.00%> (-88.11%) ⬇️
... and 118 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d0f3b9...b717e3c. Read the comment docs.

@AndrejMitrovic
Copy link
Contributor Author

I've tried testing this with the previous two fixes which are not merged yet:

However the issue is still there.

It seems like after a node nominates, it does not want to begin the balloting. At least according to the logs there are not BallotProtocol outputs.

broken: https://gist.github.com/AndrejMitrovic/fd24357035bd5ee36af808d0614ac482
fixed: https://gist.github.com/AndrejMitrovic/230db5cf228990cb79e7f153ede272ad

here broken means:

immutable quorums =
[
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(2, [n1, n2]),
];

and fixed means:

immutable quorums =
[
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
    QuorumConfig(3, [n0, n1, n2]),
];

I don't see a reason why the first configuration should fail.

@AndrejMitrovic
Copy link
Contributor Author

Well here's a bit of news: it works fine on LDC. Sigh.

@Geod24
Copy link
Collaborator

Geod24 commented Apr 2, 2020

More evidence that dropping DMD support is the future for D developers!

@AndrejMitrovic
Copy link
Contributor Author

This one test-case runs ok with LDC, but further tests don't.

Also this single test-case seems to sporadically fail, not every time. Here's a new diff where the quorum configs are the same this time:

ok: https://gist.github.com/AndrejMitrovic/44580a1c7e31dcdf6d61383e88806d4d
fail: https://gist.github.com/AndrejMitrovic/16890ce1d146832227bf87db3e947e0b

@AndrejMitrovic
Copy link
Contributor Author

I noticed one diff which seemed very relevant:

< 2020-04-03 10:22:02,408 Info [agora.network.NetworkManager] - Discovery reached. 2 peers connected.
---
> 2020-04-03 10:22:36,948 Info [agora.network.NetworkManager] - Discovery reached. 1 peers connected.

If I manually set the min_listeners to 2, then the tests always pass.

This means the core issue could be related to gossiping of envelopes.

@AndrejMitrovic
Copy link
Contributor Author

I tested adding gossiping right in receiveEnvelope, and it seems to fix the bug.

I'll submit a fix PR soon.

@Geod24 I will need you to merge #546 once it's green on Travis.

@Geod24
Copy link
Collaborator

Geod24 commented Apr 15, 2020

@Geod24 I will need you to merge #546 once it's green on Travis.

This has been done. Anything else blocking you ?

@AndrejMitrovic
Copy link
Contributor Author

I tested adding gossiping right in receiveEnvelope, and it seems to fix the bug.

It fixes one bug, but it reveals another: https://github.com/bpfkorea/agora/blob/2b787c8fde8999854347072134e0bc9c34e58399/source/scpp/src/scp/BallotProtocol.cpp#L642

I'll add my intended changes to this PR.

@AndrejMitrovic
Copy link
Contributor Author

So with the first two commits the new test added in the last commit passes, with dtest=agora.test.QuorumTest dub test. But all the other tests then fail with:

  what():  moved to a bad state (ballot protocol)

I think the only option left is to try using the SCP test-suite with the same quorum config and log the messages. Maybe it will tell us which messages are missing in Agora.

That being said, I don't know what is tripping up SCP.

There are other projects using SCP, right? Do you know their names? I could take a look at how they use it.

@AndrejMitrovic
Copy link
Contributor Author

I don't need this to be a PR anymore. Can test locally.

@AndrejMitrovic
Copy link
Contributor Author

I just did the same test but with the system integration test-suite. And the test works.

I'll try to retrieve the logs to compare what's different when running with LocalRest vs vibe.d.

@AndrejMitrovic
Copy link
Contributor Author

I changed the test to use the same quorum config for all nodes. It seems to still fail..

@AndrejMitrovic
Copy link
Contributor Author

With the quorum balancer that is currently a WIP, this test-case passes.

@AndrejMitrovic
Copy link
Contributor Author

This test-case works in my wip:

///
unittest
{
    // generate 1007 blocks, 1 short of the enrollments expiring.
    TestConf conf = { extra_blocks : 1007 };
    auto network = makeTestNetwork(conf);
    network.start();
    scope(exit) network.shutdown();
    scope(failure) network.printLogs();
    network.waitForDiscovery();

    auto nodes = network.clients;

    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1007, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1007)));

    // create enrollment data
    // send a request to enroll as a Validator
    Enrollment enroll_0 = nodes[0].createEnrollmentData();
    Enrollment enroll_1 = nodes[1].createEnrollmentData();
    nodes[2].enrollValidator(enroll_0);
    nodes[3].enrollValidator(enroll_1);

    // check enrollments
    nodes.each!(node =>
        retryFor(node.getEnrollment(enroll_0.utxo_key) == enroll_0 &&
                 node.getEnrollment(enroll_1.utxo_key) == enroll_1,
            5.seconds));

    auto txs = makeChainedTransactions(getGenesisKeyPair(),
        network.blocks[$ - 1].txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    // at block height 1008 the validator set changes from 4 => 2
    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1008, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1008)));

    // these are un-enrolled now
    nodes[2 .. $].each!(node => node.sleep(2.seconds, true));

    // verify that consensus can still be reached by the leftover validators
    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    nodes[0 .. 2].enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1009, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1009)));

    // wait for nodes[2 .. 3] to wake up
    Thread.sleep(3.seconds);

    // now try to re-enroll the rest of the validators
    Enrollment[] enrolls;
    foreach (node; nodes[2 .. $])
    {
        enrolls ~= node.createEnrollmentData();
        nodes[0].enrollValidator(enrolls[$ - 1]);
    }

    // check enrollments
    nodes.each!(node =>
        enrolls.each!(enroll =>
            retryFor(node.getEnrollment(enroll.utxo_key) == enroll, 5.seconds)));

    // this still uses 2 nodes for reaching consensus
    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1010, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1010)));

    // this should use 4 nodes
    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    nodes.enumerate.each!((idx, node) =>
        retryFor(node.getBlockHeight() == 1011, 2.seconds,
            format("Node %s has block height %s. Expected: %s",
                idx, node.getBlockHeight(), 1011)));

    // this should halt progress because threshold is set to max
    // commenting this out will make the assert below fire.
    nodes[$ - 1].sleep(4.seconds, true);

    txs = makeChainedTransactions(getGenesisKeyPair(), txs, 1);
    txs.each!(tx => nodes[0].putTransaction(tx));

    try
    {
        // progress was not made, still stuck at 1011 blocks
        nodes.enumerate.each!((idx, node) =>
            retryFor!Exception(node.getBlockHeight() == 1012, 2.seconds,
                format("Node %s has block height %s. Expected: %s",
                    idx, node.getBlockHeight(), 1012)));
        assert(0);  // should not be reached
    }
    catch (Exception ex)
    {
        assert(ex.msg.canFind("has block height 1011. Expected: 1012"));
    }

    Thread.sleep(3.seconds);  // wait for thread to wake up before shutdown()
}

There's a few issues though:

  • The excessive Thread.sleep calls
  • There should be a way to force a node to stop sleeping, and not be based on time as it currently is. So something like node.ctrl.sleep and node.ctrl.wakeUp would be good to have.
  • I wish we had some kind of event mechanism so we can just "wait" until a node informs us about its ledger state. This means a push mechanism to the unitttest thread, rather than the poll mechanism from the unittest thread as we have now. I think this should be possible to implement in TestValidatorNode, we would just have to add some overrides.

@AndrejMitrovic AndrejMitrovic force-pushed the quorum-tests branch 4 times, most recently from 9597567 to a98895f Compare July 9, 2020 09:09
@AndrejMitrovic
Copy link
Contributor Author

Most CIs seem to be turning green with the new fixes. 🎉

@AndrejMitrovic
Copy link
Contributor Author

The only thing that's left to do here is to rewrite and enable these tests:

https://github.com/bpfkorea/agora/blob/c06e8b534119039f0a104899262c15c976bda71c/source/agora/test/Quorum.d#L165
https://github.com/bpfkorea/agora/blob/c06e8b534119039f0a104899262c15c976bda71c/source/agora/test/Quorum.d#L186

They have to be rewritten because the genesis block will not contain more than N enrollments (because we want it to be hardcoded). The rewrite should spawn the minimum N number of nodes, and then enroll an additional 16-N and 32-N nodes and test if consensus can be reached.

It was a bit too low for debugging purposes.
There may be more outsider nodes than
the original set of nodes, so the index
may be out of bounds.
The spending was too low to be able to use it
for many freeze transactions.

The QuorumPreimage test had to be changed because
of the way dice() works - even though the relative amount
of stake is the same between nodes, the random number
generator will in fact return different values.

For example:

```
dice(rnd, [10, 10, 10, 10, 10));
```

Returns different indices compared to:

```
dice(rnd, [100, 100, 100, 100, 100));
```

Even though the weight is the same.
The test with the 32 nodes is semantically correct,
but it will fail due to timeouts.

There seems to be significant overhead trying to simulate
32 nodes on a single machine - it's possible we haven't
optimized our communication overhead yet.

Additionally made these tests run last since they're
taxing on the system.
@AndrejMitrovic
Copy link
Contributor Author

Folded into #1086

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug Things don't work as they were intended to, or the way they were intended to work doesn't make sense type-testing For issues which relate to test scenarios
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants