Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing a flaky test #924

Merged
merged 6 commits into from
Mar 9, 2022
Merged

Fixing a flaky test #924

merged 6 commits into from
Mar 9, 2022

Conversation

AnomalRoil
Copy link
Member

This should fix one case where the test TestRunDKGReshareTimeout would fail because its daemon weren't yet properly running while the test would carry on.

Here is the typical output one would get:

    util_test.go:266: [RunDKG] Leader obtain group
    util_test.go:270: [RunDKG] Leader    Finished. GroupHash 627ec61f3a2d690bf9f610960c84fb12d26cbfe26cd38fe89c1cea7dbfa440b5
2022-03-03T11:32:50.879+0100	INFO	dkg/dkg.go:1084	[dkg-log [dkg-step newResp fast moving to justifications phase got 3 resps]]
2022-03-03T11:32:50.879+0100	INFO	dkg/dkg.go:1084	[dkg-log msg DKG successful]
2022-03-03T11:32:50.879+0100	INFO	http/server.go:129	New beacon handler registered	{"server": "http", "chainHash": "b984233e2f884eb5af0260700dbb57e5abcdb14c0642a8629671d4e54a4b3193"}
2022-03-03T11:32:50.879+0100	INFO	http/server.go:154	New default beacon handler registered	{"server": "http"}
2022-03-03T11:32:50.879+0100	INFO	core/drand_beacon_control.go:392		{"beacon_id": "", "init_dkg": "dkg_done", "starting_beacon_time": 449884808, "now": 449884800}
2022-03-03T11:32:50.880+0100	INFO	dkg/dkg.go:1084	[dkg-log [dkg-step newResp fast moving to justifications phase got 3 resps]]
2022-03-03T11:32:50.880+0100	INFO	dkg/dkg.go:1084	[dkg-log msg DKG successful]
2022-03-03T11:32:50.880+0100	INFO	http/server.go:129	New beacon handler registered	{"server": "http", "chainHash": "b984233e2f884eb5af0260700dbb57e5abcdb14c0642a8629671d4e54a4b3193"}
2022-03-03T11:32:50.880+0100	INFO	http/server.go:154	New default beacon handler registered	{"server": "http"}
    util_test.go:288: [RunDKG] NonLeader Finished. GroupHash 627ec61f3a2d690bf9f610960c84fb12d26cbfe26cd38fe89c1cea7dbfa440b5
2022-03-03T11:32:50.880+0100	INFO	core/drand_beacon_control.go:392		{"beacon_id": "", "init_dkg": "dkg_done", "starting_beacon_time": 449884808, "now": 449884800}
    util_test.go:288: [RunDKG] NonLeader Finished. GroupHash 627ec61f3a2d690bf9f610960c84fb12d26cbfe26cd38fe89c1cea7dbfa440b5
    util_test.go:295: [RunDKG] Leader 127.0.0.1:34917 FINISHED
2022-03-03T11:32:50.918+0100	INFO	core/drand_beacon.go:192		{"beacon_id": "", "beacon_start": "2022-03-03T11:32:50.918+0100", "catchup": false}
2022-03-03T11:32:50.918+0100	INFO	beacon/node.go:181		{"beacon_id": "", "beacon": "start"}
    util_test.go:300: [-------] Leader 127.0.0.1:34917 FINISHED
    util_test.go:315: [RunDKG] READY!
2022-03-03T11:32:50.919+0100	INFO	core/drand_beacon.go:192		{"beacon_id": "", "beacon_start": "2022-03-03T11:32:50.919+0100", "catchup": false}
    util_test.go:412: Set genesis time: 449884808
2022-03-03T11:32:50.919+0100	INFO	beacon/node.go:181		{"beacon_id": "", "beacon": "start"}
    util_test.go:514: node 127.0.0.1:34917 has its beacon chain running
    drand_test.go:331: Check Beacon Length
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
2022-03-03T11:32:50.921+0100	INFO	core/drand_beacon.go:192		{"beacon_id": "", "beacon_start": "2022-03-03T11:32:50.921+0100", "catchup": false}
2022-03-03T11:32:50.921+0100	INFO	beacon/node.go:181		{"beacon_id": "", "beacon": "start"}
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:498: node 127.0.0.1:34917 is on 0 round (vs expected 1), waiting some time to ask again...
    util_test.go:431: 
        	Error Trace:	util_test.go:431
        	            				drand_test.go:332
        	Error:      	Received unexpected error:
        	            	timeout waiting node 127.0.0.1:34917 to reach 1 round
        	Test:       	TestRunDKGReshareTimeout
--- FAIL: TestRunDKGReshareTimeout (10.40s)

This is now patched by properly waiting on the Status of each Daemon to be "isRunning".
We could also be checking if they are "isStarted", but it's not exactly at the same place in the code path and we need to clarify if we really need so many states or if we could do with just one state replacing both of these.

Notice this does not appear to be the same issue as in #922 sadly, but I've added a comment about a potentially unintended AdvanceMockClock that could maybe be related. I'll have to come back to studying if this test is really properly working later on.

Currently this is just patching the flaky CI, hopefully.

@willscott
Copy link
Member

fix lint issue, otherwise looks good

core/util_test.go Outdated Show resolved Hide resolved
core/util_test.go Outdated Show resolved Hide resolved
var started bool
for trial < 5 {
for trial := 0; trial < 5; trial += 1 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to have triggered some extra flakiness in our CI and tests!
Now the "TestLocalOrchestration" test is failing almost always.

We were previously never incrementing the "trial" counter in the loop, so it would keep running non-stop without failing, either until the daemon recovered, or would just "let the other nodes" do their jobs and it used to work thanks to the fact we can support some failure as long as we reached our threshold...

cc @nikkolasg @willscott

😢

@AnomalRoil AnomalRoil changed the title Fixing flaky test Fixing a flaky test Mar 9, 2022
@AnomalRoil AnomalRoil merged commit 8413033 into master Mar 9, 2022
@AnomalRoil AnomalRoil deleted the log/flaky branch March 9, 2022 15:10
@AnomalRoil AnomalRoil added the testing Area: Testing label Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Area: Testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants