Fix partition recovery tests #2820

algonautshant · 2021-08-31T16:49:26Z

Summary

The expected round number is captured before the node stops. However,
it is likely that the node advances to the next round before it is
stopped. When this happens, the test will fail.

This change gets the most up-to-date round number after the node is
stopped, but before inducePartitionTime timeout is waited.

inducePartitionTime is the wait to make sure the expected behavior is
obtained. The round number is captured before this wait.

However, I could not identify in this PR why TestBasicPartitionRecovery has failed. Could not find anything in the test, and the failure logs have nothing.
I suspect that the failure in the other tests triggered the failure, and fixed the other tests, but cannot be sure.

As for the data race, it is fixed in #2844.

Fixes #2384 and #2545

Test Plan

This is a fix for a test.

The expected round number is captured before the node stops. However, it is likely that the node advances to the next round before it is stopped. When this happens, the test will fail. This change gets the most up-to-date round number after the node is stopped, but before inducePartitionTime timeout is waited. inducePartitionTime is the wait to make sure the expected behavior is obtained. The round number is captured before this wait.

codecov-commenter · 2021-08-31T17:27:36Z

Codecov Report

Merging #2820 (e7fb178) into master (390edd1) will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2820      +/-   ##
==========================================
- Coverage   47.11%   47.10%   -0.02%     
==========================================
  Files         349      349              
  Lines       56424    56424              
==========================================
- Hits        26584    26577       -7     
- Misses      26864    26871       +7     
  Partials     2976     2976

Impacted Files	Coverage Δ
network/requestTracker.go	`70.25% <0.00%> (-0.87%)`	⬇️
catchup/service.go	`68.57% <0.00%> (-0.78%)`	⬇️
ledger/acctupdates.go	`62.13% <0.00%> (-0.42%)`	⬇️
network/wsPeer.go	`74.65% <0.00%> (+0.27%)`	⬆️
catchup/peerSelector.go	`100.00% <0.00%> (+1.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 390edd1...e7fb178. Read the comment docs.

winder

Looks good, just one question about a possible minor improvement to one of the tests.

test/e2e-go/features/partitionRecovery/partitionRecovery_test.go

winder · 2021-08-31T20:22:30Z

test/e2e-go/features/partitionRecovery/partitionRecovery_test.go

 	a.NoError(err)

-	a.Equal(waitForRound, status.LastRound, "We should not have made progress since stopping the first node")


Was the bug that sometimes we would get to waitForRound+1, so waitForRound did not equal status.LastRound?

It seems like there is still a race condition between resuming node 1 and shutting down node 2.

Could we move fixture.StartNode(nc1.GetDataDir()) to after nc2.FullStop() and before time.Sleep(20 * time.Second)?

Maybe that race condition is the point of the test? Not sure exactly what is intended by the name runTestWithStaggeredStopStart, the comments in the other 2 tests you changed make the test a bit easier to review. Maybe you could add a similar comment to this test?

The intend of the test is to make sure that after stopping Node1, the network does not make any progress.

If the test proceeds in a timely manner, the node will be before round 3 when ClientWaitForRoundWithTimeout is called, and will be at round 3 when Node1 is stopped.
If this happens, there is no problem.

However, if the test thread is slow, and by the time Node1 is stopped, nothing prevents Node1 from getting all the way to round e.g. 5. This may happen before ClientWaitForRoundWithTimeout, or after ClientWaitForRoundWithTimeout and before FullStop. This can be tested by adding in any of these position sleep of 5 seconds.

By reading the round number after FullStop, we get the round number the network was after Node1 stopped. It might be a little after it stopped, but it is okay. As long as we make sure the network is not making any progress during the inducePartitionTime sleep.

Thanks for the explanation. I think there is still a race condition, here is what the test does:

n1 - FullStop roundAfterStop n1.Start <--- race condition here, n1 and n2 are running n2.FullStop roundAfterStall

in theory yes, you are correct. However, this is unlikely to happen, because at this point the network is stalled, and it will take a long time for it to recover to make any progress. By the time that may happen, n2 is already stopped.

However, in the previous situation, it was possible even in a fraction of a second to change the round.

I'll defer whether or not this point is blocking for the PR. Seems like we may as well stop n2 before starting n1, but maybe it isn't a problem.

test/e2e-go/features/partitionRecovery/partitionRecovery_test.go

winder

LGTM, just one minor complaint but sounds like may not be a blocker.

algonautshant added 2 commits August 31, 2021 11:37

Add similar fix for TestBasicPartitionRecoveryPartOffline

e7fb178

algonautshant added the Team Carbon-11 label Aug 31, 2021

algonautshant requested a review from tsachiherman August 31, 2021 16:49

algonautshant self-assigned this Aug 31, 2021

algonautshant linked an issue Aug 31, 2021 that may be closed by this pull request

Flaky test: TestBasic PartitionRecoveryPartOffline #2545

Closed

algonautshant changed the title ~~Shant/test basic partition recovery~~ Fix partition recovery tests Aug 31, 2021

algonautshant requested a review from winder August 31, 2021 16:55

winder reviewed Aug 31, 2021

View reviewed changes

winder approved these changes Sep 3, 2021

View reviewed changes

tsachiherman merged commit cd832b3 into algorand:master Sep 7, 2021

onetechnical mentioned this pull request Sep 15, 2021

go-algorand 3.0.0-beta #2893

Merged

algojack mentioned this pull request Sep 24, 2021

go-algorand 3.0.1-stable #2949

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix partition recovery tests #2820

Fix partition recovery tests #2820

algonautshant commented Aug 31, 2021 •

edited

codecov-commenter commented Aug 31, 2021

winder left a comment

winder Aug 31, 2021

winder Aug 31, 2021

algonautshant Aug 31, 2021

winder Sep 2, 2021

algonautshant Sep 3, 2021

winder Sep 3, 2021

winder left a comment

		a.NoError(err)

		a.Equal(waitForRound, status.LastRound, "We should not have made progress since stopping the first node")

Fix partition recovery tests #2820

Fix partition recovery tests #2820

Conversation

algonautshant commented Aug 31, 2021 • edited

Summary

Test Plan

codecov-commenter commented Aug 31, 2021

Codecov Report

winder left a comment

Choose a reason for hiding this comment

winder Aug 31, 2021

Choose a reason for hiding this comment

winder Aug 31, 2021

Choose a reason for hiding this comment

algonautshant Aug 31, 2021

Choose a reason for hiding this comment

winder Sep 2, 2021

Choose a reason for hiding this comment

algonautshant Sep 3, 2021

Choose a reason for hiding this comment

winder Sep 3, 2021

Choose a reason for hiding this comment

winder left a comment

Choose a reason for hiding this comment

algonautshant commented Aug 31, 2021 •

edited