Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partition recovery tests #2820

Merged

Conversation

algonautshant
Copy link
Contributor

@algonautshant algonautshant commented Aug 31, 2021

Summary

The expected round number is captured before the node stops. However,
it is likely that the node advances to the next round before it is
stopped. When this happens, the test will fail.

This change gets the most up-to-date round number after the node is
stopped, but before inducePartitionTime timeout is waited.

inducePartitionTime is the wait to make sure the expected behavior is
obtained. The round number is captured before this wait.

However, I could not identify in this PR why TestBasicPartitionRecovery has failed. Could not find anything in the test, and the failure logs have nothing.
I suspect that the failure in the other tests triggered the failure, and fixed the other tests, but cannot be sure.

As for the data race, it is fixed in #2844.

Fixes #2384 and #2545

Test Plan

This is a fix for a test.

The expected round number is captured before the node stops. However,
it is likely that the node advances to the next round before it is
stopped. When this happens, the test will fail.

This change gets the most up-to-date round number after the node is
stopped, but before inducePartitionTime timeout is waited.

inducePartitionTime is the wait to make sure the expected behavior is
obtained. The round number is captured before this wait.
@algonautshant algonautshant self-assigned this Aug 31, 2021
@algonautshant algonautshant linked an issue Aug 31, 2021 that may be closed by this pull request
@algonautshant algonautshant changed the title Shant/test basic partition recovery Fix partition recovery tests Aug 31, 2021
@codecov-commenter
Copy link

Codecov Report

Merging #2820 (e7fb178) into master (390edd1) will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2820      +/-   ##
==========================================
- Coverage   47.11%   47.10%   -0.02%     
==========================================
  Files         349      349              
  Lines       56424    56424              
==========================================
- Hits        26584    26577       -7     
- Misses      26864    26871       +7     
  Partials     2976     2976              
Impacted Files Coverage Δ
network/requestTracker.go 70.25% <0.00%> (-0.87%) ⬇️
catchup/service.go 68.57% <0.00%> (-0.78%) ⬇️
ledger/acctupdates.go 62.13% <0.00%> (-0.42%) ⬇️
network/wsPeer.go 74.65% <0.00%> (+0.27%) ⬆️
catchup/peerSelector.go 100.00% <0.00%> (+1.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 390edd1...e7fb178. Read the comment docs.

Copy link
Contributor

@winder winder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just one question about a possible minor improvement to one of the tests.

a.NoError(err)

a.Equal(waitForRound, status.LastRound, "We should not have made progress since stopping the first node")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was the bug that sometimes we would get to waitForRound+1, so waitForRound did not equal status.LastRound?

It seems like there is still a race condition between resuming node 1 and shutting down node 2.

Could we move fixture.StartNode(nc1.GetDataDir()) to after nc2.FullStop() and before time.Sleep(20 * time.Second)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that race condition is the point of the test? Not sure exactly what is intended by the name runTestWithStaggeredStopStart, the comments in the other 2 tests you changed make the test a bit easier to review. Maybe you could add a similar comment to this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intend of the test is to make sure that after stopping Node1, the network does not make any progress.

If the test proceeds in a timely manner, the node will be before round 3 when ClientWaitForRoundWithTimeout is called, and will be at round 3 when Node1 is stopped.
If this happens, there is no problem.

However, if the test thread is slow, and by the time Node1 is stopped, nothing prevents Node1 from getting all the way to round e.g. 5. This may happen before ClientWaitForRoundWithTimeout, or after ClientWaitForRoundWithTimeout and before FullStop. This can be tested by adding in any of these position sleep of 5 seconds.

By reading the round number after FullStop, we get the round number the network was after Node1 stopped. It might be a little after it stopped, but it is okay. As long as we make sure the network is not making any progress during the inducePartitionTime sleep.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. I think there is still a race condition, here is what the test does:

n1 - FullStop
roundAfterStop
n1.Start
                        <--- race condition here, n1 and n2 are running
n2.FullStop
roundAfterStall

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in theory yes, you are correct. However, this is unlikely to happen, because at this point the network is stalled, and it will take a long time for it to recover to make any progress. By the time that may happen, n2 is already stopped.

However, in the previous situation, it was possible even in a fraction of a second to change the round.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll defer whether or not this point is blocking for the PR. Seems like we may as well stop n2 before starting n1, but maybe it isn't a problem.

Copy link
Contributor

@winder winder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one minor complaint but sounds like may not be a blocker.

@tsachiherman tsachiherman merged commit cd832b3 into algorand:master Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test: TestBasic PartitionRecoveryPartOffline Broken test: TestBasicPartitionRecovery
4 participants