Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data race in tests: TestApplicationsUpgradeOverREST race #2844

Merged
merged 5 commits into from
Sep 11, 2021

Conversation

algonautshant
Copy link
Contributor

Summary

When multiple threads use the synchTest object, and one fails, all call FailNow.

After the first call, the main test routine reads t.finished to terminate the test.

While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race.

While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex.

The solution is to call FailNow from synchTest object only the first time.

fixes #2507

More details:

The problem with the string of data race conditions we are observing in different tests when the test fails is the following:

When the test fails, FailNow in go/src/testing/testing.go sets c.finished, which is the "Previous write" part of the data race

func (c *common) FailNow() {
...
	c.finished = true
...

this call comes from different threads, possibly multiple ones, forked from (not exclusively)
go-algorand/nodecontrol/algodControl.go

func (nc *NodeController) StartAlgod(args AlgodStartArgs) (alreadyRunning bool, err error) {
...
	go func() {
...
			if args.ExitErrorCallback != nil {
				args.ExitErrorCallback(nc, err)

where, ExitErrorCallback is:

func (f *LibGoalFixture) nodeExitWithError(nc *nodecontrol.NodeController, err error) {
...
		require.NoError(f.t, err, "Node at %s has terminated with an error", nc.GetDataDir())
...
}

and NoError, in case of an error, calls FailNow.

f.t happens to be synchTest, which has a mutex guarding the FailNow call, and hence, prevents simultaneous access t.finished.

However, there is one more read access to t.finished, which comes from the following location, which does not use synchTest object, hence, does not have the mutex used for the write:

go/src/testing/testing.go

func tRunner(t *T, fn func(t *T)) {
...
	defer func() {
...
		if !t.finished && err == nil {

In other words, when there are multiple threads, and all are fine, no thread will write to t.finished. When something goes wrong and FailNow is initiated, the other threads will be notified of the failure, and, in their turn, also report require.NoError, which calls FailNow, and writes t.finished = true.

Test Plan

@codecov-commenter
Copy link

codecov-commenter commented Sep 5, 2021

Codecov Report

Merging #2844 (74e55f1) into master (b8b7281) will increase coverage by 0.00%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2844   +/-   ##
=======================================
  Coverage   47.33%   47.33%           
=======================================
  Files         351      351           
  Lines       56523    56523           
=======================================
+ Hits        26754    26756    +2     
+ Misses      26760    26758    -2     
  Partials     3009     3009           
Impacted Files Coverage Δ
cmd/algoh/blockWatcher.go 77.77% <0.00%> (-3.18%) ⬇️
network/wsPeer.go 74.37% <0.00%> (-0.28%) ⬇️
catchup/service.go 69.35% <0.00%> (+0.77%) ⬆️
catchup/peerSelector.go 100.00% <0.00%> (+1.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8b7281...74e55f1. Read the comment docs.

tsachiherman pushed a commit that referenced this pull request Sep 7, 2021
The expected round number is captured before the node stops. However,
it is likely that the node advances to the next round before it is
stopped. When this happens, the test will fail.

This change gets the most up-to-date round number after the node is
stopped, but before inducePartitionTime timeout is waited.

inducePartitionTime is the wait to make sure the expected behavior is
obtained. The round number is captured before this wait.

However, I could not identify in this PR why TestBasicPartitionRecovery has failed. Could not find anything in the test, and the failure logs have nothing.
I suspect that the failure in the other tests triggered the failure, and fixed the other tests, but cannot be sure.

As for the data race, it is fixed in #2844.

Fixes #2384 and #2545
@algonautshant algonautshant changed the title Shant/fix synch test Data race in tests: TestApplicationsUpgradeOverREST race Sep 7, 2021
@tsachiherman
Copy link
Contributor

Based on the details you've provided, I don't believe that the proposed solution would fix the issue:
Consider the case where a new AlgodController is being created, and it's getting returned with an error after the test ends (without reporting any issue ). It would still call the "t.Fail", which would datarace with the tRunner.
I think that we should disable all Fail calls once we've completed the test execution. We can do that using a destructor, in a similar way to SetTestContext()

@algonautshant
Copy link
Contributor Author

You are right. My changes addressed only failures caused by the test, but not failures coming from the fixture.
This change should address it. I still need to update other tests with this.

Copy link
Contributor

@tsachiherman tsachiherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. make sure to fix

Go files must be formatted with gofmt. Please run:
  gofmt -w /home/circleci/project/test/e2e-go/upgrades/application_support_test.go

When multiple threads use the synchTest object, and one fails, all
call FailNow.

After the first call, the main test routine reads t.finished to
terminate the test.

While the main test routine is reading t.finished, another thread
calls FailNow, which in turn writes to t.finished, causing the data
race.

While FailNow called from synchTest are guarded by a mutex, the main
testing object has the TestingTB object, which does not use the same
mutex.

The solution is to call FailNow from synchTest object only the first
time.
call FailNow.

After the first call, the main test routine reads t.finished to
terminate the test.

While the main test routine is reading t.finished, another thread
calls FailNow, which in turn writes to t.finished, causing the data
race.

While FailNow called from synchTest are guarded by a mutex, the main
testing object has the TestingTB object, which does not use the same
mutex.

The solution is to call FailNow from synchTest object only the first
time.
Copy link
Contributor

@tsachiherman tsachiherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

loos good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestApplicationsUpgradeOverREST race
3 participants