Data race in tests: TestApplicationsUpgradeOverREST race #2844

algonautshant · 2021-09-05T19:47:11Z

Summary

When multiple threads use the synchTest object, and one fails, all call FailNow.

After the first call, the main test routine reads t.finished to terminate the test.

While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race.

While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex.

The solution is to call FailNow from synchTest object only the first time.

fixes #2507

More details:

The problem with the string of data race conditions we are observing in different tests when the test fails is the following:

When the test fails, FailNow in go/src/testing/testing.go sets c.finished, which is the "Previous write" part of the data race

func (c *common) FailNow() {
...
	c.finished = true
...

this call comes from different threads, possibly multiple ones, forked from (not exclusively)
go-algorand/nodecontrol/algodControl.go

func (nc *NodeController) StartAlgod(args AlgodStartArgs) (alreadyRunning bool, err error) {
...
	go func() {
...
			if args.ExitErrorCallback != nil {
				args.ExitErrorCallback(nc, err)

where, ExitErrorCallback is:

func (f *LibGoalFixture) nodeExitWithError(nc *nodecontrol.NodeController, err error) {
...
		require.NoError(f.t, err, "Node at %s has terminated with an error", nc.GetDataDir())
...
}

and NoError, in case of an error, calls FailNow.

f.t happens to be synchTest, which has a mutex guarding the FailNow call, and hence, prevents simultaneous access t.finished.

However, there is one more read access to t.finished, which comes from the following location, which does not use synchTest object, hence, does not have the mutex used for the write:

go/src/testing/testing.go

func tRunner(t *T, fn func(t *T)) {
...
	defer func() {
...
		if !t.finished && err == nil {

In other words, when there are multiple threads, and all are fine, no thread will write to t.finished. When something goes wrong and FailNow is initiated, the other threads will be notified of the failure, and, in their turn, also report require.NoError, which calls FailNow, and writes t.finished = true.

Test Plan

codecov-commenter · 2021-09-05T20:26:00Z

Codecov Report

Merging #2844 (74e55f1) into master (b8b7281) will increase coverage by 0.00%.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #2844   +/-   ##
=======================================
  Coverage   47.33%   47.33%           
=======================================
  Files         351      351           
  Lines       56523    56523           
=======================================
+ Hits        26754    26756    +2     
+ Misses      26760    26758    -2     
  Partials     3009     3009

Impacted Files	Coverage Δ
cmd/algoh/blockWatcher.go	`77.77% <0.00%> (-3.18%)`	⬇️
network/wsPeer.go	`74.37% <0.00%> (-0.28%)`	⬇️
catchup/service.go	`69.35% <0.00%> (+0.77%)`	⬆️
catchup/peerSelector.go	`100.00% <0.00%> (+1.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b8b7281...74e55f1. Read the comment docs.

The expected round number is captured before the node stops. However, it is likely that the node advances to the next round before it is stopped. When this happens, the test will fail. This change gets the most up-to-date round number after the node is stopped, but before inducePartitionTime timeout is waited. inducePartitionTime is the wait to make sure the expected behavior is obtained. The round number is captured before this wait. However, I could not identify in this PR why TestBasicPartitionRecovery has failed. Could not find anything in the test, and the failure logs have nothing. I suspect that the failure in the other tests triggered the failure, and fixed the other tests, but cannot be sure. As for the data race, it is fixed in #2844. Fixes #2384 and #2545

tsachiherman · 2021-09-08T18:58:13Z

Based on the details you've provided, I don't believe that the proposed solution would fix the issue:
Consider the case where a new AlgodController is being created, and it's getting returned with an error after the test ends (without reporting any issue ). It would still call the "t.Fail", which would datarace with the tRunner.
I think that we should disable all Fail calls once we've completed the test execution. We can do that using a destructor, in a similar way to SetTestContext()

algonautshant · 2021-09-09T04:04:00Z

You are right. My changes addressed only failures caused by the test, but not failures coming from the fixture.
This change should address it. I still need to update other tests with this.

test/framework/fixtures/fixture.go

tsachiherman

looks good. make sure to fix

Go files must be formatted with gofmt. Please run:
  gofmt -w /home/circleci/project/test/e2e-go/upgrades/application_support_test.go

When multiple threads use the synchTest object, and one fails, all call FailNow. After the first call, the main test routine reads t.finished to terminate the test. While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race. While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex. The solution is to call FailNow from synchTest object only the first time.

call FailNow. After the first call, the main test routine reads t.finished to terminate the test. While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race. While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex. The solution is to call FailNow from synchTest object only the first time.

tsachiherman

loos good.

algonautshant requested a review from tsachiherman September 5, 2021 19:47

algonautshant mentioned this pull request Sep 5, 2021

Fix partition recovery tests #2820

Merged

algonautshant changed the title ~~Shant/fix synch test~~ Data race in tests: TestApplicationsUpgradeOverREST race Sep 7, 2021

tsachiherman assigned algonautshant Sep 10, 2021

tsachiherman added the Team Carbon-11 label Sep 10, 2021

tsachiherman reviewed Sep 10, 2021

View reviewed changes

test/framework/fixtures/fixture.go Outdated Show resolved Hide resolved

tsachiherman reviewed Sep 10, 2021

View reviewed changes

test/framework/fixtures/fixture.go Show resolved Hide resolved

tsachiherman reviewed Sep 10, 2021

View reviewed changes

test/framework/fixtures/fixture.go Outdated Show resolved Hide resolved

tsachiherman suggested changes Sep 10, 2021

View reviewed changes

algonautshant added 5 commits September 10, 2021 16:54

Addressing a missed case per review

cd3a981

Resolving merge conflicts.

595bd56

fix errors

74e55f1

algonautshant force-pushed the shant/fixSynchTest branch from b7543c1 to 74e55f1 Compare September 10, 2021 21:02

algonautshant requested a review from tsachiherman September 11, 2021 01:55

tsachiherman approved these changes Sep 11, 2021

View reviewed changes

tsachiherman merged commit 83fb239 into algorand:master Sep 11, 2021

onetechnical mentioned this pull request Sep 15, 2021

go-algorand 3.0.0-beta #2893

Merged

algojack mentioned this pull request Sep 24, 2021

go-algorand 3.0.1-stable #2949

Merged

algonautshant mentioned this pull request Oct 7, 2021

Adjust the application of dontReportFailures in syncTest #3018

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data race in tests: TestApplicationsUpgradeOverREST race #2844

Data race in tests: TestApplicationsUpgradeOverREST race #2844

algonautshant commented Sep 5, 2021

codecov-commenter commented Sep 5, 2021 •

edited

tsachiherman commented Sep 8, 2021

algonautshant commented Sep 9, 2021

tsachiherman left a comment

tsachiherman left a comment

Data race in tests: TestApplicationsUpgradeOverREST race #2844

Data race in tests: TestApplicationsUpgradeOverREST race #2844

Conversation

algonautshant commented Sep 5, 2021

Summary

More details:

Test Plan

codecov-commenter commented Sep 5, 2021 • edited

Codecov Report

tsachiherman commented Sep 8, 2021

algonautshant commented Sep 9, 2021

tsachiherman left a comment

Choose a reason for hiding this comment

tsachiherman left a comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 5, 2021 •

edited