-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race in tests: TestApplicationsUpgradeOverREST race #2844
Data race in tests: TestApplicationsUpgradeOverREST race #2844
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2844 +/- ##
=======================================
Coverage 47.33% 47.33%
=======================================
Files 351 351
Lines 56523 56523
=======================================
+ Hits 26754 26756 +2
+ Misses 26760 26758 -2
Partials 3009 3009
Continue to review full report at Codecov.
|
The expected round number is captured before the node stops. However, it is likely that the node advances to the next round before it is stopped. When this happens, the test will fail. This change gets the most up-to-date round number after the node is stopped, but before inducePartitionTime timeout is waited. inducePartitionTime is the wait to make sure the expected behavior is obtained. The round number is captured before this wait. However, I could not identify in this PR why TestBasicPartitionRecovery has failed. Could not find anything in the test, and the failure logs have nothing. I suspect that the failure in the other tests triggered the failure, and fixed the other tests, but cannot be sure. As for the data race, it is fixed in #2844. Fixes #2384 and #2545
Based on the details you've provided, I don't believe that the proposed solution would fix the issue: |
You are right. My changes addressed only failures caused by the test, but not failures coming from the fixture. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. make sure to fix
Go files must be formatted with gofmt. Please run:
gofmt -w /home/circleci/project/test/e2e-go/upgrades/application_support_test.go
When multiple threads use the synchTest object, and one fails, all call FailNow. After the first call, the main test routine reads t.finished to terminate the test. While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race. While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex. The solution is to call FailNow from synchTest object only the first time.
call FailNow. After the first call, the main test routine reads t.finished to terminate the test. While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race. While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex. The solution is to call FailNow from synchTest object only the first time.
b7543c1
to
74e55f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loos good.
Summary
When multiple threads use the synchTest object, and one fails, all call FailNow.
After the first call, the main test routine reads t.finished to terminate the test.
While the main test routine is reading t.finished, another thread calls FailNow, which in turn writes to t.finished, causing the data race.
While FailNow called from synchTest are guarded by a mutex, the main testing object has the TestingTB object, which does not use the same mutex.
The solution is to call FailNow from synchTest object only the first time.
fixes #2507
More details:
The problem with the string of data race conditions we are observing in different tests when the test fails is the following:
When the test fails, FailNow in go/src/testing/testing.go sets
c.finished
, which is the "Previous write" part of the data racethis call comes from different threads, possibly multiple ones, forked from (not exclusively)
go-algorand/nodecontrol/algodControl.go
where,
ExitErrorCallback
is:and NoError, in case of an error, calls
FailNow
.f.t
happens to besynchTest
, which has a mutex guarding theFailNow
call, and hence, prevents simultaneous accesst.finished
.However, there is one more read access to
t.finished
, which comes from the following location, which does not use synchTest object, hence, does not have the mutex used for the write:go/src/testing/testing.go
In other words, when there are multiple threads, and all are fine, no thread will write to
t.finished
. When something goes wrong andFailNow
is initiated, the other threads will be notified of the failure, and, in their turn, also reportrequire.NoError
, which callsFailNow
, and writest.finished = true
.Test Plan