-
Notifications
You must be signed in to change notification settings - Fork 35.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acquire CCheckQueue's lock to avoid race condition #5721
Conversation
assert(pqueue->nTotal == pqueue->nIdle); | ||
assert(pqueue->nTodo == 0); | ||
assert(pqueue->fAllOk == true); | ||
assert(pqueue->IsIdle()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid having an effectful statement inside an assert? We're requiring NDEBUG now, but that may not remain the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point -- fixed so that the function gets called outside the assert (commit squashed).
54e9638
to
425eb1c
Compare
0.10? |
Untested ACK (after squash), including 0.10. |
utACK, after squash |
ea2a009
to
425eb1c
Compare
Now that I've eliminated the empty commit I used to bump travis (after it initially failed for reasons I think are unrelated to my pull), travis is again showing the initial failed run. Is there a better way for me to handle this in the future? |
@sdaftuar creating a new commit on top forces travis to test that commit. But when you pop it back off, it's already built that exact revision, so it doesn't bother trying again. Rather than that, just re-commit the change in some way that generates a new commit hash. Edit the commit message somewhat, git format-patch -1 + git am, something like that. Then force-push to overwrite the old one. But of course, the above only applies if the test failure really was a fluke! |
CCheckQueue should be able to drop its friendship with CCheckQueueControl after this change (sorry CCheckQueueControl...) That should keep something like this from happening again, since the member vars would be guarded. |
utACK The following was observed on testnet in one of the initial syncs at height 224873: bitcoind: checkqueue.h:183: CCheckQueueControl::CCheckQueueControl(CCheckQueue*) [with T = CScriptCheck]: Assertion `pqueue->nTotal == pqueue->nIdle' failed. |
Weird, it failed travis again. |
Travis reports during
It looks like test_bitcoin crashes at And this pull introduces a mutex. |
This fixes a potential race condition in the CCheckQueueControl constructor, which was looking directly at data in CCheckQueue without acquiring its lock. Remove the now-unnecessary friendship for CCheckQueueControl
425eb1c
to
cf008ac
Compare
To clarify -- this pull doesn't introduce an actually new mutex, but does it acquire a lock based on the existing mutex that is already used to synchronize access to the member variables in CCheckQueue. The travis failure was unrelated to the pull; the travis log (https://travis-ci.org/bitcoin/bitcoin/builds/48672519) shows the java comparison tool test failed with:
What appears below that is the end of the debug log from bitcoind after the comparison tool failure, which shows a clean shutdown for a bitcoind that is compiled with DEBUG_LOCKORDER and run with -debug (ie the message @jonasschnelli pasted is a normal one, not indicative of the failure). Note also that this is not test_bitcoin which failed; the test_bitcoin unit tests passed:
I tested my original pull by compiling with DEBUG_LOCKORDER and doing a -reindex on testnet, and by verifying that it solved the race condition in the original issue (which I exacerbated in testing by adding a usleep(500000) before nIdle++ at At any rate I just made the change suggested by @theuni (removing the friend class designation for CCheckQueueControl), which is going to bump travis again. |
I've tested this a good bit locally, since although it doesn't look like it should be able to break anything, the random travis failure on the DEBUG_LOCKORDER test is rather scary. I've re-run the comparison tool test dozens of times now, with the exact same conditions (built from depends, NO_QT=1 NO_UPNP=1 DEBUG=1, configured with --enable-glibc-back-compat CPPFLAGS=-DDEBUG_LOCKORDER). No problems here. Also, notice that in the Travis failure, we never even get a "bitcoind connected". Looks to me like failure really was a fluke. |
I'd bet this is related to #5433 (comment) (probably the same issue), try with TheBlueMatt@e21269e merged in. |
@TheBlueMatt Just to clarify is there anything more to be done on this pull? Since travis ran cleanly with the current code, I thought I'd just keep your workaround in mind if that travis issue comes up again, but not necessarily change anything now. |
cf008ac Acquire CCheckQueue's lock to avoid race condition (Suhas Daftuar)
This fixes a potential race condition in the CCheckQueueControl constructor, which was looking directly at data in CCheckQueue without acquiring its lock. Remove the now-unnecessary friendship for CCheckQueueControl Rebased-From: cf008ac Github-Pull: bitcoin#5721 (cherry picked from commit d148f62)
This fixes a potential race condition in the CCheckQueueControl constructor,
which was looking directly at data in CCheckQueue without acquiring its lock.
Even though only one CCheckQueueControl exists at a time, one of the
CCheckQueue threads may have completed work but not yet updated nIdle or
released its lock, so looking at that variable without acquiring the lock first is not
safe.
Fixes #5703.