Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Acquire CCheckQueue's lock to avoid race condition #5721

Merged
merged 1 commit into from
Feb 6, 2015

Conversation

sdaftuar
Copy link
Member

This fixes a potential race condition in the CCheckQueueControl constructor,
which was looking directly at data in CCheckQueue without acquiring its lock.

Even though only one CCheckQueueControl exists at a time, one of the
CCheckQueue threads may have completed work but not yet updated nIdle or
released its lock, so looking at that variable without acquiring the lock first is not
safe.

Fixes #5703.

assert(pqueue->nTotal == pqueue->nIdle);
assert(pqueue->nTodo == 0);
assert(pqueue->fAllOk == true);
assert(pqueue->IsIdle());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid having an effectful statement inside an assert? We're requiring NDEBUG now, but that may not remain the case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point -- fixed so that the function gets called outside the assert (commit squashed).

@wtogami
Copy link
Contributor

wtogami commented Jan 29, 2015

0.10?

@sipa
Copy link
Member

sipa commented Jan 30, 2015

Untested ACK (after squash), including 0.10.

@laanwj laanwj added this to the 0.10.0 milestone Jan 30, 2015
@laanwj laanwj added the Bug label Jan 30, 2015
@TheBlueMatt
Copy link
Contributor

utACK, after squash

@sdaftuar
Copy link
Member Author

sdaftuar commented Feb 2, 2015

Now that I've eliminated the empty commit I used to bump travis (after it initially failed for reasons I think are unrelated to my pull), travis is again showing the initial failed run. Is there a better way for me to handle this in the future?

@theuni
Copy link
Member

theuni commented Feb 2, 2015

@sdaftuar creating a new commit on top forces travis to test that commit. But when you pop it back off, it's already built that exact revision, so it doesn't bother trying again.

Rather than that, just re-commit the change in some way that generates a new commit hash. Edit the commit message somewhat, git format-patch -1 + git am, something like that. Then force-push to overwrite the old one.

But of course, the above only applies if the test failure really was a fluke!

@theuni
Copy link
Member

theuni commented Feb 2, 2015

CCheckQueue should be able to drop its friendship with CCheckQueueControl after this change (sorry CCheckQueueControl...)

That should keep something like this from happening again, since the member vars would be guarded.
utACK after that.

@ghost
Copy link

ghost commented Feb 3, 2015

utACK

The following was observed on testnet in one of the initial syncs at height 224873:

bitcoind: checkqueue.h:183: CCheckQueueControl::CCheckQueueControl(CCheckQueue*) [with T = CScriptCheck]: Assertion `pqueue->nTotal == pqueue->nIdle' failed.

@laanwj
Copy link
Member

laanwj commented Feb 3, 2015

Weird, it failed travis again.

@laanwj laanwj removed this from the 0.10.0 milestone Feb 3, 2015
@jonasschnelli
Copy link
Contributor

Travis reports during make check (a.k.a src/test/test_bitcoin):

2015-01-28 20:27:11 Unlocked: cs_Shutdown  init.cpp:141
make[2]: *** [check-local] Error 1

It looks like test_bitcoin crashes at init.cpp:141 (Shutdown())
This like is TRY_LOCK(cs_Shutdown, lockShutdown); (https://github.com/bitcoin/bitcoin/blob/master/src/init.cpp#L141)

And this pull introduces a mutex.
I tend to NACK unless this is sorted out.

This fixes a potential race condition in the CCheckQueueControl constructor,
which was looking directly at data in CCheckQueue without acquiring its lock.

Remove the now-unnecessary friendship for CCheckQueueControl
@sdaftuar
Copy link
Member Author

sdaftuar commented Feb 3, 2015

To clarify -- this pull doesn't introduce an actually new mutex, but does it acquire a lock based on the existing mutex that is already used to synchronize access to the member variables in CCheckQueue.

The travis failure was unrelated to the pull; the travis log (https://travis-ci.org/bitcoin/bitcoin/builds/48672519) shows the java comparison tool test failed with:

08:27:09 14 BitcoindComparisonTool$1.onPeerDisconnected: bitcoind node disconnected!

What appears below that is the end of the debug log from bitcoind after the comparison tool failure, which shows a clean shutdown for a bitcoind that is compiled with DEBUG_LOCKORDER and run with -debug (ie the message @jonasschnelli pasted is a normal one, not indicative of the failure).

Note also that this is not test_bitcoin which failed; the test_bitcoin unit tests passed:

$ if [ "$RUN_TESTS" = "true" ]; then make check; fi
Making check in src
make[1]: Entering directory `/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/src'
make[2]: Entering directory `/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/src'
make  check-TESTS check-local
make[3]: Entering directory `/home/travis/build/bitcoin/bitcoin/bitcoin-x86_64-unknown-linux-gnu/src'
Running 137 test cases...
*** No errors detected
PASS: test/test_bitcoin
=============
1 test passed
=============

I tested my original pull by compiling with DEBUG_LOCKORDER and doing a -reindex on testnet, and by verifying that it solved the race condition in the original issue (which I exacerbated in testing by adding a usleep(500000) before nIdle++ at checkqueue.h:101).

At any rate I just made the change suggested by @theuni (removing the friend class designation for CCheckQueueControl), which is going to bump travis again.

@theuni
Copy link
Member

theuni commented Feb 3, 2015

I've tested this a good bit locally, since although it doesn't look like it should be able to break anything, the random travis failure on the DEBUG_LOCKORDER test is rather scary.

I've re-run the comparison tool test dozens of times now, with the exact same conditions (built from depends, NO_QT=1 NO_UPNP=1 DEBUG=1, configured with --enable-glibc-back-compat CPPFLAGS=-DDEBUG_LOCKORDER).

No problems here.

Also, notice that in the Travis failure, we never even get a "bitcoind connected". Looks to me like failure really was a fluke.

@TheBlueMatt
Copy link
Contributor

I'd bet this is related to #5433 (comment) (probably the same issue), try with TheBlueMatt@e21269e merged in.

@sdaftuar
Copy link
Member Author

sdaftuar commented Feb 6, 2015

@TheBlueMatt Just to clarify is there anything more to be done on this pull? Since travis ran cleanly with the current code, I thought I'd just keep your workaround in mind if that travis issue comes up again, but not necessarily change anything now.

@laanwj laanwj merged commit cf008ac into bitcoin:master Feb 6, 2015
laanwj added a commit that referenced this pull request Feb 6, 2015
cf008ac Acquire CCheckQueue's lock to avoid race condition (Suhas Daftuar)
laanwj pushed a commit that referenced this pull request Feb 24, 2015
This fixes a potential race condition in the CCheckQueueControl constructor,
which was looking directly at data in CCheckQueue without acquiring its lock.

Remove the now-unnecessary friendship for CCheckQueueControl

Rebased-From: cf008ac
Github-Pull: #5721
reddink pushed a commit to reddcoin-project/reddcoin-3.10 that referenced this pull request May 27, 2020
This fixes a potential race condition in the CCheckQueueControl constructor,
which was looking directly at data in CCheckQueue without acquiring its lock.

Remove the now-unnecessary friendship for CCheckQueueControl

Rebased-From: cf008ac
Github-Pull: bitcoin#5721
(cherry picked from commit d148f62)
@bitcoin bitcoin locked as resolved and limited conversation to collaborators Sep 8, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

IBD Crash with v0.10rc3: checkqueue.h:183: Assertion `pqueue->nTotal == pqueue->nIdle' failed.
7 participants