Skip to content

Commit

Permalink
fix deadlock in subscription teardown
Browse files Browse the repository at this point in the history
Summary:
In speaking with jbower earlier today, he said that he'd
noticed correlation between this fault and the webserver crashing.
Since the webserver may have multiple subscriptions this gave me
some inspiration for the integration test to try to reproduce this.

The issue is a race condition: if multiple different client sessions
are terminated at the same time then it is possible for the `~Subscriber`
destructor to end up holding the final reference to some other `Subscriber`
instance and thus end up recursing into its `~Subscriber` and deadlock
while attempting to re-acquire the wlock that is already held by the
current thread.

Reviewed By: chadaustin

Differential Revision: D7178797

fbshipit-source-id: 540728d76ade29ea0bcf4716a71b7055966e401e
  • Loading branch information
wez authored and facebook-github-bot committed Mar 7, 2018
1 parent 620ee5a commit 5dfcdd9
Show file tree
Hide file tree
Showing 3 changed files with 48 additions and 24 deletions.
41 changes: 29 additions & 12 deletions PubSub.cpp
Expand Up @@ -16,20 +16,37 @@ Publisher::Subscriber::Subscriber(
info_(std::move(info)) {}

Publisher::Subscriber::~Subscriber() {
auto wlock = publisher_->state_.wlock();
auto it = wlock->subscribers.begin();
while (it != wlock->subscribers.end()) {
auto sub = it->lock();
// Prune vacated weak_ptr's or those that point to us
if (!sub || sub.get() == this) {
it = wlock->subscribers.erase(it);
} else {
++it;
// In the loop below we may own a reference to some other
// Subscriber instance. That is fine, but we need to take
// care: if we end up with the last reference to that subscriber
// we will end up calling its destructor and effective recurse
// and attempt to acquire the wlock. We therefore need to
// defer releasing any of the shared_ptr's that we lock in
// the loop below until after we have released the wlock.
std::vector<std::shared_ptr<Subscriber>> subscribers;

{
auto wlock = publisher_->state_.wlock();
auto it = wlock->subscribers.begin();
while (it != wlock->subscribers.end()) {
auto sub = it->lock();
// Prune vacated weak_ptr's or those that point to us
if (!sub || sub.get() == this) {
it = wlock->subscribers.erase(it);
} else {
++it;
// Defer releasing the sub reference until after we've
// release the wlock!
subscribers.emplace_back(std::move(sub));
}
}
// Take this opportunity to reap anything that is no longer
// referenced now that we've removed some subscriber(s)
wlock->collectGarbage();
}
// Take this opportunity to reap anything that is no longer
// referenced now that we've removed some subscriber(s)
wlock->collectGarbage();

// It is now safe for subscribers to be torn down and release
// any references we took ownership of in the loop above.
}

void Publisher::Subscriber::getPending(
Expand Down
6 changes: 3 additions & 3 deletions tests/integration/WatchmanTestCase.py
Expand Up @@ -90,8 +90,8 @@ def setUp(self):
def tearDown(self):
self.__clearClient()

def getClient(self, inst=None, replace_cached=False):
if inst or not hasattr(self, 'client'):
def getClient(self, inst=None, replace_cached=False, no_cache=False):
if inst or not hasattr(self, 'client') or no_cache:
client = pywatchman.client(
# ASAN-enabled builds can be slower enough that we hit timeouts
# with the default of 1 second
Expand All @@ -101,7 +101,7 @@ def getClient(self, inst=None, replace_cached=False):
recvEncoding=self.encoding,
sockpath=(inst or
WatchmanInstance.getSharedInstance()).getSockPath())
if not inst or replace_cached:
if (not inst or replace_cached) and not no_cache:
# only cache the client if it points to the shared instance
self.client = client
return client
Expand Down
25 changes: 16 additions & 9 deletions tests/integration/test_subscribe.py
Expand Up @@ -572,19 +572,26 @@ def test_flush_subscriptions(self):
def test_unsub_deadlock(self):
''' I saw a stack trace of a lock assertion that seemed to originate
in the unsubByName() method. It looks possible for this to call
itself recursively and this test was intended to try to tickle this,
but I was unable to get the deadlock check to trigger :-/ '''
itself recursively and this test exercises that code path. It
also exercises a similar deadlock where multiple subscriptions from
multiple connections are torn down around the same time. '''
root = self.mkdtemp()
self.watchmanCommand('watch', root)
clock = self.watchmanCommand('clock', root)['clock']
for _ in range(0, 100):
self.watchmanCommand(
'subscribe', root, 'sub1', {
'fields': ['name'],
'since': clock
}
)
self.watchmanCommand('unsubscribe', root, 'sub1')
clients = []
for i in range(0, 20):
client = self.getClient(no_cache=True)
client.query(
'subscribe', root, 'sub%s' % i, {
'fields': ['name'],
'since': clock
}
)
self.touchRelative(root, 'a')
clients.append(client)
for client in clients:
client.close()


def test_subscription_cleanup(self):
Expand Down

0 comments on commit 5dfcdd9

Please sign in to comment.