Skip to content

Commit

Permalink
MB-43546: Fix race between FollyExecutorPool::cancel & snooze
Browse files Browse the repository at this point in the history
There is a potential race in FollyExecutorPool between the cancel()
and snooze() methods.

Unlike CB3ExecutorPool, cancel() isn't a completely blocking operation
- while the task _is_ marked as cancelled synchronously; the actual
deletion of the TaskProxy object is done lazily in the background
after cancel() returns.

This requires that other task operations (e.g. wake, snooze) check the
GlobalTask is not null before oeprating on it. This is correctly
handled for wake (see TaskProxy::wake), but _not_ for cancel. As a
result a snooze() which executes before the background cleanup of
taskOwners can result in a nullptr dereference.

Change-Id: I8ef476426b1056bfff646551fa38d94f4ff64f01
Reviewed-on: http://review.couchbase.org/c/kv_engine/+/144166
Tested-by: Build Bot <build@couchbase.com>
Reviewed-by: Trond Norbye <trond.norbye@couchbase.com>
  • Loading branch information
daverigby authored and trondn committed Jan 26, 2021
1 parent b542c0b commit 3872972
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 2 deletions.
13 changes: 11 additions & 2 deletions engines/ep/src/folly_executorpool.cc
Original file line number Diff line number Diff line change
Expand Up @@ -461,8 +461,17 @@ struct FollyExecutorPool::State {
auto& tasks = owner.second;
it = tasks.locator.find(taskId);
if (it != tasks.locator.end()) {
it->second->task->snooze(toSleep);
it->second->updateTimeoutFromWakeTime();
auto& proxy = it->second;
if (!proxy->task) {
// Task has been cancelled ('task' shared ptr reset to
// null via resetTaskPtrViaCpuPool), but TaskProxy not yet
// been cleaned up) - i.e. a snooze and cancel have raced.
// Treat as if cancellation has completed and task no
// longer present.
return false;
}
proxy->task->snooze(toSleep);
proxy->updateTimeoutFromWakeTime();
return true;
break;
}
Expand Down
72 changes: 72 additions & 0 deletions engines/ep/tests/module_tests/executorpool_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -495,6 +495,78 @@ TYPED_TEST(ExecutorPoolTest, SnoozeWithoutSchedule) {
this->pool->unregisterTaskable(taskable, false);
}

/// Test that cancel() followed by snooze(); where cancel has only partially
/// completed is handled correctly.
/// Regression test for MB-43546 (which affected FollyExecutorPool).
TYPED_TEST(ExecutorPoolTest, CancelThenSnooze) {
if (typeid(TypeParam) == typeid(TestExecutorPool)) {
// CB3ExecutorPool deadlocks if one attempts to call snooze() while
// cancel() is still running.
// Could probably re-write test to call snooze() on a different thread
// with appropriate synchronization, but given this issue doesn't
// affect CB3, just skip the test.
GTEST_SKIP();
}

this->makePool(1);
NiceMock<MockTaskable> taskable;
this->pool->registerTaskable(taskable);

// To force the scheduling we want, need to call snooze after cancel()
// has set TaskProxy::task to null; but before that TaskProxy is removed
// from the owners. To do this, use a custom GlobalTask whose dtor
// calls snooze - GlobalTask is destroyed before removing TaskProxy from
// owners.
struct TestTask : public GlobalTask {
TestTask(Taskable& taskable, ExecutorPool& pool)
: GlobalTask(taskable, TaskId::ItemPager, INT_MAX), pool(pool) {
}

~TestTask() override {
pool.snooze(getId(), 10.0);
}

std::string getDescription() override {
return "TestTask - "s + GlobalTask::getTaskName(taskId);
}

std::chrono::microseconds maxExpectedDuration() override {
return {};
}

protected:
bool run() override {
ADD_FAILURE() << "TestTask should never run";
return false;
}
ExecutorPool& pool;
};

auto taskPtr = std::make_shared<TestTask>(taskable, *this->pool);
auto taskId = this->pool->schedule(taskPtr);

// reset our local taskPtr so only the ExecutorPool has a reference.
taskPtr.reset();

// Test: We want to setup the follow sequence:
// 1. cancel() the task; which should cause TaskProxy::task in taskOwners
// to be set to null.
// 2. Before the taskOwners element is removed
// (via FollyExecutorPool::removeTaskAfterRun), the refcount of the
// ExTask is decremented, which given our above TestTask has just one
// reference should trigger it's destruction.
// 3. In TestTask dtor; attempt to snooze() task again. This _should be
// handled correctly (ignored) given task is already in the process of
// being cancelled, but in MB-43546 snooze() incorrectly attempted to
// dereference the null TaskProxy::task member.
// 4. (Not really important) taskOwners entry is removed via
// FollyExecutorPool::removeTaskAfterRun
this->pool->cancel(taskId, true);

// Cleanup.
this->pool->unregisterTaskable(taskable, false);
}

/* This test creates an ExecutorPool, and attempts to verify that calls to
* setNumWriters are able to dynamically create more workers than were present
* at initialisation. A ThreadGate is used to confirm that two tasks
Expand Down

0 comments on commit 3872972

Please sign in to comment.