Fix race condition in scheduler reset #179

csegarragonz · 2021-11-23T16:25:34Z

In this PR I fix a race condition when resetting executors that made the flushing tests in faasm segfault sporadically.

The race condition happened between:

The main thread calling Scheduler::reset() then Executor::finish and, for each thread in threadPoolThreads: (1) check not null, (2) enqueue shutdown task, (3) join.
The thread pool thread that: (i) times out dequeueing between (1) and (2), (ii) breaks out the main executor loop setting selfShutdown = true and assigning itself to null; all of this before (3).

The consequence was that we tried to join a nullptr in (3) and segfault.

The proposed solution changes the order in the Executor::finish loop to: (1) enqueue shutdown task, (2) check if thread is null, (3) if not join the thread. This way, when we start killing the thread pool, the thread either: (1) is still blocked dequeuing, thus adding the POOL_SHUTDOWN task will have the desired effect when it wakes up, or (2) has already timed-out, and will self-destroy itself, and will be pointing to null when we check for it.

I think the chances that the thread has timed out when we enqueue but it is not null when we check and is null when we join are very remote as there's only one instruction between timing out and setting oneself to null, and I haven't been able to re-create it.

Note that the tasks queues are cleared at the end of Executor::finish so it is not really a problem having non-empty queues with POOL_SHUTDOWN tasks.

I also include a test that makes the current master branch crash with less than 20 tries (less than 10 in fact the 20 times I tried locally), but passes now.

Shillaker · 2021-11-24T08:42:10Z

tests/test/scheduler/test_executor.cpp

+
+        // We sleep for the same timeout threads have, to force a race condition
+        // between the scheduler's flush and the thread's own cleanup timeout
+        SLEEP_MS(conf.boundTimeout);


This could end up being quite a long sleep depending on the default config, and it's multiplied by 20, so it would be good to set this to something very short at the start of this test (e.g. 100ms).

Good point, thanks

tests/test/scheduler/test_executor.cpp

csegarragonz self-assigned this Nov 23, 2021

csegarragonz added the bug Something isn't working label Nov 23, 2021

csegarragonz requested review from Shillaker and removed request for Shillaker November 23, 2021 16:48

fix race condition in scheduler reset and add test

4c0af62

csegarragonz force-pushed the fix-race branch from 3f694f8 to 4c0af62 Compare November 23, 2021 16:50

csegarragonz requested a review from Shillaker November 23, 2021 17:10

Shillaker requested changes Nov 24, 2021

View reviewed changes

pr comments

bbb4ad5

Shillaker approved these changes Nov 24, 2021

View reviewed changes

csegarragonz merged commit 4930c61 into master Nov 24, 2021

csegarragonz deleted the fix-race branch November 24, 2021 14:33

csegarragonz mentioned this pull request Feb 23, 2022

Add task to generate release body #233

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix race condition in scheduler reset #179

Fix race condition in scheduler reset #179

csegarragonz commented Nov 23, 2021 •

edited

Loading

Shillaker Nov 24, 2021

csegarragonz Nov 24, 2021

Fix race condition in scheduler reset #179

Fix race condition in scheduler reset #179

Conversation

csegarragonz commented Nov 23, 2021 • edited Loading

Shillaker Nov 24, 2021

Choose a reason for hiding this comment

csegarragonz Nov 24, 2021

Choose a reason for hiding this comment

csegarragonz commented Nov 23, 2021 •

edited

Loading