Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sched: solve the thread completion/destruction race
Our code had a serious bug in thread completion: When a thread complete()s, i.e., finishes its work, it wake()s the thread doing join() on it, and that joiner thread in turn deletes the completed thread and its stack. On rare occasions, the wake() was very slow but the joiner thread was very quick in deleting the thread - leading to a crash on return (retq) from wake() because the stack on which it was running has been deleted. This patch includes a simple, but effective, fix for this bug: We add a new per-cpu field, cpu::terminating_thread. complete() no longer calls unref() itself - as the thread unref()ing itself caused the bug. Instead, complete() just sets terminating_thread to the current thread. After the scheduler on this CPU switches to the next thread, we call unref() on the thread specified in terminating_thread. We know this is safe because this thread is no longer running. This fix seems simple and effective (the crashes that were apparent in tst-wake and the sunflow benchmark seem to be gone, as far as I can tell). Its biggest downside is an extra "if" on every context switch. It is possible to devise different solutions, without the cost of the extra if, but these solutions are more complicated and require a lot more code changes. I'll add a bug-tracker entry documenting them.
- Loading branch information