sched: solve the thread completion/destruction race

Our code had a serious bug in thread completion: When a thread complete()s, i.e., finishes its work, it wake()s the thread doing join() on it, and that joiner thread in turn deletes the completed thread and its stack. On rare occasions, the wake() was very slow but the joiner thread was very quick in deleting the thread - leading to a crash on return (retq) from wake() because the stack on which it was running has been deleted. This patch includes a simple, but effective, fix for this bug: We add a new per-cpu field, cpu::terminating_thread. complete() no longer calls unref() itself - as the thread unref()ing itself caused the bug. Instead, complete() just sets terminating_thread to the current thread. After the scheduler on this CPU switches to the next thread, we call unref() on the thread specified in terminating_thread. We know this is safe because this thread is no longer running. This fix seems simple and effective (the crashes that were apparent in tst-wake and the sunflow benchmark seem to be gone, as far as I can tell). Its biggest downside is an extra "if" on every context switch. It is possible to devise different solutions, without the cost of the extra if, but these solutions are more complicated and require a lot more code changes. I'll add a bug-tracker entry documenting them.
cloudius-systems · Jun 23, 2013 · 96ee424 · 96ee424
1 parent 6278271
commit 96ee424
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 19 deletions.
diff --git a/core/sched.cc b/core/sched.cc
@@ -58,6 +58,7 @@ class thread::reaper {
 cpu::cpu(unsigned _id)
     : id(_id)
     , idle_thread([this] { idle(); }, thread::attr(this))
+    , terminating_thread(nullptr)
 {
     percpu_init(id);
 }
@@ -107,6 +108,10 @@ void cpu::reschedule_from_interrupt(bool preempt)
         if (preempt) {
             p->_fpu.restore();
         }
+        if (p->_cpu->terminating_thread) {
+            p->_cpu->terminating_thread->unref();
+            p->_cpu->terminating_thread = nullptr;
+        }
     }
 }
 
@@ -370,21 +375,16 @@ void thread::ref()
 void thread::unref()
 {
     if (_ref_counter.fetch_add(-1) == 1) {
+        // thread can't unref() itself, because if it decides to wake joiner,
+        // it will delete the stack it is currently running on.
+        assert(thread::current() != this);
+
         // FIXME: we have a problem in case of a race between join() and the
         // thread's completion. Here we can see _joiner==0 and not notify
         // anyone, but at the same time join() decided to go to sleep (because
         // status is not yet status::terminated) and we'll never wake it.
         if (_joiner) {
             _joiner->wake_with([&] { _status.store(status::terminated); });
-            // FIXME: At this point, we've awoken _joiner, which can promptly
-            // clean this thread up, including deleting the stack we and our
-            // caller (complete()) is running on. So right now we can be
-            // running on a deleted stack! This is a serious bug (even if
-            // rare because requires very specific timing).
-            // NOTE: This bug can especially be seen happen when wake above
-            // calls wrmsr (for the ipi) which takes a long time (an exit) and
-            // in the meantime the joiner deletes our stack, and the return
-            // from the ipi/wrmsr function (retq instruction) gets a segfault.
         } else {
             _status.store(status::terminated);
         }
@@ -457,18 +457,16 @@ void thread::complete()
     if (_attr.detached) {
         _s_reaper->add_zombie(this);
     }
-    // If this thread gets preempted after changing status (to terminating
-    // here, or to terminated in unref()), it will never be scheduled
-    // again and never get to wake the joiner. So must disable preemption.
+    // If this thread gets preempted after changing status it will never be
+    // scheduled again to set terminating_thread. So must disable preemption.
     preempt_disable();
     _status.store(status::terminating);
-    // unref() here will either do the exit (change state to terminated, wake
-    // joiner thread) now, or if a ref() is blocking us, leave the state
-    // "terminating" and the matching unref() will complete the termination.
-    unref();
-    preempt_enable();
-    // The thread is now in the "terminating" or "terminated" state, so on
-    // call to schedule() or on preemption, it will never get to run again.
+    // We want to run unref() here, but can't because it cause the stack we're
+    // running on to be deleted. Instead, set a _cpu field telling the next
+    // thread running on this cpu to do the unref() for us.
+    _cpu->terminating_thread = this;
+    // The thread is now in the "terminating" state, so on call to schedule()
+    // it will never get to run again.
     while (true) {
         schedule();
     }

diff --git a/include/sched.hh b/include/sched.hh
@@ -295,6 +295,7 @@ struct cpu {
     typedef lockless_queue<thread, &thread::_wakeup_link> incoming_wakeup_queue;
     cpu_set incoming_wakeups_mask;
     incoming_wakeup_queue* incoming_wakeups;
+    thread* terminating_thread;
     static cpu* current();
     void init_on_cpu();
     void schedule(bool yield = false);