Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tst-without-namespace.so is sporadically failing #790

Closed
nyh opened this issue Aug 21, 2016 · 10 comments
Closed

tst-without-namespace.so is sporadically failing #790

nyh opened this issue Aug 21, 2016 · 10 comments

Comments

@nyh
Copy link
Contributor

nyh commented Aug 21, 2016

As @wkozaczuk first noticed, tst-without-namespace.so (and also test-namespace.so?) fail once in every few runs - the failure rate goes between once every several hundred runs (on my own laptop) to once every several runs (on the OSv's Jenkins machine which reports these failures once in a while).

Here is some of the information I originally collected in issue #779:

The failure looks like:

  TEST tst-without-namespace.so           OSv v0.24-157-gab34727
eth0: 192.168.122.15
Assertion failed: sched::preemptable() (arch/x64/mmu.cc: page_fault: 33)

The backtrace I got from gdb is (release build, built on Fedora 23):

#0  0x000000000038f002 in processor::cli_hlt () at arch/x64/processor.hh:247
#1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
#2  osv::halt () at arch/x64/power.cc:24
#3  0x0000000000225747 in abort (
    fmt=fmt@entry=0x68b078 "Assertion failed: %s (%s: %s: %d)\n")
    at runtime.cc:130
#4  0x0000000000225789 in __assert_fail (
    expr=expr@entry=0x7143a3 "sched::preemptable()", 
    file=file@entry=0x70722a "arch/x64/mmu.cc", line=line@entry=33, 
    func=func@entry=0x707250 <page_fault::__func__> "page_fault")
    at runtime.cc:137
#5  0x0000000000388cb7 in page_fault (ef=0xffff800001821068)
    at arch/x64/mmu.cc:33
#6  <signal handler called>
#7  std::__atomic_base<sched::thread*>::load (__m=std::memory_order_relaxed, 
    this=0x0) at /usr/include/c++/5.3.1/bits/atomic_base.h:713
#8  std::atomic<sched::thread*>::load (__m=std::memory_order_relaxed, this=0x0)
    at /usr/include/c++/5.3.1/atomic:416
#9  lockless_queue<sched::thread, &sched::thread::_wakeup_link>::empty (
    this=0xffff9000013230c0) at include/osv/lockless-queue.hh:80
#10 sched::cpu::handle_incoming_wakeups (this=this@entry=0xffff8000012b0040)
    at core/sched.cc:439
#11 0x00000000003dfdf2 in sched::cpu::do_idle (
    this=this@entry=0xffff8000012b0040) at core/sched.cc:405
#12 0x00000000003e2138 in sched::cpu::idle (this=0xffff8000012b0040)
    at core/sched.cc:423
#13 0x00000000003e215c in sched::cpu::<lambda()>::operator() (
    __closure=<optimized out>) at core/sched.cc:165
#14 std::_Function_handler<void(), sched::cpu::init_idle_thread()::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/5.3.1/functional:1871
#15 0x00000000003e2297 in sched::thread_main_c (t=0xffff80000181c040)
    at arch/x64/arch-switch.hh:164
#16 0x0000000000388ab6 in thread_main () at arch/x64/entry.S:113

In frame 10,

(gdb) fr 10
#10 sched::cpu::handle_incoming_wakeups (this=this@entry=0xffff8000012b0040)
    at core/sched.cc:439
439             while (!q.empty()) {
(gdb) p q
$1 = (lockless_queue<sched::thread, &sched::thread::_wakeup_link> &) @0xffff9000013230c0: {_head = 0x0, _tail = 0xffffa00002b381d0}

In a second run, after trying to add a few more assertions in the lockless_queue<> code, I got a page fault from a slightly different place:

#5  0x0000000000388cb7 in page_fault (ef=0xffff8000017e7068)
    at arch/x64/mmu.cc:33
#6  <signal handler called>
#7  sched::cpu::handle_incoming_wakeups (this=this@entry=0xffff80000127a040)
    at core/sched.cc:446
#8  0x00000000003dfe82 in sched::cpu::do_idle (
    this=this@entry=0xffff80000127a040) at core/sched.cc:405
#9  0x00000000003e21c8 in sched::cpu::idle (this=0xffff80000127a040)
    at core/sched.cc:423
(gdb) fr 7
#7  sched::cpu::handle_incoming_wakeups (this=this@entry=0xffff80000127a040)
    at core/sched.cc:446
446               } else if (t.tcpu() != this) {

So in this run, q.empty() was fine, but q.front() returned garbage.

My initial suspicion was that we might have a race somehow, either a bug in the lockless_queue implementation, or a bug in its usage (e.g., two different threads writing to the same queue). However, I later reverted commit bca7bbb which re-implemented the lockless-queue.hh wakeup queue, and this test still fails (after several hundred successes) - but in a slightly different place - see #779 (comment). So it's probably not a bug in lockless-queue.hh, and probably also not an issue with multiple writers (which the previous implementation actually allowed, even if unnecessarily).

This bug is also reproducible (again, in one of many runs) with a debug build. I usually get the second type of crash crash (at t.tcpu()), and it's more informative to debug this way without information ebing optimized away.

The thread whose wakeup is crashing a newly started thread (a thread which was set up, woken, but never yet run) - i.e., a thread for which "osv info threads" shows state status::waking and IP is at arch/x64/entry.S:112.

The page fault in t.tcpu() has cr2 = 8, suggesting that when t.tcpu() was called and used "_detached_state->_cpu", _detached_state was still a null pointer. But in gdb, I already see _detached_state having a correct, non-null value. Could this be a memory ordering issue?

I thought that this is unlikely to be a memory ordering issue because _detached_state is set in thread::thread(), significantly earlier than the thread::start() call which causes the failing wake. But obviously the time difference does not ensure ordering. I was worried that wake() sets status to waking with CST memory ordering but handle_incoming_wakeups() does not read the status, so maybe this doesn't ensure memory ordering. But on the other hand, wake() uses "release" memory ordering to write to the queue, and the queue read code uses "consume" memory ordering to read from it. I also tested replacing the "consume" memory ordering with "acquire", and it didn't fix the problem (maybe moved it around a bit).

By printing t, using "osv info threads" and then "osv thread ..." and "where", I can see the stack trace of the thread which started the problematic new thread, to understand what is this new thread:

 sched::thread::switch_to (this=0xffff800002136040)
    at arch/x64/arch-switch.hh:75
#1  0x00000000005ba89a in sched::cpu::reschedule_from_interrupt (
    this=0xffff800001b36040, called_from_yield=false, preempt_after=...)
    at core/sched.cc:339
#2  0x00000000005ba15c in sched::cpu::schedule () at core/sched.cc:228
#3  0x00000000005bdca7 in sched::thread::wait (this=0xffff8000041ae050)
    at core/sched.cc:1177
#4  0x00000000005c298f in sched::thread::do_wait_until<sched::noninterruptible, sched::thread::dummy_lock, sched::thread::join()::<lambda()> >(sched::thread::dummy_lock &, sched::thread::<lambda()>) (mtx=..., pred=...)
    at include/osv/sched.hh:1033
#5  0x00000000005c1365 in sched::thread::wait_until<sched::thread::join()::<lambda()> >(sched::thread::<lambda()>) (pred=...) at include/osv/sched.hh:1044
#6  0x00000000005be01e in sched::thread::join (this=0x2000002fa930)
    at core/sched.cc:1274
#7  0x000000000056fd60 in osv::rcu_flush () at core/rcu.cc:248
#8  0x0000000000402aaa in elf::program::remove_object (this=0xffff900001fd5000, 
    ef=0xffffa000033a0c00) at core/elf.cc:1194
#9  0x0000000000401c68 in elf::program::<lambda(elf::object*)>::operator()(elf::object *) const (__closure=0xffffa000033f8bf0, obj=0xffffa000033a0c00)
    at core/elf.cc:1136
#10 0x0000000000405fc8 in std::_Sp_counted_deleter<elf::file*, elf::program::load_object(std::__cxx11::string, std::vector<std::__cxx11::basic_string<char> >, std::vector<std::shared_ptr<elf::object> >&)::<lambda(elf::object*)>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2u>::_M_dispose(void) (this=0xffffa000033f8be0)
    at /usr/include/c++/5.3.1/bits/shared_ptr_base.h:466
#11 0x000000000020e2f2 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0xffffa000033f8be0)
    at /usr/include/c++/5.3.1/bits/shared_ptr_base.h:150
#12 0x000000000020bdf1 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x2000002ffda8, __in_chrg=<optimized out>)
    at /usr/include/c++/5.3.1/bits/shared_ptr_base.h:659
#13 0x000000000040646c in std::__shared_ptr<elf::object, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x2000002ffda0, __in_chrg=<optimized out>)
    at /usr/include/c++/5.3.1/bits/shared_ptr_base.h:925
#14 0x000000000063dc7d in std::__shared_ptr<elf::object, (__gnu_cxx::_Lock_policy)2>::reset (this=0xffffa00003355e98)
    at /usr/include/c++/5.3.1/bits/shared_ptr_base.h:1022
#15 0x0000000000638bf1 in osv::application::join (this=0xffffa00003355e10)
    at core/app.cc:222
#16 0x00000000006234e9 in osv::run (path="/tests/payload-env.so", 
    args=std::vector of length 1, capacity 1 = {...}, 
    return_code=0x2000002ffe5c, new_program=false, env=0x0) at core/run.cc:12
#17 0x0000100000c0365b in run_environment_payload (new_program=<optimized out>)
    at /home/nyh/osv.tmp2/osv/tests/env-common.inc:23
#18 0x000000000076ad80 in execute_native_thread_routine ()
#19 0x0000000000699e77 in pthread_private::pthread::<lambda()>::operator()(void) const (__closure=0xffffa00003355a00) at libc/pthread.cc:101
#20 0x000000000069c48a in std::_Function_handler<void(), pthread_private::pthread::pthread(void* (*)(void*), void*, sigset_t, const pthread_private::thread_attr*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...)
    at /usr/include/c++/5.3.1/functional:1871
#21 0x000000000044c4d0 in std::function<void ()>::operator()() const (
    this=0xffff8000041ae080) at /usr/include/c++/5.3.1/functional:2267
#22 0x00000000005bdc88 in sched::thread::main (this=0xffff8000041ae050)
    at core/sched.cc:1171
#23 0x00000000005b9cbe in sched::thread_main_c (t=0xffff8000041ae050)
    at arch/x64/arch-switch.hh:164
#24 0x0000000000489036 in thread_main () at arch/x64/entry.S:113

So osv::run() called already called osv::application::join() which apparently decided to delete the shared object because it isn't in use any more (I hope this actually happens after the shared object stopped running). But the more interesting thing we can learn from this backtrace is that in frame 6 we discover the actual identity of the thread whose startup is crashing: It's not the regular application thread, it's a temporary thread created by osv::rcu_flush().

Another interesting discovery here is that the osv::rcu_flush() code can wake a thread on a different core. Looking in GDB, we discover that indeed the problem happens when a thread on sched::cpus[1] starts a thread on c=sched::cpus[0]. So this brings back the possibility of memory ordering issues.

Every time this crash happens, it's always from this osv::rcu_flush() thread, and every time I checked, it was a thread on a different CPU from where the wakeup occurred (I was testing this with 4 vcpus).

@DerangedMonkeyNinja
Copy link
Collaborator

DerangedMonkeyNinja commented Sep 13, 2016

Could this be a use after delete problem with the threads created by rcu_flush? They all have the same address on the stack. Also, this problem appears to be fixed if I make a change to dynamically allocate the threads, e.g.

diff --git a/core/rcu.cc b/core/rcu.cc
index 46adb2c..f7acf81 100644
--- a/core/rcu.cc
+++ b/core/rcu.cc
@@ -234,14 +234,14 @@ void rcu_flush()
 {
     semaphore s{0};
     for (auto c : sched::cpus) {
-        sched::thread t([&] {
-            rcu_defer([&] { s.post(); });
-            // rcu_defer() might not wake the cleanup thread until enough deferred
-            // callbacks have accumulated, so wake it up now.
-            percpu_quiescent_state_thread->wake();
-        }, sched::thread::attr().pin(c));
-        t.start();
-        t.join();
+        std::unique_ptr<sched::thread> t(new sched::thread ([&] {
+                    rcu_defer([&] { s.post(); });
+                    // rcu_defer() might not wake the cleanup thread until enough deferred
+                    // callbacks have accumulated, so wake it up now.
+                    percpu_quiescent_state_thread->wake();
+                }, sched::thread::attr().pin(c)));
+        t->start();
+        t->join();
     }
     s.wait(sched::cpus.size());
 }

@nyh
Copy link
Contributor Author

nyh commented Sep 13, 2016

@DerangedMonkeyNinja this is a very interesting observation. In the past we had bugs where a thread would wake up its own join(), or thread::join() raced with thread::destroy(), in both cases causing situations where if we destroy a thread object immediately after join()ing it, the thread might still be be in use while being destroyed. I thought I fixed all these bugs, but perhaps one remains?
I think the use-after-free scenario which you described could indeed cause the problems we saw, although I'm hazy about how exactly the details would match what I've seen.

Looking at the code, I noticed several minor issues (like second thread::join() causing immediate return instead of assert, and thread::join() uses _detaced_state without holding the rcu_read_lock), but it doesn't seem to me that either of these can cause the issue we're seeing.

If this is indeed a use-after-free bug, we could debug it with our use-after-free debugger: Keep your patch (so the threads will be separate objects on the heap, not reusing the same one), and edit conf/base.mk to to have conf-debug_memory=1 instead of 0, and recompile. This debug allocator option makes sure that each allocation lives in a separate page, which is unmapped when freeded, so a use-after-free causes an immediate page fault. This may allow you to see exactly which piece of code is using an already-freed object.

@nyh
Copy link
Contributor Author

nyh commented Sep 13, 2016

@DerangedMonkeyNinja sorry, github mailed my comment prematurely, with only partial content and partial crap. Can you please take a look at the bug tracker itself for the fixed content of my comment? Thanks.

@DerangedMonkeyNinja
Copy link
Collaborator

DerangedMonkeyNinja commented Sep 13, 2016

So, running with dynamically allocated threads and conf-debug_memory=1 doesn't seem to generate any asserts.

However, I really wasn't thinking of a typical use-after-free type issue. Since the previous thread is overwritten with a new thread in the original code, the raw pointer in _detached_state ends up pointing to a new detached_state struct every time a thread is instantiated. I honestly don't quite understand how the RCU stuff works, so maybe that isn't really a problem (though it seems like it would be...)

@DerangedMonkeyNinja
Copy link
Collaborator

DerangedMonkeyNinja commented Sep 16, 2016

I've played around with this issue some more, and came across the following observations.

First, I changed thread::_detached_state to an atomic ptr, added a new tracepoint, and modified cpu::handle_incoming_wakeups() thusly:

@@ -432,29 +436,49 @@ void cpu::handle_incoming_wakeups()
     if (!queues_with_wakes) {
         return;
     }
     for (auto i : queues_with_wakes) {
         irq_save_lock_type irq_lock;
         WITH_LOCK(irq_lock) {
             auto& q = incoming_wakeups[i];
             while (!q.empty()) {
-                auto& t = q.front();
-                q.pop_front();
-                if (&t == thread::current()) {
-                    // Special case of current thread being woken before
-                    // having a chance to be scheduled out.
-                    t._detached_state->st.store(thread::status::running);
-                } else if (t.tcpu() != this) {
-                    // Thread was woken on the wrong cpu. Can be a side-effect
-                    // of sched::thread::pin(thread*, cpu*). Do nothing.
-                } else {
-                    t._detached_state->st.store(thread::status::queued);
-                    // Make sure the CPU-local runtime measure is suitably
-                    // normalized. We may need to convert a global value to the
-                    // local value when waking up after a CPU migration, or to
-                    // perform renormalizations which we missed while sleeping.
-                    t._runtime.update_after_sleep();
-                    enqueue(t);
-                    t.resume_timers();
+                thread::detached_state *ds = nullptr;
+                size_t count = 0;
+                for (;;) {
+                    auto& t = q.front();
+                    std::atomic_thread_fence(std::memory_order_seq_cst);
+                    if ((ds = t._detached_state.load()) == nullptr) {
+                        trace_thread_fail(&t, t.id(), t._detached_state.load());
+                        count++;
+                        continue;
+                    }
+
+                    if (count > 0) {
+                        printf("ds retry count = %zu\n", count);
+                    }
+
+                    assert(ds);
+                    q.pop_front();
+
+                    if (&t == thread::current()) {
+                        // Special case of current thread being woken before
+                        // having a chance to be scheduled out.
+                        t._detached_state.load()->st.store(thread::status::running);
+                    } else if (t.tcpu() != this) {
+                        // Thread was woken on the wrong cpu. Can be a side-effect
+                        // of sched::thread::pin(thread*, cpu*). Do nothing.
+                    } else {
+                        t._detached_state.load()->st.store(thread::status::queued);
+                        // Make sure the CPU-local runtime measure is suitably
+                        // normalized. We may need to convert a global value to the
+                        // local value when waking up after a CPU migration, or to
+                        // perform renormalizations which we missed while sleeping.
+                        t._runtime.update_after_sleep();
+                        enqueue(t);
+                        t.resume_timers();
+                    }
+
+                    break;
                 }
             }
         }

Surprisingly, this function will go into an (apparently) infinite loop.

And the trace data is really strange:

0xffff800003a55050 >/tests/tst-wit  0         3.300407890 thread_create        thread=0x00002000002fa980, id=267, ds=0xffffa0000240ee00
...
0xffff800001c29040 idle1            1        45.535752290 thread_fail          thread=0x00002000002fa980, id=35455400, ds=0x0000000000000000
0xffff800001c29040 idle1            1        45.535753430 thread_fail          thread=0x00002000002fa980, id=35455400, ds=0x0000000000000000
0xffff800001c29040 idle1            1        45.535754560 thread_fail          thread=0x00002000002fa980, id=35455400, ds=0x0000000000000000
0xffff800001c29040 idle1            1        45.535755690 thread_fail          thread=0x00002000002fa980, id=35455400, ds=0x0000000000000000
0xffff800001c29040 idle1            1        45.535756810 thread_fail          thread=0x00002000002fa980, id=35455400, ds=0x0000000000000000
0xffff800001c29040 idle1            1        45.535757940 thread_fail          thread=0x00002000002fa980, id=35455400, ds=0x0000000000000000

Attaching the debugger, you can see that t, t._id, and t._detached_state both have the same values as they do in the thread_create trace line.

I think the only possible explanation for this is that CPU 1 is getting bogus data from it's cache. So as a quick and dirty way to test that out, I did this:

diff --git a/core/sched.cc b/core/sched.cc
index 750b160..1a7609c 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -428,6 +428,7 @@ void cpu::idle()

 void cpu::handle_incoming_wakeups()
 {
+    mmu::flush_tlb_local();
     cpu_set queues_with_wakes{incoming_wakeups_mask.fetch_clear()};
     if (!queues_with_wakes) {
         return;

And with that single change, I'm unable to get the test to fail. Previously, it would reliably fail within 5 attempts, but now I can run several hundred without any problems.

I guess I should add I don't know what a proper fix for this is. A TLB flush seems excessive...

@DerangedMonkeyNinja
Copy link
Collaborator

DerangedMonkeyNinja commented Sep 16, 2016

I think I have a proper fix for this.

diff --git a/core/sched.cc b/core/sched.cc
index 750b160..3db56ca 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -402,6 +402,9 @@ void cpu::do_idle()
         }
         guard.release();
         arch::wait_for_interrupt(); // this unlocks irq_lock
+        if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) {
+            mmu::flush_tlb_local();
+        }
         handle_incoming_wakeups();
     } while (runqueue.empty());
 }

In this case at least, the lazy_flush_tlb flag was getting set when the CPU waited for the interrupt. When it resumed, handle_incoming_wakeups grabbed a thread that was in a page region that was modified.

@DerangedMonkeyNinja
Copy link
Collaborator

Addendum: repeatedly running this unit test for an extended period produced an identical page fault in cpu::handle_incoming_wakeups when it was called from cpu::reschedule_from_interrupt. So, it seems like the proper fix is really...

diff --git a/core/sched.cc b/core/sched.cc
index 750b160..efe0179 100644
--- a/core/sched.cc
+++ b/core/sched.cc
@@ -333,9 +333,6 @@ void cpu::reschedule_from_interrupt(bool called_from_yield,
     if (app_thread.load(std::memory_order_relaxed) != n->_app) { // don't write into a cache line if it can be avoided
         app_thread.store(n->_app, std::memory_order_relaxed);
     }
-    if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) {
-        mmu::flush_tlb_local();
-    }
     n->switch_to();

     // Note: after the call to n->switch_to(), we should no longer use any of
@@ -432,6 +429,11 @@ void cpu::handle_incoming_wakeups()
     if (!queues_with_wakes) {
         return;
     }
+    // Incoming threads may require new page translations.  Check to
+    // see if we need to flush our cache before handling them.
+    if (lazy_flush_tlb.exchange(false, std::memory_order_seq_cst)) {
+        mmu::flush_tlb_local();
+    }
     for (auto i : queues_with_wakes) {
         irq_save_lock_type irq_lock;
         WITH_LOCK(irq_lock) {

Note: I don't think the flush in cpu::reschedule_from_interrupts is necessary since one of the first things it does is call into cpu::handle_incoming_wakeups.

@nyh
Copy link
Contributor Author

nyh commented Sep 18, 2016

Wow, this was surprising!

For reference, the "lazy tlb flush" feature which you are fixing was added by @gleb-cloudius in commit 7e38453

Here is what I think might be happening (but please correct me if I'm misunderstanding): Imagine CPU 0 creates a new application thread A which itself creates a new thread B sent to CPU 1 - where thread B is a sched::thread object created on thread A's stack. As is normal for pthread threads (see pthread::allocate_stack), thread A is created with an mmap'ed stack, so the "sched::thread" structure allocated for thread B, which is on that stack, is in an mmap'ed area. Reading from this area might require flushing the TLB first. So it's not that running the incoming thread needs the new translation - it's the incoming thread structure itself which might need the new translation!

By the way, I don't think we can get rid of the second test-flush in reschedule_from_interrupts - the problem is that it is quite possible that we have no incoming threads, so your code doesn't reach the place where it tests-and-flushes - and yet we do need to test-and-flush when we decide to context switch (the purpose of this delayed tlb flush was just to avoid the need for an IPI - not to avoid the flush).

I'm worried, though, that even this patch might not be enough :-( The code which sets the cpu::lazy_flush_tlb variable doesn't always set it: It tries to either set this variable, or send an IPI. Sometimes it will send an IPI. If the target CPU already happens to be inside the reschedule_from_interrupt() code after interrupted in an "app thread", the variable won't be set (because the code will think the cpu is still app thread state) and the IPI will be ignored (or delayed) and the reschedule code can use an outdated page table for an incoming thread? I wonder if we need to keep this variable set even if we send an IPI?

Another option we could consider is to disallow creating sched::thread on the stack - and only allow allocating it on the heap (this can be achieved with a private constructor, and classes which hold a sched::thread member will now hold an std::unique_ptrsched::thread instead). This will mean that the thread structure itself will never be in an mmap'ed area. Anybody have an opinion about this alternative?

Even more generally, I'm beginning to be worried about the validity of the assumption of patch 7e38453 - which assumed that "system threads" do not need to work on user-mmap()ed areas. We now saw that "class thread" objects can actually live on user-mmap()ed areas and system threads (even the scheduler) need it. I wonder if we can't have similar issues elsewhere...

@gleb-cloudius what do you think about this patch in general, and my questions above in particular?

@nyh
Copy link
Contributor Author

nyh commented Sep 18, 2016

I just sent a patch to the mailing list titled "RFC: force sched::thread to be in heap" which enforces that sched::thread objects are always allocated on the heap, never on the stack, so it avoids the TLB flush issue (the heap is always using a linear mapping prepared during boot, and never needs a TLB flush).
Part of this patch is, of course, making the threads in rcu.cc allocated on the heap instead of using the stack, which is exactly the first workaround that @DerangedMonkeyNinja discovered. But I tried to make it more general by outright forbidding on-stack sched::thread objects, and changing all the existing code to fit the newly proposed idiom of using std::unique_ptrsched::thread to allocate and delete threads.

I still can't make up my mind whether that approach, or the one suggested above by @DerangedMonkeyNinja (to fix the TLB flushing issue) is better. My patch is much longer and uglier - which can't be good. On the other hand, I'm worried that the patch of @DerangedMonkeyNinja might not cover all the possible cases, when a thread is woken in parallel with an interrupt being processed.

@nyh
Copy link
Contributor Author

nyh commented Sep 27, 2016

@avikivity suggested an approach similar in spirit to my patch (forcing the thread structure to be allocated on the heap) but implemented differently:

The idea is that the sched::thread will contain nothing but a pointer to the real thread structure. The user can create the sched::thread object wherever he pleases, but the constructor will allocate the actual, internal, thread structure in the way we want (on the heap).

We actually already have exactly this sort of indirection in sched::thread, and it is detached_state, which we needed to hold parts of the thread that we needed to survive the thread just a bit, until the RCU quiet period. So basically now we'll move all the thread's state into this detached state. The sched::thread destructor will rcu_dispose() the real thread structure rather than delete it immediately, but can call immediately a new method which cleans up part of the thread (like its stack) which we are sure we no longer need.

Most of the private "thread" methods will become methods on the new detached state structure (need a new name for it, maybe thread_impl) and scheduler code which currently handles thread pointers - e.g., the scheduler's run queue - will now have pointers directly to this thread_impl - so the new scheme will not hurt scheduling performance, even the opposite (currently a few of the thread's fields need to be indirected through "detached_state" and this will go away).

DerangedMonkeyNinja added a commit to SpirentOrion/osv that referenced this issue Sep 27, 2016
Experimental patch from OSv mailing list for issue
described here: cloudius-systems#790
@nyh nyh closed this as completed in c9e5af6 Oct 5, 2016
myechuri pushed a commit to myechuri/osv that referenced this issue Jun 22, 2017
In issue cloudius-systems#790, Timmons C. Player discovered something dangerous with
on-stack sched::thread objects: The application's stack may be mmap()ed,
so the scheduler now needs to access an application's mmap()ed memory
area. Those accesses can potentially have all sort of problems (e.g.,
page faults in the scheduler, although in practice issue cloudius-systems#143 precludes
it). But more concretely, The "lazy TLB flush" code added in commit
7e38453 means that the scheduler may see old pages for an application's
mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed
area.

This patch prevents on-stack sched::thread construction by hiding the
constructor, and only allowing a thread to be created through the function
sched::thread::make(...), which creates the object on the heap.

This patch is long but it more-or-less contains just the following
types of changes:

1. The change to sched.hh to hide the sched::thread constructor, and
   instead offer a sched::thread::make(...) function.

2. Every class which used a "sched::thread" member was replaced by
   a std::unique_ptr<sched::thread> member.

   One "casualty" of this change is that a class that used to hold a
   sched::thread member will now hold a sched::thread pointer, and use an
   extra allocation and indirection - even if we know for sure that the
   containing object will always be allocated on the heap (such as the
   case in pthread.cc's pthread::_thread, for example). This can be worked
   around by making the exceptional class a friend of sched::thread to be
   able to use the hidden constructor - but for now I decided the tiny
   performance benefit isn't important enough to make this exception.

3. Every place which used "new sched::thread(...)" to create the thread,
   now uses "sched::thread::make(...)" instead.

   Sadly we no longer allow "new sched::thread(...)", although there is
   nothing wrong with it, because disabling the constructor also disables
   the new expression.

4. One test in misc-wake.cc (which wasn't run by default anyway, because)
   it is named "misc-*") was commented out because it relied on creating
   thread objects in mmap()ed areas.

One of the places modified is rcu_flush(), whose on-stack sched::thread
objects are what caused issue cloudius-systems#790.

Fixes cloudius-systems#790.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants