New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tst-without-namespace.so is sporadically failing #790
Comments
Could this be a use after delete problem with the threads created by rcu_flush? They all have the same address on the stack. Also, this problem appears to be fixed if I make a change to dynamically allocate the threads, e.g.
|
@DerangedMonkeyNinja this is a very interesting observation. In the past we had bugs where a thread would wake up its own join(), or thread::join() raced with thread::destroy(), in both cases causing situations where if we destroy a thread object immediately after join()ing it, the thread might still be be in use while being destroyed. I thought I fixed all these bugs, but perhaps one remains? Looking at the code, I noticed several minor issues (like second thread::join() causing immediate return instead of assert, and thread::join() uses _detaced_state without holding the rcu_read_lock), but it doesn't seem to me that either of these can cause the issue we're seeing. If this is indeed a use-after-free bug, we could debug it with our use-after-free debugger: Keep your patch (so the threads will be separate objects on the heap, not reusing the same one), and edit conf/base.mk to to have conf-debug_memory=1 instead of 0, and recompile. This debug allocator option makes sure that each allocation lives in a separate page, which is unmapped when freeded, so a use-after-free causes an immediate page fault. This may allow you to see exactly which piece of code is using an already-freed object. |
@DerangedMonkeyNinja sorry, github mailed my comment prematurely, with only partial content and partial crap. Can you please take a look at the bug tracker itself for the fixed content of my comment? Thanks. |
So, running with dynamically allocated threads and conf-debug_memory=1 doesn't seem to generate any asserts. However, I really wasn't thinking of a typical use-after-free type issue. Since the previous thread is overwritten with a new thread in the original code, the raw pointer in _detached_state ends up pointing to a new detached_state struct every time a thread is instantiated. I honestly don't quite understand how the RCU stuff works, so maybe that isn't really a problem (though it seems like it would be...) |
I've played around with this issue some more, and came across the following observations. First, I changed thread::_detached_state to an atomic ptr, added a new tracepoint, and modified cpu::handle_incoming_wakeups() thusly:
Surprisingly, this function will go into an (apparently) infinite loop. And the trace data is really strange:
Attaching the debugger, you can see that t, t._id, and t._detached_state both have the same values as they do in the thread_create trace line. I think the only possible explanation for this is that CPU 1 is getting bogus data from it's cache. So as a quick and dirty way to test that out, I did this:
And with that single change, I'm unable to get the test to fail. Previously, it would reliably fail within 5 attempts, but now I can run several hundred without any problems. I guess I should add I don't know what a proper fix for this is. A TLB flush seems excessive... |
I think I have a proper fix for this.
In this case at least, the lazy_flush_tlb flag was getting set when the CPU waited for the interrupt. When it resumed, handle_incoming_wakeups grabbed a thread that was in a page region that was modified. |
Addendum: repeatedly running this unit test for an extended period produced an identical page fault in cpu::handle_incoming_wakeups when it was called from cpu::reschedule_from_interrupt. So, it seems like the proper fix is really...
Note: I don't think the flush in cpu::reschedule_from_interrupts is necessary since one of the first things it does is call into cpu::handle_incoming_wakeups. |
Wow, this was surprising! For reference, the "lazy tlb flush" feature which you are fixing was added by @gleb-cloudius in commit 7e38453 Here is what I think might be happening (but please correct me if I'm misunderstanding): Imagine CPU 0 creates a new application thread A which itself creates a new thread B sent to CPU 1 - where thread B is a sched::thread object created on thread A's stack. As is normal for pthread threads (see pthread::allocate_stack), thread A is created with an mmap'ed stack, so the "sched::thread" structure allocated for thread B, which is on that stack, is in an mmap'ed area. Reading from this area might require flushing the TLB first. So it's not that running the incoming thread needs the new translation - it's the incoming thread structure itself which might need the new translation! By the way, I don't think we can get rid of the second test-flush in reschedule_from_interrupts - the problem is that it is quite possible that we have no incoming threads, so your code doesn't reach the place where it tests-and-flushes - and yet we do need to test-and-flush when we decide to context switch (the purpose of this delayed tlb flush was just to avoid the need for an IPI - not to avoid the flush). I'm worried, though, that even this patch might not be enough :-( The code which sets the cpu::lazy_flush_tlb variable doesn't always set it: It tries to either set this variable, or send an IPI. Sometimes it will send an IPI. If the target CPU already happens to be inside the reschedule_from_interrupt() code after interrupted in an "app thread", the variable won't be set (because the code will think the cpu is still app thread state) and the IPI will be ignored (or delayed) and the reschedule code can use an outdated page table for an incoming thread? I wonder if we need to keep this variable set even if we send an IPI? Another option we could consider is to disallow creating sched::thread on the stack - and only allow allocating it on the heap (this can be achieved with a private constructor, and classes which hold a sched::thread member will now hold an std::unique_ptrsched::thread instead). This will mean that the thread structure itself will never be in an mmap'ed area. Anybody have an opinion about this alternative? Even more generally, I'm beginning to be worried about the validity of the assumption of patch 7e38453 - which assumed that "system threads" do not need to work on user-mmap()ed areas. We now saw that "class thread" objects can actually live on user-mmap()ed areas and system threads (even the scheduler) need it. I wonder if we can't have similar issues elsewhere... @gleb-cloudius what do you think about this patch in general, and my questions above in particular? |
I just sent a patch to the mailing list titled "RFC: force sched::thread to be in heap" which enforces that sched::thread objects are always allocated on the heap, never on the stack, so it avoids the TLB flush issue (the heap is always using a linear mapping prepared during boot, and never needs a TLB flush). I still can't make up my mind whether that approach, or the one suggested above by @DerangedMonkeyNinja (to fix the TLB flushing issue) is better. My patch is much longer and uglier - which can't be good. On the other hand, I'm worried that the patch of @DerangedMonkeyNinja might not cover all the possible cases, when a thread is woken in parallel with an interrupt being processed. |
@avikivity suggested an approach similar in spirit to my patch (forcing the thread structure to be allocated on the heap) but implemented differently: The idea is that the sched::thread will contain nothing but a pointer to the real thread structure. The user can create the sched::thread object wherever he pleases, but the constructor will allocate the actual, internal, thread structure in the way we want (on the heap). We actually already have exactly this sort of indirection in sched::thread, and it is detached_state, which we needed to hold parts of the thread that we needed to survive the thread just a bit, until the RCU quiet period. So basically now we'll move all the thread's state into this detached state. The sched::thread destructor will rcu_dispose() the real thread structure rather than delete it immediately, but can call immediately a new method which cleans up part of the thread (like its stack) which we are sure we no longer need. Most of the private "thread" methods will become methods on the new detached state structure (need a new name for it, maybe thread_impl) and scheduler code which currently handles thread pointers - e.g., the scheduler's run queue - will now have pointers directly to this thread_impl - so the new scheme will not hurt scheduling performance, even the opposite (currently a few of the thread's fields need to be indirected through "detached_state" and this will go away). |
Experimental patch from OSv mailing list for issue described here: cloudius-systems#790
In issue cloudius-systems#790, Timmons C. Player discovered something dangerous with on-stack sched::thread objects: The application's stack may be mmap()ed, so the scheduler now needs to access an application's mmap()ed memory area. Those accesses can potentially have all sort of problems (e.g., page faults in the scheduler, although in practice issue cloudius-systems#143 precludes it). But more concretely, The "lazy TLB flush" code added in commit 7e38453 means that the scheduler may see old pages for an application's mmap()ed pages, so it cannot work with a sched::thread in an mmap()ed area. This patch prevents on-stack sched::thread construction by hiding the constructor, and only allowing a thread to be created through the function sched::thread::make(...), which creates the object on the heap. This patch is long but it more-or-less contains just the following types of changes: 1. The change to sched.hh to hide the sched::thread constructor, and instead offer a sched::thread::make(...) function. 2. Every class which used a "sched::thread" member was replaced by a std::unique_ptr<sched::thread> member. One "casualty" of this change is that a class that used to hold a sched::thread member will now hold a sched::thread pointer, and use an extra allocation and indirection - even if we know for sure that the containing object will always be allocated on the heap (such as the case in pthread.cc's pthread::_thread, for example). This can be worked around by making the exceptional class a friend of sched::thread to be able to use the hidden constructor - but for now I decided the tiny performance benefit isn't important enough to make this exception. 3. Every place which used "new sched::thread(...)" to create the thread, now uses "sched::thread::make(...)" instead. Sadly we no longer allow "new sched::thread(...)", although there is nothing wrong with it, because disabling the constructor also disables the new expression. 4. One test in misc-wake.cc (which wasn't run by default anyway, because) it is named "misc-*") was commented out because it relied on creating thread objects in mmap()ed areas. One of the places modified is rcu_flush(), whose on-stack sched::thread objects are what caused issue cloudius-systems#790. Fixes cloudius-systems#790. Signed-off-by: Nadav Har'El <nyh@scylladb.com>
As @wkozaczuk first noticed, tst-without-namespace.so (and also test-namespace.so?) fail once in every few runs - the failure rate goes between once every several hundred runs (on my own laptop) to once every several runs (on the OSv's Jenkins machine which reports these failures once in a while).
Here is some of the information I originally collected in issue #779:
The failure looks like:
The backtrace I got from gdb is (release build, built on Fedora 23):
In frame 10,
In a second run, after trying to add a few more assertions in the lockless_queue<> code, I got a page fault from a slightly different place:
So in this run, q.empty() was fine, but q.front() returned garbage.
My initial suspicion was that we might have a race somehow, either a bug in the lockless_queue implementation, or a bug in its usage (e.g., two different threads writing to the same queue). However, I later reverted commit bca7bbb which re-implemented the lockless-queue.hh wakeup queue, and this test still fails (after several hundred successes) - but in a slightly different place - see #779 (comment). So it's probably not a bug in lockless-queue.hh, and probably also not an issue with multiple writers (which the previous implementation actually allowed, even if unnecessarily).
This bug is also reproducible (again, in one of many runs) with a debug build. I usually get the second type of crash crash (at t.tcpu()), and it's more informative to debug this way without information ebing optimized away.
The thread whose wakeup is crashing a newly started thread (a thread which was set up, woken, but never yet run) - i.e., a thread for which "osv info threads" shows state status::waking and IP is at
arch/x64/entry.S:112
.The page fault in t.tcpu() has cr2 = 8, suggesting that when t.tcpu() was called and used "_detached_state->_cpu", _detached_state was still a null pointer. But in gdb, I already see _detached_state having a correct, non-null value. Could this be a memory ordering issue?
I thought that this is unlikely to be a memory ordering issue because _detached_state is set in thread::thread(), significantly earlier than the thread::start() call which causes the failing wake. But obviously the time difference does not ensure ordering. I was worried that wake() sets status to waking with CST memory ordering but handle_incoming_wakeups() does not read the status, so maybe this doesn't ensure memory ordering. But on the other hand, wake() uses "release" memory ordering to write to the queue, and the queue read code uses "consume" memory ordering to read from it. I also tested replacing the "consume" memory ordering with "acquire", and it didn't fix the problem (maybe moved it around a bit).
By printing t, using "osv info threads" and then "osv thread ..." and "where", I can see the stack trace of the thread which started the problematic new thread, to understand what is this new thread:
So osv::run() called already called osv::application::join() which apparently decided to delete the shared object because it isn't in use any more (I hope this actually happens after the shared object stopped running). But the more interesting thing we can learn from this backtrace is that in frame 6 we discover the actual identity of the thread whose startup is crashing: It's not the regular application thread, it's a temporary thread created by osv::rcu_flush().
Another interesting discovery here is that the osv::rcu_flush() code can wake a thread on a different core. Looking in GDB, we discover that indeed the problem happens when a thread on sched::cpus[1] starts a thread on c=sched::cpus[0]. So this brings back the possibility of memory ordering issues.
Every time this crash happens, it's always from this osv::rcu_flush() thread, and every time I checked, it was a thread on a different CPU from where the wakeup occurred (I was testing this with 4 vcpus).
The text was updated successfully, but these errors were encountered: