-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault during benchmark #12
Comments
I'm also seeing a segfault (though it's rare an intermittent) in enqueue_bulk. The core dump I have is pointing me to line 2427 in
my program is fairly large, so this is in no way a minimal test case. However, I was wondering if you have any idea what may be causing this or (what I suspect to be the related) above bug. |
I tried changing all of the enqueue_bulk and dequeue_bulk operations to their non-bulk variants (with, of course, the appropriate loops). It seems that the segfault still remains. In this case, the core file points to this line, which, from the comment, seems to be somewhat of a corner case to begin with. Also, as evidence that the crash is not related to some other user-space issue, I can report that I've temporarily replaced the concurrent queue with the Intel TBB concurrent queue and ran a stress test (executed the program 100 times) without any segfault. Of course, the bulk operations make the moodycamel variant much faster when there is no segfault, and I'd much prefer your elegant single-header solution to the much bulkier TBB library :). |
…lmonQuantify There seems to be a bug in the moodycamel::ConcurrentQueue (cameron314/concurrentqueue#12) that, under very high contention, can result in an out-of-bounds memory access and segfault. This was happening on the test data where, due to its very small size, the queue was coming under very heavy contention. The crash was not deterministic, but replacing the moodycamel queue with the tbb queue seems to have resolved the issue in all cases. In the future, I'd like to move back to the moodycamel variant, as it allows bulk enqueue/dequeue operations, and is thus faster. However, the tbb implementation will remain in place until the moody camel bug is fixed. The old code remains in SalmonQuantify.cpp (commented out) so that it will be easy to re-enable when the bug is finally fixed.
Hi @rob-p, thanks for reporting the segfault you're seeing! That makes no less than four in a week after months without any issues. :/ Unfortunately, there's no way to know if all the segfaults are related or not (though indeed one bug is all it would take to cause multiple problems). Usually the symptom is far away from the cause, but since the queue is so complex every little clue helps (especially since I can't reproduce these segfaults on my end yet). I'll fix the bug as soon as I can pinpoint it -- sorry for the trouble! In the meantime, since I can't reproduce this on my own, I'd appreciate any help you can provide. The second segfault you're seeing with the non-bulk methods is very interesting, as it's completely unrelated to the main queueing algorithm (it's in the block free-list management code). This could either be a bug with the free list itself (unlikely, despite the corner cases), or it could be that a bug in the queueing code causes a block to be freed twice, or accessed after it's freed, etc. If you comment out the I really, really want to fix these bugs (or hopefully bug), but I can't do it unless I can either reproduce the problem (at least intermittently) or have the help of someone who can. Would you be willing to test out certain debug configurations to help pinpoint the root cause of the segfault? |
Hi @cameron314, I'd be happy to help where possible. The core dump is pretty large, but I can try and throw it up on google drive. I'll also try out disabling the thread local support to see if that makes a difference. From the other direction, you could try seeing if you can reproduce the crash in the context of my project's use of the concurrent queue. Though the latest code uses the TBB queue, commit 2a61f4937c9f9e859403a50e9987e94cee2c80c9 is exactly the one that is producing the crash (on the test data distributed with my project). If you build the project successfully, you can trigger the bug by just running
That code is the variant where the bug happens with enqueue_bulk and dequeue_bulk (which is the variant I hope to be able to use eventually anyway, since it's substantially faster than the non-bulk variants). --Rob |
Amazing, thank you so much! You don't have to send the core dump if it's too much trouble. I'll try to reproduce this with your project tomorrow, and will work on more extensive debugging instrumentation to help narrow down the bug otherwise. What compiler/platform are you on? (i.e. OS, processor, compiler + version). Thanks again, |
No problem! I really like using your Queue (I'm also using your ReaderWriter queue) --- it's wicked fast when there's actually work to be done (i.e. when you're not just hitting an empty queue), and it's so much nicer to have a pair of headers than an entire external library (TBB) that can't be statically linked, by the way. I'm doing all of my development on a machine with Ubuntu 14.04 server edition. Here are the relevant stats:
I only included the end of the --Rob |
OK --- this is very interesting. When I simply edit: If it's useful to know, the segfault also seems to occur with g++ v4.9.1 (when |
… bug with it (at the very least, it doesn't work on some platforms) -- see issue #12)
Ah, that is interesting. It's the most recently added feature, and one that more compilers than normal seem to have trouble with (because it involves a static thread-local object with a non-trivial destructor than runs when the thread exits) -- that's why I suggested disabling it (just in case). I'll have to audit the code very carefully to see if it's correct or if it's a compiler/platform bug (it's probably a bug in my code, though). The thread local storage is only used to reclaim implicit producers once the thread that uses them exits. (I have my own implementation of per-object thread-local storage for other uses.) This means that memory usage might be higher with it off (e.g. if you use many short-lived threads to enqueue to the queue), and dequeue time might increase too. I'll reply back on this thread when I find out what's going on. I don't have any theories yet. In the meantime, I've disabled it in the master branch for now. Note that at least one of the other segfault issues is almost certainly unrelated, since it occurs only with the explicit versions of the queue regardless of whether this flag is turned on/off. So there's still a bug somewhere. Thank you for testing this! Finally, since you seem to care about performance, it would be faster to use the explicit version of the queue (with one By the way, it's also possible to statically link TBB (I do it for the queue's benchmarks, for example, it's in the makefile), but since you're not supposed to you basically can't complain if anything starts to break in the future :-) |
Thanks! Also, thanks for the tips regarding performance. I must say, I'm not completely sure I've wrapped my head around the requirements for proper usage of the explicit produce / consumer tokens. I use concurrent queues for a couple of purposes in my codebase, but the biggest two are as (1) work queues and (2) multi-threaded object pools to avoid allocation overhead. For the purposes of (1), unless I'm misunderstanding the proper usage (which is quite possible), it seems like the implicit consumer model is the one that makes the most sense. There is (usually) one thread populating the queue with work that needs to be performed, and each worker thread wants to obtain a job as soon as it becomes available (i.e. a job is intended for any specific worker). However, like I said, it's quite possible I'm missing out on the proper use-case for the explicit producer and consumer functions. (2) is the main reason I'm using pointers in different places. Basically, a lot of the structures placed in the queue are re-used (and sometimes re-alloced). When old work is done, it often makes sense to keep the structure / class around and update its members rather than to deconstruct the object and construct an entirely new one. The pointers allow this to be done with code that is part of my project, and with the base-level structures that hold a significant amount of raw data (the --Rob |
Both (1) and (2) could benefit from producer tokens, though it sounds like it probably won't have much of an impact on (1) since there's usually just one thread. Consumer tokens are almost always very beneficial performance-wise. The idea behind tokens is that the queue can benefit from keeping state that's shared across calls to the enqueue/dequeue methods, stored in tokens. Every non-token method has a corresponding one that takes a token as its first parameter. To use them, simply declare a token of the appropriate type (producer or consumer) and pass them in as needed. The only caveat is that a token must only be used by one thread at a time (i.e. they are intended to be allocated either on the stack or in some other thread-local manner -- but they don't strictly have to be, as long as the one-thread-at-a-time invariant holds true). With producer tokens, what happens is that the token contains a pointer to a thread-specific ("explicit") inner queue, so that it can be used directly instead of having to look up an "implicit" thread-local inner queue in a lock-free hash map on every operation (every enqueue operation, one way or another, applies to a SPMC sub-queue). Consumer tokens allow consumers to spread out evenly over the existing sub-queues and develop an affinity, effectively pairing up producers with consumers. This makes things a lot faster if there's more than one sub-queue. I hope this clarifies a bit :-) |
This might be the cause if you're using gcc 4.9.x: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147 |
Dima reported:
I tried to run your benchmark and moodycamel::ConcurrentQueue crashed during the test:
only enqueue (pre-allocated):
(Measures the average operation speed when all threads are producers,
and the queue has been stretched out first)
Core was generated by `./benchmarks'.
Program terminated with signal 11, Segmentation fault.
#0 load (__m=std::memory_order_relaxed, this=0x8040) at /usr/include/c++/4.8.2/bits/atomic_base.h:496
496 return __atomic_load_n(&_M_i, __m);
The text was updated successfully, but these errors were encountered: