Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault during benchmark #12

Closed
cameron314 opened this issue Feb 11, 2015 · 11 comments
Closed

Segfault during benchmark #12

cameron314 opened this issue Feb 11, 2015 · 11 comments
Assignees
Labels

Comments

@cameron314
Copy link
Owner

Dima reported:

I tried to run your benchmark and moodycamel::ConcurrentQueue crashed during the test:

only enqueue (pre-allocated):
(Measures the average operation speed when all threads are producers,
and the queue has been stretched out first)

moodycamel::ConcurrentQueue
Without tokens
1 thread: Avg: 0.0131us Range: [0.0131us, 0.0131us] Ops/s: 76.48M Ops/s/t: 76.48M
2 threads: Avg: 0.0296us Range: [0.0295us, 0.0297us] Ops/s: 67.48M Ops/s/t: 33.74M
4 threads: Avg: 0.1216us Range: [0.1212us, 0.1221us] Ops/s: 32.90M Ops/s/t: 8.23M
8 threads: Avg: 0.7136us Range: [0.7077us, 0.7170us] Ops/s: 11.21M Ops/s/t: 1.40M
Segmentation fault (core dumped)

Core was generated by `./benchmarks'.
Program terminated with signal 11, Segmentation fault.
#0 load (__m=std::memory_order_relaxed, this=0x8040) at /usr/include/c++/4.8.2/bits/atomic_base.h:496

496 return __atomic_load_n(&_M_i, __m);

@cameron314 cameron314 added the bug label Feb 11, 2015
@rob-p
Copy link

rob-p commented Feb 16, 2015

I'm also seeing a segfault (though it's rare an intermittent) in enqueue_bulk. The core dump I have is pointing me to line 2427 in concurrentqueue.h:

#0  0x0000000000509e51 in moodycamel::ConcurrentQueue<AlignmentGroup<SMEMAlignment>*, moodycamel::ConcurrentQueueDefaultTraits>::ImplicitProducer::enqueue_bulk<(moodycamel::ConcurrentQueue<AlignmentGroup<SMEMAlignment>*, moodycamel::ConcurrentQueueDefaultTraits>::AllocationMode)0, std::move_iterator<__gnu_cxx::__normal_iterator<AlignmentGroup<SMEMAlignment>**, std::vector<AlignmentGroup<SMEMAlignment>*, std::allocator<AlignmentGroup<SMEMAlignment>*> > > > > (this=0x2898980, itemFirst=..., itemFirst@entry=..., count=count@entry=1000)
    at /home/robp/SoftwareStaging/sailfish/include/concurrentqueue.h:2427
2427                                                    new ((*this->tailBlock)[currentTailIndex++]) T(*itemFirst++);

my program is fairly large, so this is in no way a minimal test case. However, I was wondering if you have any idea what may be causing this or (what I suspect to be the related) above bug.

@rob-p
Copy link

rob-p commented Feb 16, 2015

I tried changing all of the enqueue_bulk and dequeue_bulk operations to their non-bulk variants (with, of course, the appropriate loops). It seems that the segfault still remains. In this case, the core file points to this line, which, from the comment, seems to be somewhat of a corner case to begin with.

Also, as evidence that the crash is not related to some other user-space issue, I can report that I've temporarily replaced the concurrent queue with the Intel TBB concurrent queue and ran a stress test (executed the program 100 times) without any segfault. Of course, the bulk operations make the moodycamel variant much faster when there is no segfault, and I'd much prefer your elegant single-header solution to the much bulkier TBB library :).

rob-p added a commit to kingsfordgroup/sailfish that referenced this issue Feb 16, 2015
…lmonQuantify

There seems to be a bug in the moodycamel::ConcurrentQueue (cameron314/concurrentqueue#12)
that, under very high contention, can result in an out-of-bounds memory
access and segfault.  This was happening on the test data where, due to
its very small size, the queue was coming under very heavy contention.
The crash was not deterministic, but replacing the moodycamel queue
with the tbb queue seems to have resolved the issue in all cases.  In
the future, I'd like to move back to the moodycamel variant, as it
allows bulk enqueue/dequeue operations, and is thus faster.  However,
the tbb implementation will remain in place until the moody camel bug is fixed.
The old code remains in SalmonQuantify.cpp (commented out) so that it
will be easy to re-enable when the bug is finally fixed.
@cameron314 cameron314 self-assigned this Feb 16, 2015
@cameron314
Copy link
Owner Author

Hi @rob-p, thanks for reporting the segfault you're seeing! That makes no less than four in a week after months without any issues. :/

Unfortunately, there's no way to know if all the segfaults are related or not (though indeed one bug is all it would take to cause multiple problems). Usually the symptom is far away from the cause, but since the queue is so complex every little clue helps (especially since I can't reproduce these segfaults on my end yet). I'll fix the bug as soon as I can pinpoint it -- sorry for the trouble!

In the meantime, since I can't reproduce this on my own, I'd appreciate any help you can provide.
With the enqueue_bulk segfault, can you find out what dereference is causing the segfault? (I'm guessing it's tailBlock). Could you perhaps email me the core dump?

The second segfault you're seeing with the non-bulk methods is very interesting, as it's completely unrelated to the main queueing algorithm (it's in the block free-list management code). This could either be a bug with the free list itself (unlikely, despite the corner cases), or it could be that a bug in the queueing code causes a block to be freed twice, or accessed after it's freed, etc.

If you comment out the MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED section on lines 108-115, does the segfault go away?

I really, really want to fix these bugs (or hopefully bug), but I can't do it unless I can either reproduce the problem (at least intermittently) or have the help of someone who can. Would you be willing to test out certain debug configurations to help pinpoint the root cause of the segfault?

@rob-p
Copy link

rob-p commented Feb 16, 2015

Hi @cameron314, I'd be happy to help where possible. The core dump is pretty large, but I can try and throw it up on google drive. I'll also try out disabling the thread local support to see if that makes a difference.

From the other direction, you could try seeing if you can reproduce the crash in the context of my project's use of the concurrent queue. Though the latest code uses the TBB queue, commit 2a61f4937c9f9e859403a50e9987e94cee2c80c9 is exactly the one that is producing the crash (on the test data distributed with my project). If you build the project successfully, you can trigger the bug by just running make test a few times. Once make test has been run one (and the sample data has been unpacked), you can run the command that make test uses which (from the build directory) is just

$ ./src/salmon quant -i ../sample_data/sample_salmon_index -l IU -1 ../sample_data/reads_1.fastq -2 ../sample_data/reads_2.fastq -o sample_salmon_quant -n 1000000

That code is the variant where the bug happens with enqueue_bulk and dequeue_bulk (which is the variant I hope to be able to use eventually anyway, since it's substantially faster than the non-bulk variants).

--Rob

@cameron314
Copy link
Owner Author

Amazing, thank you so much! You don't have to send the core dump if it's too much trouble.

I'll try to reproduce this with your project tomorrow, and will work on more extensive debugging instrumentation to help narrow down the bug otherwise.

What compiler/platform are you on? (i.e. OS, processor, compiler + version).

Thanks again,
Cameron

@rob-p
Copy link

rob-p commented Feb 16, 2015

No problem! I really like using your Queue (I'm also using your ReaderWriter queue) --- it's wicked fast when there's actually work to be done (i.e. when you're not just hitting an empty queue), and it's so much nicer to have a pair of headers than an entire external library (TBB) that can't be statically linked, by the way. I'm doing all of my development on a machine with Ubuntu 14.04 server edition. Here are the relevant stats:

$ uname -a 
Linux ocean 3.13.0-43-generic #72-Ubuntu SMP Mon Dec 8 19:35:06 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ g++ -v
Using built-in specs.
COLLECT_GCC=g++
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/4.8/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.2-19ubuntu1' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)
$ cat /proc/cpuinfo
...
processor       : 31
vendor_id       : GenuineIntel
cpu family      : 6
model           : 45
model name      : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz
stepping        : 7
microcode       : 0x70d
cpu MHz         : 2599.818
cache size      : 20480 KB
physical id     : 1
siblings        : 16
core id         : 7
cpu cores       : 8
apicid          : 47
initial apicid  : 47
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 5200.60
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I only included the end of the /proc/cpuinfo output because it's essentially the same thing repeated 32 times (it's a dual-socket machine with two 8-core CPUs & hyperthreading).

--Rob

@rob-p
Copy link

rob-p commented Feb 16, 2015

OK --- this is very interesting. When I simply #undef MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED (and re-instate the code using the bulk operations), the segfault seems to go away. I "stress tested" this by running the command I mentioned above 100 times in succession without seeing a segfault. Further, even without using the thread local storage, the moodycamel queue allows the 100 runs to complete about 25% faster than does the TBB queue (in ~ 3m 30s on my machine instead of ~4m 30s). Any idea why the enabling of thread local storage could be a problem (i.e. do you think it's a compiler / runtime level error, or do you think there might be something tricky going on in your use of the thread local storage in the queue code)?

edit: If it's useful to know, the segfault also seems to occur with g++ v4.9.1 (when MOODYCAMEL_CPP11_THREAD_LOCAL_SUPPORTED isn't undefined).

cameron314 added a commit that referenced this issue Feb 16, 2015
… bug with it (at the very least, it doesn't work on some platforms) -- see issue #12)
@cameron314
Copy link
Owner Author

Ah, that is interesting. It's the most recently added feature, and one that more compilers than normal seem to have trouble with (because it involves a static thread-local object with a non-trivial destructor than runs when the thread exits) -- that's why I suggested disabling it (just in case). I'll have to audit the code very carefully to see if it's correct or if it's a compiler/platform bug (it's probably a bug in my code, though).

The thread local storage is only used to reclaim implicit producers once the thread that uses them exits. (I have my own implementation of per-object thread-local storage for other uses.) This means that memory usage might be higher with it off (e.g. if you use many short-lived threads to enqueue to the queue), and dequeue time might increase too.

I'll reply back on this thread when I find out what's going on. I don't have any theories yet. In the meantime, I've disabled it in the master branch for now.

Note that at least one of the other segfault issues is almost certainly unrelated, since it occurs only with the explicit versions of the queue regardless of whether this flag is turned on/off. So there's still a bug somewhere.

Thank you for testing this!

Finally, since you seem to care about performance, it would be faster to use the explicit version of the queue (with one ConcurrentQueue::ProducerToken per producer per thread) if you can. Using ConsumerTokens for dequeueing would also be more efficient. You might also consider overriding the default traits and tweaking a few constants (e.g. using a bigger block size). Finally, depending on how you allocate the objects that you put pointers to into the queue, it may be faster to avoid using pointers and let the queue handle memory management, using it with regular objects (I didn't look closely at your code -- this may not be an option for you for myriad reasons).

By the way, it's also possible to statically link TBB (I do it for the queue's benchmarks, for example, it's in the makefile), but since you're not supposed to you basically can't complain if anything starts to break in the future :-)

@rob-p
Copy link

rob-p commented Feb 17, 2015

Thanks! Also, thanks for the tips regarding performance. I must say, I'm not completely sure I've wrapped my head around the requirements for proper usage of the explicit produce / consumer tokens. I use concurrent queues for a couple of purposes in my codebase, but the biggest two are as (1) work queues and (2) multi-threaded object pools to avoid allocation overhead.

For the purposes of (1), unless I'm misunderstanding the proper usage (which is quite possible), it seems like the implicit consumer model is the one that makes the most sense. There is (usually) one thread populating the queue with work that needs to be performed, and each worker thread wants to obtain a job as soon as it becomes available (i.e. a job is intended for any specific worker). However, like I said, it's quite possible I'm missing out on the proper use-case for the explicit producer and consumer functions.

(2) is the main reason I'm using pointers in different places. Basically, a lot of the structures placed in the queue are re-used (and sometimes re-alloced). When old work is done, it often makes sense to keep the structure / class around and update its members rather than to deconstruct the object and construct an entirely new one. The pointers allow this to be done with code that is part of my project, and with the base-level structures that hold a significant amount of raw data (the bam_seq_t type from libstaden).

--Rob

@cameron314
Copy link
Owner Author

Both (1) and (2) could benefit from producer tokens, though it sounds like it probably won't have much of an impact on (1) since there's usually just one thread. Consumer tokens are almost always very beneficial performance-wise.

The idea behind tokens is that the queue can benefit from keeping state that's shared across calls to the enqueue/dequeue methods, stored in tokens. Every non-token method has a corresponding one that takes a token as its first parameter. To use them, simply declare a token of the appropriate type (producer or consumer) and pass them in as needed. The only caveat is that a token must only be used by one thread at a time (i.e. they are intended to be allocated either on the stack or in some other thread-local manner -- but they don't strictly have to be, as long as the one-thread-at-a-time invariant holds true).

With producer tokens, what happens is that the token contains a pointer to a thread-specific ("explicit") inner queue, so that it can be used directly instead of having to look up an "implicit" thread-local inner queue in a lock-free hash map on every operation (every enqueue operation, one way or another, applies to a SPMC sub-queue).

Consumer tokens allow consumers to spread out evenly over the existing sub-queues and develop an affinity, effectively pairing up producers with consumers. This makes things a lot faster if there's more than one sub-queue.

I hope this clarifies a bit :-)

@JodiTheTigger
Copy link

This might be the cause if you're using gcc 4.9.x: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65147
Try a more recent version of GCC?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants