arm pzstd ThreadPool test segfault #407

Open
pixelb opened this Issue Oct 6, 2016 · 14 comments

Projects

None yet

2 participants

@pixelb
Contributor
pixelb commented Oct 6, 2016 edited

I've not actually got an arm system to dig into this, as this is coming from
fedora aarch64 and armv7hl build servers.
Anyway FYI...

Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from ThreadPool
[ RUN ] ThreadPool.Ordering
make[1]: *** [Makefile:38: test] Segmentation fault (core dumped)

@pixelb pixelb changed the title from aarch64 pzstd ThreadPool test segfault to arm pzstd ThreadPool test segfault Oct 6, 2016
@terrelln
Contributor
terrelln commented Oct 6, 2016

Thanks for the report! As of now pzstd is completely untested on arm systems. I don't currently have an arm machine to test with, but I'll work on getting at least arm emulation set up in the next few days.

@terrelln
Contributor
terrelln commented Oct 7, 2016 edited

I was unable to reproduce the failure using qemu-arm emulator compiling with arm-linux-gnueabi toolchain.

Are you passing -static to the linker? When I tried the code segfaulted when it tried to link it pthread. I solved it by replacing -lpthread with -Wl,--whole-archive -lpthread -Wl,--no-whole-archive. The tests also passed when I did not use static linking.

@terrelln
Contributor
terrelln commented Oct 7, 2016

I'll try to reproduce with those flags. I looked in the logs but couldn't find it, which gcc version are you using?

@terrelln
Contributor
terrelln commented Oct 7, 2016

Thanks! I also realized I only checked arm not aarch64, so I'll check that as well.

@pixelb
Contributor
pixelb commented Oct 7, 2016 edited

I found an (arch linux) arm7vl machine with gcc-6.1.1,
compiled gtest-1.7.0 and zstd-1.1.0 and reproduced the crash \o/

So you can leave this with me to dig into (I'll get back to it soon)

@pixelb
Contributor
pixelb commented Oct 11, 2016 edited

Ugh, Getting back to this, I've no longer access to that arm machine :(

Anyway the segfault was in internal consistency checks in results.push_back(i) in TEST(ThreadPool, Ordering). Specifically an abort in std::vector due to "if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)"
That suggests stack corruption to me.

Using valgrind to check on x86_64 we get...

$ valgrind --tool=drd ./ThreadPoolTest 
==1421== drd, a thread error detector
==1421== Copyright (C) 2006-2015, and GNU GPL'd, by Bart Van Assche.
==1421== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==1421== Command: ./ThreadPoolTest
==1421== 
Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from ThreadPool
[ RUN      ] ThreadPool.Ordering
==1421== Probably a race condition: condition variable 0xffefff790 has been signaled but the associated mutex 0xffefff768 is not locked by the signalling thread.
==1421==    at 0x4C34825: pthread_cond_signal@* (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x5565AE8: std::condition_variable::notify_one() (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4044AD: ThreadPool_Ordering_Test::TestBody() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x4E81992: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79749: testing::Test::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79897: testing::TestInfo::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79974: testing::TestCase::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E7A2DE: testing::internal::UnitTestImpl::RunAllTests() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E81E72: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79A3F: testing::UnitTest::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x50938C1: main (in /usr/lib64/libgtest_main.so.0.0.0)
==1421== cond 0xffefff790 was first observed at:
==1421==    at 0x4C33DF9: pthread_cond_wait@* (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x5565ACB: std::condition_variable::wait(std::unique_lock<std::mutex>&) (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x40613A: std::thread::_Impl<std::_Bind_simple<pzstd::ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x556AF1F: ??? (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4C2F25B: ??? (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x529C609: start_thread (pthread_create.c:334)
==1421==    by 0x5E4FA4C: clone (clone.S:109)
==1421== mutex 0xffefff768 was first observed at:
==1421==    at 0x4C31E6C: pthread_mutex_lock (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x4060C4: std::thread::_Impl<std::_Bind_simple<pzstd::ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x556AF1F: ??? (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4C2F25B: ??? (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x529C609: start_thread (pthread_create.c:334)
==1421==    by 0x5E4FA4C: clone (clone.S:109)

I've not confirmed the above warning is valid.

@terrelln
Contributor

I think that particular warning is a false positive. In WorkQueue.h I always signal without holding the lock, but valgrind can't analyze that case, so it emits a warning.

I've been looking over the code trying to figure out what is going wrong and I'm stumped. I'll run TSAN enabled tests all day today on x86_64 and see if I can trigger any failures. If that fails I'll try to reproduce again tonight.

@pixelb
Contributor
pixelb commented Oct 11, 2016

arm machine is back (and it's been upgraded to gcc-6.2.1).
I see AddJobWhileJoining consistently passes, while the other two tests in ThreadPoolTest.cpp consistently fail. I see different failures suggesting some racy corruption

@terrelln
Contributor

Thanks again for the help debugging this

Can you compile with TSAN and see if you get TSAN failures? On the current dev branch you have to compile googletest with -fPIC and make test MOREFLAGS="-fsanitize=thread -fPIC -pie". Alternatively, you can checkout my dev branch and run make googletest && make tsan && make check and it should add the necessary flags.

@pixelb
Contributor
pixelb commented Oct 11, 2016

Unfortunately none of the *san libs are included with the gcc-libs package for arm on archlinux (though are available for x86_64). Anyway I compiled up your dev branch which pulled down the latest googletest, and still have consistent segfault...

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00017128 in std::_Function_handler<void (), ThreadPool_Ordering_Test::TestBody()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
@terrelln
Contributor

I tried to reproduce again with qemu-aarch64 and qemu-arm with the flags you used and failed, so it seems like qemu isn't going to work. However, I will have an actual ARM machine to test on soon.

@terrelln
Contributor

I never ended up getting a working ARM machine. But I ran all the tests on my iPhone 6s, and everything passed. So the issue is either specific to gcc on ARM, specific to aarch64 / something more specific, or an issue in the build process somewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment