arm pzstd ThreadPool test segfault #407

pixelb · 2016-10-06T15:19:47Z

I've not actually got an arm system to dig into this, as this is coming from
fedora aarch64 and armv7hl build servers.
Anyway FYI...

Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from ThreadPool
[ RUN ] ThreadPool.Ordering
make[1]: *** [Makefile:38: test] Segmentation fault (core dumped)

terrelln · 2016-10-06T19:08:03Z

Thanks for the report! As of now pzstd is completely untested on arm systems. I don't currently have an arm machine to test with, but I'll work on getting at least arm emulation set up in the next few days.

terrelln · 2016-10-07T06:55:24Z

I was unable to reproduce the failure using qemu-arm emulator compiling with arm-linux-gnueabi toolchain.

Are you passing -static to the linker? When I tried the code segfaulted when it tried to link it pthread. I solved it by replacing -lpthread with -Wl,--whole-archive -lpthread -Wl,--no-whole-archive. The tests also passed when I did not use static linking.

pixelb · 2016-10-07T14:09:35Z

All build flags are logged at https://kojipkgs.fedoraproject.org//work/tasks/171/15970171/build.log

terrelln · 2016-10-07T18:09:25Z

I'll try to reproduce with those flags. I looked in the logs but couldn't find it, which gcc version are you using?

pixelb · 2016-10-07T18:23:12Z

https://kojipkgs.fedoraproject.org/work/tasks/171/15970171/root.log shows 6.2.1-2.fc26

terrelln · 2016-10-07T18:29:41Z

Thanks! I also realized I only checked arm not aarch64, so I'll check that as well.

pixelb · 2016-10-07T19:21:48Z

I found an (arch linux) arm7vl machine with gcc-6.1.1,
compiled gtest-1.7.0 and zstd-1.1.0 and reproduced the crash \o/

So you can leave this with me to dig into (I'll get back to it soon)

pixelb · 2016-10-11T10:09:18Z

Ugh, Getting back to this, I've no longer access to that arm machine :(

Anyway the segfault was in internal consistency checks in results.push_back(i) in TEST(ThreadPool, Ordering). Specifically an abort in std::vector due to "if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)"
That suggests stack corruption to me.

Using valgrind to check on x86_64 we get...

$ valgrind --tool=drd ./ThreadPoolTest 
==1421== drd, a thread error detector
==1421== Copyright (C) 2006-2015, and GNU GPL'd, by Bart Van Assche.
==1421== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==1421== Command: ./ThreadPoolTest
==1421== 
Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from ThreadPool
[ RUN      ] ThreadPool.Ordering
==1421== Probably a race condition: condition variable 0xffefff790 has been signaled but the associated mutex 0xffefff768 is not locked by the signalling thread.
==1421==    at 0x4C34825: pthread_cond_signal@* (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x5565AE8: std::condition_variable::notify_one() (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4044AD: ThreadPool_Ordering_Test::TestBody() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x4E81992: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79749: testing::Test::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79897: testing::TestInfo::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79974: testing::TestCase::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E7A2DE: testing::internal::UnitTestImpl::RunAllTests() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E81E72: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79A3F: testing::UnitTest::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x50938C1: main (in /usr/lib64/libgtest_main.so.0.0.0)
==1421== cond 0xffefff790 was first observed at:
==1421==    at 0x4C33DF9: pthread_cond_wait@* (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x5565ACB: std::condition_variable::wait(std::unique_lock<std::mutex>&) (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x40613A: std::thread::_Impl<std::_Bind_simple<pzstd::ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x556AF1F: ??? (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4C2F25B: ??? (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x529C609: start_thread (pthread_create.c:334)
==1421==    by 0x5E4FA4C: clone (clone.S:109)
==1421== mutex 0xffefff768 was first observed at:
==1421==    at 0x4C31E6C: pthread_mutex_lock (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x4060C4: std::thread::_Impl<std::_Bind_simple<pzstd::ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x556AF1F: ??? (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4C2F25B: ??? (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x529C609: start_thread (pthread_create.c:334)
==1421==    by 0x5E4FA4C: clone (clone.S:109)

I've not confirmed the above warning is valid.

terrelln · 2016-10-11T19:55:58Z

I think that particular warning is a false positive. In WorkQueue.h I always signal without holding the lock, but valgrind can't analyze that case, so it emits a warning.

I've been looking over the code trying to figure out what is going wrong and I'm stumped. I'll run TSAN enabled tests all day today on x86_64 and see if I can trigger any failures. If that fails I'll try to reproduce again tonight.

pixelb · 2016-10-11T20:44:46Z

arm machine is back (and it's been upgraded to gcc-6.2.1).
I see AddJobWhileJoining consistently passes, while the other two tests in ThreadPoolTest.cpp consistently fail. I see different failures suggesting some racy corruption

terrelln · 2016-10-11T21:03:07Z

Thanks again for the help debugging this

Can you compile with TSAN and see if you get TSAN failures? On the current dev branch you have to compile googletest with -fPIC and make test MOREFLAGS="-fsanitize=thread -fPIC -pie". Alternatively, you can checkout my dev branch and run make googletest && make tsan && make check and it should add the necessary flags.

pixelb · 2016-10-11T21:44:41Z

Unfortunately none of the *san libs are included with the gcc-libs package for arm on archlinux (though are available for x86_64). Anyway I compiled up your dev branch which pulled down the latest googletest, and still have consistent segfault...

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00017128 in std::_Function_handler<void (), ThreadPool_Ordering_Test::TestBody()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

terrelln · 2016-10-13T03:47:15Z

I tried to reproduce again with qemu-aarch64 and qemu-arm with the flags you used and failed, so it seems like qemu isn't going to work. However, I will have an actual ARM machine to test on soon.

terrelln · 2016-12-28T21:24:04Z

I never ended up getting a working ARM machine. But I ran all the tests on my iPhone 6s, and everything passed. So the issue is either specific to gcc on ARM, specific to aarch64 / something more specific, or an issue in the build process somewhere.

Cyan4973 · 2017-05-08T18:36:07Z

We are moving away from pzstd, towards native multi-threading support within zstd directly.

One consequence is that the code paths are different : pzstd is C++11, while zstd is C-GNU90.
As a consequence, we'll stop supporting pzstd in the next release. It will still remain available in contrib/pzstd for users which do want it, though we invite them to switch to zstd whenever possible. pzstd will eventually be deprecated and disappear from repository in the future.

pixelb changed the title ~~aarch64 pzstd ThreadPool test segfault~~ arm pzstd ThreadPool test segfault Oct 6, 2016

Cyan4973 closed this as completed May 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arm pzstd ThreadPool test segfault #407

arm pzstd ThreadPool test segfault #407

pixelb commented Oct 6, 2016 •

edited

Loading

terrelln commented Oct 6, 2016

terrelln commented Oct 7, 2016 •

edited

Loading

pixelb commented Oct 7, 2016

terrelln commented Oct 7, 2016

pixelb commented Oct 7, 2016

terrelln commented Oct 7, 2016

pixelb commented Oct 7, 2016 •

edited

Loading

pixelb commented Oct 11, 2016 •

edited

Loading

terrelln commented Oct 11, 2016

pixelb commented Oct 11, 2016

terrelln commented Oct 11, 2016

pixelb commented Oct 11, 2016

terrelln commented Oct 13, 2016

terrelln commented Dec 28, 2016

Cyan4973 commented May 8, 2017

arm pzstd ThreadPool test segfault #407

arm pzstd ThreadPool test segfault #407

Comments

pixelb commented Oct 6, 2016 • edited Loading

terrelln commented Oct 6, 2016

terrelln commented Oct 7, 2016 • edited Loading

pixelb commented Oct 7, 2016

terrelln commented Oct 7, 2016

pixelb commented Oct 7, 2016

terrelln commented Oct 7, 2016

pixelb commented Oct 7, 2016 • edited Loading

pixelb commented Oct 11, 2016 • edited Loading

terrelln commented Oct 11, 2016

pixelb commented Oct 11, 2016

terrelln commented Oct 11, 2016

pixelb commented Oct 11, 2016

terrelln commented Oct 13, 2016

terrelln commented Dec 28, 2016

Cyan4973 commented May 8, 2017

pixelb commented Oct 6, 2016 •

edited

Loading

terrelln commented Oct 7, 2016 •

edited

Loading

pixelb commented Oct 7, 2016 •

edited

Loading

pixelb commented Oct 11, 2016 •

edited

Loading