Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm pzstd ThreadPool test segfault #407

Closed
pixelb opened this issue Oct 6, 2016 · 15 comments
Closed

arm pzstd ThreadPool test segfault #407

pixelb opened this issue Oct 6, 2016 · 15 comments

Comments

@pixelb
Copy link
Contributor

pixelb commented Oct 6, 2016

I've not actually got an arm system to dig into this, as this is coming from
fedora aarch64 and armv7hl build servers.
Anyway FYI...

Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from ThreadPool
[ RUN ] ThreadPool.Ordering
make[1]: *** [Makefile:38: test] Segmentation fault (core dumped)

@pixelb pixelb changed the title aarch64 pzstd ThreadPool test segfault arm pzstd ThreadPool test segfault Oct 6, 2016
@terrelln
Copy link
Contributor

terrelln commented Oct 6, 2016

Thanks for the report! As of now pzstd is completely untested on arm systems. I don't currently have an arm machine to test with, but I'll work on getting at least arm emulation set up in the next few days.

@terrelln
Copy link
Contributor

terrelln commented Oct 7, 2016

I was unable to reproduce the failure using qemu-arm emulator compiling with arm-linux-gnueabi toolchain.

Are you passing -static to the linker? When I tried the code segfaulted when it tried to link it pthread. I solved it by replacing -lpthread with -Wl,--whole-archive -lpthread -Wl,--no-whole-archive. The tests also passed when I did not use static linking.

@pixelb
Copy link
Contributor Author

pixelb commented Oct 7, 2016

All build flags are logged at https://kojipkgs.fedoraproject.org//work/tasks/171/15970171/build.log

@terrelln
Copy link
Contributor

terrelln commented Oct 7, 2016

I'll try to reproduce with those flags. I looked in the logs but couldn't find it, which gcc version are you using?

@pixelb
Copy link
Contributor Author

pixelb commented Oct 7, 2016

@terrelln
Copy link
Contributor

terrelln commented Oct 7, 2016

Thanks! I also realized I only checked arm not aarch64, so I'll check that as well.

@pixelb
Copy link
Contributor Author

pixelb commented Oct 7, 2016

I found an (arch linux) arm7vl machine with gcc-6.1.1,
compiled gtest-1.7.0 and zstd-1.1.0 and reproduced the crash \o/

So you can leave this with me to dig into (I'll get back to it soon)

@pixelb
Copy link
Contributor Author

pixelb commented Oct 11, 2016

Ugh, Getting back to this, I've no longer access to that arm machine :(

Anyway the segfault was in internal consistency checks in results.push_back(i) in TEST(ThreadPool, Ordering). Specifically an abort in std::vector due to "if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)"
That suggests stack corruption to me.

Using valgrind to check on x86_64 we get...

$ valgrind --tool=drd ./ThreadPoolTest 
==1421== drd, a thread error detector
==1421== Copyright (C) 2006-2015, and GNU GPL'd, by Bart Van Assche.
==1421== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==1421== Command: ./ThreadPoolTest
==1421== 
Running main() from gtest_main.cc
[==========] Running 3 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 3 tests from ThreadPool
[ RUN      ] ThreadPool.Ordering
==1421== Probably a race condition: condition variable 0xffefff790 has been signaled but the associated mutex 0xffefff768 is not locked by the signalling thread.
==1421==    at 0x4C34825: pthread_cond_signal@* (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x5565AE8: std::condition_variable::notify_one() (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4044AD: ThreadPool_Ordering_Test::TestBody() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x4E81992: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79749: testing::Test::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79897: testing::TestInfo::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79974: testing::TestCase::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E7A2DE: testing::internal::UnitTestImpl::RunAllTests() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E81E72: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x4E79A3F: testing::UnitTest::Run() (in /usr/lib64/libgtest.so.0.0.0)
==1421==    by 0x50938C1: main (in /usr/lib64/libgtest_main.so.0.0.0)
==1421== cond 0xffefff790 was first observed at:
==1421==    at 0x4C33DF9: pthread_cond_wait@* (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x5565ACB: std::condition_variable::wait(std::unique_lock<std::mutex>&) (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x40613A: std::thread::_Impl<std::_Bind_simple<pzstd::ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x556AF1F: ??? (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4C2F25B: ??? (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x529C609: start_thread (pthread_create.c:334)
==1421==    by 0x5E4FA4C: clone (clone.S:109)
==1421== mutex 0xffefff768 was first observed at:
==1421==    at 0x4C31E6C: pthread_mutex_lock (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x4060C4: std::thread::_Impl<std::_Bind_simple<pzstd::ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run() (in /home/padraig/rhat/fedora-scm/zstd/zstd-1.1.0/contrib/pzstd/utils/test/ThreadPoolTest)
==1421==    by 0x556AF1F: ??? (in /usr/lib64/libstdc++.so.6.0.21)
==1421==    by 0x4C2F25B: ??? (in /usr/lib64/valgrind/vgpreload_drd-amd64-linux.so)
==1421==    by 0x529C609: start_thread (pthread_create.c:334)
==1421==    by 0x5E4FA4C: clone (clone.S:109)

I've not confirmed the above warning is valid.

@terrelln
Copy link
Contributor

I think that particular warning is a false positive. In WorkQueue.h I always signal without holding the lock, but valgrind can't analyze that case, so it emits a warning.

I've been looking over the code trying to figure out what is going wrong and I'm stumped. I'll run TSAN enabled tests all day today on x86_64 and see if I can trigger any failures. If that fails I'll try to reproduce again tonight.

@pixelb
Copy link
Contributor Author

pixelb commented Oct 11, 2016

arm machine is back (and it's been upgraded to gcc-6.2.1).
I see AddJobWhileJoining consistently passes, while the other two tests in ThreadPoolTest.cpp consistently fail. I see different failures suggesting some racy corruption

@terrelln
Copy link
Contributor

Thanks again for the help debugging this

Can you compile with TSAN and see if you get TSAN failures? On the current dev branch you have to compile googletest with -fPIC and make test MOREFLAGS="-fsanitize=thread -fPIC -pie". Alternatively, you can checkout my dev branch and run make googletest && make tsan && make check and it should add the necessary flags.

@pixelb
Copy link
Contributor Author

pixelb commented Oct 11, 2016

Unfortunately none of the *san libs are included with the gcc-libs package for arm on archlinux (though are available for x86_64). Anyway I compiled up your dev branch which pulled down the latest googletest, and still have consistent segfault...

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00017128 in std::_Function_handler<void (), ThreadPool_Ordering_Test::TestBody()::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()

@terrelln
Copy link
Contributor

I tried to reproduce again with qemu-aarch64 and qemu-arm with the flags you used and failed, so it seems like qemu isn't going to work. However, I will have an actual ARM machine to test on soon.

@terrelln
Copy link
Contributor

I never ended up getting a working ARM machine. But I ran all the tests on my iPhone 6s, and everything passed. So the issue is either specific to gcc on ARM, specific to aarch64 / something more specific, or an issue in the build process somewhere.

@Cyan4973
Copy link
Contributor

Cyan4973 commented May 8, 2017

We are moving away from pzstd, towards native multi-threading support within zstd directly.

One consequence is that the code paths are different : pzstd is C++11, while zstd is C-GNU90.
As a consequence, we'll stop supporting pzstd in the next release. It will still remain available in contrib/pzstd for users which do want it, though we invite them to switch to zstd whenever possible. pzstd will eventually be deprecated and disappear from repository in the future.

@Cyan4973 Cyan4973 closed this as completed May 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants