Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test util_ipc_reliable_mq.cpp gets undetermized result #162

Closed
yuxianch opened this issue Sep 8, 2021 · 10 comments
Closed

Test util_ipc_reliable_mq.cpp gets undetermized result #162

yuxianch opened this issue Sep 8, 2021 · 10 comments

Comments

@yuxianch
Copy link

yuxianch commented Sep 8, 2021

When running test util_ipc_reliable_mq.cpp with other processes, sometimes it is failed in line 324 BOOST_TEST_PASSPOINT().

BOOST_AUTO_TEST_CASE(multithreaded_message_passing)
{
unsigned int failure_count1 = 0, failure_count2 = 0, failure_count3 = 0;
boost::atomic_thread_fence(boost::memory_order_release);
boost::thread thread1(&multithreaded_message_passing_feeding_thread, "Thread 1", boost::ref(failure_count1));
boost::thread thread2(&multithreaded_message_passing_feeding_thread, "Thread 2", boost::ref(failure_count2));
boost::thread thread3(&multithreaded_message_passing_feeding_thread, "Thread 3", boost::ref(failure_count3));
BOOST_TEST_PASSPOINT();
queue_t queue(boost::log::open_mode::open_or_create, ipc_queue_name, capacity, block_size);
unsigned int receive_failures = 0, receive_corruptions = 0;
unsigned int message_count1 = 0, message_count2 = 0, message_count3 = 0;
std::string msg;
BOOST_TEST_PASSPOINT();

Latest version of clang++ is used to build the binary of the test.
The output of the binary looks like:

util_ipc_reliable_mq: ../../../libs/log/src/posix/ipc_sync_wrappers.hpp:147: void boost::log::ipc::aux::interprocess_mutex::unlock(): Assertion `pthread_mutex_unlock(&this->mutex) == 0' failed.
Running 4 test cases...
unknown location(0): fatal error: in "multithreaded_message_passing": signal: SIGABRT (application abort requested)
run/util_ipc_reliable_mq.cpp(324): last checkpoint

Could test owner please help to take a look?

@yuxianch
Copy link
Author

yuxianch commented Sep 8, 2021

I use valgrind to detect errors in multithreads and this is what I get:

==537225== Possible data race during read of size 4 at 0x4FE76B8 by thread #3
==537225== Locks held: none
==537225==    at 0x4FD453C: load (core_ops_gcc_atomic.hpp:88)
==537225==    by 0x4FD453C: load (atomic_impl.hpp:338)
==537225==    by 0x4FD453C: boost::thread_detail::enter_once_region(boost::once_flag&) (once_atomic.cpp:39)
==537225==    by 0x4FBFED8: void boost::call_once<void (&)()>(boost::once_flag&, void (&)()) (once_atomic.hpp:123)
==537225==    by 0x4FBD17E: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (thread.cpp:160)
==537225==    by 0x4FBD346: boost::(anonymous namespace)::thread_proxy(void*) (thread.cpp:174)
==537225==    by 0x4C37EEE: mythread_wrapper (hg_intercepts.c:387)
==537225==    by 0x5E4D2DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==537225==    by 0x6160E82: clone (in /usr/lib64/libc-2.28.so)
==537225==
==537225== This conflicts with a previous write of size 4 by thread #2
==537225== Locks held: 1, at address 0x4FE77B0
==537225==    at 0x4FD53F5: store (core_ops_gcc_atomic.hpp:76)
==537225==    by 0x4FD53F5: store (atomic_impl.hpp:330)
==537225==    by 0x4FD53F5: boost::thread_detail::commit_once_region(boost::once_flag&) (once_atomic.cpp:74)
==537225==    by 0x4FBFF66: void boost::call_once<void (&)()>(boost::once_flag&, void (&)()) (once_atomic.hpp:138)
==537225==    by 0x4FBD17E: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (thread.cpp:160)
==537225==    by 0x4FBD346: boost::(anonymous namespace)::thread_proxy(void*) (thread.cpp:174)
==537225==    by 0x4C37EEE: mythread_wrapper (hg_intercepts.c:387)
==537225==    by 0x5E4D2DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==537225==    by 0x6160E82: clone (in /usr/lib64/libc-2.28.so)
==537225==  Address 0x4fe76b8 is 0 bytes inside data symbol "_ZN5boost6detail12_GLOBAL__N_128current_thread_tls_init_flagE"
==537225==
==537225== ----------------------------------------------------------------
==537225==
==537225== Possible data race during read of size 4 at 0x4FE76BC by thread #3
==537225== Locks held: none
==537225==    at 0x4FBD17F: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (thread.cpp:161)
==537225==    by 0x4FBD346: boost::(anonymous namespace)::thread_proxy(void*) (thread.cpp:174)
==537225==    by 0x4C37EEE: mythread_wrapper (hg_intercepts.c:387)
==537225==    by 0x5E4D2DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==537225==    by 0x6160E82: clone (in /usr/lib64/libc-2.28.so)
==537225==
==537225== This conflicts with a previous write of size 4 by thread #2
==537225== Locks held: none
==537225==    at 0x5E5476F: pthread_key_create (in /usr/lib64/libpthread-2.28.so)
==537225==    by 0x4FC9F36: _ZN5boost6detail6invokeIPFvvEJEEEDTclclsr5boostE7forwardIT_Efp_Espclsr5boostE7forwardIT0_Efp0_EEEOS4_DpOS5_ (invoke.hpp:133)
==537225==    by 0x4FBFF10: void boost::call_once<void (&)()>(boost::once_flag&, void (&)()) (once_atomic.hpp:127)
==537225==    by 0x4FBD17E: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (thread.cpp:160)
==537225==    by 0x4FBD346: boost::(anonymous namespace)::thread_proxy(void*) (thread.cpp:174)
==537225==    by 0x4C37EEE: mythread_wrapper (hg_intercepts.c:387)
==537225==    by 0x5E4D2DD: start_thread (in /usr/lib64/libpthread-2.28.so)
==537225==    by 0x6160E82: clone (in /usr/lib64/libc-2.28.so)
==537225==  Address 0x4fe76bc is 0 bytes inside data symbol "_ZN5boost6detail12_GLOBAL__N_122current_thread_tls_keyE"

Command:

valgrind --tool=helgrind binary_path

@Lastique
Copy link
Member

Lastique commented Sep 8, 2021

The valgrind errors seem to be bogus. I commented in the Boost.Thread ticket.

As for the test failure, I have never seen it happen. If you can reproduce it, please collect the backtrace of the crash. It is not possible to know what is happening from the output alone.

Also, what are your Boost, system and compiler versions?

@yuxianch
Copy link
Author

As for the test failure, I have never seen it happen. If you can reproduce it, please collect the backtrace of the crash. It is not possible to know what is happening from the output alone.

Change BOOST_VERIFY in line 148 to assert and running the binary:

bash-4.4$ sed -n "145,149p" ./boost_1_75_0/libs/log/src/posix/ipc_sync_wrappers.hpp
    void unlock() BOOST_NOEXCEPT
    {
        //BOOST_VERIFY(pthread_mutex_unlock(&this->mutex) == 0);
        assert(pthread_mutex_unlock(&this->mutex) == 0);
    }

The backtrace looks like:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff67d6700 (LWP 1038721)]
[New Thread 0x7ffff5fd5700 (LWP 1038722)]
util_ipc_reliable_mq: ../../../libs/log/src/posix/ipc_sync_wrappers.hpp:148: void boost::log::ipc::aux::interprocess_mutex::unlock(): Assertion `pthread_mutex_unlock(&this->mutex) == 0' failed.
[New Thread 0x7ffff57d4700 (LWP 1038723)]

Thread 2 "util_ipc_reliab" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff67d6700 (LWP 1038721)]
0x00007ffff682370f in raise () from /lib64/libc.so.6
#0  0x00007ffff682370f in raise () from /lib64/libc.so.6
#1  0x00007ffff680db25 in abort () from /lib64/libc.so.6
#2  0x00007ffff680d9f9 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007ffff681bcc6 in __assert_fail () from /lib64/libc.so.6
#4  0x00007ffff7d4cca2 in boost::log::v2_mt_posix::ipc::aux::interprocess_mutex::unlock (this=0x7ffff4f53050) at ../../../libs/log/src/posix/ipc_sync_wrappers.hpp:148
#5  0x00007ffff7d4cb98 in boost::log::v2_mt_posix::ipc::aux::interprocess_mutex::auto_unlock::~auto_unlock (this=0x7ffff67d5cc0) at ../../../libs/log/src/posix/ipc_sync_wrappers.hpp:98
#6  0x00007ffff7d48f51 in boost::log::v2_mt_posix::ipc::reliable_message_queue::implementation::send (this=0x7ffff0000b60, message_data=0x426829, message_size=8) at ../../../libs/log/src/posix/ipc_reliable_message_queue.cpp:331
#7  0x00007ffff7d47f6c in boost::log::v2_mt_posix::ipc::reliable_message_queue::send (this=0x7ffff67d5dc8, message_data=0x426829, message_size=8) at ../../../libs/log/src/posix/ipc_reliable_message_queue.cpp:807
#8  0x00000000004142cd in (anonymous namespace)::multithreaded_message_passing_feeding_thread (message=0x426829 "Thread 1", failure_count=@0x7fffffff39e4: 0) at run/util_ipc_reliable_mq.cpp:300
#9  0x0000000000421c45 in boost::_bi::list2<boost::_bi::value<char const*>, boost::reference_wrapper<unsigned int> >::operator()<void (*)(char const*, unsigned int&), boost::_bi::list0> (this=0x44ea20, f=@0x44ea18: 0x414260 <(anonymous namespace)::multithreaded_message_passing_feeding_thread(char const*, unsigned int&)>, a=...) at ../../../boost/bind/bind.hpp:297
#10 0x0000000000421bc3 in boost::_bi::bind_t<void, void (*)(char const*, unsigned int&), boost::_bi::list2<boost::_bi::value<char const*>, boost::reference_wrapper<unsigned int> > >::operator() (this=0x44ea18) at ../../../boost/bind/bind.hpp:1272
#11 0x000000000042148c in boost::detail::thread_data<boost::_bi::bind_t<void, void (*)(char const*, unsigned int&), boost::_bi::list2<boost::_bi::value<char const*>, boost::reference_wrapper<unsigned int> > > >::run (this=0x44e8e0) at ../../../boost/thread/detail/thread.hpp:120
#12 0x00007ffff7c3c361 in boost::(anonymous namespace)::thread_proxy (param=0x44e8e0) at ../../../libs/thread/src/pthread/thread.cpp:179
#13 0x00007ffff6bb62de in start_thread () from /lib64/libpthread.so.0
#14 0x00007ffff68e7e83 in clone () from /lib64/libc.so.6

@yuxianch
Copy link
Author

what are your Boost, system and compiler versions?

Boost 1.75 is used.
System: redhat 8.2
Compiler version: clang-14(latest clang++)

@Lastique
Copy link
Member

The latest released clang is 12.0.1, I'm not sure where 14 comes from. Does it reproduce with gcc or an officially released clang?

Could you show the backtrace of the other threads (i.e. the output of thread apply all bt in gdb)?

Also, can you show what pthread_mutex_unlock returns? You can change the unlock method to something like this:

    void unlock() BOOST_NOEXCEPT
    {
        int res = pthread_mutex_unlock(&this->mutex);
        if (res != 0)
        {
            char buf[1024];
            const char* desc = strerror_r(res, buf, sizeof(buf));
            printf("pthread_mutex_unlock returned %d: %s\n", res, desc);
            abort();
        }
    }

@Lastique
Copy link
Member

When running test util_ipc_reliable_mq.cpp with other processes

Also, what other processes? Do you mean you're running multiple instances of the same test?

@yuxianch
Copy link
Author

yuxianch commented Sep 13, 2021

The latest released clang is 12.0.1, I'm not sure where 14 comes from. Does it reproduce with gcc or an officially released clang?

clang-14 is clang trunk(developing version). It can be reproduced with gcc too.

what other processes? Do you mean you're running multiple instances of the same test?

Whatever processes. To make the reproducing easier, I am running multiple instances of the same test.

@Lastique
Copy link
Member

Lastique commented Sep 13, 2021

Whatever processes. To make the reproducing easier, I am running multiple instances of the same test.

There are always some other processes running when you run the test, and arbitrary processes don't make the test crash. Some specific ones that do specific things may affect the test, that's why I'm asking.

The test creates an interprocess queue with a fixed name, which can be opened and affected by another process. Running multiple instances of the test is one example of this - the processes will access the same IPC queue and conflict with each other, that is expected behavior and should be avoided during testing. If your other processes are using the same name for shared memory segments, they will also conflict with the test.

@Lastique Lastique reopened this Sep 13, 2021
@Lastique
Copy link
Member

After updating the test to use unique IPC queue names I'm seeing errors similar to yours. I don't understand why this happens yet, but it looks like pthread_mutex_t in the shared memory doesn't work for some reason - I'm seeing multiple threads able to acquire the lock concurrently.

@Lastique
Copy link
Member

Should be fixed now. Thank you for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants