Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsan warning: lock-order-inversion (potential deadlock) #14

Closed
larsonmpdx opened this issue May 31, 2017 · 28 comments
Closed

tsan warning: lock-order-inversion (potential deadlock) #14

larsonmpdx opened this issue May 31, 2017 · 28 comments
Labels
bug This issue is a bug. closed-for-staleness guidance Question that needs advice or information. help wanted We are asking the community to submit a PR to resolve this issue.

Comments

@larsonmpdx
Copy link
Contributor

found this when running llvm thread sanitizer on my program which has control flow similar to the pubsub sample:

https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/samples/PubSub/PubSub.cpp

tsan warns about this mutex sequence:

first sync_action_response_lock_ here:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/src/ClientCoreState.cpp#L98

ack_map_lock_ here:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/src/ClientCoreState.cpp#L197

then sync_action_response_lock_ again here:
https://github.com/aws/aws-iot-device-sdk-cpp/blob/master/src/ClientCoreState.cpp#L112

before I investigate further, is this something you've seen in testing or maybe fixed in an upcoming release?

@chaurah
Copy link
Contributor

chaurah commented May 31, 2017

Hi @larsonmpdx,
We haven't seen this yet in our testing. I haven't seen any lock up issues in our long running tests either so far. I will look into running the thread sanitizer as well to see if we can reproduce the issue. Please do let us know if you find something on your end.

Rahul

@larsonmpdx
Copy link
Contributor Author

thanks. tsan normally has a low false positive rate but this kind of deadlock warning can trigger on code that makes sense. Usually the fix is to reduce the number of mutexes used or use std::lock to handle locking multiple mutexes at once as it guarantees no deadlocks (http://en.cppreference.com/w/cpp/thread/lock).

There are some minor asan/ubsan warnings too, probably nothing. And some basic clang warnings when using this warning set:

-Wall -Wextra -Wpedantic -Wpointer-arith -Wno-unused-parameter -Wcast-qual -Wcast-align
(some of these might be duplicates, I haven't checked carefully)

I'm trying to get my whole project to run clean on tsan/asan/usbsan and no warnings so it would be great to get rid of these even if they're false positives. I'm busy till next week but I can provide some short instructions to reproduce monday or tuesday.

@larsonmpdx
Copy link
Contributor Author

see #15
tsan warnings show up in the existing integration and unit tests. I haven't looked at ubsan/asan but I assume they will show up too

@lerebel103
Copy link

Hi all,
@larsonmpdx and @chaurah, I can confirm I am seeing this issue in production, the CPP SDK locks up on Publish(), which appears related to what is described above. I was able to capture the stack traces of the dead-locked threads (see below).

From what I can see and supported by the Traces:

  1. sync_action_response_lock_ is kept locked in ClientCoreState::PerformAction() as rc = itr->second->PerformAction(p_network_connection_, p_action_data); isn't returning,
  2. This in turn causes a ClientCoreState::DeletePendingAck() call which can't acquire ack_map_lock_ locked by ClientCoreState::DeleteExpiredAcks()
  3. ClientCoreState::ForwardReceivedAck() has ack_map_lock_ locked whilst invoking itr->second->p_async_ack_handler_(action_id, rc) and cannot proceed as this lands back on ClientCoreState::SyncActionHandler(), but sync_action_response_lock_ is locked.
  4. Dead Lock.

We are at this stage unable to Publish with QoS1 as Publish() eventually locks up. Could you please advise @chaurah, any pointers or resolution appreciated?

[Switching to thread 12 (Thread 0xaf2fd440 (LWP 25247))]
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xb634f0cc in __lll_lock_wait (futex=futex@entry=0xb4d59b88, private=0) at lowlevellock.c:46
#2  0xb634a40a in __GI___pthread_mutex_lock (mutex=0xb4d59b88) at pthread_mutex_lock.c:80
#3  0x005258ca in __gthread_mutex_lock (__mutex=0xb4d59b88)
    at /usr/arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf/bits/gthr-default.h:748
#4  0x005285e4 in std::mutex::lock (this=0xb4d59b88) at /usr/arm-linux-gnueabihf/include/c++/6/bits/std_mutex.h:103
#5  0x005287a4 in std::lock_guard<std::mutex>::lock_guard (this=0xaf2fc128, __m=...)
    at /usr/arm-linux-gnueabihf/include/c++/6/bits/std_mutex.h:162
#6  0x0057f398 in awsiotsdk::ClientCoreState::DeletePendingAck (this=0xb4d59b50, action_id=1036)
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCoreState.cpp:204
#7  0x005503a0 in awsiotsdk::mqtt::PublishActionAsync::PerformAction (this=0xb4d5a0a8,
    p_network_connection=std::shared_ptr (count 8, weak 0) 0xb4d61210,
    p_action_data=std::shared_ptr (count 5, weak 0) 0xa7163afc)
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/mqtt/Publish.cpp:207
#8  0x0057ec5e in awsiotsdk::ClientCoreState::PerformAction (this=0xb4d59b50,
    action_type=awsiotsdk::ActionType::PUBLISH, p_action_data=std::shared_ptr (count 5, weak 0) 0xa7163afc,
    action_reponse_timeout=...) at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCoreState.cpp:118
#9  0x00577f90 in awsiotsdk::ClientCore::PerformAction (this=0xb4d5d4d8, action_type=awsiotsdk::ActionType::PUBLISH,
    p_action_data=std::shared_ptr (count 5, weak 0) 0xa7163afc, action_reponse_timeout=...)
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCore.cpp:59
#10 0x0052d118 in awsiotsdk::MqttClient::Publish (this=0xb4d60cb0,
    p_topic_name=std::unique_ptr<awsiotsdk::Utf8String> containing 0x0, is_retained=false, is_duplicate=false,
    qos=awsiotsdk::mqtt::QoS::QOS1,
    payload="[\n    {\n        \"version\": \"0.1\",\n        \"device_uid\": \"7890609085980302167\",\n        \"start_
timestamp\": 1517452469582000,\n        \"end_timestamp\": 1517452472257000,\n        \"samples\": [\n", ' ' <repeats 12
 times>..., action_response_timeout=...) at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/mqtt/Client.cpp:173
#11 0x004c67da in mqtt::awsiot::AWSConnector::Publish (this=0xb4d2bb98, topic=...,
    inbound_payload="[\n    {\n        \"version\": \"0.1\",\n        \"device_uid\": \"7890609085980302167\",\n
---Type <return> to continue, or q <return> to quit---
\"start_timestamp\": 1517452469582000,\n        \"end_timestamp\": 1517452472257000,\n        \"samples\": [\n", ' ' <re
peats 12 times>...) at /home/willc/rprt-rtg/RTG/src/mqtt/awsiot/aws_connector.cpp:191


thread 19 ----
(gdb) bt
#0  0xb634fb7e in nanosleep () at ../sysdeps/unix/syscall-template.S:84
#1  0x00546878 in std::this_thread::sleep_for<long long, std::ratio<1ll, 1000ll> > (__rtime=...)
    at /usr/arm-linux-gnueabihf/include/c++/6/thread:323
#2  0x0057edce in awsiotsdk::ClientCoreState::ProcessOutboundActionQueue (this=0xb4d59b50,
    thread_task_out_sync=std::shared_ptr (count 3, weak 0) 0xb4d60f24)
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCoreState.cpp:139
#3  0x0057e2d8 in std::__invoke_impl<void, void (awsiotsdk::ClientCoreState::* const&)(std::shared_ptr<std::atomic<bool>
 >), std::shared_ptr<awsiotsdk::ClientCoreState>&, std::shared_ptr<std::atomic<bool> >&> (__f=
    @0xb4d5ea6c: (void (awsiotsdk::ClientCoreState::*)(awsiotsdk::ClientCoreState * const, std::shared_ptr<std::atomic<b
ool> >)) 0x57ed79 <awsiotsdk::ClientCoreState::ProcessOutboundActionQueue(std::shared_ptr<std::atomic<bool> >)>,
    __t=std::shared_ptr (count 13, weak 0) 0xb4d59b50, __args#0=std::shared_ptr (count 3, weak 0) 0xb4d60f24)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:227
#4  0x0057e1b0 in std::__invoke<void (awsiotsdk::ClientCoreState::* const&)(std::shared_ptr<std::atomic<bool> >), std::s
hared_ptr<awsiotsdk::ClientCoreState>&, std::shared_ptr<std::atomic<bool> >&> (__fn=
    @0xb4d5ea6c: (void (awsiotsdk::ClientCoreState::*)(awsiotsdk::ClientCoreState * const, std::shared_ptr<std::atomic<b
ool> >)) 0x57ed79 <awsiotsdk::ClientCoreState::ProcessOutboundActionQueue(std::shared_ptr<std::atomic<bool> >)>,
    __args#0=std::shared_ptr (count 13, weak 0) 0xb4d59b50, __args#1=std::shared_ptr (count 3, weak 0) 0xb4d60f24)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:251
#5  0x0057e106 in std::_Mem_fn_base<void (awsiotsdk::ClientCoreState::*)(std::shared_ptr<std::atomic<bool> >), true>::op
erator()<std::shared_ptr<awsiotsdk::ClientCoreState>&, std::shared_ptr<std::atomic<bool> >&> (this=0xb4d5ea6c,
    __args#0=std::shared_ptr (count 13, weak 0) 0xb4d59b50, __args#1=std::shared_ptr (count 3, weak 0) 0xb4d60f24)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:604
#6  0x0057dffa in std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*)(std::shared_ptr<std::atomic<bool> >)> (st
d::shared_ptr<awsiotsdk::ClientCoreState>, std::shared_ptr<std::atomic<bool> >)>::__call<void, , 0u, 1u>(std::tuple<>&&,
 std::_Index_tuple<0u, 1u>) (this=0xb4d5ea6c,
    __args=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xcc392a, DIE 0xcd9157>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:934
#7  0x0057de3e in std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*)(std::shared_ptr<std::atomic<bool> >)> (st
---Type <return> to continue, or q <return> to quit---
d::shared_ptr<awsiotsdk::ClientCoreState>, std::shared_ptr<std::atomic<bool> >)>::operator()<, void>() (
    this=0xb4d5ea6c) at /usr/arm-linux-gnueabihf/include/c++/6/functional:993
#8  0x0057ddec in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*)(std::shared_ptr<std::at
omic<bool> >)> (std::shared_ptr<awsiotsdk::ClientCoreState>, std::shared_ptr<std::atomic<bool> >)> ()>::_M_invoke<>(std:
:_Index_tuple<>) (this=0xb4d5ea6c) at /usr/arm-linux-gnueabihf/include/c++/6/functional:1391
#9  0x0057dd34 in std::_Bind_simple<std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*)(std::shared_ptr<std::at
omic<bool> >)> (std::shared_ptr<awsiotsdk::ClientCoreState>, std::shared_ptr<std::atomic<bool> >)> ()>::operator()() (
    this=0xb4d5ea6c) at /usr/arm-linux-gnueabihf/include/c++/6/functional:1380
#10 0x0057dcaa in std::thread::_State_impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*
)(std::shared_ptr<std::atomic<bool> >)> (std::shared_ptr<awsiotsdk::ClientCoreState>, std::shared_ptr<std::atomic<bool>
>)> ()> >::_M_run() (this=0xb4d5ea68) at /usr/arm-linux-gnueabihf/include/c++/6/thread:197
#11 0xb62c3a6a in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#12 0xb63485e8 in start_thread (arg=0x0) at pthread_create.c:335
#13 0xb613c54a in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb)

thread 20 ----
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xb634f0cc in __lll_lock_wait (futex=futex@entry=0xb4d59bb8, private=0) at lowlevellock.c:46
#2  0xb634a40a in __GI___pthread_mutex_lock (mutex=0xb4d59bb8) at pthread_mutex_lock.c:80
#3  0x005258ca in __gthread_mutex_lock (__mutex=0xb4d59bb8)
    at /usr/arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf/bits/gthr-default.h:748
#4  0x005285e4 in std::mutex::lock (this=0xb4d59bb8) at /usr/arm-linux-gnueabihf/include/c++/6/bits/std_mutex.h:103
#5  0x005287a4 in std::lock_guard<std::mutex>::lock_guard (this=0xab2fcadc, __m=...)
    at /usr/arm-linux-gnueabihf/include/c++/6/bits/std_mutex.h:162
#6  0x0057eae8 in awsiotsdk::ClientCoreState::SyncActionHandler (this=0xb4d59b50, action_id=1022,
    rc=awsiotsdk::ResponseCode::SUCCESS) at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCoreState.cpp:98
#7  0x0058747e in std::__invoke_impl<void, void (awsiotsdk::ClientCoreState::* const&)(unsigned short, awsiotsdk::Respon
seCode), awsiotsdk::ClientCoreState*&, unsigned short, awsiotsdk::ResponseCode>(std::__invoke_memfun_deref, void (awsiot
sdk::ClientCoreState::* const&)(unsigned short, awsiotsdk::ResponseCode), awsiotsdk::ClientCoreState*&, unsigned short&&
, awsiotsdk::ResponseCode&&) (__f=
    @0xa7163500: (void (awsiotsdk::ClientCoreState::*)(awsiotsdk::ClientCoreState * const, unsigned short, awsiotsdk::Re
sponseCode)) 0x57eac9 <awsiotsdk::ClientCoreState::SyncActionHandler(unsigned short, awsiotsdk::ResponseCode)>,
    __t=@0xa7163508: 0xb4d59b50,
    __args#0=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd0189e>,
    __args#1=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd018aa>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:227
#8  0x005871ba in std::__invoke<void (awsiotsdk::ClientCoreState::* const&)(unsigned short, awsiotsdk::ResponseCode), aw
siotsdk::ClientCoreState*&, unsigned short, awsiotsdk::ResponseCode>(void (awsiotsdk::ClientCoreState::* const&)(unsigne
d short, awsiotsdk::ResponseCode), awsiotsdk::ClientCoreState*&, unsigned short&&, awsiotsdk::ResponseCode&&) (__fn=
    @0xa7163500: (void (awsiotsdk::ClientCoreState::*)(awsiotsdk::ClientCoreState * const, unsigned short, awsiotsdk::Re
sponseCode)) 0x57eac9 <awsiotsdk::ClientCoreState::SyncActionHandler(unsigned short, awsiotsdk::ResponseCode)>,
    __args#0=@0xa7163508: 0xb4d59b50,
    __args#1=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd0189e>,
    __args#2=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd018aa>)
---Type <return> to continue, or q <return> to quit---
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:251
#9  0x00586ebe in std::_Mem_fn_base<void (awsiotsdk::ClientCoreState::*)(unsigned short, awsiotsdk::ResponseCode), true>
::operator()<awsiotsdk::ClientCoreState*&, unsigned short, awsiotsdk::ResponseCode>(awsiotsdk::ClientCoreState*&, unsign
ed short&&, awsiotsdk::ResponseCode&&) const (this=0xa7163500, __args#0=@0xa7163508: 0xb4d59b50,
    __args#1=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd0189e>,
    __args#2=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd018aa>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:604
#10 0x005867c0 in std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*)(unsigned short, awsiotsdk::ResponseCode)>
 (awsiotsdk::ClientCoreState*, std::_Placeholder<1>, std::_Placeholder<2>)>::__call<void, unsigned short&&, awsiotsdk::R
esponseCode&&, 0u, 1u, 2u>(std::tuple<unsigned short&&, awsiotsdk::ResponseCode&&>&&, std::_Index_tuple<0u, 1u, 2u>) (
    this=0xa7163500,
    __args=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd02f16>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:934
#11 0x0058611c in std::_Bind<std::_Mem_fn<void (awsiotsdk::ClientCoreState::*)(unsigned short, awsiotsdk::ResponseCode)>
 (awsiotsdk::ClientCoreState*, std::_Placeholder<1>, std::_Placeholder<2>)>::operator()<unsigned short, awsiotsdk::Respo
nseCode, void>(unsigned short&&, awsiotsdk::ResponseCode&&) (this=0xa7163500,
    __args#0=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd0189e>,
    __args#1=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd018aa>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:993
#12 0x00585228 in std::_Function_handler<void (unsigned short, awsiotsdk::ResponseCode), std::_Bind<std::_Mem_fn<void (a
wsiotsdk::ClientCoreState::*)(unsigned short, awsiotsdk::ResponseCode)> (awsiotsdk::ClientCoreState*, std::_Placeholder<
1>, std::_Placeholder<2>)> >::_M_invoke(std::_Any_data const&, unsigned short&&, awsiotsdk::ResponseCode&&) (
    __functor=...,
    __args#0=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd0189e>,
    __args#1=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xce2425, DIE 0xd018aa>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:1731
#13 0x005830b6 in std::function<void (unsigned short, awsiotsdk::ResponseCode)>::operator()(unsigned short, awsiotsdk::R
esponseCode) const (this=0xa71634c8, __args#0=1022, __args#1=awsiotsdk::ResponseCode::SUCCESS)
---Type <return> to continue, or q <return> to quit---
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:2127
#14 0x0057f5de in awsiotsdk::ClientCoreState::ForwardReceivedAck (this=0xb4d59b50, action_id=1022,
    rc=awsiotsdk::ResponseCode::SUCCESS) at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCoreState.cpp:230
#15 0x00548d8c in awsiotsdk::mqtt::NetworkReadActionRunner::HandlePuback (this=0xb4d60bf8,
    read_buf=std::vector of length 2, capacity 568 = {...})
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/mqtt/NetworkRead.cpp:271
#16 0x005486b0 in awsiotsdk::mqtt::NetworkReadActionRunner::PerformAction (this=0xb4d60bf8,
    p_network_connection=std::shared_ptr (count 8, weak 0) 0xb4d61210, p_action_data=std::shared_ptr (empty) 0x0)
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/mqtt/NetworkRead.cpp:153
#17 0x0057e236 in std::__invoke_impl<awsiotsdk::ResponseCode, awsiotsdk::ResponseCode (awsiotsdk::Action::* const&)(std:
:shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>), std::unique_ptr<awsiotsdk::Action, s
td::default_delete<awsiotsdk::Action> >&, std::shared_ptr<awsiotsdk::NetworkConnection>&, std::shared_ptr<awsiotsdk::Act
ionData>&> (
    __f=@0xb4d5a614: &virtual awsiotsdk::Action::PerformAction(std::shared_ptr<awsiotsdk::NetworkConnection>, std::share
d_ptr<awsiotsdk::ActionData>), __t=std::unique_ptr<awsiotsdk::Action> containing 0xb4d60bf8,
    __args#0=std::shared_ptr (count 8, weak 0) 0xb4d61210, __args#1=std::shared_ptr (empty) 0x0)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:227
#18 0x0057e160 in std::__invoke<awsiotsdk::ResponseCode (awsiotsdk::Action::* const&)(std::shared_ptr<awsiotsdk::Network
Connection>, std::shared_ptr<awsiotsdk::ActionData>), std::unique_ptr<awsiotsdk::Action, std::default_delete<awsiotsdk::
Action> >&, std::shared_ptr<awsiotsdk::NetworkConnection>&, std::shared_ptr<awsiotsdk::ActionData>&> (
    __fn=@0xb4d5a614: &virtual awsiotsdk::Action::PerformAction(std::shared_ptr<awsiotsdk::NetworkConnection>, std::shar
ed_ptr<awsiotsdk::ActionData>), __args#0=std::unique_ptr<awsiotsdk::Action> containing 0xb4d60bf8,
    __args#1=std::shared_ptr (count 8, weak 0) 0xb4d61210, __args#2=std::shared_ptr (empty) 0x0)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:251
#19 0x0057e09c in std::_Mem_fn_base<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_ptr<awsiotsdk::NetworkCon
nection>, std::shared_ptr<awsiotsdk::ActionData>), true>::operator()<std::unique_ptr<awsiotsdk::Action, std::default_del
ete<awsiotsdk::Action> >&, std::shared_ptr<awsiotsdk::NetworkConnection>&, std::shared_ptr<awsiotsdk::ActionData>&> (
    this=0xb4d5a614, __args#0=std::unique_ptr<awsiotsdk::Action> containing 0xb4d60bf8,
    __args#1=std::shared_ptr (count 8, weak 0) 0xb4d61210, __args#2=std::shared_ptr (empty) 0x0)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:604
#20 0x0057df40 in std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_ptr<awsiotsdk::Netw
orkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::default_delete<awsiots
dk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>::__call<awsiotsdk
::ResponseCode, , 0u, 1u, 2u>(std::tuple<>&&, std::_Index_tuple<0u, 1u, 2u>) (this=0xb4d5a614,
    __args=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xcc392a, DIE 0xcd9157>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:934
#21 0x0057de1a in std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_ptr<awsiotsdk::Netw
orkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::default_delete<awsiots
dk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>::operator()<, aws
iotsdk::ResponseCode>() (this=0xb4d5a614) at /usr/arm-linux-gnueabihf/include/c++/6/functional:993
#22 0x0057dd8e in std::_Bind_simple<std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_p
tr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::defa
ult_delete<awsiotsdk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>
 ()>::_M_invoke<>(std::_Index_tuple<>) (this=0xb4d5a614) at /usr/arm-linux-gnueabihf/include/c++/6/functional:1391
#23 0x0057dd1a in std::_Bind_simple<std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_p
tr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::defa
ult_delete<awsiotsdk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>
 ()>::operator()() (this=0xb4d5a614) at /usr/arm-linux-gnueabihf/include/c++/6/functional:1380
#24 0x0057dc90 in std::thread::_State_impl<std::_Bind_simple<std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk:
:Action::*)(std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<aws
iotsdk::Action, std::default_delete<awsiotsdk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr
<awsiotsdk::ActionData>)> ()> >::_M_run() (this=0xb4d5a610) at /usr/arm-linux-gnueabihf/include/c++/6/thread:197
#25 0xb62c3a6a in ?? () from /usr/lib/arm-linux-gnueabihf/libstdc++.so.6
#26 0xb63485e8 in start_thread (arg=0x0) at pthread_create.c:335
#27 0xb613c54a in ?? () at ../sysdeps/unix/sysv/linux/arm/clone.S:76 from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

thread 21 ---
(gdb) bt
#0  __libc_do_syscall () at ../sysdeps/unix/sysv/linux/arm/libc-do-syscall.S:46
#1  0xb634f0cc in __lll_lock_wait (futex=futex@entry=0xb4d59ba0, private=0) at lowlevellock.c:46
#2  0xb634a40a in __GI___pthread_mutex_lock (mutex=0xb4d59ba0) at pthread_mutex_lock.c:80
#3  0x005258ca in __gthread_mutex_lock (__mutex=0xb4d59ba0)
    at /usr/arm-linux-gnueabihf/include/c++/6/arm-linux-gnueabihf/bits/gthr-default.h:748
#4  0x005285e4 in std::mutex::lock (this=0xb4d59ba0) at /usr/arm-linux-gnueabihf/include/c++/6/bits/std_mutex.h:103
#5  0x005287a4 in std::lock_guard<std::mutex>::lock_guard (this=0xaaafc9d0, __m=...)
    at /usr/arm-linux-gnueabihf/include/c++/6/bits/std_mutex.h:162
#6  0x0057eb30 in awsiotsdk::ClientCoreState::PerformAction (this=0xb4d59b50,
    action_type=awsiotsdk::ActionType::DISCONNECT, p_action_data=std::shared_ptr (count 1, weak 0) 0xa7163054,
    action_reponse_timeout=...) at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/ClientCoreState.cpp:105
#7  0x0054205a in awsiotsdk::mqtt::KeepaliveActionRunner::PerformAction (this=0xb4d19b90,
    p_network_connection=std::shared_ptr (count 8, weak 0) 0xb4d61210, p_action_data=std::shared_ptr (empty) 0x0)
    at /home/willc/rprt-rtg/RTG/third-party/aws-iot/src/mqtt/Connect.cpp:579
#8  0x0057e236 in std::__invoke_impl<awsiotsdk::ResponseCode, awsiotsdk::ResponseCode (awsiotsdk::Action::* const&)(std:
:shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>), std::unique_ptr<awsiotsdk::Action, s
td::default_delete<awsiotsdk::Action> >&, std::shared_ptr<awsiotsdk::NetworkConnection>&, std::shared_ptr<awsiotsdk::Act
ionData>&> (
    __f=@0xb4d19c94: &virtual awsiotsdk::Action::PerformAction(std::shared_ptr<awsiotsdk::NetworkConnection>, std::share
d_ptr<awsiotsdk::ActionData>), __t=std::unique_ptr<awsiotsdk::Action> containing 0xb4d19b90,
    __args#0=std::shared_ptr (count 8, weak 0) 0xb4d61210, __args#1=std::shared_ptr (empty) 0x0)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:227
#9  0x0057e160 in std::__invoke<awsiotsdk::ResponseCode (awsiotsdk::Action::* const&)(std::shared_ptr<awsiotsdk::Network
Connection>, std::shared_ptr<awsiotsdk::ActionData>), std::unique_ptr<awsiotsdk::Action, std::default_delete<awsiotsdk::
Action> >&, std::shared_ptr<awsiotsdk::NetworkConnection>&, std::shared_ptr<awsiotsdk::ActionData>&> (
    __fn=@0xb4d19c94: &virtual awsiotsdk::Action::PerformAction(std::shared_ptr<awsiotsdk::NetworkConnection>, std::shar
ed_ptr<awsiotsdk::ActionData>), __args#0=std::unique_ptr<awsiotsdk::Action> containing 0xb4d19b90,
    __args#1=std::shared_ptr (count 8, weak 0) 0xb4d61210, __args#2=std::shared_ptr (empty) 0x0)
---Type <return> to continue, or q <return> to quit---
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:251
#10 0x0057e09c in std::_Mem_fn_base<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_ptr<awsiotsdk::NetworkCon
nection>, std::shared_ptr<awsiotsdk::ActionData>), true>::operator()<std::unique_ptr<awsiotsdk::Action, std::default_del
ete<awsiotsdk::Action> >&, std::shared_ptr<awsiotsdk::NetworkConnection>&, std::shared_ptr<awsiotsdk::ActionData>&> (
    this=0xb4d19c94, __args#0=std::unique_ptr<awsiotsdk::Action> containing 0xb4d19b90,
    __args#1=std::shared_ptr (count 8, weak 0) 0xb4d61210, __args#2=std::shared_ptr (empty) 0x0)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:604
#11 0x0057df40 in std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_ptr<awsiotsdk::Netw
orkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::default_delete<awsiots
dk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>::__call<awsiotsdk
::ResponseCode, , 0u, 1u, 2u>(std::tuple<>&&, std::_Index_tuple<0u, 1u, 2u>) (this=0xb4d19c94,
    __args=<unknown type in /home/debian/rtg/cmake-build-debug-bbb/bin/rtg, CU 0xcc392a, DIE 0xcd9157>)
    at /usr/arm-linux-gnueabihf/include/c++/6/functional:934
#12 0x0057de1a in std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_ptr<awsiotsdk::Netw
orkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::default_delete<awsiots
dk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>::operator()<, aws
iotsdk::ResponseCode>() (this=0xb4d19c94) at /usr/arm-linux-gnueabihf/include/c++/6/functional:993
#13 0x0057dd8e in std::_Bind_simple<std::_Bind<std::_Mem_fn<awsiotsdk::ResponseCode (awsiotsdk::Action::*)(std::shared_p
tr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)> (std::unique_ptr<awsiotsdk::Action, std::defa
ult_delete<awsiotsdk::Action> >, std::shared_ptr<awsiotsdk::NetworkConnection>, std::shared_ptr<awsiotsdk::ActionData>)>
 ()>::_M_invoke<>(std::_Index_tuple<>) (this=0xb4d19c94) at /usr/arm-linux-gnueabihf/include/c++/6/functional:1391

@chaurah
Copy link
Contributor

chaurah commented Feb 1, 2018

Hi @lerebel103,
Thank you for this. I tried to do this in an actual scenario when it was reported and couldn't. Then it got de-prioritized when other things demanded my attention. This is a tremendous help. I don't have a workaround just yet but I am working on it. Give me a couple of days please to rework things in a way that doesn't break existing expected behavior. If you have any suggestions in the mean time, please do let me know.

Rahul

@lerebel103
Copy link

Thanks @chaurah,
Glad to hear this helps, it was a curly one to pin down in production. I am away for a could of days too, perhaps we can sync up next week.
Cheers,
Will.

@larsonmpdx
Copy link
Contributor Author

I haven't spent time to understand your existing mutexes but I'll add two things

  • for this library you can probably reduce the number of mutexes you're using, possibly to just one "big aws lock", and you won't see a performance impact. A single lock is immensely easier to reason about than multiple locks.
    Ensure this lock is held and released quickly. If you need concurrency controls for longer duration things like waiting on a packet to return, use condition variables (as you already do), std::atomic<>, or promise/future, but avoid the mutex
  • the tsan integration which fails on your existing tests probably catches this. you should iterate until you can run tsan clean

@rwightman
Copy link

rwightman commented Feb 4, 2018

I'm considering using this code for a project and after a quick analysis it's clear that the locking here needs rework. I visually traced at least one deadlock. Even simplifying to one lock would still be problematic as there would be recursive locking paths. I feel using recursive_mutex is generally not the best fix for those.

To make the locking design safer, I find the easiest way of achieving this in a manner that's easier to reason about is to be stricter about the interface design. On any team I would be involved with this design would be sent back for refactor with following recommendations:

  • Consider how many locks you actually need to prevent unnecessary blocking or lock contention. More locks is always harder to reason about. This design looks overly granular.
  • Make it VERY clear which locks are protecting which data or associated with what condition variables. Group related data and lock declarations together, and also make it clear which functions work with data under the same lock whenever possible. If you have granular locks protecting specific member variables, put the locks exclusively in private methods that only interact with these variables, or refactor related data, lock, and methods into another child class. This is a problem in this design.
  • Don't overlap responsibility of the locks. This may be an issue here, but needs close inspection.
  • If you want to protect your public class methods with locks, do not also call those methods (or bind callbacks with them) from within other class methods. Be very strict with your public and private class interface when locks are involved. If you have public methods that lock that you also want to call from within your class, split them so that there is both a DoSomething() public method that performs the locking and calls a private DoSomething_() method with the logic and no lock. Other class methods that already have the lock will also call or bind the private method and NOT the public method. This is a big problem here.

Many of these can be addressed in a more significant refactoring of the class where you break out each lock and it's associated data and manipulation methods into its own class that is used by the parent. This make things much clearer and less error prone if the discipline isn't there to make it clear through naming and grouping conventions. ClientCoreState here could definitely be refactored into multiple classes around the locking.

@chaurah
Copy link
Contributor

chaurah commented Feb 5, 2018

Hi @larsonmpdx and @rwightman,
Thank you for the feedback. Let me add my general opinion on the SDK as it stands. I have mentioned this before in other posts but it makes sense to reitreate it. The SDK was designed this way for specific reasons that I can't really discuss here. My longer term view for this SDK has been to simply rewrite the internal APIs and simplify. A thin MQTT layer which leaves a lot more up to the parent Application. There is more justification for this in my opinion as more of these finer issues are reported and I plan to dedicate some time this week to figure out the best path forward. If I can justify it, I will rework this and simplify things along the lines you both have mentioned plus some of my own thoughts on this. If you have any further suggestions, definitely do let me know.

@larsonmpdx: I will look into simplifying the locks if I end up simply fixing this design. I think a read and write lock at the network layer + a sync/async op lock at client core may make this simpler barring any complications that @rwightman mentions. If you have any specific thoughts, would love to hear them.

@rwightman: Thank you again for the excellent list of recommendations.

  1. Addressed the simplification of locking above
  2. I agree on the granularity of locks
  3. Agreed
  4. This seems to be similar to what I did with the Network layer code. If I understand correctly, you're suggesting a similar model for the Client Core interface as well?

I know this has been open for a long time. But I will address this and your feedback is extremely important in helping me decide what the next step should be and I really appreciate you both taking the time to provide it. As I have repeatedly said, please do keep it coming.

I will have further updates soon :)

Rahul

@lerebel103
Copy link

Hi @chaurah,
Just picking this up again, @larsonmpdx and @rwightman provided excellent feedback.
Will.

@lerebel103
Copy link

Hi,
Is there an update on this issue please?
Will.

@larsonmpdx
Copy link
Contributor Author

larsonmpdx commented Mar 8, 2018

I think I'm seeing this in production now, but our message rates are fairly low so it's every 2-4 days

@lerebel103
Copy link

Hello Amazon Web Services,
Any news on a resolution for this thread lock please? We may have missed it but a quick scan through the recent commits doesn't seem to indicate that this issue is getting attention?

As noted, aws-iot-device-sdk-cpp remains unusable at this stage and locks up within hours.
Will.

@vareddy-zz
Copy link
Contributor

vareddy-zz commented May 4, 2018

Hi @lerebel103,
Thanks for your help and being patient with us. As you know, it is not a simple fix when it involves making changes to the threading model. However, I wanted to let you know that we are actively investigating this issue to resolve it as soon as possible. Again thanks for all the feedback and help so far! If there are additional feedback and help you can provide, please feel free to comment on this issue.
Thanks!
Varun

@lerebel103
Copy link

Thanks @vareddy, we appreciate that this is actively being resolved, as it is critical from a device point of view.

@vareddy-zz
Copy link
Contributor

Hi @lerebel103,
We are looking to reproduce the issue and in your original comment, you state that:

This in turn causes a ClientCoreState::DeletePendingAck() call which can't acquire ack_map_lock_ locked by ClientCoreState::DeleteExpiredAcks()

The SDK itself does not call DeleteExpiredAcks() internally and it is not exposed. I am also not able to find it in the stack trace that you posted. Are you calling DeleteExpiredAcks().

Could you provide a code snippet if possible?
Thanks!
Varun

@lerebel103
Copy link

Hi @vareddy,

See:

DeletePendingAck(p_action_data->GetActionId());

DeleteExpiredAcks() is called internally from ClientCoreState::ProcessOutboundActionQueue(). Please note that for brevity I only provided a subset of the Traces for discussion.

However, by observing how sync_action_response_lock_ is acquired and held as per steps in my earlier comments, one can see that a dead-lock is inevitable (also picked up by llvm thread sanitizer).

Re code snippet: I am afraid we're not using the SDK in a particularly unusual manner, other than attempting to send regular JSON payloads on a single Topic, over a persistent Client instance/connection (Cellular).

Will.

@vareddy-zz
Copy link
Contributor

vareddy-zz commented May 9, 2018

Hi @lerebel103,
We're still debugging your issue but we have a patch that you can try with your application, and check if helps resolve the deadlock. Can you replace your ForwardReceivedAck function with the below code:

void ClientCoreState::ForwardReceivedAck(uint16_t action_id, ResponseCode rc) {
        ack_map_lock_.lock();
        // No response code because all Acks might not have registered handlers. No other possible error
        util::Map<uint16_t, std::unique_ptr<PendingAckData>>::const_iterator itr = pending_ack_map_.find(action_id);
        if (itr != pending_ack_map_.end()) {
            ActionData::AsyncAckNotificationHandlerPtr ack_handler = itr->second->p_async_ack_handler_;
            pending_ack_map_.erase(itr);
            ack_map_lock_.unlock();
            ack_handler(action_id, rc);
            return;
        }
        ack_map_lock_.unlock();
    }

Edit: Another question, are you using sync publish or async publish in your application?
Thanks!
Varun

@vareddy-zz
Copy link
Contributor

vareddy-zz commented May 21, 2018

HI @lerebel103 and @larsonmpdx,
We have created another branch with major changes to the locking system we currently use in the SDK. We ran it through thread sanitizer and it threw no warnings. We did not change the public APIs themselves so there ideally should not be any compatibility issues between the current SDK (on master and release) and this SDK (locking-fixes). Could you please test this branch with your setups?

Any feedback you provide will be valuable!
Thanks!
Varun

@lerebel103
Copy link

lerebel103 commented May 22, 2018 via email

@lerebel103
Copy link

lerebel103 commented May 22, 2018 via email

@larsonmpdx
Copy link
Contributor Author

thank you, I don't think I actually saw this problem in the end (it was the connection stuff in #80). I just suspected this when I saw tsan warnings. I'll give this a test this week just to look for regressions. good job fixing all the warnings

@lerebel103
Copy link

Thanks, for the fixes, we've tried it and so far I haven't seen a deadlock. We'll report back in a few days.
Will.

@vareddy-zz
Copy link
Contributor

Thanks for the update @lerebel103. Please let us know if anything comes up in your extended testing.
Thanks!
Varun

@philllx
Copy link

philllx commented Nov 21, 2018

Hi Guys,
Do we have an ETA on when this will hit the release branch?

@graebm graebm added the bug This issue is a bug. label Jan 18, 2019
@JonathanHenson
Copy link

We're not planning on fixing this since doing so would basically be a rewrite of the entire SDK. We have been putting our effort into a new version of this SDK that should avoid all of these problems. It's currently in dev preview if you'd like to test with that:

github.com/awslabs/aws-iot-device-sdk-cpp-v2

@jaredgisin
Copy link

Reopening as this is still a known issue in v1.4.0

If you are experiencing this issue, we have a proposed fix in a branch that we encourage you to try here: (locking-fixes). If you find success with this branch, please let us know by commenting on this thread.

The long-term fix for this will be in the new (C++ IoT Device SDK which is currently available in developer preview).

@jaredgisin jaredgisin reopened this Feb 18, 2019
@justnance justnance added guidance Question that needs advice or information. help wanted We are asking the community to submit a PR to resolve this issue. wontfix We have determined that we will not resolve the issue. and removed help wanted labels Apr 19, 2019
@KaibaLopez KaibaLopez removed the wontfix We have determined that we will not resolve the issue. label Feb 25, 2020
@github-actions
Copy link

github-actions bot commented Sep 3, 2020

Greetings! Sorry to say but this is a very old issue that is probably not getting as much attention as it deserves. We encourage you to try V2 and if you find that this is still a problem, please feel free to open a new issue there.

@github-actions github-actions bot added closing-soon This issue will automatically close in 4 days unless further comments are made. closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 3, 2020
@github-actions github-actions bot closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. closed-for-staleness guidance Question that needs advice or information. help wanted We are asking the community to submit a PR to resolve this issue.
Projects
None yet
Development

No branches or pull requests