[Issue 4070][pulsar-client-cpp] Fix for possible deadlock when closing Pulsar client #6277

heronr · 2020-02-09T07:15:05Z

Motivation

This change is to fix a possible deadlock that can occur when closing the Pulsar client that is caused by the ExecutorService worker thread attempting to join itself.

Modifications

The close() method on the ExecutorService will now not join the worker_ thread if its thread id is the same as the calling thread. The type of worker_ was changed to std::thread to allow for the check since the thread id is not exposed by boost::asio::detail::thread.

Verifying this change

Make sure that the change passes the CI checks.

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API: no
The schema: no
The default values of configurations: no
The wire protocol: no
The rest endpoints: no
The admin cli options: no
Anything that affects deployment: no

Documentation

Does this pull request introduce a new feature? no

merlimat · 2020-02-10T21:54:31Z

@heronr I think there might be a possible issue with this change.

I'm getting a segfault when running tests and tests failing (for unrelated issues):

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f351d252899 in __GI_abort () at abort.c:79
#2  0x00007f351d4d994a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f351d4e535c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f351d4e53c7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000055708bcafbbf in std::thread::~thread (this=0x55708c947268, __in_chrg=<optimized out>) at /usr/include/c++/9/thread:139
#6  0x00007f351da183f2 in pulsar::ExecutorService::~ExecutorService (this=0x55708c947250, __in_chrg=<optimized out>) at /pulsar/pulsar-client-cpp/lib/ExecutorService.cc:52

where we have:

    ~thread()
    {
      if (joinable())
        std::terminate();
    }

So I think that in this case we have not called join() before the thread was actually destroyed.

heronr · 2020-02-11T01:08:02Z

This implies that there is thread leak in certain cases, probably caused by the Executor service being destroyed in the context of the worker thread. I will try to run the tests locally and track this down, but it may prove difficult to resolve.

merlimat · 2020-02-11T17:51:40Z

@heronr I don't think there's a thread leak. Rather, the error seems to be that the std::thread object is destroyed before the actual thread was stopped.

The main issue I see here is that the ~std::thread() is being invoked from the thread itself and therefore it's not "joined" and it cannot be "joined" under any circumstance.

I believe the only solution is that we should ensure, that ~ExecutorService() is never called from the same thread.

heronr · 2020-02-12T08:05:00Z

Agreed on your assessment. I tracked one instance of a non-joined thread to the HTTPLookupService owning the ExecutorServiceProvider that it posts its own work onto. Now the provider it uses is owned by the ClientImpl, but this just exposed a different non-joined thread caused by the ClientImpl itself being destroyed on the ExecutorService worker thread. Here is the callstack

Ultimately because shared_ptrs are being used to manage the lifetime of objects that then post work to ExecutorServices owned by those same objects, we can encounter a non-joinable thread since there is not fine control over when the ref count goes to 0.

If we can designate the pulsar::Client as non-copyable then its destructor can force a shutdown() on the ClientImpl and ensure that all outstanding worker threads are joined at that point as a result. The latest exception that I linked stems from the fact that if a pulsar::Client is destructed without first calling shutdown, any outstanding work on an ExecutorService thread will likely result in the destruction of the ClientImpl on that thread and make it impossible to join.

merlimat · 2020-02-12T18:01:17Z

@heronr I'm not sure that fixes the underlying issue. It still will be triggering (another) std::thread object being destroyed from itself.

Actually, I think the solution should be easy: just use std::sthread::detach() (http://www.cplusplus.com/reference/thread/thread/detach/) on top of your original commit:

if (std::this_thread::get_id() != worker_.get_id() && worker_.joinable()) {
    worker_.join();
} else {
    worker_.detach();
}

heronr · 2020-02-13T05:32:42Z

Yes, I also considered using detach() as it is the path of least resistance. I just can't help but feel like it's a workaround solution that sweeps the underlying problem under the rug.
That being said, for the purposes of this PR I will go ahead and use std::thread::detach() to remove the deadlock. I will revisit this later with a (hopefully) better solution.

…s still unable to be joined

…g Pulsar client (apache#6277) * Attempt at fixing deadlock during client.close() * Fixed formatting * Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined * Possible formatting fixes

…g Pulsar client (apache#6277) * Attempt at fixing deadlock during client.close() * Fixed formatting * Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined * Possible formatting fixes (cherry picked from commit 2e1c74a)

…g Pulsar client (#6277) * Attempt at fixing deadlock during client.close() * Fixed formatting * Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined * Possible formatting fixes (cherry picked from commit 2e1c74a)

…g Pulsar client (apache#6277) * Attempt at fixing deadlock during client.close() * Fixed formatting * Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined * Possible formatting fixes (cherry picked from commit 2e1c74a)

…g Pulsar client (apache#6277) * Attempt at fixing deadlock during client.close() * Fixed formatting * Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined * Possible formatting fixes

heronr force-pushed the DeadlockFix branch from 42de203 to 1a0fb88 Compare February 9, 2020 07:41

merlimat approved these changes Feb 9, 2020

View reviewed changes

heronr force-pushed the DeadlockFix branch from 1a0fb88 to 78defd4 Compare February 10, 2020 00:16

heronr marked this pull request as ready for review February 10, 2020 00:42

sijie approved these changes Feb 10, 2020

View reviewed changes

merlimat added component/c++ type/bug The PR fixed a bug or issue reported a bug labels Feb 10, 2020

merlimat added this to the 2.6.0 milestone Feb 10, 2020

merlimat added the release/2.5.1 label Feb 10, 2020

heronr force-pushed the DeadlockFix branch from 1a33ce7 to d1b7522 Compare February 13, 2020 06:51

heronr added 4 commits February 13, 2020 08:24

Attempt at fixing deadlock during client.close()

cd5084b

Fixed formatting

07e6132

Detach the worker thread in the destructor of ExecutorService if it i…

4638fd9

…s still unable to be joined

Possible formatting fixes

877ff73

heronr force-pushed the DeadlockFix branch from d1b7522 to 877ff73 Compare February 13, 2020 16:30

sijie assigned heronr Feb 13, 2020

merlimat merged commit 2e1c74a into apache:master Feb 14, 2020

heronr deleted the DeadlockFix branch February 14, 2020 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue 4070][pulsar-client-cpp] Fix for possible deadlock when closing Pulsar client #6277

[Issue 4070][pulsar-client-cpp] Fix for possible deadlock when closing Pulsar client #6277

heronr commented Feb 9, 2020 •

edited

Loading

merlimat commented Feb 10, 2020

heronr commented Feb 11, 2020

merlimat commented Feb 11, 2020

heronr commented Feb 12, 2020

merlimat commented Feb 12, 2020

heronr commented Feb 13, 2020

[Issue 4070][pulsar-client-cpp] Fix for possible deadlock when closing Pulsar client #6277

[Issue 4070][pulsar-client-cpp] Fix for possible deadlock when closing Pulsar client #6277

Conversation

heronr commented Feb 9, 2020 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

merlimat commented Feb 10, 2020

heronr commented Feb 11, 2020

merlimat commented Feb 11, 2020

heronr commented Feb 12, 2020

merlimat commented Feb 12, 2020

heronr commented Feb 13, 2020

heronr commented Feb 9, 2020 •

edited

Loading