Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue 4070][pulsar-client-cpp] Fix for possible deadlock when closing Pulsar client #6277

Merged
merged 4 commits into from
Feb 14, 2020

Conversation

heronr
Copy link
Contributor

@heronr heronr commented Feb 9, 2020

Fixes #4070

Motivation

This change is to fix a possible deadlock that can occur when closing the Pulsar client that is caused by the ExecutorService worker thread attempting to join itself.

Modifications

The close() method on the ExecutorService will now not join the worker_ thread if its thread id is the same as the calling thread. The type of worker_ was changed to std::thread to allow for the check since the thread id is not exposed by boost::asio::detail::thread.

Verifying this change

  • Make sure that the change passes the CI checks.

This change is already covered by existing tests.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API: no
  • The schema: no
  • The default values of configurations: no
  • The wire protocol: no
  • The rest endpoints: no
  • The admin cli options: no
  • Anything that affects deployment: no

Documentation

  • Does this pull request introduce a new feature? no

@heronr heronr marked this pull request as ready for review February 10, 2020 00:42
@merlimat merlimat added component/c++ type/bug The PR fixed a bug or issue reported a bug labels Feb 10, 2020
@merlimat merlimat added this to the 2.6.0 milestone Feb 10, 2020
@merlimat
Copy link
Contributor

@heronr I think there might be a possible issue with this change.

I'm getting a segfault when running tests and tests failing (for unrelated issues):

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f351d252899 in __GI_abort () at abort.c:79
#2  0x00007f351d4d994a in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007f351d4e535c in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#4  0x00007f351d4e53c7 in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#5  0x000055708bcafbbf in std::thread::~thread (this=0x55708c947268, __in_chrg=<optimized out>) at /usr/include/c++/9/thread:139
#6  0x00007f351da183f2 in pulsar::ExecutorService::~ExecutorService (this=0x55708c947250, __in_chrg=<optimized out>) at /pulsar/pulsar-client-cpp/lib/ExecutorService.cc:52

where we have:

    ~thread()
    {
      if (joinable())
        std::terminate();
    }

So I think that in this case we have not called join() before the thread was actually destroyed.

@heronr
Copy link
Contributor Author

heronr commented Feb 11, 2020

This implies that there is thread leak in certain cases, probably caused by the Executor service being destroyed in the context of the worker thread. I will try to run the tests locally and track this down, but it may prove difficult to resolve.

@merlimat
Copy link
Contributor

@heronr I don't think there's a thread leak. Rather, the error seems to be that the std::thread object is destroyed before the actual thread was stopped.

The main issue I see here is that the ~std::thread() is being invoked from the thread itself and therefore it's not "joined" and it cannot be "joined" under any circumstance.

I believe the only solution is that we should ensure, that ~ExecutorService() is never called from the same thread.

@heronr
Copy link
Contributor Author

heronr commented Feb 12, 2020

Agreed on your assessment. I tracked one instance of a non-joined thread to the HTTPLookupService owning the ExecutorServiceProvider that it posts its own work onto. Now the provider it uses is owned by the ClientImpl, but this just exposed a different non-joined thread caused by the ClientImpl itself being destroyed on the ExecutorService worker thread. Here is the callstack

Ultimately because shared_ptrs are being used to manage the lifetime of objects that then post work to ExecutorServices owned by those same objects, we can encounter a non-joinable thread since there is not fine control over when the ref count goes to 0.

If we can designate the pulsar::Client as non-copyable then its destructor can force a shutdown() on the ClientImpl and ensure that all outstanding worker threads are joined at that point as a result. The latest exception that I linked stems from the fact that if a pulsar::Client is destructed without first calling shutdown, any outstanding work on an ExecutorService thread will likely result in the destruction of the ClientImpl on that thread and make it impossible to join.

@merlimat
Copy link
Contributor

@heronr I'm not sure that fixes the underlying issue. It still will be triggering (another) std::thread object being destroyed from itself.

Actually, I think the solution should be easy: just use std::sthread::detach() (http://www.cplusplus.com/reference/thread/thread/detach/) on top of your original commit:

if (std::this_thread::get_id() != worker_.get_id() && worker_.joinable()) {
    worker_.join();
} else {
    worker_.detach();
}

@heronr
Copy link
Contributor Author

heronr commented Feb 13, 2020

Yes, I also considered using detach() as it is the path of least resistance. I just can't help but feel like it's a workaround solution that sweeps the underlying problem under the rug.
That being said, for the purposes of this PR I will go ahead and use std::thread::detach() to remove the deadlock. I will revisit this later with a (hopefully) better solution.

@merlimat merlimat merged commit 2e1c74a into apache:master Feb 14, 2020
@heronr heronr deleted the DeadlockFix branch February 14, 2020 04:38
tuteng pushed a commit to AmateurEvents/pulsar that referenced this pull request Feb 23, 2020
…g Pulsar client (apache#6277)

* Attempt at fixing deadlock during client.close()

* Fixed formatting

* Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined

* Possible formatting fixes
tuteng pushed a commit to AmateurEvents/pulsar that referenced this pull request Mar 21, 2020
…g Pulsar client (apache#6277)

* Attempt at fixing deadlock during client.close()

* Fixed formatting

* Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined

* Possible formatting fixes

(cherry picked from commit 2e1c74a)
tuteng pushed a commit that referenced this pull request Apr 6, 2020
…g Pulsar client (#6277)

* Attempt at fixing deadlock during client.close()

* Fixed formatting

* Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined

* Possible formatting fixes

(cherry picked from commit 2e1c74a)
tuteng pushed a commit that referenced this pull request Apr 13, 2020
…g Pulsar client (#6277)

* Attempt at fixing deadlock during client.close()

* Fixed formatting

* Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined

* Possible formatting fixes

(cherry picked from commit 2e1c74a)
jiazhai pushed a commit to jiazhai/pulsar that referenced this pull request May 18, 2020
…g Pulsar client (apache#6277)

* Attempt at fixing deadlock during client.close()

* Fixed formatting

* Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined

* Possible formatting fixes

(cherry picked from commit 2e1c74a)
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
…g Pulsar client (apache#6277)

* Attempt at fixing deadlock during client.close()

* Fixed formatting

* Detach the worker thread in the destructor of ExecutorService if it is still unable to be joined

* Possible formatting fixes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release/2.5.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[pulsar-client-cpp] Deadlock when closing the pulsar Client
3 participants