Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Fix request timeout for GetLastMessageId doesn't work #12586

Merged

Conversation

BewareMyPower
Copy link
Contributor

Motivation

#11600 adds the timeout for GetLastMessageId request by using sendRequestWithId instead of sendCommand. However, it's still incorrect. Because when the request timeout exceeds, the future that is completed with ResultTimeout is what sendRequestWithId returns but not the promise.getFuture(). Therefore, even if the request was not responded in operationTimeout seconds, the future returned by newGetLastMessageId would still be not completed.

Besides, when I tried to complete the promise in sendRequestWithId's callback, I found a deadlock issue if ServerCnx#handleGetLastMessageId hang forever (I just add a long sleep call in this method).

    frame #3: std::__1::mutex::lock() + 9
    frame #4: pulsar::Promise<pulsar::Result, pulsar::ResponseData>::setFailed(pulsar::Result) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock(this=<unavailable>, __m=0x00007ffb1b0044d8) at __mutex_base:119:61 [opt]
    frame #5: pulsar::Promise<pulsar::Result, pulsar::ResponseData>::setFailed(pulsar::Result) const [inlined] std::__1::unique_lock<std::__1::mutex>::unique_lock(this=<unavailable>, __m=0x00007ffb1b0044d8) at __mutex_base:119 [opt]
    frame #6: pulsar::Promise<pulsar::Result, pulsar::ResponseData>::setFailed(this=0x00007ffb1b0046e8, result=ResultConnectError) const at Future.h:118 [opt]
    frame #7: pulsar::ClientConnection::close(this=<unavailable>, result=ResultConnectError) at ClientConnection.cc:1556:27 [opt]

We can see Promise::setFailed stuck in ClientConnection::close:

     for (auto& kv : pendingRequests) {
         kv.second.promise.setFailed(ResultConnectError);
     }  

It's because the future's callback is called in setFailed. However, the callback also calls setFailed, which tries to acquire the same lock that is not reentrant. So deadlock happens.

Modifications

Refactor the Future/Promise infrastructure. The current design is too old and the code style is bad. The important things of the refactoring are:

  1. Change completed_ field (the original complete field) to an atomic variable. So when checking if the future is completed, no lock is required.
  2. Move the triggering of listeners_ out of the locked code block. So that each listener (callback) will be triggered without acquiring any lock.
  3. Move conditional_variable::notify_all() out of the locked code block as well. The notifying thread does not need to hold the lock, see https://en.cppreference.com/w/cpp/thread/condition_variable/notify_all.
  4. Move all related implementations into InternalState so that Future and Promise only need to call them directly.

Then add a PromiseTest to protect the refactoring. Based on the refactor, just add a callback to sendRequestWithId in newGetLastMessageId to make sure the request timeout works.

Verifying this change

  • Make sure that the change passes the CI checks.

It's hard to simulate the operation timeout. So I have to add the following code to ServerCnx#handleGetLastMessageId

        try {
            Thread.sleep(1000000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

and run a reader to call hasMessageAvailable with 3 seconds operation timeout. After it failed, the client exited after 6 seconds, which is twice the operation timeout. The logs are:

2021-11-02 20:39:39.685 ERROR [0x11b16de00] SampleConsumer:48 | Failed to check hasMessageAvailable: TimeOut
2021-11-02 20:39:39.685 INFO  [0x11b16de00] ClientImpl:492 | Closing Pulsar client with 0 producers and 1 consumers
2021-11-02 20:39:39.685 INFO  [0x11b16de00] ConsumerImpl:894 | [persistent://public/default/my-topic-0, reader-35479ac30a, 0] Closing consumer for topic persistent://public/default/my-topic-0
2021-11-02 20:39:43.686 ERROR [0x700006b1e000] ConsumerImpl:952 | [persistent://public/default/my-topic-0, reader-35479ac30a, 0] Failed to close consumer: TimeOut
2021-11-02 20:39:43.686 INFO  [0x700006b1e000] ClientConnection:1542 | [[::1]:60215 -> [::1]:6650] Connection closed
2021-11-02 20:39:43.687 INFO  [0x11b16de00] ClientConnection:255 | [[::1]:60215 -> [::1]:6650] Destroyed connection

@BewareMyPower BewareMyPower added type/bug The PR fixed a bug or issue reported a bug component/c++ doc-not-needed Your PR changes do not impact docs release/2.8.2 release/2.9.1 labels Nov 2, 2021
@BewareMyPower BewareMyPower added this to the 2.10.0 milestone Nov 2, 2021
@BewareMyPower BewareMyPower self-assigned this Nov 2, 2021
@BewareMyPower
Copy link
Contributor Author

I'll fix the compilation error soon.

@BewareMyPower BewareMyPower force-pushed the bewaremypower/cpp-correct-timeout branch from 9bf793b to 3d2a4fd Compare November 2, 2021 15:23
pulsar-client-cpp/lib/Future.h Outdated Show resolved Hide resolved
pulsar-client-cpp/lib/ClientConnection.cc Show resolved Hide resolved
pulsar-client-cpp/lib/Future.h Show resolved Hide resolved
@BewareMyPower BewareMyPower changed the title [C++] Fix request timeout for GetLastMessageId doesn't work [WIP][C++] Fix request timeout for GetLastMessageId doesn't work Nov 3, 2021
@BewareMyPower
Copy link
Contributor Author

Mark it as WIP first. I'll fix the tests in local env before pushing the commits. And I will revert the refactoring of Future.

@BewareMyPower BewareMyPower changed the title [WIP][C++] Fix request timeout for GetLastMessageId doesn't work [C++] Fix request timeout for GetLastMessageId doesn't work Nov 3, 2021
@merlimat merlimat merged commit a54c6c0 into apache:master Nov 3, 2021
@BewareMyPower BewareMyPower deleted the bewaremypower/cpp-correct-timeout branch November 4, 2021 01:17
hangc0276 pushed a commit that referenced this pull request Nov 4, 2021
* Fix request timeout for GetLastMessageId doesn't work

* Fix CentOS 7 build error

* Revert refactors

* Remove redundant clear for listeners

* Use swap instead of move

(cherry picked from commit a54c6c0)
@hangc0276 hangc0276 added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Nov 4, 2021
eolivelli pushed a commit to eolivelli/pulsar that referenced this pull request Nov 29, 2021
…2586)

* Fix request timeout for GetLastMessageId doesn't work

* Fix CentOS 7 build error

* Revert refactors

* Remove redundant clear for listeners

* Use swap instead of move
codelipenghui pushed a commit that referenced this pull request Dec 20, 2021
* Fix request timeout for GetLastMessageId doesn't work

* Fix CentOS 7 build error

* Revert refactors

* Remove redundant clear for listeners

* Use swap instead of move

(cherry picked from commit a54c6c0)
@codelipenghui codelipenghui added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.8 Archived: 2.8 is end of life cherry-picked/branch-2.9 Archived: 2.9 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.2 release/2.9.2 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants