-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Fix incorrect connect timeout implementation #11889
Merged
BewareMyPower
merged 1 commit into
apache:master
from
BewareMyPower:bewaremypower/fix-cpp-connect-timeout
Sep 2, 2021
Merged
[C++] Fix incorrect connect timeout implementation #11889
BewareMyPower
merged 1 commit into
apache:master
from
BewareMyPower:bewaremypower/fix-cpp-connect-timeout
Sep 2, 2021
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
merlimat
approved these changes
Sep 2, 2021
hangc0276
pushed a commit
that referenced
this pull request
Sep 4, 2021
Fixes #10721 ### Motivation Currently the connection timer stops after the TCP connection is established immediately (the state becomes `TcpConnected`). However, the connect phase should also include sending the `CommandConnect` request and receiving the `CommandConnected` response from broker successfully. For example, in a case like #10721 described, if the Pulsar broker received a SIGSTOP signal and became pending, the TCP connection can be established but no response could be received from a stopped broker. Then the client would hang forever. ### Modifications Stop the connection timer only after the `ClientConnection`'s state becomes `Ready`, which means the client has received the `CommandConnected` response successfully. It's also consistent with Java client's implementation, see https://github.com/apache/pulsar/blob/235e678a56d0284e68b45e46706b6237d7c6d5f9/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L329-L330 ### Verifying this change - [ ] Make sure that the change passes the CI checks. It's hard to simulate the scenario in unit test. But we can simply reproduce it in local env. 1. Run a pulsar standalone and send `SIGSTOP` signal by `pkill -SIGSTOP -f pulsar`. 2. Run any C++/Python client with this fix to connect to the standalone. ``` >>> import pulsar >>> c = pulsar.Client('pulsar://localhost:6650') >>> c.create_producer('xxx') 2021-09-02 11:54:38.828 INFO [0x10d72ae00] ClientConnection:181 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000 2021-09-02 11:54:38.828 INFO [0x10d72ae00] ConnectionPool:96 | Created connection for pulsar://localhost:6650 2021-09-02 11:54:38.969 INFO [0x700003136000] ClientConnection:367 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connected to broker 2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:532 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection was not established in 10000 ms, close the socket 2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:572 | [127.0.0.1:62784 -> 127.0.0.1:6650] Read failed: Operation canceled 2021-09-02 11:54:48.973 INFO [0x700003136000] ClientConnection:1495 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection closed 2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientImpl:188 | Error Checking/Getting Partition Metadata while creating producer on persistent://public/default/xxx -- ConnectError Traceback (most recent call last): # ... _pulsar.ConnectError: Pulsar error: ConnectError ``` we can see after 10 seconds (the default connect timeout), it failed. (cherry picked from commit cdd0f41)
bharanic-dev
pushed a commit
to bharanic-dev/pulsar
that referenced
this pull request
Mar 18, 2022
Fixes apache#10721 ### Motivation Currently the connection timer stops after the TCP connection is established immediately (the state becomes `TcpConnected`). However, the connect phase should also include sending the `CommandConnect` request and receiving the `CommandConnected` response from broker successfully. For example, in a case like apache#10721 described, if the Pulsar broker received a SIGSTOP signal and became pending, the TCP connection can be established but no response could be received from a stopped broker. Then the client would hang forever. ### Modifications Stop the connection timer only after the `ClientConnection`'s state becomes `Ready`, which means the client has received the `CommandConnected` response successfully. It's also consistent with Java client's implementation, see https://github.com/apache/pulsar/blob/235e678a56d0284e68b45e46706b6237d7c6d5f9/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L329-L330 ### Verifying this change - [ ] Make sure that the change passes the CI checks. It's hard to simulate the scenario in unit test. But we can simply reproduce it in local env. 1. Run a pulsar standalone and send `SIGSTOP` signal by `pkill -SIGSTOP -f pulsar`. 2. Run any C++/Python client with this fix to connect to the standalone. ``` >>> import pulsar >>> c = pulsar.Client('pulsar://localhost:6650') >>> c.create_producer('xxx') 2021-09-02 11:54:38.828 INFO [0x10d72ae00] ClientConnection:181 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000 2021-09-02 11:54:38.828 INFO [0x10d72ae00] ConnectionPool:96 | Created connection for pulsar://localhost:6650 2021-09-02 11:54:38.969 INFO [0x700003136000] ClientConnection:367 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connected to broker 2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:532 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection was not established in 10000 ms, close the socket 2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:572 | [127.0.0.1:62784 -> 127.0.0.1:6650] Read failed: Operation canceled 2021-09-02 11:54:48.973 INFO [0x700003136000] ClientConnection:1495 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection closed 2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientImpl:188 | Error Checking/Getting Partition Metadata while creating producer on persistent://public/default/xxx -- ConnectError Traceback (most recent call last): # ... _pulsar.ConnectError: Pulsar error: ConnectError ``` we can see after 10 seconds (the default connect timeout), it failed.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
cherry-picked/branch-2.8
Archived: 2.8 is end of life
doc-not-needed
Your PR changes do not impact docs
release/2.8.1
release/2.9.0
type/bug
The PR fixed a bug or issue reported a bug
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #10721
Motivation
Currently the connection timer stops after the TCP connection is established immediately (the state becomes
TcpConnected
). However, the connect phase should also include sending theCommandConnect
request and receiving theCommandConnected
response from broker successfully. For example, in a case like #10721 described, if the Pulsar broker received a SIGSTOP signal and became pending, the TCP connection can be established but no response could be received from a stopped broker. Then the client would hang forever.Modifications
Stop the connection timer only after the
ClientConnection
's state becomesReady
, which means the client has received theCommandConnected
response successfully. It's also consistent with Java client's implementation, seepulsar/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java
Lines 329 to 330 in 235e678
Verifying this change
It's hard to simulate the scenario in unit test. But we can simply reproduce it in local env.
SIGSTOP
signal bypkill -SIGSTOP -f pulsar
.we can see after 10 seconds (the default connect timeout), it failed.