Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Fix incorrect connect timeout implementation #11889

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Sep 2, 2021

Fixes #10721

Motivation

Currently the connection timer stops after the TCP connection is established immediately (the state becomes TcpConnected). However, the connect phase should also include sending the CommandConnect request and receiving the CommandConnected response from broker successfully. For example, in a case like #10721 described, if the Pulsar broker received a SIGSTOP signal and became pending, the TCP connection can be established but no response could be received from a stopped broker. Then the client would hang forever.

Modifications

Stop the connection timer only after the ClientConnection's state becomes Ready, which means the client has received the CommandConnected response successfully. It's also consistent with Java client's implementation, see

connectionFuture.complete(null);
state = State.Ready;

Verifying this change

  • Make sure that the change passes the CI checks.

It's hard to simulate the scenario in unit test. But we can simply reproduce it in local env.

  1. Run a pulsar standalone and send SIGSTOP signal by pkill -SIGSTOP -f pulsar.
  2. Run any C++/Python client with this fix to connect to the standalone.
>>> import pulsar
>>> c = pulsar.Client('pulsar://localhost:6650')
>>> c.create_producer('xxx')
2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ClientConnection:181 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2021-09-02 11:54:38.969 INFO  [0x700003136000] ClientConnection:367 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connected to broker
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:532 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection was not established in 10000 ms, close the socket
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:572 | [127.0.0.1:62784 -> 127.0.0.1:6650] Read failed: Operation canceled
2021-09-02 11:54:48.973 INFO  [0x700003136000] ClientConnection:1495 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection closed
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientImpl:188 | Error Checking/Getting Partition Metadata while creating producer on persistent://public/default/xxx -- ConnectError
Traceback (most recent call last):
    # ...
_pulsar.ConnectError: Pulsar error: ConnectError

we can see after 10 seconds (the default connect timeout), it failed.

@BewareMyPower BewareMyPower self-assigned this Sep 2, 2021
@BewareMyPower BewareMyPower added component/c++ type/bug The PR fixed a bug or issue reported a bug labels Sep 2, 2021
@merlimat merlimat added this to the 2.9.0 milestone Sep 2, 2021
@BewareMyPower BewareMyPower added the doc-not-needed Your PR changes do not impact docs label Sep 2, 2021
@BewareMyPower BewareMyPower merged commit cdd0f41 into apache:master Sep 2, 2021
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-cpp-connect-timeout branch September 2, 2021 14:07
hangc0276 pushed a commit that referenced this pull request Sep 4, 2021
Fixes #10721

### Motivation

Currently the connection timer stops after the TCP connection is established immediately (the state becomes `TcpConnected`). However, the connect phase should also include sending the `CommandConnect` request and receiving the `CommandConnected` response from broker successfully. For example, in a case like #10721 described, if the Pulsar broker received a SIGSTOP signal and became pending, the TCP connection can be established but no response could be received from a stopped broker. Then the client would hang forever.

### Modifications

Stop the connection timer only after the `ClientConnection`'s state becomes `Ready`, which means the client has received the `CommandConnected` response successfully. It's also consistent with Java client's implementation, see https://github.com/apache/pulsar/blob/235e678a56d0284e68b45e46706b6237d7c6d5f9/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L329-L330

### Verifying this change

- [ ] Make sure that the change passes the CI checks.

It's hard to simulate the scenario in unit test. But we can simply reproduce it in local env.
1. Run a pulsar standalone and send `SIGSTOP` signal by `pkill -SIGSTOP -f pulsar`.
2. Run any C++/Python client with this fix to connect to the standalone.

```
>>> import pulsar
>>> c = pulsar.Client('pulsar://localhost:6650')
>>> c.create_producer('xxx')
2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ClientConnection:181 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2021-09-02 11:54:38.969 INFO  [0x700003136000] ClientConnection:367 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connected to broker
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:532 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection was not established in 10000 ms, close the socket
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:572 | [127.0.0.1:62784 -> 127.0.0.1:6650] Read failed: Operation canceled
2021-09-02 11:54:48.973 INFO  [0x700003136000] ClientConnection:1495 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection closed
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientImpl:188 | Error Checking/Getting Partition Metadata while creating producer on persistent://public/default/xxx -- ConnectError
Traceback (most recent call last):
    # ...
_pulsar.ConnectError: Pulsar error: ConnectError
```

we can see after 10 seconds (the default connect timeout), it failed.

(cherry picked from commit cdd0f41)
@hangc0276 hangc0276 added cherry-picked/branch-2.8 Archived: 2.8 is end of life release/2.8.1 and removed release/2.8.2 labels Sep 4, 2021
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
Fixes apache#10721 

### Motivation

Currently the connection timer stops after the TCP connection is established immediately (the state becomes `TcpConnected`). However, the connect phase should also include sending the `CommandConnect` request and receiving the `CommandConnected` response from broker successfully. For example, in a case like apache#10721 described, if the Pulsar broker received a SIGSTOP signal and became pending, the TCP connection can be established but no response could be received from a stopped broker. Then the client would hang forever.

### Modifications

Stop the connection timer only after the `ClientConnection`'s state becomes `Ready`, which means the client has received the `CommandConnected` response successfully. It's also consistent with Java client's implementation, see https://github.com/apache/pulsar/blob/235e678a56d0284e68b45e46706b6237d7c6d5f9/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L329-L330

### Verifying this change

- [ ] Make sure that the change passes the CI checks.

It's hard to simulate the scenario in unit test. But we can simply reproduce it in local env.
1. Run a pulsar standalone and send `SIGSTOP` signal by `pkill -SIGSTOP -f pulsar`.
2. Run any C++/Python client with this fix to connect to the standalone.

```
>>> import pulsar
>>> c = pulsar.Client('pulsar://localhost:6650')
>>> c.create_producer('xxx')
2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ClientConnection:181 | [<none> -> pulsar://localhost:6650] Create ClientConnection, timeout=10000
2021-09-02 11:54:38.828 INFO  [0x10d72ae00] ConnectionPool:96 | Created connection for pulsar://localhost:6650
2021-09-02 11:54:38.969 INFO  [0x700003136000] ClientConnection:367 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connected to broker
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:532 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection was not established in 10000 ms, close the socket
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientConnection:572 | [127.0.0.1:62784 -> 127.0.0.1:6650] Read failed: Operation canceled
2021-09-02 11:54:48.973 INFO  [0x700003136000] ClientConnection:1495 | [127.0.0.1:62784 -> 127.0.0.1:6650] Connection closed
2021-09-02 11:54:48.973 ERROR [0x700003136000] ClientImpl:188 | Error Checking/Getting Partition Metadata while creating producer on persistent://public/default/xxx -- ConnectError
Traceback (most recent call last):
    # ...
_pulsar.ConnectError: Pulsar error: ConnectError
```

we can see after 10 seconds (the default connect timeout), it failed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.8 Archived: 2.8 is end of life doc-not-needed Your PR changes do not impact docs release/2.8.1 release/2.9.0 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
4 participants