Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proxy] Pulsar Handshake was not completed within timeout, Proxy stops working for proxying Broker connections while Admin API proxying keeps working , #14831

Open
kwenzh opened this issue Mar 24, 2022 · 9 comments
Labels
lifecycle/stale Stale type/bug The PR fixed a bug or issue reported a bug

Comments

@kwenzh
Copy link

kwenzh commented Mar 24, 2022

Describe the bug

Pulsar Proxy can get into a state where it stops proxying Broker connections while Admin API proxying keeps working.
The proxy logs are filled with this type of warnings:

but directly connect pulsar-broker is normally,

13:13:09.996 [pulsar-proxy-io-2-1] WARN org.apache.pulsar.common.protocol.PulsarHandler - [[id: 0x83e12747, L:/ip:port - R:/ip:port]] Pulsar Handshake was not completed within timeout, closing connection

and back to normally after restart proxy.

i saw the similar issue:
#14075
#14078

To Reproduce
The steps to reproduce are not known.

@kwenzh kwenzh added the type/bug The PR fixed a bug or issue reported a bug label Mar 24, 2022
@kwenzh
Copy link
Author

kwenzh commented Mar 24, 2022

pulsar version: 2.8.1

@kwenzh
Copy link
Author

kwenzh commented Mar 24, 2022

I see the connection count going up all the time in proxy, is it a bug in pulsar2.8.1 ?

root@pulsar-cluster-proxy-1:/pulsar# netstat -nat | wc -l
6368
root@pulsar-cluster-proxy-1:/pulsar# netstat -nat | wc -l
6518
root@pulsar-cluster-proxy-1:/pulsar# netstat -nat | wc -l
6541

@lhotari
Copy link
Member

lhotari commented Mar 24, 2022

Thanks for the report @kwenzh . This seems like a similar problem as why I added more logging in #14710 .
I also ended up creating #14713 which deals with some issues that can occur.

@kwenzh Would you be able to collect a thread dump (output from jstack -l <PID>), if you are able to reproduce the issue?

@lhotari
Copy link
Member

lhotari commented Mar 24, 2022

I also initiated a discussion around enabling TCP/IP keepalive: #14841

@kwenzh kwenzh closed this as completed Mar 25, 2022
@kwenzh
Copy link
Author

kwenzh commented Mar 25, 2022

@lhotari
i knew this PR #14710 , for add more log. thanks for your commits.

This may not be a problem. Yesterday, I found that some clients kept trying to retry consuming non-existent topics and repeatedly created client connections, resulting in the number of pulsar-proxy connections being full,pulsar -proxy ulimit is 1048576。 and then this exception occurred。In the end, i find there are some go client , in each time consumer , is always to create new client, but do not close .All in all, this should not be a problem that belongs to improper usage

@kwenzh kwenzh reopened this Mar 25, 2022
@lhotari
Copy link
Member

lhotari commented Mar 25, 2022

@kwenzh Can you upgrade to Pulsar 2.8.3 ? It has been recently released. It also includes #13836 which reduces resource consumption significantly when there's a large number of connections. Before that change, 2 threads were created for each connection.

It would be interesting to hear if the problem reproduces with Pulsar 2.8.3 .

@kwenzh
Copy link
Author

kwenzh commented Mar 25, 2022

@kwenzh Can you upgrade to Pulsar 2.8.3 ? It has been recently released. It also includes #13836 which reduces resource consumption significantly when there's a large number of connections. Before that change, 2 threads were created for each connection.

It would be interesting to hear if the problem reproduces with Pulsar 2.8.3 .

fine, I will discuss the upgrade to pulasr 2.8.3 with the team !

i see the latest version is 2.9.1 in https://pulsar.apache.org

image

and the latest tag is 2.9.2 in github tag

image

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

@github-actions
Copy link

The issue had no activity for 30 days, mark with Stale label.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

2 participants