-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broker hangs and crashes when listing non-persistent topics #5417
Comments
We're seeing this in 2.4.0 clusters as well. I suspect that bumping the jetty thread count would help. Let's make the jetty thread count configurable. |
Currently |
@digi691 increase the numHttpServerThreads could solve this issue? |
I understand that it is possible to solve this problem by increasing the number of threads of the parameter numHttpServerThreads. If the amount of concurrency for HTTP requests is large, we should increase the value of this parameter. We may need to add documents to illustrate the HTTP request problem under the condition of high concurrency. |
What is the appropriate number? Like I said earlier, I currently have it set to 8. Which is the suggested default. What I don't understand is that if I make any other requests on the API, it finishes fine. It's only this endpoint that causes the whole broker to crash. I don't have proof of what is going on, though it seems like when I hit that endpoint it ties up a HttpServerThread indefinitely. |
Thank you for your reply. I will test it under the cluster and try to fix this problem. |
I tested this problem. under a cluster (including three brokers and three bookies), I used the I think this is because
a. Get all broker addresses
b. Loop the following rest API to obtain bundle
The following results will be returned
c. Get the topic under the bundle through the bundle.
I think the second method will solve the problem of blocking the broker due to the high number of concurrent requests. |
@digi691 You can try version 2.4.2 https://dist.apache.org/repos/dist/release/pulsar/pulsar-2.4.2/. A blocking operation was used in version 2.4.1 https://github.com/apache/pulsar/blob/v2.4.1/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/NonPersistentTopics.java#L184, and has been fixed in version 2.4.2 https://github.com/apache/pulsar/blob/v2.4.2/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/NonPersistentTopics.java#L234. |
I think this issue can be closed, if there are any problems, we can consider open it back on. |
@tuteng will talk to my team about upgrading our test environment. Hopefully I will be able to get to testing 2.4.2 within the next couple of weeks. |
@tuteng I was able to upgrade our dev/test Pulsar cluster to version 2.4.2. Now hitting the /admin/v2/non-persistent/{tenant}/{namespace} API endpoint causes a HTTP ERROR 504 when connecting through the proxy whether the namspace has non-persistent topics or not. When connecting directly to the broker at /admin/v2/non-persistent/{tenant}/{namespace} it seems to infinitely wait. The crashing of the broker API seems to be resolved, but still cannot hit this endpoint on a Pulsar cluster. When Pulsar is running in standalone mode, this produces an empty list. |
@tuteng Let me know how I should proceed or offer further help into figuring out this issue. |
Can you give me a detailed description of your use of context information, such as whether to turn on authentication, the command used, the relevant configuration of the cluster, etc. @digi691 |
@tuteng In my development environment I have two HA-Proxy servers that load balance between 2 - Pulsar proxy servers. I then have 3 brokers, these brokers then point to 4 bookkeepers. I also have 3 Zookeeper nodes. The difference between my development and production environments are that the development environment's configuration store is also the Zookeeper quorum. In my Production environment I have a separate quorum zookeeper cluster (3) and configuration store (3), and also have 8 Bookkeepers in production. Just to note, I am seeing this issue in our production instance as well. Here is the broker.conf I'm using on all three of the brokers and is pretty much the same in the production environment just different host names, cert names, and bucket names: https://gist.github.com/digi691/2a27c8a6055145e98450fc7efce8c0c4. FYI - I had to scrub the file for hostnames, Identifications in Cert Names, etc... As you'll notice: TLS is turned on though authentication is currently turned off. I timeout when going through the HA-Proxy and Pulsar Proxies as I would expect when hitting the admin api to list non-persistent volumes. When I point directly to the brokers and hit that that admin api, it just sits and waits forever as I explained above and never produces a response. I also cannot use the pulsar-admin tool sub command |
Know this issue is stale though we are still having this issue on our bare metal clusters. I recently setup Pulsar in Kubernetes and I'm not seeing this behavior. Is there any kind of miss-configuration of the brokers and zookeepers that could cause the broker just to swallow these type of requests, not respond, and not log any errors about it? |
I'm also seeing this issue on a cluster built around the
The request to the persistent route returns a code 200 and a list of topics immediately. The request to the non-presistent route times out and the only an error is displayed by the client. |
@flowchartsman I have tried to reproduce this problem, seems it's not easy to reproduce. Is it works when you list all topics through the broker directly? If it works, the problem might related to the Pulsar proxy, Otherwise, the problem should happens at the broker side. This will help us to locate the problem |
I can confirm that this happens when accessing the broker directly and when accessing the proxy:
|
This is fixed by #9228 |
Describe the bug
On a Pulsar cluster w/ versions 2.3.0 or 2.4.1 when I send the API request /admin/v2/non-persistent/{tenant}/{namespace} to one of my brokers, the request just hangs. If I send this request to frequently, the broker's API becomes unresponsive until the broker is restarted. The broker also never logs the GET request.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expect pulsar to return an empty list.
Additional context
I could not reproduce on a standalone instance but the behavior is present on two of our clusters. We do not use non-persistent topics but when trying to use Presto against Pulsar, the Presto jar tries to arbitrarily list non-persistent topics.
The text was updated successfully, but these errors were encountered: