Broker hangs and crashes when listing non-persistent topics #5417

digi691 · 2019-10-18T20:00:07Z

Describe the bug
On a Pulsar cluster w/ versions 2.3.0 or 2.4.1 when I send the API request /admin/v2/non-persistent/{tenant}/{namespace} to one of my brokers, the request just hangs. If I send this request to frequently, the broker's API becomes unresponsive until the broker is restarted. The broker also never logs the GET request.

To Reproduce
Steps to reproduce the behavior:

Run step 2 on a pulsar broker that is in a cluster
curl -v http://{HOST}:8080/admin/v2/non-persistent/{tenant}/{namespace}

Expected behavior
I expect pulsar to return an empty list.

Additional context
I could not reproduce on a standalone instance but the behavior is present on two of our clusters. We do not use non-persistent topics but when trying to use Presto against Pulsar, the Presto jar tries to arbitrarily list non-persistent topics.

vicaya · 2019-10-31T21:29:53Z

We're seeing this in 2.4.0 clusters as well. I suspect that bumping the jetty thread count would help. Let's make the jetty thread count configurable.

digi691 · 2019-11-01T15:15:39Z

Currently numHttpServerThreads=8 is set to what #3776 suggests.

jiazhai · 2019-12-03T02:40:46Z

@digi691 increase the numHttpServerThreads could solve this issue?

tuteng · 2019-12-03T03:39:14Z

I understand that it is possible to solve this problem by increasing the number of threads of the parameter numHttpServerThreads. If the amount of concurrency for HTTP requests is large, we should increase the value of this parameter. We may need to add documents to illustrate the HTTP request problem under the condition of high concurrency.

digi691 · 2019-12-03T19:44:28Z

What is the appropriate number? Like I said earlier, I currently have it set to 8. Which is the suggested default. What I don't understand is that if I make any other requests on the API, it finishes fine. It's only this endpoint that causes the whole broker to crash. I don't have proof of what is going on, though it seems like when I hit that endpoint it ties up a HttpServerThread indefinitely.

tuteng · 2019-12-04T02:25:30Z

Thank you for your reply. I will test it under the cluster and try to fix this problem.

tuteng · 2019-12-06T00:45:10Z

I tested this problem. under a cluster (including three brokers and three bookies), I used the ab tool to test. numHttpServerThreads defaults to 8. when the number of query requests initiated is significantly higher than the number that the thread can handle, it will indeed block the broker.

I think this is because non-persisten topic is stored in memory. when there are multiple brokers, your request is sent to one of them, and the broker completes the remaining work. after the broker receives the request, it will forward the request to all brokers under the cluster. Look up all bundles of all broker, then traverse all topics under these bundles, and finally return topics. A large part of this operation is network delay. Therefore, when concurrent requests are large, the above problems will occur. I have not found a suitable way to solve this problem through fix code, but I have two alternatives to query non-persistent topic.

Reasonably evaluate your query request and configure the appropriate number of numHttpServerThreads, but this method cannot completely solve the above problems.
The proposal is divided into the following three steps

a. Get all broker addresses

curl -v http://any-broker-ip:8080/admin/v2/brokers/cluster-name

b. Loop the following rest API to obtain bundle

curl -v http://broker-1:8080/admin/v2/non-persistent/test/test-namespace
curl -v http://broker-2:8080/admin/v2/non-persistent/test/test-namespace

The following results will be returned

"bundles" : {
    "boundaries" : [ "0x00000000", "0x40000000", "0x80000000", "0xc0000000", "0xffffffff" ],
    "numBundles" : 4
  },

c. Get the topic under the bundle through the bundle.

...
http://broker-ip:8080/admin/v2/non-persistent/test/test-namespace/0x00000000_0x40000000
http://broker-ip:8080/admin/v2/non-persistent/test/test-namespace/0x40000000_0x80000000
...

I think the second method will solve the problem of blocking the broker due to the high number of concurrent requests.

tuteng · 2019-12-06T12:16:45Z

@digi691 You can try version 2.4.2 https://dist.apache.org/repos/dist/release/pulsar/pulsar-2.4.2/.

A blocking operation was used in version 2.4.1 https://github.com/apache/pulsar/blob/v2.4.1/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/NonPersistentTopics.java#L184, and has been fixed in version 2.4.2 https://github.com/apache/pulsar/blob/v2.4.2/pulsar-broker/src/main/java/org/apache/pulsar/broker/admin/v2/NonPersistentTopics.java#L234.

tuteng · 2019-12-07T00:40:43Z

I think this issue can be closed, if there are any problems, we can consider open it back on.

digi691 · 2019-12-09T17:42:30Z

@tuteng will talk to my team about upgrading our test environment. Hopefully I will be able to get to testing 2.4.2 within the next couple of weeks.

digi691 · 2020-01-07T17:00:55Z

@tuteng I was able to upgrade our dev/test Pulsar cluster to version 2.4.2. Now hitting the /admin/v2/non-persistent/{tenant}/{namespace} API endpoint causes a HTTP ERROR 504 when connecting through the proxy whether the namspace has non-persistent topics or not. When connecting directly to the broker at /admin/v2/non-persistent/{tenant}/{namespace} it seems to infinitely wait. The crashing of the broker API seems to be resolved, but still cannot hit this endpoint on a Pulsar cluster. When Pulsar is running in standalone mode, this produces an empty list.

digi691 · 2020-01-14T15:51:25Z

@tuteng Let me know how I should proceed or offer further help into figuring out this issue.

tuteng · 2020-01-14T23:25:13Z

Can you give me a detailed description of your use of context information, such as whether to turn on authentication, the command used, the relevant configuration of the cluster, etc. @digi691

digi691 · 2020-01-17T16:17:02Z

@tuteng In my development environment I have two HA-Proxy servers that load balance between 2 - Pulsar proxy servers. I then have 3 brokers, these brokers then point to 4 bookkeepers. I also have 3 Zookeeper nodes. The difference between my development and production environments are that the development environment's configuration store is also the Zookeeper quorum. In my Production environment I have a separate quorum zookeeper cluster (3) and configuration store (3), and also have 8 Bookkeepers in production. Just to note, I am seeing this issue in our production instance as well. Here is the broker.conf I'm using on all three of the brokers and is pretty much the same in the production environment just different host names, cert names, and bucket names: https://gist.github.com/digi691/2a27c8a6055145e98450fc7efce8c0c4. FYI - I had to scrub the file for hostnames, Identifications in Cert Names, etc... As you'll notice: TLS is turned on though authentication is currently turned off. I timeout when going through the HA-Proxy and Pulsar Proxies as I would expect when hitting the admin api to list non-persistent volumes. When I point directly to the brokers and hit that that admin api, it just sits and waits forever as I explained above and never produces a response. I also cannot use the pulsar-admin tool sub command topics list as I believe it's trying to list non-persistent topics as well as persistent.

digi691 · 2020-04-29T16:41:11Z

Know this issue is stale though we are still having this issue on our bare metal clusters. I recently setup Pulsar in Kubernetes and I'm not seeing this behavior. Is there any kind of miss-configuration of the brokers and zookeepers that could cause the broker just to swallow these type of requests, not respond, and not log any errors about it?

flowchartsman · 2021-01-10T23:30:05Z

I'm also seeing this issue on a cluster built around the pulsar-all image with docker. If I inspect the request with a proxy, pulsarctl and pulsar-admin both make two requests in the background:

admin/v2/persistent/<tenant>/<namespace>
admin/v2/non-persistent/<tenant>/<namespace>

The request to the persistent route returns a code 200 and a list of topics immediately. The request to the non-presistent route times out and the only an error is displayed by the client.

codelipenghui · 2021-01-11T11:49:19Z

@flowchartsman I have tried to reproduce this problem, seems it's not easy to reproduce. Is it works when you list all topics through the broker directly? If it works, the problem might related to the Pulsar proxy, Otherwise, the problem should happens at the broker side. This will help us to locate the problem

flowchartsman · 2021-01-11T19:17:54Z

I can confirm that this happens when accessing the broker directly and when accessing the proxy:

curl -v http://<broker_addr>:8080/admin/v2/non-persistent/tenantName/namespaceName
*   Trying <broker_addr>...
* TCP_NODELAY set
* Connected to <broker_addr> (<broker_addr>) port 8080 (#0)
> GET /admin/v2/non-persistent/tenantName/namespaceName HTTP/1.1
> Host: <broker_addr>:8080
> User-Agent: curl/7.64.1
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Date: Mon, 11 Jan 2021 19:15:13 GMT
< broker-address: <broker_addr>
< Content-Type: text/plain
< Content-Length: 3183
< Server: Jetty(9.4.33.v20201020)
<

 --- An unexpected error occurred in the server ---

Message: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException

Stacktrace:

org.apache.pulsar.client.admin.PulsarAdminException: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at org.apache.pulsar.client.admin.internal.BaseResource.getApiException(BaseResource.java:231)
        at org.apache.pulsar.client.admin.internal.TopicsImpl$5.failed(TopicsImpl.java:233)
        at org.glassfish.jersey.client.JerseyInvocation$1.failed(JerseyInvocation.java:839)
        at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:247)
        at org.glassfish.jersey.client.ClientRuntime.processFailure(ClientRuntime.java:242)
        at org.glassfish.jersey.client.ClientRuntime.access$100(ClientRuntime.java:62)
        at org.glassfish.jersey.client.ClientRuntime$2.lambda$failure$1(ClientRuntime.java:178)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:248)
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:244)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:292)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:274)
        at org.glassfish.jersey.internal.Errors.process(Errors.java:244)
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:288)
        at org.glassfish.jersey.client.ClientRuntime$2.failure(ClientRuntime.java:178)
        at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$apply$1(AsyncHttpConnector.java:200)
        at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
        at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
        at org.apache.pulsar.client.admin.internal.http.AsyncHttpConnector.lambda$timeoutAfter$7(AsyncHttpConnector.java:300)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: java.util.concurrent.TimeoutException
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at java.util.concurrent.CompletableFuture.orApply(CompletableFuture.java:1385)
        at java.util.concurrent.CompletableFuture$OrApply.tryFire(CompletableFuture.java:1364)
        at java.util.concurrent.CompletableFuture$CoCompletion.tryFire(CompletableFuture.java:1034)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException
        ... 8 more
* Connection #0 to host <broker_addr> left intact
* Closing connection 0

sijie · 2021-01-21T02:59:45Z

This is fixed by #9228

digi691 added the type/bug The PR fixed a bug or issue reported a bug label Oct 18, 2019

adeora mentioned this issue Oct 21, 2019

pulsar-admin topics list public/default fails #3671

Closed

sijie added the triage/week-43 label Oct 27, 2019

vicaya mentioned this issue Oct 31, 2019

Use at least 8 threads in Jetty thread pool #3776

Merged

sijie mentioned this issue Dec 27, 2019

ISSUE-5417: Broker hangs and crashes when listing non-persistent topics streamnative/pulsar-archived#227

Closed

codelipenghui added the release/2.7.1 label Jan 10, 2021

sijie closed this as completed Jan 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broker hangs and crashes when listing non-persistent topics #5417

Broker hangs and crashes when listing non-persistent topics #5417

digi691 commented Oct 18, 2019

vicaya commented Oct 31, 2019

digi691 commented Nov 1, 2019

jiazhai commented Dec 3, 2019

tuteng commented Dec 3, 2019

digi691 commented Dec 3, 2019

tuteng commented Dec 4, 2019

tuteng commented Dec 6, 2019

tuteng commented Dec 6, 2019 •

edited

Loading

tuteng commented Dec 7, 2019

digi691 commented Dec 9, 2019

digi691 commented Jan 7, 2020 •

edited

Loading

digi691 commented Jan 14, 2020

tuteng commented Jan 14, 2020

digi691 commented Jan 17, 2020

digi691 commented Apr 29, 2020

flowchartsman commented Jan 10, 2021 •

edited

Loading

codelipenghui commented Jan 11, 2021

flowchartsman commented Jan 11, 2021

sijie commented Jan 21, 2021

Broker hangs and crashes when listing non-persistent topics #5417

Broker hangs and crashes when listing non-persistent topics #5417

Comments

digi691 commented Oct 18, 2019

vicaya commented Oct 31, 2019

digi691 commented Nov 1, 2019

jiazhai commented Dec 3, 2019

tuteng commented Dec 3, 2019

digi691 commented Dec 3, 2019

tuteng commented Dec 4, 2019

tuteng commented Dec 6, 2019

tuteng commented Dec 6, 2019 • edited Loading

tuteng commented Dec 7, 2019

digi691 commented Dec 9, 2019

digi691 commented Jan 7, 2020 • edited Loading

digi691 commented Jan 14, 2020

tuteng commented Jan 14, 2020

digi691 commented Jan 17, 2020

digi691 commented Apr 29, 2020

flowchartsman commented Jan 10, 2021 • edited Loading

codelipenghui commented Jan 11, 2021

flowchartsman commented Jan 11, 2021

sijie commented Jan 21, 2021

tuteng commented Dec 6, 2019 •

edited

Loading

digi691 commented Jan 7, 2020 •

edited

Loading

flowchartsman commented Jan 10, 2021 •

edited

Loading